You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

Revue : Journal of Data Mining and Digital Humanities (Historical Documents and...)

Consulter la fiche HAL

Résumé

Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.

Disciplines

Humanités numériques

Partager sur les réseaux sociaux

À découvrir

Découvrez d'autres productions de l'École sur les mêmes thématiques.

Humanités numériques

Consulter la page «Humanités numériques»

Computational Museology in the Age of Experience

Vidéo
- Sarah Kenderdine
Whose Pen Wrote the Map? Battling Over the Armenian Medieval Text Ashkharhatsuyts with Stylometry

Publication de chercheur
- Jean-Baptiste Camps,
  Chahan Vidal-Gorène
From questions to insights: a reproducible question-answering pipeline for historiographical corpus exploration

Publication de chercheur
- Lucas Terriel,
  Vincent Jolivet
A Riddle in a Haystack: LLM Detection of Intricate Wordplays in Colette and Willy's Novels for Authorship Attribution

Publication de chercheur
- Florian Cafiero,
  Marie Puren
Greening your database of literary works: How to avoid reinventing vocabularies, in favor of sustainable, reusable models

Publication de chercheur
- Kelly Christensen,
  Jean-Baptiste Camps
Évaluation automatique du retour à la source dans un contexte historique long et bruité : les débats parlementaires de la Troisième République française

Publication de chercheur
- Aurélien Pellet,
  Julien Perez,
  Marie Puren
Style in Eight Syllables: Metric Annotation and Stylometry of Chrétien de Troyes and Contemporaries

Publication de chercheur
- Jean-Baptiste Camps,
  Florian Cafiero,
  Philippe Chaumet-Riffaud,
  Damien Conceicao,
  Ulysse Godreau,
  Émilie Guidi,
  Théo Moins,
  Pierre-Alexandre Nistor,
  Benedetta Salvati,
  Alexandre Lionnet-Rollin
The times are a-changin': présent vs passé simple in French novels (1811-2024)

Publication de chercheur
- Simon Gabay,
  Jean Barré,
  Florian Cafiero
Consulter la page «Humanités numériques»

Nous suivre

You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

Résumé

Résumé

Disciplines

Humanités numériques

Partager sur les réseaux sociaux

À découvrir

Humanités numériques

Computational Museology in the Age of Experience

Whose Pen Wrote the Map? Battling Over the Armenian Medieval Text Ashkharhatsuyts with Stylometry

From questions to insights: a reproducible question-answering pipeline for historiographical corpus exploration

A Riddle in a Haystack: LLM Detection of Intricate Wordplays in Colette and Willy's Novels for Authorship Attribution

Greening your database of literary works: How to avoid reinventing vocabularies, in favor of sustainable, reusable models

Évaluation automatique du retour à la source dans un contexte historique long et bruité : les débats parlementaires de la Troisième République française

Style in Eight Syllables: Metric Annotation and Stylometry of Chrétien de Troyes and Contemporaries

The times are a-changin': présent vs passé simple in French novels (1811-2024)