Data Diversity in handwritten text recognition. Challenge or opportunity?

Congrès : Digital Humanities 2022 (2022-07-25 - 2022-07-29)
Directeur(s) : DH2022 Local Organizing Committee
Pages : 160-165

Consulter la fiche HAL

Résumé

In this paper, we wish to show approaches in handling diversity in larger collections of training data for text acquisition pipelines, specifically handwritten text recognition for medieval manuscripts in Latin and French. Present throughout medieval Europe, Latin is one, if not the most used written language of the time on this continent, while French has known from a relatively early date (around the 12th century judging from preserved manuscripts) a vernacular production that soon became one of the most prominent of Western Europe, influencing the written culture of its neighbours from its central position. Combined, they provide a case study whose diversity and general scope could, we hope, allow to provide results with broader applicability, even beyond medieval Western manuscripts. Heterogeneity or diversity in the collections can result from intrinsic features (e.g. linguistic, palaeographic, diachronic variation in the sources), but also from extrinsic features (aim and provenance of transcriptions, idiosyncrasies of transcribers…). We propose to approach both types of diversity by reusing several open data sets from various research projects in diverse fields and involving many collaborators. We add a double focus, linguistic (Latin vs. French manuscripts) and graphic (abbreviated vs. normalised transcriptions). We hope to be able to overcome, to some extent, the issue of linguistic diversity and propose a common, modular pipeline for different languages, related but different in their inner structure and declension mechanisms. In this attempt, we strive to answer more specifically the following questions: (a) To what extent can we (and should we) mutualise HTR training material between preexisting datasets and even related languages? (and is it worth the effort?); (b) Are approaches that decompose image to text prediction and further linguistic normalisations (abbreviation expansion for instance) better performing for that goal than straightforward “image to normalised text” approaches?

Disciplines

Humanités numériques

Partager sur les réseaux sociaux

À découvrir

Découvrez d'autres productions de l'École sur les mêmes thématiques.

Humanités numériques

Consulter la page «Humanités numériques»

SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles

Publication de chercheur
- Simon Gabay,
  Ariane Pinche,
  Kelly Christensen,
  Jean-Baptiste Camps
Intelligence artificielle et institutions patrimoniales

Vidéo
- Emmanuelle Bermès
Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking

Publication de chercheur
- Chahan Vidal-Gorène,
  Clément Salah,
  Noëmie Lucas,
  Aliénor Decours-Perez,
  Antoine Perrier
Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs

Publication de chercheur
- Chahan Vidal-Gorène,
  Nadi Tomeh,
  Victoria Khurshudyan
Generative Artificial Intelligence and Historical Research: Challenges, Potentials, and Limitations. Application of RAG to French Parliamentary Debates of the Third Republic (1881-1940)

Publication de chercheur
- Aurélien Pellet,
  Julien Perez,
  Marie Puren
Accountable AI for Authentic Records?

Vidéo
Optimizing HTR and Reading Order Strategies for Chinese Imperial Editions with Few-Shot Learning

Publication de chercheur
- Marie Bizais-Lillig,
  Chahan Vidal-Gorène,
  Boris Dupin
Detecting and Deciphering Damaged Medieval Armenian Inscriptions Using YOLO and Vision Transformers

Publication de chercheur
- Chahan Vidal-Gorène,
  Aliénor Decours-Perez
Consulter la page «Humanités numériques»

Nous suivre

Data Diversity in handwritten text recognition. Challenge or opportunity?

Résumé

Résumé

Disciplines

Humanités numériques

Partager sur les réseaux sociaux

À découvrir

Humanités numériques

SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles

Intelligence artificielle et institutions patrimoniales

Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking

Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs

Generative Artificial Intelligence and Historical Research: Challenges, Potentials, and Limitations. Application of RAG to French Parliamentary Debates of the Third Republic (1881-1940)

Accountable AI for Authentic Records?

Optimizing HTR and Reading Order Strategies for Chinese Imperial Editions with Few-Shot Learning

Detecting and Deciphering Damaged Medieval Armenian Inscriptions Using YOLO and Vision Transformers