- Congrès : Computational Humanities Research Conference (CHR) 2022 (2022-12-12 - 2022-12-14)
Résumé
As more and more projects openly release ground truth for handwritten text recognition (HTR), we expect the quality of automatic transcription to improve on unseen data. Getting models robust to scribal and material changes is a necessary step for specific data mining tasks. However, evaluation of HTR results requires ground truth to compare prediction statistically. In the context of modern languages, successful attempts to evaluate quality have been done using lexical features or n-grams.This, however, proves difficult in the context of spelling variation that both Old French and Latin have, even more so in the context of sometime heavily abbreviated manuscripts. We propose a new method based on deep learning where we attempt to categorize each line error rate into four error rate ranges (0 < 10% < 25% < 50% < 100%) using three different encoder (GRU with Attention, BiLSTM, TextCNN). To train these models, we propose a new dataset engineering approach using early stopped model, as an alternative to rule-based fake predictions. Our model largely outperforms the n-gram approach. We also provide an example application to qualitatively analyse our classifier, using classification on new prediction on a sample of 1,800 manuscripts ranging from the 9 th century to the 15 th .
Disciplines
Partager sur les réseaux sociaux
Publications de chercheur
CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024
Layout Analysis Dataset with SegmOnto
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024
Les registres médiévaux de Notre Dame : une archive numérique ouverte de la vie du chapitre
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024