- Dans Document Analysis and Recognition – ICDAR 2021 Workshops
- Éditeur : Springer International Publishing
- Pages : 265-281
Résumé
The Arabic scripts raise numerous issues in text recognition and layout analysis. To overcome these, several datasets and methods have been proposed in recent years. Although the latter are focused on common scripts and layout, many Arabic writings and written traditions remain under-resourced. We therefore propose a new dataset comprising 300 images representative of the handwritten production of the Arabic Maghrebi scripts. This dataset is the achievement of a collaborative work undertaken in the first quarter of 2021, and it offers several levels of annotation and transcription. The article intends to shed light on the specificities of these writing and manuscripts, as well as highlight the challenges of the recognition. The collaborative tools used for the creation of the dataset are assessed and the dataset itself is evaluated with state of the art methods in layout analysis. The word-based text recognition method used and experimented on for these writings achieves CER of 4.8% on average. The pipeline described constitutes an experience feedback for the quick creation of data and the training of effective HTR systems for Arabic scripts and non-Latin scripts in general.
Disciplines
Partager sur les réseaux sociaux
Publications de chercheur
CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024
Layout Analysis Dataset with SegmOnto
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024
Les registres médiévaux de Notre Dame : une archive numérique ouverte de la vie du chapitre
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024