- Dans Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
- Éditeur : European Language Resources Association (ELRA)
- Pages : 22-27
Résumé
Classical Armenian, Old Georgian and Syriac are under-resourced digital languages. Even though a lot of printed critical editions or dictionaries are available, there is currently a lack of fully tagged corpora that could be reused for automatic text analysis. In this paper, we introduce an ongoing project of lemmatization and POS-tagging for these languages, relying on a recurrent neural network (RNN), specific morphological tags and dedicated datasets. For this paper, we have combine different corpora previously processed by automatic out-of-context lemmatization and POS-tagging, and manual proofreading by the collaborators of the GREgORI Project (UCLouvain, Louvain-la-Neuve, Belgium). We intend to compare a rule based approach and a RNN approach by using PIE specialized by Calfa (Paris, France). We introduce here first results. We reach a mean accuracy of 91,63% in lemmatization and of 92,56% in POS-tagging. The datasets, which were constituted and used for this project, are not yet representative of the different variations of these languages through centuries, but they are homogenous and allow reaching tangible results, paving the way for further analysis of wider corpora.
Partager sur les réseaux sociaux
Publications de chercheur
‘La Rochelle, notre commune patrie': the World of the Rochelais Huguenots before the Revocation of the Edict of Nantes
Publication de chercheur
Chapitre d’ouvrage
- Date de parution : 2025
Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking
Publication de chercheur
Communication dans un congrès Nouveauté
- Date de parution : 2024
Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs
Publication de chercheur
Communication dans un congrès Nouveauté
- Date de parution : 2024