- Congrès : TEI 2022 conference : Text as data (2022-09-13 - 2022-09-16)
Résumé
Cultural heritage institutions today aim to digitise their collections of prints and manuscripts (Bermès 2020) and are generating more and more digital images (Gray 2009). To enrich these images, many institutions work with standardised formats such as IIIF, preserving as much of the source’s information as possible. To take full advantage of textual documents, an image alone is not enough. Thanks to automatic text recognition technology, it is now possible to extract images’ content on a large scale. The TEI seems to provide the perfect format to capture both an image’s formal and textual data (Janès et al. 2021). However, this poses a problem. To ensure compatibility with a range of use cases, TEI XML files must guarantee IIIF or RDF exports and therefore must be based on strict data structures that can be automated. But a rigid structure contradicts the basic principles of philology, which require maximum flexibility to cope with various situations. The solution proposed by the Gallic(orpor)a project1 attempted to deal with such a contradiction, focusing on French historical documents produced between the 15th and the 18th c. It aims to enrich the digital facsimiles distributed by the French National Library (BnF).
Disciplines
Partager sur les réseaux sociaux
Publications de chercheur
CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024
Layout Analysis Dataset with SegmOnto
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024
Les registres médiévaux de Notre Dame : une archive numérique ouverte de la vie du chapitre
Publication de chercheur
Communication dans un congrès
- Date de parution : 2024