A processing chain for extracting and providing online access to annotated and semantically enriched historical data. The AGODA project

Congrès : Digital Humanities 2022 (2022-07-25 - 2022-07-29)

Consulter la fiche HAL

Résumé

The AGODA project is one of five pilot projects supported by the DataLab of the Bibliothèque nationale de France. It aims to create an online platform facilitating the exploration and use of the parliamentary debates of the Chamber of Deputies published in the Journal officiel from 1881 to 1940. In the framework of the DataLab, we are working on a test subcorpus, namely the parliamentary cycle from 1889 to 1893, to test our hypotheses on a smaller dataset. Over the past sixty years, a great deal of work has been done on parliamentary debates. It is indeed a valuable sourcefor historians, political scientists, sociologists or linguists. Access to digitised and ocerised debates thus seems to have a positive effect on the number of historical works using these documents. The same effect can be observed for other disciplines using contemporary debates. AGODA is thus part of a wider movement to facilitate the use and analysis of parliamentary data, following the example of ParlaClarin and ParlaMint, which propose to produce comparable and multilingual Parliamentary Proceedings Corpora according to the XML-TEI standard. Naomi Truan has also produced a corpus of parliamentary debates encoded in XML-TEI. The production of this type of resource facilitates the publication of works exploiting this data to better understand French political discourse. Between 1881 and 1899, 2596 issues of the Journal Officiel were published (50791 JPG images). The debates are also in TXT format but put online without extensive post-correction: the quality of the OCR is not sufficient to provide a satisfactory online browsing experience, and it could have a negative impact on the analyses performed on these texts. Therefore, we chose to ocerise the text, to obtain a better-quality result. We use the PERO OCR based solution developed by the SODUCO project . Ocerised texts are obtained in JSON format; we are developing Python scripts to convert this output into an XML file corresponding to the chosen TEI model. This model is formalised with an adapted XML schema, created using an ODD. We chose to use the ODD created by ParlaClarin which can be easily adapted to annotate historical parliamentary debates. In the case of France, the rules for transcribing debates were set in the 19th century; thus, the recordings of today's debates are very similar to those produced during the Third Republic. The TEI-encoded corpus will be stored in an eXist-db database, and it will be visualised using the TEI Publisher application, which can transform the source data into HTML web pages. The parliamentary debates will thus be made available to online users as a digital edition and integrated into an application context. We will also present the first analyses we have carried out on this corpus with "bag-of-words" techniques - these being not too sensitive to the quality of the OCR. We first used topic modelling, an unsupervised learning method that allows us to discover the latent semantic structures of a corpus of texts, without using semantic and lexical resources. This method is well suited to study parliamentary debates. Alternatively, we can use word embeddings to reduce the dimension of the original space from several tens of thousands of forms to a hundred axes, and then apply classical data science tools such as clustering or correlation analysis on the reduced space. Word embedding has thus shown its interest in the study of parliamentary debates. We used a continuous bag-of-words model for dimension reduction and an unsupervised classification algorithm - in this case DBSCAN - to group words into clusters.

Disciplines

Humanités numériques

Partager sur les réseaux sociaux

À découvrir

Découvrez d'autres productions de l'École sur les mêmes thématiques.

Humanités numériques

Consulter la page «Humanités numériques»

SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles

Publication de chercheur
- Simon Gabay,
  Ariane Pinche,
  Kelly Christensen,
  Jean-Baptiste Camps
Intelligence artificielle et institutions patrimoniales

Vidéo
- Emmanuelle Bermès
Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking

Publication de chercheur
- Chahan Vidal-Gorène,
  Clément Salah,
  Noëmie Lucas,
  Aliénor Decours-Perez,
  Antoine Perrier
Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs

Publication de chercheur
- Chahan Vidal-Gorène,
  Nadi Tomeh,
  Victoria Khurshudyan
Generative Artificial Intelligence and Historical Research: Challenges, Potentials, and Limitations. Application of RAG to French Parliamentary Debates of the Third Republic (1881-1940)

Publication de chercheur
- Aurélien Pellet,
  Julien Perez,
  Marie Puren
Accountable AI for Authentic Records?

Vidéo
Optimizing HTR and Reading Order Strategies for Chinese Imperial Editions with Few-Shot Learning

Publication de chercheur
- Marie Bizais-Lillig,
  Chahan Vidal-Gorène,
  Boris Dupin
Detecting and Deciphering Damaged Medieval Armenian Inscriptions Using YOLO and Vision Transformers

Publication de chercheur
- Chahan Vidal-Gorène,
  Aliénor Decours-Perez
Consulter la page «Humanités numériques»

Nous suivre

A processing chain for extracting and providing online access to annotated and semantically enriched historical data. The AGODA project

Résumé

Résumé

Disciplines

Humanités numériques

Partager sur les réseaux sociaux

À découvrir

Humanités numériques

SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles

Intelligence artificielle et institutions patrimoniales

Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking

Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs

Generative Artificial Intelligence and Historical Research: Challenges, Potentials, and Limitations. Application of RAG to French Parliamentary Debates of the Third Republic (1881-1940)

Accountable AI for Authentic Records?

Optimizing HTR and Reading Order Strategies for Chinese Imperial Editions with Few-Shot Learning

Detecting and Deciphering Damaged Medieval Armenian Inscriptions Using YOLO and Vision Transformers