Dominique Stutzmann


2021

pdf bib
Named Entity Recognition for French medieval charters
Sergio Torres Aguilar | Dominique Stutzmann
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

This paper presents the process of annotating and modelling a corpus to automatically detect named entities in medieval charters in French. It introduces a new annotated corpus and a new system which outperforms state-of-the art libraries. Charters are legal documents and among the most important historical sources for medieval studies as they reflect economic and social dynamics as well as the evolution of literacy and writing practices. Automatic detection of named entities greatly improves the access to these unstructured texts and facilitates historical research. The experiments described here are based on a corpus encompassing about 500k words (1200 charters) coming from three charter collections of the 13th and 14th centuries. We annotated the corpus and then trained two state-of-the art NLP libraries for Named Entity Recognition (Spacy and Flair) and a custom neural model (Bi-LSTM-CRF). The evaluation shows that all three models achieve a high performance rate on the test set and a high generalization capacity against two external corpora unseen during training. This paper describes the corpus and the annotation model, and discusses the issues related to the linguistic processing of medieval French and formulaic discourse, so as to interpret the results within a larger historical perspective.

2020

pdf bib
Books of Hours. the First Liturgical Data Set for Text Segmentation.
Amir Hazem | Beatrice Daille | Christopher Kermorvant | Dominique Stutzmann | Marie-Laurence Bonhomme | Martin Maarand | Mélodie Boillet
Proceedings of the Twelfth Language Resources and Evaluation Conference

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documenting the devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of its manuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hours raises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviated words, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers a new field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis. In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated by Handwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. We designed a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, we performed a systematic evaluation of the main state of the art text segmentation approaches.

pdf bib
Hierarchical Text Segmentation for Medieval Manuscripts
Amir Hazem | Beatrice Daille | Dominique Stutzmann | Christopher Kermorvant | Louis Chevalier
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we address the segmentation of books of hours, Latin devotional manuscripts of the late Middle Ages, that exhibit challenging issues: a complex hierarchical entangled structure, variable content, noisy transcriptions with no sentence markers, and strong correlations between sections for which topical information is no longer sufficient to draw segmentation boundaries. We show that the main state-of-the-art segmentation methods are either inefficient or inapplicable for books of hours and propose a bottom-up greedy approach that considerably enhances the segmentation results. We stress the importance of such hierarchical segmentation of books of hours for historians to explore their overarching differences underlying conception about Church.

2019

pdf bib
Transcription automatique et segmentation thématique de livres d’heures manuscrits [Automatic transcription and thematic segmentation of Books of Hours]
Béatrice Daille | Amir Hazem | Christopher Kermorvant | Martin Maarand | Marie-Laurence Bonhomme | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Traitement Automatique des Langues, Volume 60, Numéro 3 : TAL et humanités numériques [NLP and Digital Humanities]

pdf bib
Réutilisation de Textes dans les Manuscrits Anciens (Text Reuse in Ancient Manuscripts)
Amir Hazem | Béatrice Daille | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Nous nous intéressons dans cet article à la problématique de réutilisation de textes dans les livres liturgiques du Moyen Âge. Plus particulièrement, nous étudions les variations textuelles de la prière Obsecro Te souvent présente dans les livres d’heures. L’observation manuelle de 772 copies de l’Obsecro Te a montré l’existence de plus de 21 000 variantes textuelles. Dans le but de pouvoir les extraire automatiquement et les catégoriser, nous proposons dans un premier temps une classification lexico-sémantique au niveau n-grammes de mots pour ensuite rendre compte des performances de plusieurs approches état-de-l’art d’appariement automatique de variantes textuelles de l’Obsecro Te.

pdf bib
Towards Automatic Variant Analysis of Ancient Devotional Texts
Amir Hazem | Béatrice Daille | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We address in this paper the issue of text reuse in liturgical manuscripts of the middle ages. More specifically, we study variant readings of the Obsecro Te prayer, part of the devotional Books of Hours often used by Christians as guidance for their daily prayers. We aim at automatically extracting and categorising pairs of words and expressions that exhibit variant relations. For this purpose, we adopt a linguistic classification that allows to better characterize the variants than edit operations. Then, we study the evolution of Obsecro Te texts from a temporal and geographical axis. Finally, we contrast several unsupervised state-of-the-art approaches for the automatic extraction of Obsecro Te variants. Based on the manual observation of 772 Obsecro Te copies which show more than 21,000 variants, we show that the proposed methodology is helpful for an automatic study of variants and may serve as basis to analyze and to depict useful information from devotional texts.