Marco Sorbi
2026
Weakly Supervised Named Entity Recognition for Historical Texts
Marco Sorbi | Laurent Moccozet | Stephane Marchand-Maillet
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Marco Sorbi | Laurent Moccozet | Stephane Marchand-Maillet
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Named Entity Recognition has emerged as a critical task in natural language processing, particularly for extracting meaningful information from unstructured text. Although traditional approaches rely heavily on large annotated datasets, recent advances have explored weak supervision techniques to address the limitations of resource-intensive annotation processes. Historical texts provide unique challenges to this task because of their linguistic peculiarities, and several approaches exist to address texts of this domain in a supervised way, but they involve lengthy manual annotations of the documents of interest by domain experts. To address this issue, this paper explores how recent weakly supervised NER techniques can be adapted to historical texts, analyzing their suitability for this domain. The experiments show that domain-specific architectures can be effectively trained on low-resource corpora with weak supervision over a small set of entity labels. Using only 10% of the annotations, the performance of these architectures remains above 80% of the supervised quality in terms of F1-Score.
2024
RCnum: A Semantic and Multilingual Online Edition of the Geneva Council Registers from 1545 to 1550
Pierrette Bouillon | Christophe Chazalon | Sandra Coram-Mekkey | Gilles Falquet | Johanna Gerlach | Stephane Marchand-Maillet | Laurent Moccozet | Jonathan Mutal | Raphael Rubino | Marco Sorbi
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
Pierrette Bouillon | Christophe Chazalon | Sandra Coram-Mekkey | Gilles Falquet | Johanna Gerlach | Stephane Marchand-Maillet | Laurent Moccozet | Jonathan Mutal | Raphael Rubino | Marco Sorbi
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
The RCnum project is funded by the Swiss National Science Foundation and aims at producing a multilingual and semantically rich online edition of the Registers of Geneva Council from 1545 to 1550. Combining multilingual NLP, history and paleography, this collaborative project will clear hurdles inherent to texts manually written in 16th century Middle French while allowing for easy access and interactive consultation of these archives.