Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants

Raphael Rubino; Johanna Gerlach; Jonathan Mutal; Pierrette Bouillon

doi:10.18653/v1/2024.findings-naacl.215

Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants

Raphael Rubino, Johanna Gerlach, Jonathan Mutal, Pierrette Bouillon

Abstract

Conservation of historical documents benefits from computational methods by alleviating the manual labor related to digitization and modernization of textual content. Languages usually evolve over time and keeping historical wordforms is crucial for diachronic studies and digital humanities. However, spelling conventions did not necessarily exist when texts were originally written and orthographic variations are commonly observed depending on scribes and time periods. In this study, we propose to automatically normalize orthographic wordforms found in historical archives written in Middle French during the 16th century without fully modernizing textual content. We leverage pre-trained models in a low resource setting based on a manually curated parallel corpus and produce additional resources with artificial data generation approaches. Results show that causal language models and knowledge distillation improve over a strong baseline, thus validating the proposed methods.

Anthology ID:: 2024.findings-naacl.215
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3394–3402
Language:
URL:: https://aclanthology.org/2024.findings-naacl.215/
DOI:: 10.18653/v1/2024.findings-naacl.215
Bibkey:
Cite (ACL):: Raphael Rubino, Johanna Gerlach, Jonathan Mutal, and Pierrette Bouillon. 2024. Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3394–3402, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants (Rubino et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-naacl.215.pdf

PDF Cite Search Fix data