Eva Schaeffer-Lacroix


2024

pdf bib
Compilation of a Synthetic Judeo-French Corpus
Iglika Nikolova-Stoupak | Gaél Lejeune | Eva Schaeffer-Lacroix
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

This is a short paper describing the process of derivation of synthetic Judeo-French text. Judeo-French is one of a number of rare languages used in speaking and writing by Jewish communities as confined to a particular temporal and geographical frame (in this case, 11th- to 14th-century France). The number of resources in the language is very limited and its involvement in the contemporary domain of Natural Language Processing (NLP) is practically non-existent. This work outlines the compilation of a synthetic Judeo-French corpus. For the purpose, a pipeline of transformations is applied to Old French text belonging to the same general time period, leading to the derivation of text that is as reliable as possible in terms of phonological, morphological and lexical characteristics as witnessed in Judeo-French. Ultimately, the goal is for this synthetic corpus to be used in standard NLP tasks, such as Neural Machine Translation (NMT), as an instance of data augmentation.