Clamença Poujade


2024

pdf bib
Loflòc: A Morphological Lexicon for Occitan using Universal Dependencies
Marianne Vergez-Couret | Myriam Bras | Aleksandra Miletić | Clamença Poujade
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents Loflòc (Lexic obèrt flechit Occitan – Open Inflected Lexicon of Occitan), a morphological lexicon for Occitan. Even though the lexicon no longer occupies the same place in the NLP pipeline since the advent of large language models, it remains a crucial resource for low-resourced languages. Occitan is a Romance language spoken in the south of France and in parts of Italy and Spain. It is not recognized as an official language in France and no standard variety is shared across the area. To the best of our knowledge, Loflòc is the first publicly available lexicon for Occitan. It contains 650 thousand entries for 57 thousand lemmas. Each entry is accompanied by the corresponding Universal Dependencies Part-of-Speech tag. We show that the lexicon has solid coverage on the existing freely available corpora of Occitan in four major dialects. Coverage gaps on multi-dialect corpora are overwhelmingly driven by dialectal variation, which affects both open and closed classes. Based on this analysis we propose directions for future improvements.

2020

pdf bib
Building a Universal Dependencies Treebank for Occitan
Aleksandra Miletic | Myriam Bras | Marianne Vergez-Couret | Louise Esher | Clamença Poujade | Jean Sibille
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper outlines the ongoing effort of creating the first treebank for Occitan, a low-ressourced regional language spoken mainly in the south of France. We briefly present the global context of the project and report on its current status. We adopt the Universal Dependencies framework for this project. Our methodology is based on two main principles. Firstly, in order to guarantee the annotation quality, we use the agile annotation approach. Secondly, we rely on pre-processing using existing tools (taggers and parsers) to facilitate the work of human annotators, mainly through a delexicalized cross-lingual parsing approach. We present the results available at this point (annotation guidelines and a sub-corpus annotated with PoS tags and lemmas) and give the timeline for the rest of the work.

pdf bib
A Four-Dialect Treebank for Occitan: Building Process and Parsing Experiments
Aleksandra Miletic | Myriam Bras | Marianne Vergez-Couret | Louise Esher | Clamença Poujade | Jean Sibille
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Occitan is a Romance language spoken mainly in the south of France. It has no official status in the country, it is not standardized and displays important diatopic variation resulting in a rich system of dialects. Recently, a first treebank for this language was created. However, this corpus is based exclusively on texts in the Lengadocian dialect. Our paper describes the work aimed at extending the existing corpus with content in three new dialects, namely Gascon, Provençau and Lemosin. We describe both the annotation of initial content in these new varieties of Occitan and experiments allowing us to identify the most efficient method for further enrichment of the corpus. We observe that parsing models trained on Occitan dialects achieve better results than a delexicalized model trained on other Romance languages despite the latter training corpus being much larger (20K vs 900K tokens). The results of the native Occitan models show an important impact of cross-dialectal lexical variation, whereas syntactic variation seems to affect the systems less. We hope that the resulting corpus, incorporating several Occitan varieties, will facilitate the training of robust NLP tools, capable of processing all kinds of Occitan texts.