2024
pdf
bib
abs
CorpusArièja: Building an Annotated Corpus with Variation in Occitan
Clamenca Poujade
|
Myriam Bras
|
Assaf Urieli
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
The Occitan language is a less resourced language and is classified as ‘in danger’ by the UNESCO. Thereby, it is important to build resources and tools that can help to safeguard and develop the digitisation of the language. CorpusArièja is a collection of 72 texts (just over 41,000 tokens) in the Occitan language of the French department of Ariège. The majority of the texts needed to be digitised and pass within an Optical Character Recognition. This corpus contains dialectal and spelling variation, but is limited to prose, without diachronic variation or genre variation. It is an annotated corpus with two levels of lemmatisation, POS tags and verbal inflection. One of the main aims of the corpus is to enable the conception of tools that can automatically annotate all Occitan texts, regardless of the dialect or spelling used. The Ariège territory is interesting because it includes the two variations that we focus on, dialectal and spelling. It has plenty of authors that write in their native language, their variety of Occitan.
pdf
bib
abs
Loflòc: A Morphological Lexicon for Occitan using Universal Dependencies
Marianne Vergez-Couret
|
Myriam Bras
|
Aleksandra Miletić
|
Clamença Poujade
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper presents Loflòc (Lexic obèrt flechit Occitan – Open Inflected Lexicon of Occitan), a morphological lexicon for Occitan. Even though the lexicon no longer occupies the same place in the NLP pipeline since the advent of large language models, it remains a crucial resource for low-resourced languages. Occitan is a Romance language spoken in the south of France and in parts of Italy and Spain. It is not recognized as an official language in France and no standard variety is shared across the area. To the best of our knowledge, Loflòc is the first publicly available lexicon for Occitan. It contains 650 thousand entries for 57 thousand lemmas. Each entry is accompanied by the corresponding Universal Dependencies Part-of-Speech tag. We show that the lexicon has solid coverage on the existing freely available corpora of Occitan in four major dialects. Coverage gaps on multi-dialect corpora are overwhelmingly driven by dialectal variation, which affects both open and closed classes. Based on this analysis we propose directions for future improvements.
2020
pdf
bib
abs
A Four-Dialect Treebank for Occitan: Building Process and Parsing Experiments
Aleksandra Miletic
|
Myriam Bras
|
Marianne Vergez-Couret
|
Louise Esher
|
Clamença Poujade
|
Jean Sibille
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Occitan is a Romance language spoken mainly in the south of France. It has no official status in the country, it is not standardized and displays important diatopic variation resulting in a rich system of dialects. Recently, a first treebank for this language was created. However, this corpus is based exclusively on texts in the Lengadocian dialect. Our paper describes the work aimed at extending the existing corpus with content in three new dialects, namely Gascon, Provençau and Lemosin. We describe both the annotation of initial content in these new varieties of Occitan and experiments allowing us to identify the most efficient method for further enrichment of the corpus. We observe that parsing models trained on Occitan dialects achieve better results than a delexicalized model trained on other Romance languages despite the latter training corpus being much larger (20K vs 900K tokens). The results of the native Occitan models show an important impact of cross-dialectal lexical variation, whereas syntactic variation seems to affect the systems less. We hope that the resulting corpus, incorporating several Occitan varieties, will facilitate the training of robust NLP tools, capable of processing all kinds of Occitan texts.
pdf
bib
abs
Building a Universal Dependencies Treebank for Occitan
Aleksandra Miletic
|
Myriam Bras
|
Marianne Vergez-Couret
|
Louise Esher
|
Clamença Poujade
|
Jean Sibille
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper outlines the ongoing effort of creating the first treebank for Occitan, a low-ressourced regional language spoken mainly in the south of France. We briefly present the global context of the project and report on its current status. We adopt the Universal Dependencies framework for this project. Our methodology is based on two main principles. Firstly, in order to guarantee the annotation quality, we use the agile annotation approach. Secondly, we rely on pre-processing using existing tools (taggers and parsers) to facilitate the work of human annotators, mainly through a delexicalized cross-lingual parsing approach. We present the results available at this point (annotation guidelines and a sub-corpus annotated with PoS tags and lemmas) and give the timeline for the rest of the work.