A Four-Dialect Treebank for Occitan: Building Process and Parsing Experiments

Aleksandra Miletic, Myriam Bras, Marianne Vergez-Couret, Louise Esher, Clamença Poujade, Jean Sibille


Abstract
Occitan is a Romance language spoken mainly in the south of France. It has no official status in the country, it is not standardized and displays important diatopic variation resulting in a rich system of dialects. Recently, a first treebank for this language was created. However, this corpus is based exclusively on texts in the Lengadocian dialect. Our paper describes the work aimed at extending the existing corpus with content in three new dialects, namely Gascon, Provençau and Lemosin. We describe both the annotation of initial content in these new varieties of Occitan and experiments allowing us to identify the most efficient method for further enrichment of the corpus. We observe that parsing models trained on Occitan dialects achieve better results than a delexicalized model trained on other Romance languages despite the latter training corpus being much larger (20K vs 900K tokens). The results of the native Occitan models show an important impact of cross-dialectal lexical variation, whereas syntactic variation seems to affect the systems less. We hope that the resulting corpus, incorporating several Occitan varieties, will facilitate the training of robust NLP tools, capable of processing all kinds of Occitan texts.
Anthology ID:
2020.vardial-1.13
Volume:
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venues:
COLING | VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
Note:
Pages:
140–149
Language:
URL:
https://aclanthology.org/2020.vardial-1.13
DOI:
Bibkey:
Cite (ACL):
Aleksandra Miletic, Myriam Bras, Marianne Vergez-Couret, Louise Esher, Clamença Poujade, and Jean Sibille. 2020. A Four-Dialect Treebank for Occitan: Building Process and Parsing Experiments. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 140–149, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
Cite (Informal):
A Four-Dialect Treebank for Occitan: Building Process and Parsing Experiments (Miletic et al., VarDial 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.vardial-1.13.pdf