OcWikiDialects: A Wikipedia Dataset With Rich Metadata for Occitan Dialect Identification

Oriane Nédey; Rachel Bawden; Thibault Clérice; Benoît Sagot

OcWikiDialects: A Wikipedia Dataset With Rich Metadata for Occitan Dialect Identification

Oriane Nédey, Rachel Bawden, Thibault Clérice, Benoît Sagot

Abstract

Occitan is a Romance language spoken mostly in the South of France and characterised by rich dialectal variation, which can pose problems for certain NLP tools. This shortfall is largely attributable to the scarcity of dialect-annotated corpora, in a context where linguistic classification within the Occitan dialect continuum is still debated and major nomenclatures, such as ISO 639, fail to provide granular codes for varieties below the generic "Occitan" label. In this paper, we introduce OcWikiDialects, a new dataset comprising articles from the Occitan Wikipedia. The corpus features rich metadata, including dialect labels, and is segmented at both paragraph and sentence levels. Combined with previously released datasets, we explore approaches for Occitan dialect identification by training three types of model on up to 8 labels: linear SVM classifiers based on word and character n-grams, FastText classifiers based on pretrained vectors, and BERT-based neural classifiers adapted through fine-tuning. Evaluations across in- and out-of-domain test sets demonstrate the substantial impact of our new dataset for the task. However, a peak macro-averaged F1 score of 58.15 underscores persistent challenges for underrepresented Occitan varieties, supported by our per-dialect analysis. Code, dataset and models are available: https://github.com/DEFI-COLaF/OcWikiDialects.

Anthology ID:: 2026.vardial-1.4
Volume:: Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Venues:: VarDial | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 45–57
Language:
URL:: https://aclanthology.org/2026.vardial-1.4/
DOI:
Bibkey:
Cite (ACL):: Oriane Nédey, Rachel Bawden, Thibault Clérice, and Benoît Sagot. 2026. OcWikiDialects: A Wikipedia Dataset With Rich Metadata for Occitan Dialect Identification. In Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 45–57, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: OcWikiDialects: A Wikipedia Dataset With Rich Metadata for Occitan Dialect Identification (Nédey et al., VarDial 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.vardial-1.4.pdf

PDF Cite Search Fix data