RoDia: A New Dataset for Romanian Dialect Identification from Speech

Rotaru Codruț, Nicolae Ristea, Radu Ionescu


Abstract
We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.
Anthology ID:
2024.findings-naacl.20
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
279–286
Language:
URL:
https://aclanthology.org/2024.findings-naacl.20
DOI:
Bibkey:
Cite (ACL):
Rotaru Codruț, Nicolae Ristea, and Radu Ionescu. 2024. RoDia: A New Dataset for Romanian Dialect Identification from Speech. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 279–286, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
RoDia: A New Dataset for Romanian Dialect Identification from Speech (Codruț et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.20.pdf
Copyright:
 2024.findings-naacl.20.copyright.pdf