Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning

Mingfei Lau; Qian Chen (陈千); Yeming Fang; Tingting Xu; Tongzhou Chen; Pavel Golik

doi:10.18653/v1/2025.acl-long.370

Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning

Mingfei Lau, Qian Chen, Yeming Fang, Tingting Xu, Tongzhou Chen, Pavel Golik

Abstract

Our quality audit for three widely used public multilingual speech datasets Mozilla Common Voice 17.0, FLEURS, and VoxPopuli shows that in some languages, these datasets suffer from significant quality issues. We believe addressing these issues will make these datasets more useful as evaluation sets, and improve downstream models. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the process of Automatic Speech Recognition (ASR) dataset creation. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness in creating robust and reliable speech data resources.

Anthology ID:: 2025.acl-long.370
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7466–7492
Language:
URL:: https://aclanthology.org/2025.acl-long.370/
DOI:: 10.18653/v1/2025.acl-long.370
Bibkey:
Cite (ACL):: Mingfei Lau, Qian Chen, Yeming Fang, Tingting Xu, Tongzhou Chen, and Pavel Golik. 2025. Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7466–7492, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning (Lau et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.370.pdf

PDF Cite Search Fix data