VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis

Emily Ahn, Eleanor Chodroff


Abstract
Cross-linguistic phonetic analysis has long been limited by data scarcity and insufficient computational resources. In the past few years, the availability of large-scale cross-linguistic spoken corpora has increased dramatically, but the data still require considerable computational power and processing for downstream phonetic analysis. To facilitate large-scale cross-linguistic phonetic research in the field, we release the VoxCommunis Corpus, which contains acoustic models, pronunciation lexicons, and word- and phone-level alignments, derived from the publicly available Mozilla Common Voice Corpus. The current release includes data from 36 languages. The corpus also contains acoustic-phonetic measurements, which currently consist of formant frequencies (F1–F4) from all vowel quartiles. Major advantages of this corpus for phonetic analysis include the number of available languages, the large amount of speech per language, as well as the fact that most language datasets have dozens to hundreds of contributing speakers. We demonstrate the utility of this corpus for downstream phonetic research in a descriptive analysis of language-specific vowel systems, as well as an analysis of “uniformity” in vowel realization across languages. The VoxCommunis Corpus is free to download and use under a CC0 license.
Anthology ID:
2022.lrec-1.566
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5286–5294
Language:
URL:
https://aclanthology.org/2022.lrec-1.566
DOI:
Bibkey:
Cite (ACL):
Emily Ahn and Eleanor Chodroff. 2022. VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5286–5294, Marseille, France. European Language Resources Association.
Cite (Informal):
VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis (Ahn & Chodroff, LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.566.pdf
Data
Common VoiceMultilingual LibriSpeechVoxClamantis