A Corpus for Large-Scale Phonetic Typology

Elizabeth Salesky; Eleanor Chodroff; Tiago Pimentel; Matthew Wiesner; Ryan Cotterell; Alan W. Black; Jason Eisner

doi:10.18653/v1/2020.acl-main.415

A Corpus for Large-Scale Phonetic Typology

Elizabeth Salesky, Eleanor Chodroff, Tiago Pimentel, Matthew Wiesner, Ryan Cotterell, Alan W Black, Jason Eisner

Abstract

A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality readings. Our corpus and scripts are publicly available for non-commercial use at https://voxclamantisproject.github.io.

Anthology ID:: 2020.acl-main.415
Volume:: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2020
Address:: Online
Editors:: Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4526–4546
Language:
URL:: https://aclanthology.org/2020.acl-main.415/
DOI:: 10.18653/v1/2020.acl-main.415
Bibkey:
Cite (ACL):: Elizabeth Salesky, Eleanor Chodroff, Tiago Pimentel, Matthew Wiesner, Ryan Cotterell, Alan W Black, and Jason Eisner. 2020. A Corpus for Large-Scale Phonetic Typology. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4526–4546, Online. Association for Computational Linguistics.
Cite (Informal):: A Corpus for Large-Scale Phonetic Typology (Salesky et al., ACL 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.acl-main.415.pdf
Video:: http://slideslive.com/38928945

PDF Cite Search Video Fix data