M. Elizabeth Garza
2020
Massively Multilingual Pronunciation Modeling with WikiPron
Jackson L. Lee
|
Lucas F.E. Ashby
|
M. Elizabeth Garza
|
Yeonju Lee-Sikka
|
Sean Miller
|
Alan Wong
|
Arya D. McCarthy
|
Kyle Gorman
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
Search
Co-authors
- Jackson L. Lee 1
- Lucas F.E. Ashby 1
- Yeonju Lee-Sikka 1
- Sean Miller 1
- Alan Wong 1
- show all...
Venues
- lrec1