M. Elizabeth Garza

2020

We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.

Co-authors

Lucas F.E. Ashby 1
Kyle Gorman 1
Jackson L. Lee 1
Yeonju Lee-Sikka 1
Arya D. McCarthy 1

Sean Miller 1

Alan Wong 1

Venues

LREC1

Fix author