Lexicon Design for Transcription of Spontaneous Voice Messages

Michal Gishri, Vered Silber-Varod, Ami Moyal


Abstract
Building a comprehensive pronunciation lexicon is a crucial element in the success of any speech recognition engine. The first stage of lexicon design involves the compilation of a comprehensive word list that keeps the Out-Of-Vocabulary (OOV) word rate to a minimum. The second stage involves providing optimized phonemic representations for all lexical items on the list. The research presented here focuses on the first stage of lexicon design ― word list compilation, and describes the methodologies employed in the collection of a pronunciation lexicon designed for the purpose of American English voice message transcription using speech recognition. The lexicon design used is based on a topic domain structure with a target of 90% word coverage for each domain. This differs somewhat from standard approaches where probable words from textual corpora are extracted. This paper raises four issues involved in lexicon design for the transcription of spontaneous voice messages: the inclusion of interjections and other characteristics common to spontaneous speech; the identification of unique messaging terminology; the relative ratio of proper nouns to common words; and the overall size of the lexicon.
Anthology ID:
L10-1644
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/953_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Michal Gishri, Vered Silber-Varod, and Ami Moyal. 2010. Lexicon Design for Transcription of Spontaneous Voice Messages. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Lexicon Design for Transcription of Spontaneous Voice Messages (Gishri et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/953_Paper.pdf