A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

Abir Masmoudi; Mariem Ellouze Khemekhem; Yannick Estève; Lamia Hadrich Belguith; Nizar Habash

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

Abir Masmoudi, Mariem Ellouze Khmekhem, Yannick Estève, Lamia Hadrich Belguith, Nizar Habash

Abstract

In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

Anthology ID:: L14-1385
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 306–310
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/454_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Abir Masmoudi, Mariem Ellouze Khmekhem, Yannick Estève, Lamia Hadrich Belguith, and Nizar Habash. 2014. A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 306–310, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition (Masmoudi et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/454_Paper.pdf

PDF Cite Search Fix data