Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic Speech

Mohamed Elmahdy, Mark Hasegawa-Johnson, Eiman Mustafawi


Abstract
In this paper, a framework for long audio alignment for conversational Arabic speech is proposed. Accurate alignments help in many speech processing tasks such as audio indexing, speech recognizer acoustic model (AM) training, audio summarizing and retrieving, etc. We have collected more than 1,400 hours of conversational Arabic besides the corresponding human generated non-aligned transcriptions. Automatic audio segmentation is performed using a split and merge approach. A biased language model (LM) is trained using the corresponding text after a pre-processing stage. Because of the dominance of non-standard Arabic in conversational speech, a graphemic pronunciation model (PM) is utilized. The proposed alignment approach is performed in two passes. Firstly, a generic standard Arabic AM is used along with the biased LM and the graphemic PM in a fast speech recognition pass. In a second pass, a more restricted LM is generated for each audio segment, and unsupervised acoustic model adaptation is applied. The recognizer output is aligned with the processed transcriptions using Levenshtein algorithm. The proposed approach resulted in an initial alignment accuracy of 97.8-99.0% depending on the amount of disfluencies. A confidence scoring metric is proposed to accept/reject aligner output. Using confidence scores, it was possible to reject the majority of mis-aligned segments resulting in alignment accuracy of 99.0-99.8% depending on the speech domain and the amount of disfluencies.
Anthology ID:
L14-1372
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3062–3066
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/434_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Mohamed Elmahdy, Mark Hasegawa-Johnson, and Eiman Mustafawi. 2014. Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic Speech. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3062–3066, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic Speech (Elmahdy et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/434_Paper.pdf