Translations of the Callhome Egyptian Arabic corpus for conversational speech translation

Gaurav Kumar, Yuan Cao, Ryan Cotterell, Chris Callison-Burch, Daniel Povey, Sanjeev Khudanpur


Abstract
Translation of the output of automatic speech recognition (ASR) systems, also known as speech translation, has received a lot of research interest recently. This is especially true for programs such as DARPA BOLT which focus on improving spontaneous human-human conversation across languages. However, this research is hindered by the dearth of datasets developed for this explicit purpose. For Egyptian Arabic-English, in particular, no parallel speechtranscription-translation dataset exists in the same domain. In order to support research in speech translation, we introduce the Callhome Egyptian Arabic-English Speech Translation Corpus. This supplements the existing LDC corpus with four reference translations for each utterance in the transcripts. The result is a three-way parallel dataset of Egyptian Arabic Speech, transcriptions and English translations.
Anthology ID:
2014.iwslt-papers.13
Volume:
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers
Month:
December 4-5
Year:
2014
Address:
Lake Tahoe, California
Editors:
Marcello Federico, Sebastian Stüker, François Yvon
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
244–248
Language:
URL:
https://aclanthology.org/2014.iwslt-papers.13
DOI:
Bibkey:
Cite (ACL):
Gaurav Kumar, Yuan Cao, Ryan Cotterell, Chris Callison-Burch, Daniel Povey, and Sanjeev Khudanpur. 2014. Translations of the Callhome Egyptian Arabic corpus for conversational speech translation. In Proceedings of the 11th International Workshop on Spoken Language Translation: Papers, pages 244–248, Lake Tahoe, California.
Cite (Informal):
Translations of the Callhome Egyptian Arabic corpus for conversational speech translation (Kumar et al., IWSLT 2014)
Copy Citation:
PDF:
https://aclanthology.org/2014.iwslt-papers.13.pdf