LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition

Benjamin Beilharz, Xin Sun, Sariya Karimova, Stefan Riezler


Abstract
We present a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that speech alignment quality is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for German speech recognition and for end-to-end German-to-English speech translation.
Anthology ID:
2020.lrec-1.441
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3590–3594
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.441
DOI:
Bibkey:
Cite (ACL):
Benjamin Beilharz, Xin Sun, Sariya Karimova, and Stefan Riezler. 2020. LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3590–3594, Marseille, France. European Language Resources Association.
Cite (Informal):
LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition (Beilharz et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.441.pdf
Data
LibriVoxDeEnLibriSpeechWikiMatrix