IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation

Giuseppe Della Corte, Sara Stymne


Abstract
We discuss a set of methods for the creation of IESTAC: a English-Italian speech and text parallel corpus designed for the training of end-to-end speech-to-text machine translation models and publicly released as part of this work. We first mapped English LibriVox audiobooks and their corresponding English Gutenberg Project e-books to Italian e-books with a set of three complementary methods. Then we aligned the English and the Italian texts using both traditional Gale-Church based alignment methods and a recently proposed tool to perform bilingual sentences alignment computing the cosine similarity of multilingual sentence embeddings. Finally, we forced the alignment between the English audiobooks and the English side of our textual parallel corpus with a text-to-speech and dynamic time warping based forced alignment tool. For each step, we provide the reader with a critical discussion based on detailed evaluation and comparison of the results of the different methods.
Anthology ID:
2020.nlpbt-1.5
Volume:
Proceedings of the First International Workshop on Natural Language Processing Beyond Text
Month:
November
Year:
2020
Address:
Online
Editors:
Giuseppe Castellucci, Simone Filice, Soujanya Poria, Erik Cambria, Lucia Specia
Venue:
nlpbt
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
41–50
Language:
URL:
https://aclanthology.org/2020.nlpbt-1.5
DOI:
10.18653/v1/2020.nlpbt-1.5
Bibkey:
Cite (ACL):
Giuseppe Della Corte and Sara Stymne. 2020. IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation. In Proceedings of the First International Workshop on Natural Language Processing Beyond Text, pages 41–50, Online. Association for Computational Linguistics.
Cite (Informal):
IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation (Della Corte & Stymne, nlpbt 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.nlpbt-1.5.pdf
Code
 giuseppe-della-corte/iestac
Data
Europarl-STLibriSpeechMuST-C