Speechformer: Reducing Information Loss in Direct Speech Translation

Sara Papi; Marco Gaido; Matteo Negri; Marco Turchi

doi:10.18653/v1/2021.emnlp-main.127

Speechformer: Reducing Information Loss in Direct Speech Translation

Sara Papi, Marco Gaido, Matteo Negri, Marco Turchi

Abstract

Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer’s quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en→de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.

Anthology ID:: 2021.emnlp-main.127
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1698–1706
Language:
URL:: https://aclanthology.org/2021.emnlp-main.127
DOI:: 10.18653/v1/2021.emnlp-main.127
Bibkey:
Cite (ACL):: Sara Papi, Marco Gaido, Matteo Negri, and Marco Turchi. 2021. Speechformer: Reducing Information Loss in Direct Speech Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1698–1706, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Speechformer: Reducing Information Loss in Direct Speech Translation (Papi et al., EMNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.emnlp-main.127.pdf
Video:: https://aclanthology.org/2021.emnlp-main.127.mp4
Code: sarapapi/fbk-fairseq
Data: MuST-C

PDF Cite Search Code Video