Unveiling the Role of Pretraining in Direct Speech Translation

Belen Alastruey; Gerard I. Gállego; Marta R. Costa-jussà

doi:10.18653/v1/2024.emnlp-main.630

Unveiling the Role of Pretraining in Direct Speech Translation

Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà

Abstract

Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this study, we compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch. We observe that, throughout the training, the randomly initialized model struggles to incorporate information from the speech inputs for its predictions. Hence, we hypothesize that this issue stems from the difficulty of effectively training an encoder for direct speech translation. While a model trained from scratch needs to learn acoustic and semantic modeling simultaneously, a pretrained one can just focus on the latter. Based on these findings, we propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training. We show that with this change, the model trained from scratch can achieve comparable performance to the pretrained one, while reducing the training time.

Anthology ID:: 2024.emnlp-main.630
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11259–11265
Language:
URL:: https://aclanthology.org/2024.emnlp-main.630/
DOI:: 10.18653/v1/2024.emnlp-main.630
Bibkey:
Cite (ACL):: Belen Alastruey, Gerard I. Gállego, and Marta R. Costa-jussà. 2024. Unveiling the Role of Pretraining in Direct Speech Translation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11259–11265, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Unveiling the Role of Pretraining in Direct Speech Translation (Alastruey et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.630.pdf

PDF Cite Search Fix data