Improving Speech Translation by Fusing Speech and Text

Wenbiao Yin, Zhicheng Liu, Chengqi Zhao, Tao Wang, Jian Tong, Rong Ye


Abstract
In speech translation, leveraging multimodal data to improve model performance and address limitations of individual modalities has shown significant effectiveness. In this paper, we harness the complementary strengths of speech and text to improve speech translation. However, speech and text are disparate modalities, we observe three aspects of modality gap that impede their integration in a speech translation model. To tackle these gaps, we propose **Fuse**-**S**peech-**T**ext (**FuseST**), a cross-modal model which supports three distinct input modalities for translation: speech, text and fused speech-text. We leverage multiple techniques for cross-modal alignment and conduct a comprehensive analysis to assess its impact on speech translation, machine translation and fused speech-text translation. We evaluate FuseST on MuST-C, GigaST and newstest benchmark. Experiments show that the proposed FuseST achieves an average 34.0 BLEU on MuST-C EnDe/Es/Fr (vs SOTA +1.1 BLEU). Further experiments demonstrate that FuseST does not degrade on MT task, as observed in previous works. Instead, it yields an average improvement of 3.2 BLEU over the pre-trained MT model. Code is available at https://github.com/WenbiaoYin/FuseST.
Anthology ID:
2023.findings-emnlp.414
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6262–6273
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.414
DOI:
10.18653/v1/2023.findings-emnlp.414
Bibkey:
Cite (ACL):
Wenbiao Yin, Zhicheng Liu, Chengqi Zhao, Tao Wang, Jian Tong, and Rong Ye. 2023. Improving Speech Translation by Fusing Speech and Text. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6262–6273, Singapore. Association for Computational Linguistics.
Cite (Informal):
Improving Speech Translation by Fusing Speech and Text (Yin et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.414.pdf