STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Qingkai Fang, Rong Ye, Lei Li, Yang Feng, Mingxuan Wang


Abstract
How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.
Anthology ID:
2022.acl-long.486
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7050–7062
Language:
URL:
https://aclanthology.org/2022.acl-long.486
DOI:
10.18653/v1/2022.acl-long.486
Bibkey:
Cite (ACL):
Qingkai Fang, Rong Ye, Lei Li, Yang Feng, and Mingxuan Wang. 2022. STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7050–7062, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation (Fang et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.486.pdf
Code
 ictnlp/stemm
Data
MuST-C