CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation

Xiaohu Zhao, Haoran Sun, Yikun Lei, Shaolin Zhu, Deyi Xiong


Abstract
Deep neural networks have demonstrated their capacity in extracting features from speech inputs. However, these features may include non-linguistic speech factors such as timbre and speaker identity, which are not directly related to translation. In this paper, we propose a content-centric speech representation disentanglement learning framework for speech translation, CCSRD, which decomposes speech representations into content representations and non-linguistic representations via representation disentanglement learning. CCSRD consists of a content encoder that encodes linguistic content information from the speech input, a non-content encoder that models non-linguistic speech features, and a disentanglement module that learns disentangled representations with a cyclic reconstructor, feature reconstructor and speaker classifier trained in a multi-task learning way. Experiments on the MuST-C benchmark dataset demonstrate that CCSRD achieves an average improvement of +0.9 BLEU in two settings across five translation directions over the baseline, outperforming state-of-the-art end-to-end speech translation models and cascaded models.
Anthology ID:
2023.findings-emnlp.394
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5920–5932
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.394
DOI:
10.18653/v1/2023.findings-emnlp.394
Bibkey:
Cite (ACL):
Xiaohu Zhao, Haoran Sun, Yikun Lei, Shaolin Zhu, and Deyi Xiong. 2023. CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5920–5932, Singapore. Association for Computational Linguistics.
Cite (Informal):
CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation (Zhao et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.394.pdf