DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Ye Zhongyi, Linli Xu


Abstract
While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the continuous speech representation space, while employing the diffusion backward process in the discrete speech unit space. In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models. We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines (500 steps on average) with significantly fewer decoding steps (50 steps).
Anthology ID:
2023.emnlp-main.709
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11573–11583
Language:
URL:
https://aclanthology.org/2023.emnlp-main.709
DOI:
10.18653/v1/2023.emnlp-main.709
Bibkey:
Cite (ACL):
Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Ye Zhongyi, and Linli Xu. 2023. DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11573–11583, Singapore. Association for Computational Linguistics.
Cite (Informal):
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation (Zhu et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.709.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.709.mp4