TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Xize Cheng, Rongjie Huang, Linjun Li, Zehan Wang, Tao Jin, Aoxiong Yin, Chen Feiyang, Xinyu Duan, Baoxing Huai, Zhou Zhao


Abstract
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, TransFace, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that Unit2Lip significantly improves synchronization and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100% isochronous translations. The samples are available at https://transface-demo.github.io .
Anthology ID:
2024.findings-acl.593
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9973–9986
Language:
URL:
https://aclanthology.org/2024.findings-acl.593
DOI:
10.18653/v1/2024.findings-acl.593
Bibkey:
Cite (ACL):
Xize Cheng, Rongjie Huang, Linjun Li, Zehan Wang, Tao Jin, Aoxiong Yin, Chen Feiyang, Xinyu Duan, Baoxing Huai, and Zhou Zhao. 2024. TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9973–9986, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation (Cheng et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.593.pdf