T3M: Text Guided 3D Human Motion Synthesis from Speech

Wenshuo Peng, Kaipeng Zhang, Sai Qian Zhang


Abstract
Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed T3M. Unlike traditional approaches, T3M allows precise control over motion synthesis via textual input, enhancing the degree of diversity and user customization. The experiment results demonstrate that T3M can greatly outperform the state-of-the-art methods in both quantitative metrics and qualitative evaluations. We have publicly released our code at https://github.com/Gloria2tt/naacl2024.git
Anthology ID:
2024.findings-naacl.74
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1168–1177
Language:
URL:
https://aclanthology.org/2024.findings-naacl.74
DOI:
10.18653/v1/2024.findings-naacl.74
Bibkey:
Cite (ACL):
Wenshuo Peng, Kaipeng Zhang, and Sai Qian Zhang. 2024. T3M: Text Guided 3D Human Motion Synthesis from Speech. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1168–1177, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
T3M: Text Guided 3D Human Motion Synthesis from Speech (Peng et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.74.pdf