RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding

Yisi Liu; Chenyang Wang; Hanjo Kim; Raniya Khan; Gopala Anumanchipalli

doi:10.18653/v1/2025.acl-demo.37

RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding

Yisi Liu, Chenyang Wang, Hanjo Kim, Raniya Khan, Gopala Anumanchipalli

Abstract

Voice conversion has emerged as a pivotal technology in numerous applications ranging from assistive communication to entertainment. In this paper, we present RT-VC, a zero-shot real-time voice conversion system that delivers ultra-low latency and high-quality performance. Our approach leverages an articulatory feature space to naturally disentangle content and speaker characteristics, facilitating more robust and interpretable voice transformations. Additionally, the integration of differentiable digital signal processing (DDSP) enables efficient vocoding directly from articulatory features, significantly reducing conversion latency. Experimental evaluations demonstrate that, while maintaining synthesis quality comparable to the current state-of-the-art (SOTA) method, RT-VC achieves a CPU latency of 61.4 ms, representing a 13.3% reduction in latency.

Anthology ID:: 2025.acl-demo.37
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Pushkar Mishra, Smaranda Muresan, Tao Yu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 385–393
Language:
URL:: https://aclanthology.org/2025.acl-demo.37/
DOI:: 10.18653/v1/2025.acl-demo.37
Bibkey:
Cite (ACL):: Yisi Liu, Chenyang Wang, Hanjo Kim, Raniya Khan, and Gopala Anumanchipalli. 2025. RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 385–393, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding (Liu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-demo.37.pdf

PDF Cite Search Fix data