Chao Weng
2024
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners
Rongjie Huang
|
Chunlei Zhang
|
Yongqi Wang
|
Dongchao Yang
|
Jinchuan Tian
|
Zhenhui Ye
|
Luping Liu
|
Zehan Wang
|
Ziyue Jiang
|
Xuankai Chang
|
Jiatong Shi
|
Chao Weng
|
Zhou Zhao
|
Dong Yu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have successfully served as a general-purpose interface across multiple tasks and languages, while the adaptation of voice LLMs is mostly designed for specific purposes (either single-task or monolingual), where the advantages of LLMs especially for low-resource language processing and zero-shot task generalization are less exploited in the audio community. To bridge the gap, we introduce Make-A-Voice as a multi-modal voice LLM and conduct a comprehensive study on its capability to deal with multiple tasks/languages. When trained on ~200K hours of 6-language data for 4 voice generation applications, Make-A-Voice emerges notable advantages: 1) as scalable learners to improve performance with end-to-end local and global multiscale transformers; and 2) as multitask learners by adjusting prompts to share common knowledge across modalities (speech/singing) and present in-context learning abilities by generalizing to unseen tasks not explicitly train on; 3) as multilingual learners to alleviate data scarcity of low-resource languages by including rich-resource language training data. Experimental results demonstrate that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models in monolingual/cross-lingual voice generation. Audio samples are available at https://M-Voice.github.io
Search
Co-authors
- Rongjie Huang 1
- Chunlei Zhang 1
- Yongqi Wang 1
- Dongchao Yang 1
- Jinchuan Tian 1
- show all...
Venues
- acl1