Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment

Zhiqing Hong; Rongjie Huang; Xize Cheng; Yongqi Wang; Ruiqi Li; Fuming You; Zhou Zhao; Zhimeng Zhang

doi:10.18653/v1/2024.acl-long.339

Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment

Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang

Abstract

A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to exploring song synthesis. In this work, we propose a novel task called Text-to-Song synthesis which incorporates both vocal and accompaniment generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.

Anthology ID:: 2024.luhme-long.339
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6248–6261
Language:
URL:: https://aclanthology.org/2024.luhme-long.339/
DOI:: 10.18653/v1/2024.acl-long.339
Bibkey:
Cite (ACL):: Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, and Zhimeng Zhang. 2024. Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6248–6261, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment (Hong et al., ACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.acl-long.339.pdf

PDF Cite Search Fix data