OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation

Tanvir Mahmud, Diana Marculescu


Abstract
Audio separation in real-world scenarios, where mixtures contain a variable number of sources, presents significant challenges due to limitations of existing models, such as over-separation, under-separation, and dependence on predefined training sources. We propose OpenSep, a novel framework that leverages large language models (LLMs) for automated audio separation, eliminating the need for manual intervention and overcoming source limitations. OpenSep uses textual inversion to generate captions from audio mixtures with off-the-shelf audio captioning models, effectively parsing the sound sources present. It then employs few-shot LLM prompting to extract detailed audio properties of each parsed source, facilitating separation in unseen mixtures. Additionally, we introduce a multi-level extension of the mix-and-separate training framework to enhance modality alignment by separating single source sounds and mixtures simultaneously. Extensive experiments demonstrate OpenSep’s superiority in precisely separating new, unseen, and variable sources in challenging mixtures, outperforming SOTA baseline methods. Code is released at https://github.com/tanvir-utexas/OpenSep.git.
Anthology ID:
2024.emnlp-main.735
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13244–13260
Language:
URL:
https://aclanthology.org/2024.emnlp-main.735
DOI:
Bibkey:
Cite (ACL):
Tanvir Mahmud and Diana Marculescu. 2024. OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13244–13260, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation (Mahmud & Marculescu, EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.735.pdf
Data:
 2024.emnlp-main.735.data.zip