OpenSep: Leveraging Large Language Models with Textual Inversion for
Open World Audio Separation (Supplementary)
We propose an end-to-end novel audio separator framework to automatically separate all individual
sound sources from noisy mixtures in challenging open-world scenarios, with unseen noisy,
noisy sources, without manual intervensions.
Under review at ACL Rolling Review, 2024 (June Cycle).
OpenSep (Proposed): This is our proposed model. We use the end-to-end model without manual
prompting or text conditioning. The best match between parsed sources and manually curated condition prompts
are used for demonstration.
AudioSep:AudioSep1 is a text-conditioned audio separator
used as a baseline. We directly provide manually curated source prompts for conditional separation.
CLIPSep:CLIPSep2 is a text-conditioned audio separator
used as a baseline. We directly provide manually curated source prompts for conditional separation.
MixIT+PIT: This is an unconditional baseline combining both mixture invariant training,
MixIT3
and permutation invariant training, PIT4. We manually align the best matched
predictions with parsed sources, since these models rely on post-processing to parse source entities
from predictions.
Example results on real-world audio mixtures
Example 1 – "Someone is coughing while a woman speaks, and a cat meows later.”
Source1: “A woman speaks”
Source 2: “A cat meows”
Source 3: “Someone is coughing”
Mixture
OpenSep (Source 1)
OpenSep (Source 2)
OpenSep (Source 3)
AudioSep (Source 1)
AudioSep (Source 2)
AudioSep (Source 3)
CLIPSep (Source 1)
CLIPSep (Source 2)
CLIPSep (Source 3)
MixIT+PIT (Source 1)
MixIT+PIT (Source 2)
MixIT+PIT (Source 3)
* All baselines show large spectral overlap of “woman talking” sound in
other two source predictions. OpenSep precisely disentangles all three sources
minimizing the spectral overlap across sources, while preserving spectral details.
Example 2 – "A woman talks in a microphone, while several children yelling in the background."
Source1: “A woman talks in a microphone”
Source 2: “Several children yelling”
Mixture
OpenSep (Source 1)
OpenSep (Source 2)
AudioSep (Source 1)
AudioSep (Source 2)
CLIPSep (Source 1)
CLIPSep (Source 2)
MixIT+PIT (Source 1)
MixIT+PIT (Source 2)
*
For the dominant noisy sound of "children yelling",
all baselines can hardly separate the "woman talking"
sound. OpenSep significantly reduces noise in "woman talking",
while preserving spectral details of noisy "children yelling" sound.
Example 3 – "A woman speaks while music is being played, with sounds of frying foods in the background."
Source1: “A woman speaks”
Source 2: “Music is being played”
Source 3: “Frying foods”
Mixture
OpenSep (Source 1)
OpenSep (Source 2)
OpenSep (Source 3)
AudioSep (Source 1)
AudioSep (Source 2)
AudioSep (Source 3)
CLIPSep (Source 1)
CLIPSep (Source 2)
CLIPSep (Source 3)
MixIT+PIT (Source 1)
MixIT+PIT (Source 2)
MixIT+PIT (Source 3)
* We can observe the dominant “woman talks” spectral
content in “frying foods” for most baselines. However, in CLIPSep, such
overlap is largely reduced, but horizontal spectral contents from
“music plays” is visible. In contrast, OpenSep largely reduces
such spectral overlap in all three components while preserving all details.
Example 4 – "A beep sound followed by wind blows."
Source1: “A beep sound”
Source 2: “Wind blows”
Mixture
OpenSep (Source 1)
OpenSep (Source 2)
AudioSep (Source 1)
AudioSep (Source 2)
CLIPSep (Source 1)
CLIPSep (Source 2)
MixIT+PIT (Source 1)
MixIT+PIT (Source 2)
* In this mixture, the "beep sound" is only
present at the beginning, with large noisy sound of "wind blows"
over the spectrogram. Most baseline methods contain noisy spectral
contents in the "beep sound", while losing spectral contents
in the "wind blows" prediction. In contrast, OpenSep
disentangles this noisy mixture with significant reduction of
spectral overlaps.
Example 5 – “Someone is playing guitar with whistle blowing, and a man talks afterwards.”
Source1: “Someone is playing guitar”
Source 2: “Whistle blowing”
Source 3: “A man talks”
Mixture
OpenSep (Source 1)
OpenSep (Source 2)
OpenSep (Source 3)
AudioSep (Source 1)
AudioSep (Source 2)
AudioSep (Source 3)
CLIPSep (Source 1)
CLIPSep (Source 2)
CLIPSep (Source 3)
MixIT+PIT (Source 1)
MixIT+PIT (Source 2)
MixIT+PIT (Source 3)
*
In the initial phase, the “whistle blows” and
“guitar plays” sounds are present, while the
“man talks” sound appears at the end. In all
baselines, we can see significant spectral overlaps of
the “whistle blows” and “guitar plays”
sounds. In contrast, OpenSep precisely separates both of
these challenging components, while also reducing background
contents in the “man talks” prediction.
Example 6 – "Phone rings followed by a woman talks."
Source1: “Phone rings”
Source 2: “A woman talks”
Mixture
OpenSep (Source 1)
OpenSep (Source 2)
AudioSep (Source 1)
AudioSep (Source 2)
CLIPSep (Source 1)
CLIPSep (Source 2)
MixIT+PIT (Source 1)
MixIT+PIT (Source 2)
*
The "phone rings" sound mostly appears at the beginning
followed by the "woman talks" sound at the end. Compared
to all baselines, OpenSep more sharply disentangles both of these
sources from this challenging mixture highlighting its
effectiveness in practice.
Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia,
Yuxuan Wang, Mark D Plumbley, and Wenwu Wang. 2023.
Separate anything you describe. arXiv preprint arXiv:2308.05037
↩
DDong, H. W., Takahashi, N., Mitsufuji, Y., McAuley, J., & Berg-Kirkpatrick, T. (2022, September).
CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos.
In The Eleventh International Conference on Learning Representations.
↩
Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron Weiss, Kevin Wilson,
and John Hershey. 2020. Un-supervised sound separation using mixture invariant training.
Advances in Neural Information Processing,
Systems, 33:3846–3857
↩
Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. 2017.
Permutation invariant training of deep models for speaker-independent
multi-talker speech separation. In 2017 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 241–245. IEEE.
↩