Unsupervised Sounding Pixel Learning

Yining Zhang, Yanli Ji, Yang Yang


Abstract
Sounding source localization is a challenging cross-modal task due to the difficulty of cross-modal alignment. Although supervised cross-modal methods achieve encouraging performance, heavy manual annotations are expensive and inefficient. Thus it is valuable and meaningful to develop unsupervised solutions. In this paper, we propose an **U**nsupervised **S**ounding **P**ixel **L**earning (USPL) approach which enables a pixel-level sounding source localization in unsupervised paradigm. We first design a mask augmentation based multi-instance contrastive learning to realize unsupervised cross-modal coarse localization, which aligns audio-visual features to obtain coarse sounding maps. Secondly, we present an *Unsupervised Sounding Map Refinement (SMR)* module which employs the visual semantic affinity learning to explore inter-pixel relations of adjacent coordinate features. It contributes to recovering the boundary of coarse sounding maps and obtaining fine sounding maps. Finally, a *Sounding Pixel Segmentation (SPS)* module is presented to realize audio-supervised semantic segmentation. Extensive experiments are performed on the AVSBench-S4 and VGGSound datasets, exhibiting encouraging results compared with previous SOTA methods.
Anthology ID:
2023.emnlp-main.777
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12610–12620
Language:
URL:
https://aclanthology.org/2023.emnlp-main.777
DOI:
10.18653/v1/2023.emnlp-main.777
Bibkey:
Cite (ACL):
Yining Zhang, Yanli Ji, and Yang Yang. 2023. Unsupervised Sounding Pixel Learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12610–12620, Singapore. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Sounding Pixel Learning (Zhang et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.777.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.777.mp4