Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, Yejin Choi


Abstract
Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning. Prevailing learning paradigms of audio-text connections have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces Audio-Text alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about 221 ≈ 2\text{M} supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.
Anthology ID:
2022.naacl-main.333
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4492–4507
Language:
URL:
https://aclanthology.org/2022.naacl-main.333
DOI:
10.18653/v1/2022.naacl-main.333
Bibkey:
Cite (ACL):
Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, and Yejin Choi. 2022. Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4492–4507, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer (Zhao et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.333.pdf
Code
 zhaoyanpeng/vipant
Data
AudioCapsAudioSetCOCOClothoESC-50