Cross-Modal Discrete Representation Learning

Alex Liu; SouYoung Jin; Cheng-I Lai; Andrew Rouditchenko; Aude Oliva; James Glass

doi:10.18653/v1/2022.acl-long.215

Cross-Modal Discrete Representation Learning

Alexander Liu, SouYoung Jin, Cheng-I Lai, Andrew Rouditchenko, Aude Oliva, James Glass

Abstract

In contrast to recent advances focusing on high-level representation learning across modalities, in this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. We show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.

Anthology ID:: 2022.acl-long.215
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3013–3035
Language:
URL:: https://aclanthology.org/2022.acl-long.215/
DOI:: 10.18653/v1/2022.acl-long.215
Bibkey:
Cite (ACL):: Alexander Liu, SouYoung Jin, Cheng-I Lai, Andrew Rouditchenko, Aude Oliva, and James Glass. 2022. Cross-Modal Discrete Representation Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3013–3035, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Cross-Modal Discrete Representation Learning (Liu et al., ACL 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.acl-long.215.pdf
Video:: https://aclanthology.org/2022.acl-long.215.mp4
Data: ImageNet, MSR-VTT, Places205

PDF Cite Search Video Fix data