Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment

Tao Jin, Wang Lin, Ye Wang, Linjun Li, Xize Cheng, Zhou Zhao


Abstract
Transformer-based methods have gone mainstream in multimodal sequential learning. The intra and inter modality interactions are captured by the query-key associations of multi-head attention. In this way, the calculated multimodal contexts (attentional results) are expected to be relevant to the query modality. However, in existing literature, the alignment degree between different calculated attentional results of the same query are under-explored. Based on this concern, we propose a new constrained scheme called Multimodal Contextual Contrast (MCC), which could align the multiple attentional results from both local and global perspectives, making the information capture more efficient. Concretely, the calculated attentional results of different modalities are mapped into a common feature space, those attentional vectors with the same query are considered as a positive group and the remaining sets are negative. From local perspective, we sample the negative groups for a positive group by randomly changing the sequential step of one specific context and keeping the other stay the same. From coarse global perspective, we divide all the contextual groups into two sets (i.e., aligned and unaligned), making the total score of aligned group relatively large. We extend the vectorial inner product operation for more input and calculate the aligned score for each multimodal group. Considering that the computational complexity scales exponentially to the number of modalities, we adopt stochastic expectation approximation (SEA) for the real process. The extensive experimental results on several tasks reveal the effectiveness of our contributions.
Anthology ID:
2024.luhme-long.287
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5247–5265
Language:
URL:
https://aclanthology.org/2024.luhme-long.287/
DOI:
10.18653/v1/2024.acl-long.287
Bibkey:
Cite (ACL):
Tao Jin, Wang Lin, Ye Wang, Linjun Li, Xize Cheng, and Zhou Zhao. 2024. Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5247–5265, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment (Jin et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.287.pdf