L2Dir: Integrating L_2-Norm and Directional Alignment for Unsupervised Contrastive Representation Learning in Multimodal Retrieval

Tianyu Zong; Rui Dai; Hongzhu Yi; Yuanxiang Wang; Zhenghao Zhang; Zhenyu Guan; Yujia Yang; Bingkang Shi; Yueyang Ding; Xiangxiang Chu; Kaikui Liu; Jungang Xu

L2Dir: Integrating L_2-Norm and Directional Alignment for Unsupervised Contrastive Representation Learning in Multimodal Retrieval

Tianyu Zong, Rui Dai, Hongzhu Yi, Yuanxiang Wang, Zhenghao Zhang, Zhenyu Guan, Yujia Yang, Bingkang Shi, Yueyang Ding, Xiangxiang Chu, Kaikui Liu, Jungang Xu

Abstract

Multimodal representation learning primarily relies on contrastive objectives such as InfoNCE to align diverse modalities. However, these methods focus almost exclusively on directional alignment and often neglect the intrinsic role of embedding magnitudes (L2-norm) in the contrastive process. To bridge this gap, we propose L2Dir, a plug-and-play framework designed to optimize L2-norm alignment and Directional consistency jointly. As a highly efficient solution, L2Dir doesn’t require extra data, distillation, or external supervision. It can be integrated seamlessly into existing pipelines by employing a lightweight MLP to reconstruct magnitudes from frozen backbone features. Extensive evaluations across 95 tasks using UniIR and VLM2Vec-V2 frameworks demonstrate that L2Dir yields consistent and significant performance gains over established baselines across various backbones and scales, proving that explicit magnitude modeling is a versatile and potent strategy for refining unsupervised multimodal representations. The source code for L2Dir in VLM2Vec-V2 is available in the supplementary materials.

Anthology ID:: 2026.acl-long.1604
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34717–34740
Language:
URL:: https://aclanthology.org/2026.acl-long.1604/
DOI:
Bibkey:
Cite (ACL):: Tianyu Zong, Rui Dai, Hongzhu Yi, Yuanxiang Wang, Zhenghao Zhang, Zhenyu Guan, Yujia Yang, Bingkang Shi, Yueyang Ding, Xiangxiang Chu, Kaikui Liu, and Jungang Xu. 2026. L2Dir: Integrating L_2-Norm and Directional Alignment for Unsupervised Contrastive Representation Learning in Multimodal Retrieval. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34717–34740, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: L2Dir: Integrating L_2-Norm and Directional Alignment for Unsupervised Contrastive Representation Learning in Multimodal Retrieval (Zong et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1604.pdf
Checklist:: 2026.acl-long.1604.checklist.pdf

PDF Cite Search Checklist Fix data