Exploring the Potential of Dense Information in Multimodal Alignment

Zhiyuan Fan, Zhihong Chen, Benyou Wang


Abstract
Despite the success of data augmentation in improving CLIP model, existing methods that utilize LLM or SAM to enrich the information in captions still suffer from several limitations, including insufficient detail and excessive hallucinations, ultimately resulting in compromised alignment and masking the true potential of dense information. This can lead to erroneous conclusions about CLIP’s ability to handle rich data, impeding the development of more effective models. To address the limitations of existing methods, we introduce a novel pipeline that generates highly detailed, factually accurate captions for images, which facilitates in-depth analysis of the potential for dense information in multimodal alignment. Contrary to previous findings, our investigation revealed that lengthening captions boosts performance across diverse benchmarks, even surpassing the effectiveness of meticulously crafted hard negative samples. Building on these insights, DELIP is introduced, demonstrably enhancing both foundational multimodal alignment and compositional reasoning abilities. Finally, we explore strategies to expand the context window of the text encoder, unlocking the potential of richer data for CLIP and paving the way for advancements in leveraging dense information for multimodal alignment.
Anthology ID:
2024.findings-acl.797
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13440–13451
Language:
URL:
https://aclanthology.org/2024.findings-acl.797
DOI:
Bibkey:
Cite (ACL):
Zhiyuan Fan, Zhihong Chen, and Benyou Wang. 2024. Exploring the Potential of Dense Information in Multimodal Alignment. In Findings of the Association for Computational Linguistics ACL 2024, pages 13440–13451, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Exploring the Potential of Dense Information in Multimodal Alignment (Fan et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.797.pdf