Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon


Abstract
A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.
Anthology ID:
2024.findings-emnlp.837
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14330–14345
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.837
DOI:
Bibkey:
Cite (ACL):
Eunji Kim, Kyuhong Shim, Simyung Chang, and Sungroh Yoon. 2024. Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14330–14345, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP (Kim et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.837.pdf