Data or Language Supervision: What Makes CLIP Better than DINO?

Yiming Liu; Yuhui Zhang; Dhruba Ghosh; Ludwig Schmidt; Serena Yeung-Levy

doi:10.18653/v1/2025.findings-emnlp.98

Data or Language Supervision: What Makes CLIP Better than DINO?

Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, Serena Yeung-Levy

Abstract

CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP’s language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings—using the same architecture, dataset, and training configuration—achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

Anthology ID:: 2025.findings-emnlp.98
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1868–1874
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.98/
DOI:: 10.18653/v1/2025.findings-emnlp.98
Bibkey:
Cite (ACL):: Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, and Serena Yeung-Levy. 2025. Data or Language Supervision: What Makes CLIP Better than DINO?. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1868–1874, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Data or Language Supervision: What Makes CLIP Better than DINO? (Liu et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.98.pdf
Checklist:: 2025.findings-emnlp.98.checklist.pdf

PDF Cite Search Checklist Fix data