On the Difference of BERT-style and CLIP-style Text Encoders

Zhihong Chen, Guiming Chen, Shizhe Diao, Xiang Wan, Benyou Wang


Abstract
Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans.
Anthology ID:
2023.findings-acl.866
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13710–13721
Language:
URL:
https://aclanthology.org/2023.findings-acl.866
DOI:
10.18653/v1/2023.findings-acl.866
Bibkey:
Cite (ACL):
Zhihong Chen, Guiming Chen, Shizhe Diao, Xiang Wan, and Benyou Wang. 2023. On the Difference of BERT-style and CLIP-style Text Encoders. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13710–13721, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
On the Difference of BERT-style and CLIP-style Text Encoders (Chen et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.866.pdf