Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability

Yejun Yoon, Seunghyun Yoon, Kunwoo Park


Abstract
This paper addresses the critical challenge of assessing the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the actors discussed in the news text. To serve the challenge, we introduce NewsTT, a manually annotated dataset of 1000 news thumbnail images and text pairs. We found that the pretrained vision and language models, such as BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, the pretrained models could have a limited capability to match news actors’ visual and textual appearances. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability of vision and language models. We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis. We found that our simple method can boost the performance for assessing news thumbnail representativeness, supporting our assumption. Code and data can be accessed at https://github.com/ssu-humane/news-images-acl24.
Anthology ID:
2024.findings-acl.534
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9009–9024
Language:
URL:
https://aclanthology.org/2024.findings-acl.534
DOI:
Bibkey:
Cite (ACL):
Yejun Yoon, Seunghyun Yoon, and Kunwoo Park. 2024. Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability. In Findings of the Association for Computational Linguistics ACL 2024, pages 9009–9024, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability (Yoon et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.534.pdf