Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou


Abstract
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates, which better represents the meaning of an ambiguous word within a given context. In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches. Since VWSD is primarily a text-image retrieval task, we explore the latest transformer-based methods for multimodal retrieval. Additionally, we utilize Large Language Models (LLMs) as knowledge bases to enhance the given phrases and resolve ambiguity related to the target word. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, as well as question-answering (QA), to fully explore the capabilities of relevant models. To tap into the implicit knowledge of LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable answer generation. On top of all, we train a learn to rank (LTR) model in order to combine our different modules, achieving competitive ranking results. Extensive experiments on VWSD demonstrate valuable insights to effectively drive future directions.
Anthology ID:
2023.emnlp-main.807
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13053–13077
Language:
URL:
https://aclanthology.org/2023.emnlp-main.807
DOI:
10.18653/v1/2023.emnlp-main.807
Bibkey:
Cite (ACL):
Anastasia Kritharoula, Maria Lymperaiou, and Giorgos Stamou. 2023. Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13053–13077, Singapore. Association for Computational Linguistics.
Cite (Informal):
Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation (Kritharoula et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.807.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.807.mp4