Shibingfeng Zhang


2023

pdf bib
GPL at SemEval-2023 Task 1: WordNet and CLIP to Disambiguate Images
Shibingfeng Zhang | Shantanu Nath | Davide Mazzaccara
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Given a word in context, the task of VisualWord Sense Disambiguation consists of select-ing the correct image among a set of candidates. To select the correct image, we propose a so-lution blending text augmentation and multi-modal models. Text augmentation leverages thefine-grained semantic annotation from Word-Net to get a better representation of the tex-tual component. We then compare this sense-augmented text to the set of image using pre-trained multimodal models CLIP and ViLT. Oursystem has been ranked 16th for the Englishlanguage, achieving 68.5 points for hit rate and79.2 for mean reciprocal rank.