Lawal Abdulmujeeb


2025

Correctly identifying idiomatic expressions remains a major challenge in Natural Language Processing (NLP), as these expressions often have meanings that cannot be directly inferred from their individual words. The SemEval-2025 Task 1 introduces two subtasks, A and B, designed to test models’ ability to interpret idioms using multimodal data, including both text and images. This paper focuses on Subtask A, where the goal is to determine which among several images best represents the intended meaning of an idiomatic expression in a given sentence.To address this, we employed a two-stage approach. First, we used GPT-4o to analyze sentences, extracting relevant keywords and sentiments to better understand the idiomatic usage. This processed information was then passed to a CLIP-VIT model, which ranked the available images based on their relevance to the idiomatic expression. Our results showed that this approach performed significantly better than directly feeding sentences and idiomatic compounds into the models without preprocessing. Specifically, our method achieved a Top-1 accuracy of 0.67 in English, whereas performance in Portuguese was notably lower at 0.23. These findings highlight both the promise of multimodal approaches for idiom interpretation and the challenges posed by language-specific differences in model performance.