Davide Napolitano


2025

Idiomatic expressions present significant challenges for natural language understanding systems as their meaning often diverge from the literal interpretation. While prior works have focused on textual idiom detection, the role of visual content in reasoning about idiomaticity remains underexplored. This study introduces a Chain-of-Thought reasoning framework that enhances idiomatic comprehension by ranking images based on their relevance to a compound expression in context, requiring the system to distinguish between idiomatic and literal meanings.We comprehensively evaluate our approach by quantitatively analyzing the performance improvements achieved integrating textual and visual information in the ranking process through different prompting settings. Our empirical findings provide insights into the capabilities of visual Large Language Models to establish meaningful correlations between idiomatic content and its visual counterpart, suggesting promising directions for multimodal language understanding.