Hatice Kose
2026
VisAffect at MWE-2026 AdMIRe 2: IMMCAN Idiom Multimodal Cross-Attention Network
Barış Bilen | Ali Azmoudeh | Hazım Kemal Ekenel | Hatice Kose
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Barış Bilen | Ali Azmoudeh | Hazım Kemal Ekenel | Hatice Kose
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
We address AdMIRe 2.0, a static image ranking task where a sentence containing a potentially idiomatic expression is paired with five image–caption candidates, and the goal is to rank the candidates by semantic compatibility with the intended idiomatic or literal meaning. We propose IMMCAN, which keeps XLM-R and Jina-CLIP-v2 frozen and learns a lightweight two-stage cross-attention fusion, caption–image grounding followed by idiom-to-multimodal conditioning, to predict a compatibility score per candidate. We also evaluate caption-only augmentation via back-translation and synonym substitution, and compare regression and rank-class formulations. On AdMIRe 1.0, text-only achieves higher test top-image accuracy than VLM-grounded modeling. In contrast, on AdMIRe 2.0 zero-shot, adding visual patch grounding improves both accuracy and NDCG indicating better cross-lingual ranking transfer.