A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages

Gözde Aslantaş; Tunga Gungor

A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages

Abstract

Idiomatic expressions are culturally grounded, semantically opaque, and difficult to interpret for multilingual natural language processing systems. Despite the large speaker population of Turkic languages, resources that focus on monolingual and cross-lingual idioms and their meanings are limited. We introduce the first unified benchmark for idiom understanding across Turkish, Azerbaijani, Turkmen, Gagauz, and Uzbek languages. The datasets compiled include token-level idiom span annotations. We develop models for idiom identification and semantic retrieval tasks. We evaluate seven models for idiom identification and nine embedding models for semantic retrieval tasks under several fine-tuning schemes using standard dense retrieval metrics. This benchmark provides a basis for studying idiomatic phenomena in Turkic languages and clarifies how idiomatic meanings are shared, altered, or diverge across languages.

Anthology ID:: 2026.sigturk-1.4
Volume:: Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Kemal Oflazer, Abdullatif Köksal, Onur Varol
Venues:: SIGTURK | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 38–51
Language:
URL:: https://aclanthology.org/2026.sigturk-1.4/
DOI:
Bibkey:
Cite (ACL):: Gözde Aslantaş and Tunga Gungor. 2026. A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages. In Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026), pages 38–51, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages (Aslantaş & Gungor, SIGTURK 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.sigturk-1.4.pdf

PDF Cite Search Fix data