Ümit Atlamaz
2026
TimeRes: A Turkish Benchmark For Evaluating Temporal Understanding of Large Language Models
Habib Yağız Demir | Ümit Atlamaz | Susan Üsküdarlı
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Habib Yağız Demir | Ümit Atlamaz | Susan Üsküdarlı
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Temporal information is an essential part of communication, and understanding language requires processing it effectively. Despite recent advances, Large Language Models (LLMs) still struggle with temporal understanding.Existing benchmarks primarily focus on English and underexplore how linguistic structure contributes to temporal meaning.As a result, temporal understanding in languages other than English remains largely understudied.In this paper, we introduce TimeRes, a Turkish benchmark for evaluating temporal understanding of LLMs. TimeRes aims to investigate comprehension of Reichenbach’s temporal points and reported speech through date arithmetic.Our dataset includes 4,600 questions across 4 tasks at two levels of complexity, and presents a paired question formulation to distinguish temporal discourse understanding from temporal arithmetic capabilities.We evaluated six LLMs, and demonstrated that models struggle to resolve reported speech and fail to generalize across word order variations.
2025
Text Extraction and Script Completion in Images of Arabic Script-Based Calligraphy: A Thesis Proposal
Dilara Zeynep Gürer | Ümit Atlamaz | Şaziye Betül Özateş
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Dilara Zeynep Gürer | Ümit Atlamaz | Şaziye Betül Özateş
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Arabic calligraphy carries rich historical information and meaning. However, the complexity of its artistic elements and the absence of a consistent baseline make text extraction from such works highly challenging. In this paper, we provide an in-depth analysis of the unique obstacles in processing and interpreting these images, including the variability in calligraphic styles, the influence of artistic distortions, and the challenges posed by missing or damaged text elements. We explore potential solutions by leveraging state-of-the-art architectures and deep learning models, including visual language models, to improve text extraction and script completion.
2024
ImplicaTR: A Granular Dataset for Natural Language Inference and Pragmatic Reasoning in Turkish
Mustafa Halat | Ümit Atlamaz
Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024)
Mustafa Halat | Ümit Atlamaz
Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024)
We introduce ImplicaTR, a linguistically informed diagnostic dataset designed to evaluate semantic and pragmatic reasoning capabilities of Natural Language Inference (NLI) models in Turkish. Existing Turkish NLI datasets treat NLI as determining whether a sentence pair represents entailment, contradiction, or a neutral relation. Such datasets do not distinguish between semantic entailment and pragmatic implicature, which linguists have long recognized as separate inferences types. ImplicaTR addresses this by testing NLI models’ ability to differentiate between entailment and implicature, thus assessing their pragmatic reasoning skills. The dataset consists of 19,350 semi-automatically generated sentence pairs covering implicature, entailment, contradiction, and neutral relations. We evaluated various models (BERT, Gemma, Llama-2, and Mistral) on ImplicaTR and found out that these models can reach up to 98% accuracy on semantic and pragmatic reasoning. We also fine tuned various models on subsets of ImplicaTR to test the abilities of NLI models to generalize across unseen implicature contexts. Our results indicate that model performance is highly dependent on the diversity of linguistic expressions within each subset, highlighting a weakness in the abstract generalization capabilities of large language models regarding pragmatic reasoning. We share all the code, models, and the dataset.