OCRTurk: A Comprehensive OCR Benchmark for Turkish

Deniz Yılmaz, Evren Ayberk Munis, Cagri Toraman, Süha Kağan Köse, Burak Aktaş, Mehmet Can Baytekin, Bilge Kaan Görür


Abstract
Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining the best Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type: models perform well on non-academic documents, while slideshows become the most challenging.
Anthology ID:
2026.sigturk-1.16
Volume:
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Kemal Oflazer, Abdullatif Köksal, Onur Varol
Venues:
SIGTURK | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
197–208
Language:
URL:
https://aclanthology.org/2026.sigturk-1.16/
DOI:
Bibkey:
Cite (ACL):
Deniz Yılmaz, Evren Ayberk Munis, Cagri Toraman, Süha Kağan Köse, Burak Aktaş, Mehmet Can Baytekin, and Bilge Kaan Görür. 2026. OCRTurk: A Comprehensive OCR Benchmark for Turkish. In Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026), pages 197–208, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
OCRTurk: A Comprehensive OCR Benchmark for Turkish (Yılmaz et al., SIGTURK 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.sigturk-1.16.pdf