Digitizing Old Ukrainian Texts: A Prompt-Based OCR Pipeline and Evaluation Dataset

Dmytro Chaplynskyi, Hanna Dydyk-Meush


Abstract
We present a methodology and an open dataset for OCR of handwritten index cards containing a scholarly transcription of an early 17th-century Ukrainian polemical text, Perestoroha by Iov Boretskyi (Lviv, 1605–1606). The 430 cards, produced by 20th-century researchers, preserve the text in Old Ukrainian orthography with archaic diacritics, titlos, superscript letters, and ligatures that make automated recognition non-trivial. We develop a prompt-based OCR pipeline driven by a custom instruction set designed iteratively from the source material’s orthographic conventions. The pipeline is evaluated against human-proofread ground truth in proprietary and open-source configurations using identical instructions and evaluation data. The proprietary configuration with extended thinking at maximum budget (Claude Opus 4.7, xhigh) achieves a Character Error Rate of 2.5%; an Opus 4.6 baseline at the default 2,048-token thinking budget — used for the first batch of the released dataset — reaches 4.2%; and two open-source Qwen3.6 variants running locally on consumer hardware reach 14.6% (dense 27B) and 14.8% (35B-A3B MoE). We release the fully digitized text aligned at line level to 300 DPI scanned images, as both a scholarly digital resource and training data for future OCR systems targeting Old Slavic manuscripts.
Anthology ID:
2026.unlp-1.7
Volume:
Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
Month:
May
Year:
2026
Address:
Lviv, Ukraine
Editor:
Mariana Romanyshyn
Venue:
UNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
58–66
Language:
URL:
https://aclanthology.org/2026.unlp-1.7/
DOI:
Bibkey:
Cite (ACL):
Dmytro Chaplynskyi and Hanna Dydyk-Meush. 2026. Digitizing Old Ukrainian Texts: A Prompt-Based OCR Pipeline and Evaluation Dataset. In Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026), pages 58–66, Lviv, Ukraine. Association for Computational Linguistics.
Cite (Informal):
Digitizing Old Ukrainian Texts: A Prompt-Based OCR Pipeline and Evaluation Dataset (Chaplynskyi & Dydyk-Meush, UNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.unlp-1.7.pdf