Digitizing Old Ukrainian Texts: A Prompt-Based OCR Pipeline and Evaluation Dataset

Dmytro Chaplynskyi; Hanna Dydyk-Meush

Digitizing Old Ukrainian Texts: A Prompt-Based OCR Pipeline and Evaluation Dataset

Abstract

We present a methodology and an open dataset for OCR of handwritten index cards containing a scholarly transcription of an early 17th-century Ukrainian polemical text, Perestoroha by Iov Boretskyi (Lviv, 1605–1606). The 430 cards, produced by 20th-century researchers, preserve the text in Old Ukrainian orthography with archaic diacritics, titlos, superscript letters, and ligatures that make automated recognition non-trivial. We develop a prompt-based OCR pipeline driven by a custom instruction set designed iteratively from the source material’s orthographic conventions. The pipeline is evaluated against human-proofread ground truth in proprietary and open-source configurations using identical instructions and evaluation data. The proprietary configuration with extended thinking at maximum budget (Claude Opus 4.7, xhigh) achieves a Character Error Rate of 2.5%; an Opus 4.6 baseline at the default 2,048-token thinking budget — used for the first batch of the released dataset — reaches 4.2%; and two open-source Qwen3.6 variants running locally on consumer hardware reach 14.6% (dense 27B) and 14.8% (35B-A3B MoE). We release the fully digitized text aligned at line level to 300 DPI scanned images, as both a scholarly digital resource and training data for future OCR systems targeting Old Slavic manuscripts.

Anthology ID:: 2026.unlp-1.7
Volume:: Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
Month:: May
Year:: 2026
Address:: Lviv, Ukraine
Editor:: Mariana Romanyshyn
Venue:: UNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 58–66
Language:
URL:: https://aclanthology.org/2026.unlp-1.7/
DOI:
Bibkey:
Cite (ACL):: Dmytro Chaplynskyi and Hanna Dydyk-Meush. 2026. Digitizing Old Ukrainian Texts: A Prompt-Based OCR Pipeline and Evaluation Dataset. In Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026), pages 58–66, Lviv, Ukraine. Association for Computational Linguistics.
Cite (Informal):: Digitizing Old Ukrainian Texts: A Prompt-Based OCR Pipeline and Evaluation Dataset (Chaplynskyi & Dydyk-Meush, UNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.unlp-1.7.pdf

PDF Cite Search Fix data