A Question of Confidence: Using OCR Technology for Script analysis

Antonia Karaisl


Abstract
The following article proposes a method employing the Tesseract OCR engine to aid palaeographic analysis and scribal identification. Repurposing the so-called confidence score provided by the OCR engine, different methods of visualization are used to surface differences between font families, script types and manuscript hands.
Anthology ID:
2023.nlp4dh-1.20
Volume:
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Month:
December
Year:
2023
Address:
Tokyo, Japan
Editors:
Mika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter
Venues:
NLP4DH | IWCLUL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
162–171
Language:
URL:
https://aclanthology.org/2023.nlp4dh-1.20
DOI:
Bibkey:
Cite (ACL):
Antonia Karaisl. 2023. A Question of Confidence: Using OCR Technology for Script analysis. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 162–171, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
A Question of Confidence: Using OCR Technology for Script analysis (Karaisl, NLP4DH-IWCLUL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nlp4dh-1.20.pdf