Leveraging LLMs for Post-OCR Correction of Historical Newspapers

Alan Thomas; Robert Gaizauskas; Haiping Lu

Leveraging LLMs for Post-OCR Correction of Historical Newspapers

Alan Thomas, Robert Gaizauskas, Haiping Lu

Abstract

Poor OCR quality continues to be a major obstacle for humanities scholars seeking to make use of digitised primary sources such as historical newspapers. Typical approaches to post-OCR correction employ sequence-to-sequence models for a neural machine translation task, mapping erroneous OCR texts to accurate reference texts. We shift our focus towards the adaptation of generative LLMs for a prompt-based approach. By instruction-tuning Llama 2 and comparing it to a fine-tuned BART on BLN600, a parallel corpus of 19th century British newspaper articles, we demonstrate the potential of a prompt-based approach in detecting and correcting OCR errors, even with limited training data. We achieve a significant enhancement in OCR quality with Llama 2 outperforming BART, achieving a 54.51% reduction in the character error rate against BART’s 23.30%. This paves the way for future work leveraging generative LLMs to improve the accessibility and unlock the full potential of historical texts for humanities research.

Anthology ID:: 2024.lt4hala-1.14
Volume:: Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Rachele Sprugnoli, Marco Passarotti
Venues:: LT4HALA | WS
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 116–121
Language:
URL:: https://aclanthology.org/2024.lt4hala-1.14/
DOI:
Bibkey:
Cite (ACL):: Alan Thomas, Robert Gaizauskas, and Haiping Lu. 2024. Leveraging LLMs for Post-OCR Correction of Historical Newspapers. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pages 116–121, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Leveraging LLMs for Post-OCR Correction of Historical Newspapers (Thomas et al., LT4HALA 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lt4hala-1.14.pdf

PDF Cite Search Fix data