Leveraging External Knowledge for Historical Document Restoration via Retrieval-Augmented Large Language Models

Gabeen Kim; Kyeongpil Kang

Leveraging External Knowledge for Historical Document Restoration via Retrieval-Augmented Large Language Models

Abstract

Historical documents act as invaluable knowledge archives but often suffer from illegibility due to physical deterioration and damage. While existing restoration methods based on masked language modeling effectively utilize local context, they struggle to restore named entities that require external historical knowledge. To address this limitation, we introduce a novel framework for historical document restoration that leverages large language models with retrieval-augmented generation (RAG). By combining the implicit knowledge of pre-trained LLMs with explicitly retrieved external context, our model ARI effectively mitigates the challenge of inferring context-dependent proper nouns. Extensive experiments on Korean historical documents demonstrate that our approach significantly outperforms baselines, achieving substantial gains in restoring both general characters and named entities. Furthermore, comprehensive evaluations including expert assessments confirm that ARI serves as a practical tool for domain experts, promising to accelerate the analysis of historical records.

Anthology ID:: 2026.findings-acl.2148
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 43290–43304
Language:
URL:: https://aclanthology.org/2026.findings-acl.2148/
DOI:
Bibkey:
Cite (ACL):: Gabeen Kim and Kyeongpil Kang. 2026. Leveraging External Knowledge for Historical Document Restoration via Retrieval-Augmented Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 43290–43304, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Leveraging External Knowledge for Historical Document Restoration via Retrieval-Augmented Large Language Models (Kim & Kang, Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.2148.pdf
Checklist:: 2026.findings-acl.2148.checklist.pdf

PDF Cite Search Checklist Fix data