Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-Gomez, Tony Montes, Arturo Rodriguez Herrera, Ruben Manrique


Abstract
This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.
Anthology ID:
2024.nlp4dh-1.13
Volume:
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:
November
Year:
2024
Address:
Miami, USA
Editors:
Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
132–139
Language:
URL:
https://aclanthology.org/2024.nlp4dh-1.13
DOI:
Bibkey:
Cite (ACL):
Laura Manrique-Gomez, Tony Montes, Arturo Rodriguez Herrera, and Ruben Manrique. 2024. Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 132–139, Miami, USA. Association for Computational Linguistics.
Cite (Informal):
Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction (Manrique-Gomez et al., NLP4DH 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4dh-1.13.pdf