BART for Post-Correction of OCR Newspaper Text

Elizabeth Soper, Stanley Fujimoto, Yen-Yun Yu


Abstract
Optical character recognition (OCR) from newspaper page images is susceptible to noise due to degradation of old documents and variation in typesetting. In this report, we present a novel approach to OCR post-correction. We cast error correction as a translation task, and fine-tune BART, a transformer-based sequence-to-sequence language model pretrained to denoise corrupted text. We are the first to use sentence-level transformer models for OCR post-correction, and our best model achieves a 29.4% improvement in character accuracy over the original noisy OCR text. Our results demonstrate the utility of pretrained language models for dealing with noisy text.
Anthology ID:
2021.wnut-1.31
Volume:
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
Month:
November
Year:
2021
Address:
Online
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
284–290
Language:
URL:
https://aclanthology.org/2021.wnut-1.31
DOI:
10.18653/v1/2021.wnut-1.31
Bibkey:
Cite (ACL):
Elizabeth Soper, Stanley Fujimoto, and Yen-Yun Yu. 2021. BART for Post-Correction of OCR Newspaper Text. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 284–290, Online. Association for Computational Linguistics.
Cite (Informal):
BART for Post-Correction of OCR Newspaper Text (Soper et al., WNUT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wnut-1.31.pdf