Stanley Fujimoto


2021

pdf bib
BART for Post-Correction of OCR Newspaper Text
Elizabeth Soper | Stanley Fujimoto | Yen-Yun Yu
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Optical character recognition (OCR) from newspaper page images is susceptible to noise due to degradation of old documents and variation in typesetting. In this report, we present a novel approach to OCR post-correction. We cast error correction as a translation task, and fine-tune BART, a transformer-based sequence-to-sequence language model pretrained to denoise corrupted text. We are the first to use sentence-level transformer models for OCR post-correction, and our best model achieves a 29.4% improvement in character accuracy over the original noisy OCR text. Our results demonstrate the utility of pretrained language models for dealing with noisy text.