Effective Synthetic Data and Test-Time Adaptation for OCR Correction

Shuhao Guan, Cheng Xu, Moule Lin, Derek Greene


Abstract
Post-OCR technology is used to correct errors in the text produced by OCR systems. This study introduces a method for constructing post-OCR synthetic data with different noise levels using weak supervision. We define Character Error Rate (CER) thresholds for “effective” and “ineffective” synthetic data, allowing us to create more useful multi-noise level synthetic datasets. Furthermore, we propose Self-Correct-Noise Test-Time Adaptation (SCN-TTA), which combines self-correction and noise generation mechanisms. SCN-TTA allows a model to dynamically adjust to test data without relying on labels, effectively handling proper nouns in long texts and further reducing CER. In our experiments we evaluate a range of models, including multiple PLMs and LLMs. Results indicate that our method yields models that are effective across diverse text types. Notably, the ByT5 model achieves a CER reduction of 68.67% without relying on manually annotated data
Anthology ID:
2024.emnlp-main.862
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15412–15425
Language:
URL:
https://aclanthology.org/2024.emnlp-main.862
DOI:
Bibkey:
Cite (ACL):
Shuhao Guan, Cheng Xu, Moule Lin, and Derek Greene. 2024. Effective Synthetic Data and Test-Time Adaptation for OCR Correction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15412–15425, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Effective Synthetic Data and Test-Time Adaptation for OCR Correction (Guan et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.862.pdf