Effect of data quality on the automated identification of register features in Eighteenth Century Collections Online

Aatu Liimatta


Abstract
Many large-scale investigations of textual data are based on the automated identification of various linguistic features. However, if the textual data is of lower quality, automated identification of linguistic features, particularly more complex ones, can be severely hampered. Data quality problems are particularly prominent with large datasets of historical text which have been made machine-readable using optical character recognition (OCR) technology, but it is unclear how much the identification of individual linguistic features is affected by the dirty OCR, and how features of varying complexity are influenced differently. In this paper, I analyze the effect of OCR quality on the automated identification of the set of linguistic features commonly used for multi-dimensional register analysis (MDA) by comparing their observed frequencies in the OCR-processed Eighteenth Century Collections Online (ECCO) and a clean baseline (ECCO-TCP). The results show that the identification of most features is disturbed more as the OCR quality decreases, but different features start degrading at different OCR quality levels and do so at different rates.
Anthology ID:
2023.nlp4dh-1.6
Volume:
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Month:
December
Year:
2023
Address:
Tokyo, Japan
Editors:
Mika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter
Venues:
NLP4DH | IWCLUL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
45–51
Language:
URL:
https://aclanthology.org/2023.nlp4dh-1.6
DOI:
Bibkey:
Cite (ACL):
Aatu Liimatta. 2023. Effect of data quality on the automated identification of register features in Eighteenth Century Collections Online. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 45–51, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
Effect of data quality on the automated identification of register features in Eighteenth Century Collections Online (Liimatta, NLP4DH-IWCLUL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nlp4dh-1.6.pdf