2024
pdf
bib
abs
Text Length and the Function of Intentionality: A Case Study of Contrastive Subreddits
Emily Sofi Ohman
|
Aatu Liimatta
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Text length is of central concern in natural language processing (NLP) tasks, yet it is very much under-researched. In this paper, we use social media data, specifically Reddit, to explore the function of text length and intentionality by contrasting subreddits of the same topic where one is considered more serious/professional/academic and the other more relaxed/beginner/layperson. We hypothesize that word choices are more deliberate and intentional in the more in-depth and professional subreddits with texts subsequently becoming longer as a function of this intentionality. We argue that this has deep implications for many applied NLP tasks such as emotion and sentiment analysis, fake news and disinformation detection, and other modeling tasks focused on social media and similar platforms where users interact with each other via the medium of text.
2023
pdf
bib
abs
Measuring the distribution of Hume’s Scotticisms in the ECCO collection
Iiro Tiihonen
|
Aatu Liimatta
|
Lidia Pivovarova
|
Tanja Säily
|
Mikko Tolonen
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
This short paper studies the distribution of Scotticisms from a list compiled by David Hume in a large collection of 18th century publications. We use regular expression search to find the items on the list in the ECCO collection, and then apply regression analysis to test whether the distribution of Scotticisms in works first published in Scotland is significantly different from the distribution of Scotticisms in works first published in England. We further refine our analysis to trace the influence of variables such as publication date, genre and author’s country of origin.
pdf
bib
abs
Effect of data quality on the automated identification of register features in Eighteenth Century Collections Online
Aatu Liimatta
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Many large-scale investigations of textual data are based on the automated identification of various linguistic features. However, if the textual data is of lower quality, automated identification of linguistic features, particularly more complex ones, can be severely hampered.
Data quality problems are particularly prominent with large datasets of historical text which have been made machine-readable using optical character recognition (OCR) technology, but it is unclear how much the identification of individual linguistic features is affected by the dirty OCR, and how features of varying complexity are influenced differently.
In this paper, I analyze the effect of OCR quality on the automated identification of the set of linguistic features commonly used for multi-dimensional register analysis (MDA) by comparing their observed frequencies in the OCR-processed Eighteenth Century Collections Online (ECCO) and a clean baseline (ECCO-TCP). The results show that the identification of most features is disturbed more as the OCR quality decreases, but different features start degrading at different OCR quality levels and do so at different rates.