Iiro Tiihonen


2023

pdf bib
Measuring the distribution of Hume’s Scotticisms in the ECCO collection
Iiro Tiihonen | Aatu Liimatta | Lidia Pivovarova | Tanja Säily | Mikko Tolonen
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

This short paper studies the distribution of Scotticisms from a list compiled by David Hume in a large collection of 18th century publications. We use regular expression search to find the items on the list in the ECCO collection, and then apply regression analysis to test whether the distribution of Scotticisms in works first published in Scotland is significantly different from the distribution of Scotticisms in works first published in England. We further refine our analysis to trace the influence of variables such as publication date, genre and author’s country of origin.

2022

pdf bib
Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model
Iiro Rastas | Yann Ciarán Ryan | Iiro Tiihonen | Mohammadreza Qaraei | Liina Repo | Rohit Babbar | Eetu Mäkelä | Mikko Tolonen | Filip Ginter
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.