Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model

Iiro Rastas, Yann Ciarán Ryan, Iiro Tiihonen, Mohammadreza Qaraei, Liina Repo, Rohit Babbar, Eetu Mäkelä, Mikko Tolonen, Filip Ginter


Abstract
In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.
Anthology ID:
2022.lchange-1.7
Volume:
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venues:
ACL | LChange
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
68–77
Language:
URL:
https://aclanthology.org/2022.lchange-1.7
DOI:
10.18653/v1/2022.lchange-1.7
Bibkey:
Cite (ACL):
Iiro Rastas, Yann Ciarán Ryan, Iiro Tiihonen, Mohammadreza Qaraei, Liina Repo, Rohit Babbar, Eetu Mäkelä, Mikko Tolonen, and Filip Ginter. 2022. Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, pages 68–77, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model (Rastas et al., LChange 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lchange-1.7.pdf