Iiro Rastas


2022

pdf bib
Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model
Iiro Rastas | Yann Ciarán Ryan | Iiro Tiihonen | Mohammadreza Qaraei | Liina Repo | Rohit Babbar | Eetu Mäkelä | Mikko Tolonen | Filip Ginter
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.

2021

pdf bib
Finnish Paraphrase Corpus
Jenna Kanerva | Filip Ginter | Li-Hsin Chang | Iiro Rastas | Valtteri Skantsi | Jemina Kilpeläinen | Hanna-Mari Kupari | Jenna Saarni | Maija Sevón | Otto Tarkka
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts. Additionally, we establish a manual candidate selection method and demonstrate its feasibility in high quality paraphrase selection in terms of both cost and quality.