Linguistic change and historical periodization of Old Literary Finnish

Niko Partanen, Khalid Alnajjar, Mika Hämäläinen, Jack Rueter


Abstract
In this study, we have normalized and lemmatized an Old Literary Finnish corpus using a lemmatization model trained on texts from Agricola. We analyse the error types that occur and appear in different decades, and use word error rate (WER) and different error types as a proxy for measuring linguistic innovation and change. We show that the proposed approach works, and the errors are connected to accumulating changes and innovations, which also results in a continuous decrease in the accuracy of the model. The described error types also guide further work in improving these models, and document the currently observed issues. We also have trained word embeddings for four centuries of lemmatized Old Literary Finnish, which are available on Zenodo.
Anthology ID:
2021.lchange-1.4
Volume:
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | IJCNLP | LChange
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21–27
Language:
URL:
https://aclanthology.org/2021.lchange-1.4
DOI:
10.18653/v1/2021.lchange-1.4
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.lchange-1.4.pdf