Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

Mika Hämäläinen, Niko Partanen, Khalid Alnajjar


Abstract
Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3% accuracy in texts written by Agricola and 87.7% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.
Anthology ID:
2021.jeptalnrecital-taln.18
Volume:
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale
Month:
6
Year:
2021
Address:
Lille, France
Editors:
Pascal Denis, Natalia Grabar, Amel Fraisse, Rémi Cardon, Bernard Jacquemin, Eric Kergosien, Antonio Balvet
Venue:
JEP/TALN/RECITAL
SIG:
Publisher:
ATALA
Note:
Pages:
189–198
Language:
URL:
https://aclanthology.org/2021.jeptalnrecital-taln.18
DOI:
Bibkey:
Cite (ACL):
Mika Hämäläinen, Niko Partanen, and Khalid Alnajjar. 2021. Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography. In Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 189–198, Lille, France. ATALA.
Cite (Informal):
Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography (Hämäläinen et al., JEP/TALN/RECITAL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.jeptalnrecital-taln.18.pdf
Code
 mikahama/murre