Benchmarking Finnish Lemmatizers across Historical and Contemporary Texts

Emily Öhman, Leo Huovinen, Mika Hämäläinen


Abstract
Lemmatization is crucial in natural language processing (NLP) for languages like Finnish, where complex inflectional morphology significantly affects downstream tasks such as parsing, named entity recognition, and sentiment analysis. This study evaluates the accuracy and efficiency of several Finnish lemmatizers, utilizing the Project Gutenberg corpus, which includes diverse Finnish-language texts from different periods. Notably, this is the first study to employ Trankit for Finnish lemmatization, providing novel insights into its performance. Additionally, the integration of Murre preprocessing has been emphasized, demonstrating substantial improvements in lemmatization results. By comparing traditional and neural-network-based approaches, this paper aims to provide insights into tool selection for NLP practitioners working with Finnish based on dataset characteristics and processing constraint.
Anthology ID:
2025.iwclul-1.2
Volume:
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages
Month:
December
Year:
2025
Address:
Joensuu, Finland
Editors:
Mika Hämäläinen, Michael Rießler, Eiaki V. Morooka, Lev Kharlashkin
Venues:
IWCLUL | WS
SIG:
SIGUR
Publisher:
Association for Computational Linguistics
Note:
Pages:
6–11
Language:
URL:
https://aclanthology.org/2025.iwclul-1.2/
DOI:
Bibkey:
Cite (ACL):
Emily Öhman, Leo Huovinen, and Mika Hämäläinen. 2025. Benchmarking Finnish Lemmatizers across Historical and Contemporary Texts. In Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages, pages 6–11, Joensuu, Finland. Association for Computational Linguistics.
Cite (Informal):
Benchmarking Finnish Lemmatizers across Historical and Contemporary Texts (Öhman et al., IWCLUL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.iwclul-1.2.pdf