Lemmatization of Polish Multi-word Expressions

Magdalena Król; Aleksander Smywiński-Pohl; Zbigniew Kaleta; Paweł Lewkowicz

doi:10.18653/v1/2025.emnlp-main.1126

Lemmatization of Polish Multi-word Expressions

Magdalena Król, Aleksander Smywiński-Pohl, Zbigniew Kaleta, Paweł Lewkowicz

Abstract

This paper explores the lemmatization of multi-word expressions (MWEs) and proper names in Polish – tasks complicated by linguistic irregularities and historical factors. Instead of using rule-based methods, we apply a machine learning approach with fine-tuned plT5 and mT5 models. We trained and validated the models on enhanced gold-standard data from the 2019 PolEval task and evaluated the impact of additional fine-tuning on a silver-standard dataset derived from Wikipedia. Two setups were tested: one without context, and one using left-side context of the target MWE. Our best model achieved 86.23% AccCS (Accuracy Case-Sensitive), 89.43% AccCI (Accuracy Case-Insensitive), and a combined score of 88.79%, setting a new state-of-the-art for Polish MWE and named entity lemmatization, as confirmed by the PolEval maintainers. We also evaluated optimization and quantization techniques to reduce model size and inference time with minimal quality loss.

Anthology ID:: 2025.emnlp-main.1126
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22148–22157
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1126/
DOI:: 10.18653/v1/2025.emnlp-main.1126
Bibkey:
Cite (ACL):: Magdalena Król, Aleksander Smywiński-Pohl, Zbigniew Kaleta, and Paweł Lewkowicz. 2025. Lemmatization of Polish Multi-word Expressions. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22148–22157, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Lemmatization of Polish Multi-word Expressions (Król et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1126.pdf
Checklist:: 2025.emnlp-main.1126.checklist.pdf

PDF Cite Search Checklist Fix data