Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Eduard Barbu; Meeri-Ly Muru; Sten Marcus Malva

Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Eduard Barbu, Meeri-Ly Muru, Sten Marcus Malva

Abstract

This paper presents a method for text simplification based on two neural architectures: a neural machine translation (NMT) model and a fine-tuned large language model (LLaMA). Given the scarcity of existing resources for Estonian, a new dataset was created by combining manually translated corpora with GPT-4.0-generated simplifications. OpenNMT was selected as a representative NMT-based system, while LLaMA was fine-tuned on the constructed dataset. Evaluation shows LLaMA outperforms OpenNMT in grammaticality, readability, and meaning preservation. These results underscore the effectiveness of large language models for text simplification in low-resource language settings. The complete dataset, fine-tuning scripts, and evaluation pipeline are provided in a publicly accessible supplementary package to support reproducibility and adaptation to other languages.

Anthology ID:: 2025.ranlp-1.16
Volume:: Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Editors:: Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 133–142
Language:
URL:: https://aclanthology.org/2025.ranlp-1.16/
DOI:
Bibkey:
Cite (ACL):: Eduard Barbu, Meeri-Ly Muru, and Sten Marcus Malva. 2025. Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 133–142, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets (Barbu et al., RANLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ranlp-1.16.pdf

PDF Cite Search Fix data