Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Chahan Vidal-Gorène; Bastien Kindt; Florian Cafiero

Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero

Abstract

Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

Anthology ID:: 2026.loreslm-1.28
Volume:: Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venue:: LoResLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 324–334
Language:
URL:: https://aclanthology.org/2026.loreslm-1.28/
DOI:
Bibkey:
Cite (ACL):: Chahan Vidal-Gorène, Bastien Kindt, and Florian Cafiero. 2026. Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 324–334, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac (Vidal-Gorène et al., LoResLM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.loreslm-1.28.pdf

PDF Cite Search Fix data