What Resources Matter for Interlinear Glossing? Using LLMs and RAG for the Low-Resource Mapudungun Language

Anaís Almendra; Arianna Bisazza; Claudio Gutierrez; Felipe Hasler

What Resources Matter for Interlinear Glossing? Using LLMs and RAG for the Low-Resource Mapudungun Language

Anaís Almendra, Arianna Bisazza, Claudio Gutierrez, Felipe Hasler

Abstract

Interlinear glossing is essential for the study and revitalization of endangered languages. However, it remains a time-consuming process that requires extensive linguistic expertise. Recent advances in Large Language Models (LLMs) offer a potential solution. In this research, we study the case of Mapudungun, an endangered language spoken in Chile and Argentina, to generate automatic interlinear glosses using the Gemini 2.5 Pro model. Our study investigates which information configuration through Retrieval-Augmented Generation (RAG) yields the best results. We compare the integration of a formal grammar, a dictionary, a small annotated corpus, and a combination of all these resources. Our evaluation shows that while dictionary integration causes a significant degradation in performance, grounding the model with a structured corpus maximizes accuracy relative to the resources employed. Notably, we find that a remarkably small dataset of 589 meaning units provides enough normative guidance to significantly improve the morphological tagging task. This work highlights the viability of utilizing minimally annotated corpora to assist in the documentation of morphologically complex languages.

Anthology ID:: 2026.americasnlp-6.6
Volume:: Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
Venues:: AmericasNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 64–73
Language:
URL:: https://aclanthology.org/2026.americasnlp-6.6/
DOI:
Bibkey:
Cite (ACL):: Anaís Almendra, Arianna Bisazza, Claudio Gutierrez, and Felipe Hasler. 2026. What Resources Matter for Interlinear Glossing? Using LLMs and RAG for the Low-Resource Mapudungun Language. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 64–73, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: What Resources Matter for Interlinear Glossing? Using LLMs and RAG for the Low-Resource Mapudungun Language (Almendra et al., AmericasNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.americasnlp-6.6.pdf

PDF Cite Search Fix data