Corpora duplication for NLP in low-resource languages: A case study of Nahuatl

Juan-José Guzmán-Landa; Juan-Manuel Torres-Moreno; Luis Moreno Jimenez; Elvys Linhares-Pontes; Miguel Figueroa-Saavedra; Graham Ranger; Martha Lorena Avendaño Garrido

Corpora duplication for NLP in low-resource languages: A case study of Nahuatl

Juan Jose Guzman Landa, Juan-Manuel Torres-Moreno, Luis Moreno Jimenez, Elvys Linhares Pontes, Miguel Figueroa-Saavedra, Graham Ranger, Martha Lorena Avendaño Garrido

Abstract

In this paper, we aim to answer the following question: could corpus duplication be useful in Natural Language Processing (NLP) for low-resource languages? In these languages (or pi-languages), corpora available for training Large Language Models are virtually non-existent. Specifically, we study the impact of corpus expansion in Nahuatl, an agglutinative and polysynthetic Amerindian pi-language characterised by extensive dialectal variation. Our goal is to increase the size of Nahuatl corpora, which currently consist of a limited number of tokens, through controlled duplication techniques. Our experimental setup employs incremental duplication alongside appropriate corpus balancing, with the objective of training embeddings optimised for downstream NLP tasks. Consequently, static embeddings were trained and evaluated on a sentence-level semantic similarity task. Our results show a significant improvement in performance when incremental duplication is applied, compared to results obtained without corpus expansion. To our knowledge, this technique has not yet been explored in this field.

Anthology ID:: 2026.americasnlp-6.11
Volume:: Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
Venues:: AmericasNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 115–127
Language:
URL:: https://aclanthology.org/2026.americasnlp-6.11/
DOI:
Bibkey:
Cite (ACL):: Juan Jose Guzman Landa, Juan-Manuel Torres-Moreno, Luis Moreno Jimenez, Elvys Linhares Pontes, Miguel Figueroa-Saavedra, Graham Ranger, and Martha Lorena Avendaño Garrido. 2026. Corpora duplication for NLP in low-resource languages: A case study of Nahuatl. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 115–127, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Corpora duplication for NLP in low-resource languages: A case study of Nahuatl (Guzman Landa et al., AmericasNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.americasnlp-6.11.pdf
Supplementarymaterial:: 2026.americasnlp-6.11.SupplementaryMaterial.zip

PDF Cite Search Supplementarymaterial Fix data