Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language

Raphaël Merx, Aso Mahmudi, Katrina Langford, Leo Alberto de Araujo, Ekaterina Vylomova


Abstract
This study explores the use of large language models (LLMs) for translating English into Mambai, a low-resource Austronesian language spoken in Timor-Leste, with approximately 200,000 native speakers. Leveraging a novel corpus derived from a Mambai language manual and additional sentences translated by a native speaker, we examine the efficacy of few-shot LLM prompting for machine translation (MT) in this low-resource context. Our methodology involves the strategic selection of parallel sentences and dictionary entries for prompting, aiming to enhance translation accuracy, using open-source and proprietary LLMs (LlaMa 2 70b, Mixtral 8x7B, GPT-4). We find that including dictionary entries in prompts and a mix of sentences retrieved through TF-IDF and semantic embeddings significantly improves translation quality. However, translation accuracy varies between test sets, highlighting the importance of diverse corpora for evaluating low-resource MT. This research provides insights into few-shot LLM prompting for low-resource MT, and makes available an initial corpus for the Mambai language.
Anthology ID:
2024.eurali-1.1
Volume:
Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Atul Kr. Ojha, Sina Ahmadi, Silvie Cinková, Theodorus Fransen, Chao-Hong Liu, John P. McCrae
Venues:
EURALI | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
1–11
Language:
URL:
https://aclanthology.org/2024.eurali-1.1
DOI:
Bibkey:
Cite (ACL):
Raphaël Merx, Aso Mahmudi, Katrina Langford, Leo Alberto de Araujo, and Ekaterina Vylomova. 2024. Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language. In Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024, pages 1–11, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language (Merx et al., EURALI-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eurali-1.1.pdf