Retrieval-Augmented Machine Translation with Unstructured Knowledge

Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou


Abstract
Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance MT models. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs’ retrieval-augmented MT ability. RAGtrans contains 169K MT samples collected via GPT-4o and human translators. Besides, documents from various languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.6-3.1 BLEU and 1.0-2.0 COMET scores in En-Zh, and 1.7-2.9 BLEU and 2.1-2.7 COMET scores in En-De. We also conclude the critical difficulties that current LLMs face with this task.
Anthology ID:
2025.findings-emnlp.313
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5858–5871
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.313/
DOI:
Bibkey:
Cite (ACL):
Jiaan Wang, Fandong Meng, Yingxue Zhang, and Jie Zhou. 2025. Retrieval-Augmented Machine Translation with Unstructured Knowledge. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5858–5871, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Retrieval-Augmented Machine Translation with Unstructured Knowledge (Wang et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.313.pdf
Checklist:
 2025.findings-emnlp.313.checklist.pdf