OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Raphaël Merx; Hanna Suominen; Trevor Cohn; Ekaterina Vylomova

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Raphael Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova

Abstract

Health machine translation (MT) is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in the health domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization’s e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.

Anthology ID:: 2025.wmt-1.8
Volume:: Proceedings of the Tenth Conference on Machine Translation
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:: WMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 142–160
Language:
URL:: https://aclanthology.org/2025.wmt-1.8/
DOI:
Bibkey:
Cite (ACL):: Raphael Merx, Hanna Suominen, Trevor Cohn, and Ekaterina Vylomova. 2025. OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages. In Proceedings of the Tenth Conference on Machine Translation, pages 142–160, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages (Merx et al., WMT 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.wmt-1.8.pdf

PDF Cite Search Fix data