BOUQuET : dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

Pierre Andrews; Mikel Artetxe; Mariano Coria Meglioli; Marta R. Costa-jussà; Joe Chuang; David Dale; Mark Duppenthaler; Nathanial Paul Ekberg; Cynthia Gao; Daniel Edward Licht; Jean Maillard; Alexandre Mourachko; Christophe Ropers; Safiyyah Saleem; Eduardo Sánchez; Ioannis Tsiamas; Arina Turkatenko; Albert Ventayol-Boada; Shireen Yates

doi:10.18653/v1/2025.emnlp-main.1400

BOUQuET : dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-jussà, Joe Chuang, David Dale, Mark Duppenthaler, Nathanial Paul Ekberg, Cynthia Gao, Daniel Edward Licht, Jean Maillard, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo Sánchez, Ioannis Tsiamas, Arina Turkatenko, Albert Ventayol-Boada, Shireen Yates

Abstract

BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark, and a broader collaborative initiative. This dataset is handcrafted in 8 non-English languages (i.e. Egyptian Arabic and Modern Standard Arabic, French, German, Hindi, Indonesian, Mandarin Chinese, Russian, and Spanish). Each of these source languages are representative of the most widely spoken ones and therefore they have the potential to serve as pivot languages that will enable more accurate translations. The dataset is multicentric to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for crowd-source extension for which we are launching a call aim-ing at collecting a multi-way parallel corpus covering any written language. The dataset is freely available at https://huggingface.co/datasets/facebook/bouquet.

Anthology ID:: 2025.emnlp-main.1400
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27515–27535
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1400/
DOI:: 10.18653/v1/2025.emnlp-main.1400
Bibkey:
Cite (ACL):: Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-jussà, Joe Chuang, David Dale, Mark Duppenthaler, Nathanial Paul Ekberg, Cynthia Gao, Daniel Edward Licht, Jean Maillard, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo Sánchez, Ioannis Tsiamas, Arina Turkatenko, Albert Ventayol-Boada, and Shireen Yates. 2025. BOUQuET : dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27515–27535, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: BOUQuET : dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation (Andrews et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1400.pdf
Checklist:: 2025.emnlp-main.1400.checklist.pdf

PDF Cite Search Checklist Fix data