Xu Huang
Nanjing
Other people with similar names: Xu Huang
Unverified author pages with similar names: Xu Huang
2025
Findings of the WMT25 Terminology Translation Task: Terminology is Useful Especially for Good MTs
Kirill Semenov | Xu Huang | Vilém Zouhar | Nathaniel Berger | Dawei Zhu | Arturo Oncevay | Pinzhen Chen
Proceedings of the Tenth Conference on Machine Translation
Kirill Semenov | Xu Huang | Vilém Zouhar | Nathaniel Berger | Dawei Zhu | Arturo Oncevay | Pinzhen Chen
Proceedings of the Tenth Conference on Machine Translation
The WMT25 Terminology Translation Task releases new resources in high-stakes domains and investigates the capabilities of translation systems to accurately and consistently translate specialized terms. This year, we feature new domain and language coverage over previous editions, introducing two distinct tracks: (1) sentence-level translation in the information technology domain for English→German, English→Russian, and English→Spanish, and (2) document-level translation in the finance domain for English↔Traditional Chinese with a document-level one-to-many dictionary. Participants are challenged to translate texts under three modes: no terminology, proper terminology, and random terminology, allowing for a causal analysis of terminology utility. Evaluation combines overall quality, terminology accuracy, and terminology consistency. This shared task attracted broad participation, with 13 teams submitting 20 systems in Track 1 and 4 teams participating in Track 2. The results show that providing proper terminology consistently boosts both overall translation quality and term accuracy, whereas reliance on random terminology yields smaller gains. Despite the near-saturation of sentence-level benchmarks, document-level finance translation still fallsshort, indicating an urgent need for long-form evaluation and more robust metrics tailored to professional domains.
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
Xu Huang | Wenhao Zhu | Hanxu Hu | Conghui He | Lei Li | Shujian Huang | Fei Yuan
Findings of the Association for Computational Linguistics: EMNLP 2025
Xu Huang | Wenhao Zhu | Hanxu Hu | Conghui He | Lei Li | Shujian Huang | Fei Yuan
Findings of the Association for Computational Linguistics: EMNLP 2025
Existing multilingual benchmarks focus primarily on language understanding tasks. There is a lack of benchmarks to measure comprehensive critical capabilities of large language models (LLMs) across diverse languages, including instruction following, reasoning, code generation, and long context understanding. To bridge this gap, we develop BenchMAX, a multi-way multilingual benchmark that covers 10 diverse tasks, to evaluate LLMs’ general abilities across many languages. To ensure high data quality, each sample is post-edited by three native annotators after machine-translating from English into 16 languages. Extensive experiments on BenchMAX reveal uneven utilization of core capabilities across languages, emphasizing the performance gaps that scaling model size alone does not resolve. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models
Hengyu Luo | Zihao Li | Joseph Attieh | Sawal Devkota | Ona de Gibert | Xu Huang | Shaoxiong Ji | Peiqin Lin | Bhavani Sai Praneeth Varma Mantina | Ananda Sreenidhi | Raúl Vázquez | Mengjie Wang | Samea Yusofi | Fei Yuan | Jörg Tiedemann
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Hengyu Luo | Zihao Li | Joseph Attieh | Sawal Devkota | Ona de Gibert | Xu Huang | Shaoxiong Ji | Peiqin Lin | Bhavani Sai Praneeth Varma Mantina | Ananda Sreenidhi | Raúl Vázquez | Mengjie Wang | Samea Yusofi | Fei Yuan | Jörg Tiedemann
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary languages. Evaluating these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks suffer from inconsistency across different benchmarks, being disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this critical challenge of fragmented and inconsistent multilingual evaluation, we introduce GlotEval, a unified and lightweight framework that systematically integrates 27 benchmarks under a standardized ISO 639-3 language identifier system, allowing for seamless incorporation of new benchmarks. Supporting nine key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, intrinsic evaluation, instruction following and reasoning), spanning over dozens to hundreds of languages, GlotEval uniquely enables language-specific, cross-benchmark analysis and non-English-centric evaluations at a scale previously less practical for many researchers. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval’s applicability for multilingual and language-specific evaluations.
2024
Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation
Xu Huang | Zhirui Zhang | Xiang Geng | Yichao Du | Jiajun Chen | Shujian Huang
Findings of the Association for Computational Linguistics: ACL 2024
Xu Huang | Zhirui Zhang | Xiang Geng | Yichao Du | Jiajun Chen | Shujian Huang
Findings of the Association for Computational Linguistics: ACL 2024
This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task, aiming to better understand the mechanisms behind their remarkable performance in this task.We design the controlled experiments across various input modes and model types, and employ both coarse-grained and fine-grained prompts to discern the utility of source versus reference information.We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive, indicating LLMs’ inability to fully leverage the cross-lingual capability when evaluating translations.Further analysis of the fine-grained evaluation and fine-tuning experiments show similar results.These findings also suggest a potential research direction for LLMs that fully exploits the cross-lingual capability of LLMs to achieve better performance in machine translation evaluation tasks.
2023
IMTLab: An Open-Source Platform for Building, Evaluating, and Diagnosing Interactive Machine Translation Systems
Xu Huang | Zhirui Zhang | Ruize Gao | Yichao Du | Lemao Liu | Guoping Huang | Shuming Shi | Jiajun Chen | Shujian Huang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Xu Huang | Zhirui Zhang | Ruize Gao | Yichao Du | Lemao Liu | Guoping Huang | Shuming Shi | Jiajun Chen | Shujian Huang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
We present IMTLab, an open-source end-to-end interactive machine translation (IMT) system platform that enables researchers to quickly build IMT systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. IMTLab treats the whole interactive translation process as a task-oriented dialogue with a human-in-the-loop setting, in which human interventions can be explicitly incorporated to produce high-quality, error-free translations. To this end, a general communication interface is designed to support the flexible IMT architectures and user policies. Based on the proposed design, we construct a simulated and real interactive environment to achieve end-to-end evaluation and leverage the framework to systematically evaluate previous IMT systems. Our simulated and manual experiments show that the prefix-constrained decoding approach still gains the lowest editing cost in the end-to-end evaluation, while BiTIIMT achieves comparable editing cost with a better interactive experience.
Search
Fix author
Co-authors
- Shujian Huang (书剑 黄) 3
- Jiajun Chen 2
- Yichao Du 2
- Fei Yuan 2
- Zhirui Zhang 2
- Joseph Attieh 1
- Nathaniel Berger 1
- Pinzhen Chen 1
- Sawal Devkota 1
- Ruize Gao 1
- Xiang Geng 1
- Conghui He 1
- Hanxu Hu 1
- Guoping Huang 1
- Shaoxiong Ji 1
- Lei Li 1
- Zihao Li 1
- Peiqin Lin 1
- Lemao Liu 1
- Hengyu Luo 1
- Bhavani Sai Praneeth Varma Mantina 1
- Arturo Oncevay 1
- Kirill Semenov 1
- Shuming Shi 1
- Ananda Sreenidhi 1
- Jörg Tiedemann 1
- Raúl Vázquez 1
- Mengjie Wang 1
- Samea Yusofi 1
- Dawei Zhu 1
- Wenhao Zhu 1
- Vilém Zouhar 1
- Ona de Gibert 1