Hanxu Hu


2025

pdf bib
LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
Ruijie Hou | Yueyang Jiao | Hanxu Hu | Yingming Li | Wai Lam | Huajian Zhang | Hongyuan Lu
Findings of the Association for Computational Linguistics: EMNLP 2025

The problem of data contamination is now almost inevitable during the development of large language models (LLMs), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes it hard to benchmark LLMs fairly. Instead of constructing contamination-free datasets (quite hard), we propose a novel framework, LNE-Blocking, to restore model performance prior to contamination on potentially leaked datasets. Our framework consists of two components: contamination detection and disruption operation. For the prompt, the framework first uses the contamination detection method, LNE, to assess the extent of contamination in the model. Based on this, it adjusts the intensity of the disruption operation, Blocking, to elicit non-memorized responses from the model. Our framework is the first to efficiently restore the model’s greedy decoding performance. This comes with a strong performance on multiple datasets with potential leakage risks, and it consistently achieves stable recovery results across different models and varying levels of data contamination. We release the code at https://github.com/RuijieH/LNE-Blocking to facilitate research.

pdf bib
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
Xu Huang | Wenhao Zhu | Hanxu Hu | Conghui He | Lei Li | Shujian Huang | Fei Yuan
Findings of the Association for Computational Linguistics: EMNLP 2025

Existing multilingual benchmarks focus primarily on language understanding tasks. There is a lack of benchmarks to measure comprehensive critical capabilities of large language models (LLMs) across diverse languages, including instruction following, reasoning, code generation, and long context understanding. To bridge this gap, we develop BenchMAX, a multi-way multilingual benchmark that covers 10 diverse tasks, to evaluate LLMs’ general abilities across many languages. To ensure high data quality, each sample is post-edited by three native annotators after machine-translating from English into 16 languages. Extensive experiments on BenchMAX reveal uneven utilization of core capabilities across languages, emphasizing the performance gaps that scaling model size alone does not resolve. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.

pdf bib
Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents
Hanxu Hu | Jannis Vamvas | Rico Sennrich
Findings of the Association for Computational Linguistics: EMNLP 2025

LLMs have paved the way for truly simple document-level machine translation, but challenges such as omission errors remain. In this paper, we study a simple method for handling document-level machine translation, by leveraging previous contexts in a multi-turn conversational manner. Specifically, by decomposing documents into segments and iteratively translating them while maintaining previous turns, this method ensures coherent translations without additional training, and can fully re-use the KV cache of previous turns thus minimizing computational overhead. We further propose a ‘source-primed’ method that first provides the whole source document before multi-turn translation. We empirically show this multi-turn method outperforms both translating entire documents in a single turn and translating each segment independently according to multiple automatic metrics in representative LLMs, establishing a strong baseline for document-level translation using LLMs.

pdf bib
Fine-Tuning Large Language Models with Sequential Instructions
Hanxu Hu | Simon Yu | Pinzhen Chen | Edoardo Ponti
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We find that existing instruction-tuned models usually struggle to adhere to a query with multiple intentions, which impairs their performance when the completion of several tasks is demanded by a single command. Hence, this paper teaches models to respond to sequential instructions. Our first attempt stems from a task-driven perspective, manually creating additional intermediate tasks to train multilingual and visual question answering. Next, we develop an automatic and generic process that turns instructions in existing data into diverse and complex task chains. Models that underwent sequential instruction tuning follow a list of instructions better and deliver higher results in coding, maths, and open-ended generation. Moreover, we put forward a new benchmark named SeqEval to evaluate a model’s ability to follow all the instructions in a sequence, which further corroborates the benefits of our sequential instruction tuning method.

2024

pdf bib
CLEANEVAL: Clean Evaluation on Contaminated Large Language Models
Wenhong Zhu | Hongkun Hao | Zhiwei He | Yun-Ze Song | Jiao Yueyang | Yumeng Zhang | Hanxu Hu | Yiran Wei | Rui Wang | Hongyuan Lu
Findings of the Association for Computational Linguistics: NAACL 2024

We are currently in an era of fierce competition among various large language models (LLMs), continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging and critical issue due to potential data contamination. In this paper, we propose a novel and valuable method, Clean-Eval, which mitigates the issue of data contamination and evaluates the LLMs more cleanly. Clean-Eval employs a neural-based model to paraphrase and back-translate the contaminated data into a candidate set, generating expressions with the same meaning but in different surface forms. A semantic detector is then used to filter those generated low-quality samples to narrow down this candidate set. Candidates with moderate BLEURT scores against the original samples are selected as the final evaluation set. According to human assessment, this set is almost semantically equivalent to the original contamination set but expressed differently. We conduct experiments on 20 existing benchmarks across diverse tasks, and results demonstrate that Clean-Eval substantially restores the actual evaluation results on contaminated LLMs under both few-shot learning and fine-tuning scenarios.

2023

pdf bib
Improving User Controlled Table-To-Text Generation Robustness
Hanxu Hu | Yunqing Liu | Zhongyi Yu | Laura Perez-Beltrachini
Findings of the Association for Computational Linguistics: EACL 2023

In this work we study user controlled table-to-text generation where users explore the content in a table by selecting cells and reading a natural language description thereof automatically produce by a natural language generator. Such generation models usually learn from carefully selected cell combinations (clean cell selections); however, in practice users may select unexpected, redundant, or incoherent cell combinations (noisy cell selections). In experiments, we find that models perform well on test sets coming from the same distribution as the train data but their performance drops when evaluated on realistic noisy user inputs. We propose a fine-tuning regime with additional user-simulated noisy cell selections. Models fine-tuned with the proposed regime gain 4.85 BLEU points on user noisy test cases and 1.4 on clean test cases; and achieve comparable state-of-the-art performance on the ToTTo dataset.

pdf bib
Meta-learning For Vision-and-language Cross-lingual Transfer
Hanxu Hu | Frank Keller
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)