Xudong Han - ACL Anthology

Xudong Han

2026

Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.

SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning
Renxi Wang | Honglin Mu | Liqun Ma | Lizhi Lin | Yunlong Feng | Timothy Baldwin | Xudong Han | Haonan Li
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Long-context understanding has emerged as a critical capability for large language models (LLMs). However, evaluating this ability remains challenging. We present SCALAR, a benchmark designed to assess citation-grounded long-context reasoning in academic writing. SCALAR leverages academic papers and their citation structure to automatically generate high-quality ground-truth labels without human annotation. It features controllable difficulty levels and a dynamic updating mechanism that mitigates data contamination. The benchmark includes two tasks: a multiple-choice QA format and a cloze-style citation prediction. We evaluate a range of state-of-the-art LLMs and find that the multiple-choice task effectively distinguishes model capabilities—while human experts achieve over 90% accuracy, most models struggle. The cloze-style task is even more challenging, with no model exceeding 40% accuracy. SCALAR provides a domain-grounded, continuously updating framework for tracking progress in citation-based long-context understanding. Code and data will be publicly released.

2025

NAT: Enhancing Agent Tuning with Negative Samples
Renxi Wang | Xudong Han | Yixuan Zhang | Timothy Baldwin | Haonan Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Interaction trajectories between agents and environments have proven effective in tuning LLMs into task-specific agents. However, constructing these trajectories, especially successful trajectories, is often computationally and time intensive due to the relatively low success rates of even the most advanced LLMs, such as GPT-4 and Claude. Additionally, common training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL) not only require large volumes of data but also have specific demands regarding the trajectories used. For instance, existing SFT approaches typically utilize only positive examples, limiting their efficiency in low-resource scenarios. To address this, we introduce Negative-Aware Training (NAT), a straightforward yet effective method that leverages both successful and failed trajectories for fine-tuning, maximizing the utility of limited resources. Experimental results demonstrate that NAT consistently surpasses existing methods, including SFT, DPO, and PPO, across various tasks.

As large language models (LLMs) continue to evolve, leaderboards play a significant role in steering their development. Existing leaderboards often prioritize model capabilities while overlooking safety concerns, leaving a significant gap in responsible AI development. To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Honglin Mu | Han He | Yuxin Zhou | Yunlong Feng | Yang Xu | Libo Qin | Xiaoming Shi | Zeming Liu | Xudong Han | Qi Shi | Qingfu Zhu | Wanxiang Che
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language model (LLM) safety is a critical issue, with numerous studies employing red team testing to enhance model security. Among these, jailbreak methods explore potential vulnerabilities by crafting malicious prompts that induce model outputs contrary to safety alignments. Existing black-box jailbreak methods often rely on model feedback, repeatedly submitting queries with detectable malicious instructions during the attack search process. Although these approaches are effective, the attacks may be intercepted by content moderators during the search process. We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation. This method offers enhanced stealth, as it does not involve submitting identifiable malicious instructions to the target model during the search phase. Our approach achieved a maximum attack success rate of 92%, or a balanced value of 80% with an average of 1.5 detectable jailbreak queries per sample against GPT-3.5 Turbo on a subset of AdvBench. These results underscore the need for more robust defense mechanisms.

Demographics and Democracy: Benchmarking LLMs’ Gender Bias and Political Leaning in European Parliament
Jinrui Yang | Xudong Han | Timothy Baldwin
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)

Loki: An Open-Source Tool for Fact Verification
Haonan Li | Xudong Han | Hao Wang | Yuxia Wang | Minghan Wang | Rui Xing | Yilin Geng | Zenan Zhai | Preslav Nakov | Timothy Baldwin
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

We introduce Loki, an open-source tool designed to address the growing problem of misinformation. Loki adopts a human-centered approach, striking a balance between the quality of fact-checking and the cost of human involvement. It decomposes the fact-checking task into a five-step pipeline: breaking down long texts into individual claims, assessing their check-worthiness, generating queries, retrieving evidence, and verifying the claims. Instead of fully automating the claim verification process, provides essential information at each step to assist human judgment, especially for general users such as journalists and content moderators. Moreover, it has been optimized for latency, robustness, and cost efficiency at a commercially usable level. Loki is released under an MIT license and is available on GitHub. We also provide a video presenting the system and its capabilities.

2024

Do-Not-Answer: Evaluating Safeguards in LLMs
Yuxia Wang | Haonan Li | Xudong Han | Preslav Nakov | Timothy Baldwin
Findings of the Association for Computational Linguistics: EACL 2024

With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to identify potential risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. Here we aim to facilitate this process. In particular, we collect an open-source dataset to evaluate the safeguards in LLMs, to facilitate the deployment of safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We assess the responses of six popular LLMs to these instructions, and we find that simple BERT-style classifiers can achieve results that are comparable to GPT-4 on automatic safety evaluation. Our data and code are available at https://github.com/Libr-AI/do-not-answer

A Chinese Dataset for Evaluating the Safeguards in Large Language Models
Yuxia Wang | Zenan Zhai | Haonan Li | Xudong Han | Shom Lin | Zhenxuan Zhang | Angela Zhao | Preslav Nakov | Timothy Baldwin
Findings of the Association for Computational Linguistics: ACL 2024

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks. Previous studies have proposed comprehensive taxonomies of LLM risks, as well as corresponding prompts that can be used to examine LLM safety. However, the focus has been almost exclusively on English. We aim to broaden LLM safety research by introducing a dataset for the safety evaluation of Chinese LLMs, and extending it to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments over five LLMs show that region-specific risks are the prevalent risk type. Warning: this paper contains example data that may be offensive, harmful, or biased. Our data is available at https://github.com/Libr-AI/do-not-answer.

Demystifying Instruction Mixing for Fine-tuning Large Language Models
Renxi Wang | Haonan Li | Minghao Wu | Yuxia Wang | Xudong Han | Chiyu Zhang | Timothy Baldwin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Instruction tuning significantly enhances the performance of large language models (LLMs) across various tasks. However, the procedure to optimizing the mixing of instruction datasets for LLM fine-tuning is still poorly understood. This study categorizes instructions into three primary types: NLP downstream tasks, coding, and general chat. We explore the effects of instruction tuning on different combinations of datasets on LLM performance, and find that certain instruction types are more advantageous for specific applications but can negatively impact other areas. This work provides insights into instruction mixtures, laying the foundations for future research.

2023

Fair Enough: Standardizing Evaluation and Model Selection for Fairness Research in NLP
Xudong Han | Timothy Baldwin | Trevor Cohn
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Modern NLP systems exhibit a range of biases, which a growing literature on model debiasing attempts to correct. However, current progress is hampered by a plurality of definitions of bias, means of quantification, and oftentimes vague relation between debiasing algorithms and theoretical measures of bias. This paper seeks to clarify the current situation and plot a course for meaningful progress in fair learning, with two key contributions: (1) making clear inter-relations among the current gamut of methods, and their relation to fairness theory; and (2) addressing the practical problem of model selection, which involves a trade-off between fairness and accuracy and has led to systemic issues in fairness research. Putting them together, we make several recommendations to help shape future work.

Uncertainty Estimation for Debiased Models: Does Fairness Hurt Reliability?
Gleb Kuzmin | Artem Vazhentsev | Artem Shelmanov | Xudong Han | Simon Suster | Maxim Panov | Alexander Panchenko | Timothy Baldwin
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2022

Systematic Evaluation of Predictive Fairness
Xudong Han | Aili Shen | Trevor Cohn | Timothy Baldwin | Lea Frermann
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Mitigating bias in training on biased datasets is an important open problem. Several techniques have been proposed, however the typical evaluation regime is very limited, considering very narrow data conditions. For instance, the effect of target class imbalance and stereotyping is under-studied. To address this gap, we examine the performance of various debiasing methods across multiple tasks, spanning binary classification (Twitter sentiment), multi-class classification (profession prediction), and regression (valence prediction). Through extensive experimentation, we find that data conditions have a strong influence on relative model performance, and that general conclusions cannot be drawn about method efficacy when evaluating only on standard datasets, as is current practice in fairness research.

Does Representational Fairness Imply Empirical Fairness?
Aili Shen | Xudong Han | Trevor Cohn | Timothy Baldwin | Lea Frermann
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

NLP technologies can cause unintended harms if learned representations encode sensitive attributes of the author, or predictions systematically vary in quality across groups. Popular debiasing approaches, like adversarial training, remove sensitive information from representations in order to reduce disparate performance, however the relation between representational fairness and empirical (performance) fairness has not been systematically studied. This paper fills this gap, and proposes a novel debiasing method building on contrastive learning to encourage a latent space that separates instances based on target label, while mixing instances that share protected attributes. Our results show the effectiveness of our new method and, more importantly, show across a set of diverse debiasing methods that representational fairness does not imply empirical fairness. This work highlights the importance of aligning and understanding the relation of the optimization objective and final fairness target.

FairLib: A Unified Framework for Assessing and Improving Fairness
Xudong Han | Aili Shen | Yitong Li | Lea Frermann | Timothy Baldwin | Trevor Cohn
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper presents FairLib, an open-source python library for assessing and improving model fairness. It provides a systematic framework for quickly accessing benchmark datasets, reproducing existing debiasing baseline models, developing new methods, evaluating models with different metrics, and visualizing their results. Its modularity and extensibility enable the framework to be used for diverse types of inputs, including natural language, images, and audio. We implement 14 debiasing methods, including pre-processing,at-training-time, and post-processing approaches. The built-in metrics cover the most commonly acknowledged fairness criteria and can be further generalized and customized for fairness evaluation.

Balancing out Bias: Achieving Fairness Through Balanced Training
Xudong Han | Timothy Baldwin | Trevor Cohn
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Group bias in natural language processing tasks manifests as disparities in system error rates across texts authorized by different demographic groups, typically disadvantaging minority groups. Dataset balancing has been shown to be effective at mitigating bias, however existing approaches do not directly account for correlations between author demographics and linguistic variables, limiting their effectiveness. To achieve Equal Opportunity fairness, such as equal job opportunity without regard to demographics, this paper introduces a simple, but highly effective, objective for countering bias using balanced training.We extend the method in the form of a gated model, which incorporates protected attributes as input, and show that it is effective at reducing bias in predictions through demographic input perturbation, outperforming all other bias mitigation techniques when combined with balanced training.

Towards Fair Dataset Distillation for Text Classification
Xudong Han | Aili Shen | Yitong Li | Lea Frermann | Timothy Baldwin | Trevor Cohn
Proceedings of the Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)

With the growing prevalence of large-scale language models, their energy footprint and potential to learn and amplify historical biases are two pressing challenges. Dataset distillation (DD) — a method for reducing the dataset size by learning a small number of synthetic samples which encode the information in the original dataset — is a method for reducing the cost of model training, however its impact on fairness has not been studied. We investigate how DD impacts on group bias, with experiments over two language classification tasks, concluding that vanilla DD preserves the bias of the dataset. We then show how existing debiasing methods can be combined with DD to produce models that are fair and accurate, at reduced training cost.

Optimising Equal Opportunity Fairness in Model Training
Aili Shen | Xudong Han | Trevor Cohn | Timothy Baldwin | Lea Frermann
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Real-world datasets often encode stereotypes and societal biases. Such biases can be implicitly captured by trained models, leading to biased predictions and exacerbating existing societal preconceptions. Existing debiasing methods, such as adversarial training and removing protected information from representations, have been shown to reduce bias. However, a disconnect between fairness criteria and training objectives makes it difficult to reason theoretically about the effectiveness of different techniques. In this work, we propose two novel training objectives which directly optimise for the widely-used criterion of equal opportunity, and show that they are effective in reducing bias while maintaining high performance over two classification tasks.

2021

Diverse Adversaries for Mitigating Bias in Training
Xudong Han | Timothy Baldwin | Trevor Cohn
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Adversarial learning can learn fairer and less biased models of language processing than standard training. However, current adversarial techniques only partially mitigate the problem of model bias, added to which their training procedures are often unstable. In this paper, we propose a novel approach to adversarial learning based on the use of multiple diverse discriminators, whereby discriminators are encouraged to learn orthogonal hidden representations from one another. Experimental results show that our method substantially improves over standard adversarial removal methods, in terms of reducing bias and stability of training.

Evaluating Debiasing Techniques for Intersectional Biases
Shivashankar Subramanian | Xudong Han | Timothy Baldwin | Trevor Cohn | Lea Frermann
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Bias is pervasive for NLP models, motivating the development of automatic debiasing techniques. Evaluation of NLP debiasing methods has largely been limited to binary attributes in isolation, e.g., debiasing with respect to binary gender or race, however many corpora involve multiple such attributes, possibly with higher cardinality. In this paper we argue that a truly fair model must consider ‘gerrymandering’ groups which comprise not only single attributes, but also intersectional groups. We evaluate a form of bias-constrained model which is new to NLP, as well an extension of the iterative nullspace projection technique which can handle multiple identities.

Decoupling Adversarial Training for Fair NLP
Xudong Han | Timothy Baldwin | Trevor Cohn
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2019

Grounding learning of modifier dynamics: An application to color naming
Xudong Han | Philip Schulz | Trevor Cohn
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Grounding is crucial for natural language understanding. An important subtask is to understand modified color expressions, such as “light blue”. We present a model of color modifiers that, compared with previous additive models in RGB space, learns more complex transformations. In addition, we present a model that operates in the HSV color space. We show that certain adjectives are better modeled in that space. To account for all modifiers, we train a hard ensemble model that selects a color space depending on the modifier-color pair. Experimental results show significant and consistent improvements compared to the state-of-the-art baseline model.

Co-authors

Monojit Choudhury 2

Artem Shelmanov 2

Zhenxuan Zhang 2

Utkarsh Agarwal 1

Emad A. Alghamdi 1

Debopriyo Banerjee 1

Rishabh Bhardwaj 1

Junaid Hamid Bhat 1

Shivam Chauhan 1

Mukund Choudhary 1

Rocktim Jyoti Das 1

Ali El Filali 1

Samujjwal Ghosh 1

Gurpreet Gosal 1

Iryna Gurevych 1

Alok Anil Jadhav 1

Rituraj Joshi 1

Tatsuki Kuribayashi 1

Zhengzhong Liu 1

Parvez Mullah 1

Alexander Panchenko 1

Onkar Arun Pandit 1

Soujanya Poria 1

Lalit Pradhan 1

Zainul Abedien Ahmed Quraishi 1

Gokulakrishnan Ramakrishnan 1

Hector Xuguang Ren 1

Sunil Kumar Sahu 1

Philip Schulz 1

Neha Sengupta 1

Avraham Sheinin 1

Awantika Shukla 1

Aaryamonvikram Singh 1

Shivashankar Subramanian 1

Natalia Vassilieva 1

Artem Vazhentsev 1

Yaodong Yang (杨耀东) 1

Youliang Yuan 1

Bingchen Zhao 1

Venues