Xuming Hu - ACL Anthology

Xuming Hu

2025

UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models
Boyang Xue | Fei Mi | Qi Zhu | Hongru Wang | Rui Wang | Sheng Wang | Erxin Yu | Xuming Hu | Kam-Fai Wong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite demonstrating impressive capabilities, Large Language Models (LLMs) still often struggle to accurately express the factual knowledge they possess, especially in cases where the LLMs’ knowledge boundaries are ambiguous. To improve LLMs’ factual expressions, we propose the UAlign framework, which leverages Uncertainty estimations to represent knowledge boundaries, and then explicitly incorporates these representations as input features into prompts for LLMs to Align with factual knowledge. First, we prepare the dataset on knowledge question-answering (QA) samples by calculating two uncertainty estimations, including confidence score and semantic entropy, to represent the knowledge boundaries for LLMs. Subsequently, using the prepared dataset, we train a reward model that incorporates uncertainty estimations and then employ the Proximal Policy Optimization (PPO) algorithm for factuality alignment on LLMs. Experimental results indicate that, by integrating uncertainty representations in LLM alignment, the proposed UAlign can significantly enhance the LLMs’ capacities to confidently answer known questions and refuse unknown questions on both in-domain and out-of-domain tasks, showing reliability improvements and good generalizability over various prompt- and training-based baselines.

Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?
Leyi Pan | Aiwei Liu | Shiyu Huang | Yijian Lu | Xuming Hu | Lijie Wen | Irwin King | Philip S. Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies.

Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis
Junzhuo Li | Bo Wang | Xiuze Zhou | Peijie Jiang | Jia Liu | Xuming Hu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mistral-7B). Results show MoE models achieve 31% higher per-layer efficiency via a “mid-activation, late-amplification” pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a “basic-refinement” framework—shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow Olmoe suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.

Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning
Shaobo Wang | Xiangqi Jin | Ziming Wang | Jize Wang | Jiajun Zhang | Kaixin Li | Zichen Wen | Zhong Li | Conghui He | Xuming Hu | Linfeng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model’s predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4× speedup.

MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection
Yibo Yan | Shen Wang | Jiahao Huo | Philip S. Yu | Xuming Hu | Qingsong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of **identifying and categorizing student errors in multimodal mathematical contexts**. Therefore, we introduce **MathAgent, a novel Mixture-of-Math-Agent framework** specifically designed to address these challenges. Our approach decomposes error detection into three phases with specialized agents: an image-text consistency validator, a visual semantic interpreter, and an integrative error analyzer. This architecture enables more accurate processing of multimodal mathematical content by explicitly modeling the relationships between multimodal problems and student solution steps. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification and 3% improvement in error categorization compared to baseline models. Furthermore, MathAgent has been successfully deployed in an educational platform serving over one million K-12 students, achieving nearly 90% student satisfaction while generating significant cost savings by reducing manual error detection.

Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction
Yinghui Li | Shang Qin | Jingheng Ye | Haojing Huang | Yangning Li | Shu-Yu Guo | Libo Qin | Xuming Hu | Wenhao Jiang | Hai-Tao Zheng | Philip S. Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Recently, Large Language Models (LLMs) have been widely studied by researchers for their roles in various downstream NLP tasks. As a fundamental task in the NLP field, Chinese Grammatical Error Correction (CGEC) aims to correct all potential grammatical errors in the input sentences. Previous studies have shown that LLMs’ performance as correctors on CGEC remains unsatisfactory due to the challenging nature of the task. To promote the CGEC field to better adapt to the era of LLMs, we rethink the roles of LLMs in the CGEC task so that they can be better utilized and explored in CGEC. Considering the rich grammatical knowledge stored in LLMs and their powerful semantic understanding capabilities, we utilize LLMs as explainers to provide explanation information to the CGEC small models during error correction, aiming to enhance performance. We also use LLMs as evaluators to bring more reasonable CGEC evaluations, thus alleviating the troubles caused by the subjectivity of the CGEC task. In particular, our work is also an active exploration of how LLMs and small models better collaborate in downstream tasks. Extensive experiment and detailed analyses on widely used datasets verify the effectiveness of our intuition and the proposed methods.

Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis
Haoming Huang | Yibo Yan | Jiahao Huo | Xin Zou | Xinfeng Li | Kun Wang | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce **PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing.** By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the function of key components in the circuit and how the attention pattern dynamics contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit’s effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation. Our code can be found in https://github.com/halfmorepiece/PhantomCircuit.

Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
Yunkai Dang | Mengxi Gao | Yibo Yan | Xin Zou | Yanggan Gu | Jungang Li | Jingyu Wang | Peijie Jiang | Aiwei Liu | Jia Liu | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding. However, existing studies have concentrated mainly on visual–textual misalignment, leaving largely unexplored the MLLMs’ ability to preserve an originally correct answer when confronted with misleading information. We reveal a response uncertainty phenomenon: across nine standard datasets, twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue. To systematically quantify this vulnerability, we propose a two-stage evaluation pipeline: (1) elicit each model’s original response on unperturbed inputs; (2) inject explicit (false-answer hints) and implicit (contextual contradictions) misleading instructions, and compute the misleading rate—the fraction of correct-to-incorrect flips. Leveraging the most susceptible examples, we curate the Multimodal Uncertainty Benchmark (MUB), a collection of image–question pairs stratified into low, medium, and high difficulty based on how many of twelve state-of-the-art MLLMs they mislead. Extensive evaluation on twelve open-source and five closed-source models reveals a high uncertainty: average misleading rates exceed 86%, with explicit cues over 67.19% and implicit cues over 80.67%. To reduce the misleading rate, we then fine-tune all open-source MLLMs on a compact 2,000-sample mixed-instruction dataset, reducing misleading rates to 6.97% (explicit) and 32.77% (implicit), boosting consistency by nearly 29.37% on highly deceptive inputs, and slightly improving accuracy on standard benchmarks.

Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation
Junzhuo Li | Bo Wang | Xiuze Zhou | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Mixture-of-Experts (MoE) models offer immense capacity via sparsely gated expert subnetworks, yet adapting them to multiple domains without catastrophic forgetting remains an open challenge. Existing approaches either incur prohibitive computation, suffer cross-domain interference, or require separate runs per domain. We propose DES-MoE, a dynamic expert specialization framework for multi-domain adaptation of Mixture-of-Experts models. DES-MoE addresses catastrophic forgetting through three innovations: (1) an adaptive router balancing pre-trained knowledge retention and task-specific updates via distillation, (2) real-time expert-domain correlation mapping to isolate domain-specific gradients, and (3) a three-phase adaptive fine-tuning schedule that progressively freezes non-specialized parameters. Evaluated on six domains (math, code, law, etc.), DES-MoE matches single-domain ESFT performance while training one unified model, reduces forgetting by 89% compared to full fine-tuning as domains scale from 2 to 6, and achieves 68% faster convergence than conventional methods. Our work establishes dynamic expert isolation as a scalable paradigm for multi-task MoE adaptation.

Internal Chain-of-Thought: Empirical Evidence for Layer‐wise Subtask Scheduling in LLMs
Zhipeng Yang | Junzhuo Li | Siyu Xia | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We show that large language models (LLMs) exhibit an internal chain-of-thought: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world TRACE benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.

VLA-Mark: A cross modal watermark for large vision-language alignment models
Shuliang Liu | Zheng Qi | Jesse Jiaxi Xu | Yibo Yan | Junyan Zhang | He Geng | Aiwei Liu | Peijie Jiang | Jia Liu | Yik-Cheung Tam | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking.

DiscoverGPT: Multi-task Fine-tuning Large Language Model for Related Table Discovery
Xuming Hu | Xiao Qin | Chuan Lei | Asterios Katsifodimos | Zhengyuan Shen | Balasubramaniam Srinivasan | Huzefa Rangwala
Findings of the Association for Computational Linguistics: NAACL 2025

Natural language understanding over tabular data has played a significant role in data discovery tasks such as joinable and unionable table search. State-of-the-art approaches adopt large language models (LLMs) pre-trained over massive text corpora to learn and evaluate the table semantic relatedness. Existing methods typically follow a pretrain-and-finetune paradigm, namely fine-tuning an LLM using tabular data with table relatedness labels. To enhance model’s understanding of tabular data, recent studies include auxiliary tasks such as entity resolution and column type classification in the fine-tuning phase. In spite of achieving performance gain from these supervisions, there is a lack of study on how these supervisions complement or even contrast each other, leading to a subpar performance on the final data discovery tasks. In this paper, we propose a simple yet effective multi-task fine-tuning framework named DiscoverGPT that holistically discovers and leverages the intricate relationships among the supervisions to optimize the performance on the data discovery task. Moreover, DiscoverGPT is plug-and-play that allows a broad range of open-domain auxiliary tasks to be incorporated, by utilizing the generative power of LLMs. We demonstrate the usability and effectiveness of DiscoverGPT with baseline comparisons and ablation studies. DiscoverGPT outperforms the best performing baseline by up to 7% in F1 score.

PolyJoin: Semantic Multi-key Joinable Table Search in Data Lakes
Xuming Hu | Chuan Lei | Xiao Qin | Asterios Katsifodimos | Christos Faloutsos | Huzefa Rangwala
Findings of the Association for Computational Linguistics: NAACL 2025

Given a query table, how can we effectively discover multi-key joinable tables on the web? This can be seen as a retrieval task, where users can lookup on the web for tables related to an existing one. Searching and discovering such joinable tables is critical to data analysts and data scientists for reporting, establishing correlations and training machine learning models. Existing joinable table search methods have mostly focused on single key (unary) joins, where a single column is the join key. However, these methods are ineffective when dealing with join keys composed of multiple columns (n-ary joins), which are prevalent on web table corpora. In this paper, we introduce PolyJoin, which finds multi-key semantically-joinable tables on the web, given a query table. PolyJoin employs a multi-key encoder and a novel self-supervised training method to generate the representations of multiple join keys, preserving the alignment across multiple columns. In particular, PolyJoin is equipped with a hierarchical contrastive learning technique to further enhance the model’s semantic understanding of multi-key joinable tables. PolyJoin outperforms the state-of-the-art methods by 2.89% and 3.67% with respect to MAP@30 and R@30 on two real-world web table benchmarks, respectively.

Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
Kening Zheng | Junkai Chen | Yibo Yan | Xin Zou | Huiyu Zhou | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

Hallucination issues continue to affect multimodal large language models (MLLMs), with existing research mainly addressing object-level or attribute-level hallucinations, neglecting the more complex relation hallucinations that require advanced reasoning. Current benchmarks for relation hallucinations lack detailed evaluation and effective mitigation, and their datasets often suffer from biases due to systematic annotation processes. To address these challenges, we introduce Reefknot, a comprehensive benchmark targeting relation hallucinations, comprising over 20,000 real-world samples. We provide a systematic definition of relation hallucinations, integrating perceptive and cognitive perspectives, and construct a relation-based corpus using the Visual Genome scene graph dataset. Our comparative evaluation reveals significant limitations in current MLLMs’ ability to handle relation hallucinations. Additionally, we propose a novel confidence-based mitigation strategy, which reduces the hallucination rate by an average of 9.75% across three datasets, including Reefknot. Our work offers valuable insights for achieving trustworthy multimodal intelligence. The dataset and code are released at https://github.com/JackChen-seu/Reefknot.

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models
Jiamin Su | Yibo Yan | Fangteng Fu | Zhang Han | Jingheng Ye | Xiang Liu | Jiahao Huo | Huiyu Zhou | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (i) reliance on handcrafted features that limit generalizability, (ii) difficulty in capturing fine-grained traits like coherence and argumentation, and (iii) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose **EssayJudge**, the **first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits**. By leveraging MLLMs’ strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research. Our dataset and code will be available upon acceptance.

MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models
Jiahao Huo | Yibo Yan | Xu Zheng | Yuanhuiyi Lyu | Xin Zou | Zhihua Wei | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to **reformulate the task of multimodal MU in the era of MLLMs**, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we **develop a novel geometry-constrained gradient ascent method MMUnlearner**. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code will be released upon acceptance.

StructFact: Reasoning Factual Knowledge from Structured Data with Large Language Models
Sirui Huang | Yanggan Gu | Zhonghao Li | Xuming Hu | Li Qing | Guandong Xu
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) have made significant strides in natural language processing by leveraging their ability to comprehend and reason with factual knowledge. However, a significant amount of factual knowledge is stored in structured data, which has unique characteristics not typically encountered in the unstructured texts used for pretraining LLMs. To evaluate the capability of LLMs in handling facts structurally stored, we introduce a benchmark called StructFact, which includes meticulously annotated factual questions, spanning five tasks that reflect the intrinsic properties of structured data. This benchmark aims to delineate the strengths and limitations of LLMs in reasoning with structured data for knowledge-intensive tasks in practical applications. Extensive experiments conducted on 10 common LLMs have yielded several insights, one notable finding being that these models struggle significantly with the heterogeneity of structured data during reasoning.

Unlocking Speech Instruction Data Potential with Query Rewriting
Yonghua Hei | Yibo Yan | Shuliang Liu | Huiyu Zhou | Linfeng Zhang | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

End-to-end Large Speech Language Models (**LSLMs**) demonstrate strong potential in response latency and speech comprehension capabilities, showcasing general intelligence across speech understanding tasks. However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks. Leveraging the rich ASR datasets, previous approaches have used Large Language Models (**LLMs**) to continue the linguistic information of speech to construct speech instruction datasets. Yet, due to the gap between LLM-generated results and real human responses, the continuation methods further amplify these shortcomings. Given the high costs of collecting and annotating speech instruction datasets by humans, using speech synthesis to construct large-scale speech instruction datasets has become a balanced and robust alternative. Although modern Text-To-Speech (**TTS**) models have achieved near-human-level synthesis quality, it is challenging to appropriately convert out-of-distribution text instruction to speech due to the limitations of the training data distribution in TTS models. To address this issue, we propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech, making it possible to construct high-quality speech instruction datasets without relying on human annotation. Experiments show that this method can transform text instructions into distributions more suitable for TTS models for speech synthesis through zero-shot rewriting, increasing data usability from 72% to 93%. It also demonstrates unique advantages in rewriting tasks that require complex knowledge and context-related abilities.

A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
Yibo Yan | Jiamin Su | Jianxiang He | Fangteng Fu | Xu Zheng | Yuanhuiyi Lyu | Kun Wang | Shen Wang | Qingsong Wen | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides **the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs)**. We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.

SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Junkai Chen | Zhijie Deng | Kening Zheng | Yibo Yan | Shuliang Liu | PeiJun Wu | Peijie Jiang | Jia Liu | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

As Multimodal Large Language Models (MLLMs) develop, their potential security issues have become increasingly prominent. **Machine Unlearning (MU)**, as an effective strategy for forgetting specific knowledge in training data, has been widely used in privacy protection. However, *MU for safety in MLLM has yet to be fully explored*. To address this issue, we propose , a safety unlearning benchmark for MLLMs, consisting of 3,000 images and 28.8K VQA pairs. We comprehensively evaluate unlearning methods from two perspectives: **_forget quality_** and **_model utility_**. Our findings show that existing MU methods struggle to maintain model performance while implementing the forget operation and often suffer from **_over-forgetting_**. Hence, we introduce **Prompt Decouple (PD) Loss** to alleviate over-forgetting through decouple prompt during unlearning process. To quantitatively measure over-forgetting mitigated by PD Loss, we propose a new metric called **Safe Answer Refusal Rate (SARR)**. Experimental results demonstrate that combining PD Loss with existing unlearning methods can effectively prevent over-forgetting and achieve a decrease of 79.5% in the SARR metric of LLaVA-7B and LLaVA-13B, while maintaining forget quality and model utility. Our code and dataset will be released upon acceptance. **Warning: This paper contains examples of harmful language and images, and reader discretion is recommended.**

Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models
Yanggan Gu | Junzhuo Li | Sirui Huang | Xin Zou | Zhenghua Li | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

Aligning small language models (SLMs) with human values typically involves distilling preference knowledge from large language models (LLMs). However, existing distillation methods model preference knowledge in teacher LLMs by comparing pairwise responses, overlooking the extent of difference between responses. This limitation hinders student SLMs from capturing the nuanced preferences for multiple responses. In this paper, we propose a Preference-Aligned Distillation (PAD) framework, which models teacher’s preference knowledge as a probability distribution over all potential preferences, thereby providing more nuanced supervisory signals. Our insight in developing PAD is rooted in the demonstration that language models can serve as reward functions, reflecting their intrinsic preferences. Based on this, PAD comprises three key steps: (1) sampling diverse responses using high-temperature; (2) computing rewards for both teacher and student to construct their intrinsic preference; and (3) training the student’s intrinsic preference distribution to align with the teacher’s. Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches, achieving over 20% improvement on AlpacaEval 2 and Arena-Hard, indicating superior alignment with human preferences. Notably, on MT-Bench, using the Gemma model family, the student trained by PAD surpasses its teacher, further validating the effectiveness of our PAD.

A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models
Shuliang Liu | Hongyi Liu | Aiwei Liu | Duan Bingchen | Zheng Qi | Yibo Yan | He Geng | Peijie Jiang | Jia Liu | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.

LLM Agents for Education: Advances and Applications
Zhendong Chu | Shen Wang | Jian Xie | Tinghui Zhu | Yibo Yan | Jingheng Ye | Aoxiao Zhong | Xuming Hu | Jing Liang | Philip S. Yu | Qingsong Wen
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Model (LLM) agents are transforming education by automating complex pedagogical tasks and enhancing both teaching and learning processes. In this survey, we present a systematic review of recent advances in applying LLM agents to address key challenges in educational settings, such as feedback comment generation, curriculum design, etc. We analyze the technologies enabling these agents, including representative datasets, benchmarks, and algorithmic frameworks. Additionally, we highlight key challenges in deploying LLM agents in educational settings, including ethical issues, hallucination and overreliance, and integration with existing educational ecosystems. Beyond the core technical focus, we include in Appendix A a comprehensive overview of domain-specific educational agents, covering areas such as science learning, language learning, and professional development.

PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
Song Dai | Yibo Yan | Jiamin Su | Zihao Dongfang | Yubo Gao | Yonghua Hei | Jungang Li | Junyan Zhang | Sicheng Tao | Zhuoran Gao | Xuming Hu
Findings of the Association for Computational Linguistics: EMNLP 2025

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce **PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation.** PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.

Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?
Junyan Zhang | Yiming Huang | Shuliang Liu | Yubo Gao | Xuming Hu
Findings of the Association for Computational Linguistics: EMNLP 2025

The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing “LLM-centric” trend by systematically comparing three category methods, *i.e.,* BERT-like models fine-tuning, LLM internal state utilization, and LLM zero-shot inference across six challenging datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Subsequently, we conducted experiments on a broader range of text classification tasks to demonstrate the generalizability of our findings. We further investigated how the relative performance of different models varies under different levels of data availability. Finally, based on these findings, we propose **TaMAS**, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs. Code is available at [https://github.com/jyzhang2002/TaMAS-TextClass](https://github.com/jyzhang2002/TaMAS-TextClass).

DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction
Jian Chen | Zhenyan Chen | Xuming Hu | Peilin Zhou | Yining Hua | Han Fang | Cissy Hing Yee Choy | Xinmei Ke | Jingfeng Luo | Zixuan Yuan
Findings of the Association for Computational Linguistics: EMNLP 2025

Natural Language to SQL (NL2SQL) provides a new model-centric paradigm that simplifies database access for non-technical users by converting natural language queries into SQL commands. Recent advancements, particularly those integrating Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) reasoning, have made significant strides in enhancing NL2SQL performance. However, challenges such as inaccurate task decomposition and keyword extraction by LLMs remain major bottlenecks, often leading to errors in SQL generation. While existing datasets aim to mitigate these issues by fine-tuning models, they struggle with over-fragmentation of tasks and lack of domain-specific keyword annotations, limiting their effectiveness.To address these limitations, we present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs aimed at refining task decomposition and enhancing keyword extraction precision for the RAG pipeline. Fine-tuned with DeKeyNLU, we propose DeKeySQL, a RAG-based NL2SQL pipeline that employs three distinct modules for user question understanding, entity retrieval, and generation to improve SQL generation accuracy. We benchmarked multiple model configurations within DeKeySQL RAG pipeline. Experimental results demonstrate that fine-tuning with DeKeyNLU significantly improves SQL generation accuracy on both BIRD (62.31% to 69.10%) and Spider (84.2% to 88.7%) dev datasets.

2024

LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments
Junzhe Chen | Xuming Hu | Shuodi Liu | Shiyu Huang | Wei-Wei Tu | Zhaofeng He | Lijie Wen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in large language models (LLMs) have revealed their potential for achieving autonomous agents possessing human-level intelligence. However, existing benchmarks for evaluating LLM Agents either use static datasets, potentially leading to data leakage or focus only on single-agent scenarios, overlooking the complexities of multi-agent interactions. There is a lack of a benchmark that evaluates the diverse capabilities of LLM agents in multi-agent, dynamic environments. To this end, we introduce LLMArena, a novel and easily extensible framework for evaluating the diverse capabilities of LLM in multi-agent dynamic environments. LLMArena encompasses seven distinct gaming environments, employing Trueskill scoring to assess crucial abilities in LLM agents, including spatial reasoning, strategic planning, numerical reasoning, risk assessment, communication, opponent modeling, and team collaboration. We conduct an extensive experiment and human evaluation among different sizes and types of LLMs, showing that LLMs still have a significant journey ahead in their development towards becoming fully autonomous agents, especially in opponent modeling and team collaboration. We hope LLMArena could guide future research towards enhancing these capabilities in LLMs, ultimately leading to more sophisticated and practical applications in dynamic, multi-agent settings.

MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
Jiahao Huo | Yibo Yan | Boren Hu | Yutao Yue | Xuming Hu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Projecting visual features into word embedding space has become a significant fusion strategy adopted by Multimodal Large Language Models (MLLMs). However, its internal mechanisms have yet to be explored. Inspired by multilingual research, we identify domain-specific neurons in multimodal large language models. Specifically, we investigate the distribution of domain-specific neurons and the mechanism of how MLLMs process features from diverse domains. Furthermore, we propose a three-stage framework for language model modules in MLLMs when handling projected image features, and verify this hypothesis using logit lens. Extensive experiments indicate that while current MLLMs exhibit Visual Question Answering (VQA) capability, they may not fully utilize domain-specific information. Manipulating domain-specific neurons properly will result in a 10% change of accuracy at most, shedding light on the development of cross-domain, all-encompassing MLLMs in the future. The source code is available at https://anonymous.4open.science/r/MMNeuron.

Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications
Weize Liu | Yinlong Xu | Hongxia Xu | Jintai Chen | Xuming Hu | Jian Wu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recently, large language models (LLMs) have achieved tremendous breakthroughs in the field of NLP, but still lack understanding of their internal neuron activities when processing different languages. We designed a method to convert dense LLMs into fine-grained MoE architectures, and then visually studied the multilingual activation patterns of LLMs through expert activation frequency heatmaps. Through comprehensive experiments on different model families, different model sizes, and different variants, we analyzed the similarities and differences in the internal neuron activation patterns of LLMs when processing different languages. Specifically, we investigated the distribution of high-frequency activated experts, multilingual shared experts, whether multilingual activation patterns are related to language families, and the impact of instruction tuning on activation patterns. We further explored leveraging the discovered differences in expert activation frequencies to guide sparse activation and pruning. Experimental results demonstrated that our method significantly outperformed random expert pruning and even exceeded the performance of unpruned models in some languages. Additionally, we found that configuring different pruning rates for different layers based on activation level differences could achieve better results. Our findings reveal the multilingual processing mechanisms within LLMs and utilize these insights to offer new perspectives for applications such as sparse activation and model pruning.

MarkLLM: An Open-Source Toolkit for LLM Watermarking
Leyi Pan | Aiwei Liu | Zhiwei He | Zitian Gao | Xuandong Zhao | Yijian Lu | Binglin Zhou | Shuliang Liu | Xuming Hu | Lijie Wen | Irwin King | Philip S. Yu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Watermarking for Large Language Models (LLMs), which embeds imperceptible yet algorithmically detectable signals in model outputs to identify LLM-generated text, has become crucial in mitigating the potential misuse of LLMs. However, the abundance of LLM watermarking algorithms, their intricate mechanisms, and the complex evaluation procedures and perspectives pose challenges for researchers and the community to easily understand, implement and evaluate the latest advancements. To address these issues, we introduce MarkLLM, an open-source toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework for implementing LLM watermarking algorithms, while providing user-friendly interfaces to ensure ease of access. Furthermore, it enhances understanding by supporting automatic visualization of the underlying mechanisms of these algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools spanning three perspectives, along with two types of automated evaluation pipelines. Through MarkLLM, we aim to support researchers while improving the comprehension and involvement of the general public in LLM watermarking technology, fostering consensus and driving further advancements in research and application. Our code is available at https://github.com/THU-BPM/MarkLLM.

Evaluating Robustness of Generative Search Engine on Adversarial Factoid Questions
Xuming Hu | Xiaochuan Li | Junzhe Chen | Yinghui Li | Yangning Li | Xiaoguang Li | Yasheng Wang | Qun Liu | Lijie Wen | Philip Yu | Zhijiang Guo
Findings of the Association for Computational Linguistics: ACL 2024

Generative search engines have the potential to transform how people seek information online, but generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Nonetheless, retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable part of a claim. To this end, we propose evaluating the robustness of generative search engines in the realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning incorrect responses. Through a comprehensive human evaluation of various generative search engines, such as Bing Chat, PerplexityAI, and YouChat across diverse queries, we demonstrate the effectiveness of adversarial factual questions in inducing incorrect responses. Moreover, retrieval-augmented generation exhibits a higher susceptibility to factual errors compared to LLMs without retrieval. These findings highlight the potential security risks of these systems and emphasize the need for rigorous evaluation before deployment. The dataset and code will be publicly available.

On the Robustness of Document-Level Relation Extraction Models to Entity Name Variations
Shiao Meng | Xuming Hu | Aiwei Liu | Fukun Ma | Yawen Yang | Shuang Li | Lijie Wen
Findings of the Association for Computational Linguistics: ACL 2024

Driven by the demand for cross-sentence and large-scale relation extraction, document-level relation extraction (DocRE) has attracted increasing research interest. Despite the continuous improvement in performance, we find that existing DocRE models which initially perform well may make more mistakes when merely changing the entity names in the document, hindering the generalization to novel entity names. To this end, we systematically investigate the robustness of DocRE models to entity name variations in this work. We first propose a principled pipeline to generate entity-renamed documents by replacing the original entity names with names from Wikidata. By applying the pipeline to DocRED and Re-DocRED datasets, we construct two novel benchmarks named Env-DocRED and Env-Re-DocRED for robustness evaluation. Experimental results show that both three representative DocRE models and two in-context learned large language models consistently lack sufficient robustness to entity name variations, particularly on cross-sentence relation instances and documents with more entities. Finally, we propose an entity variation robust training method which not only improves the robustness of DocRE models but also enhances their understanding and reasoning capabilities. We further verify that the basic idea of this method can be effectively transferred to in-context learning for DocRE as well.

LongGenBench: Long-context Generation Benchmark
Xiang Liu | Peijie Dong | Xuming Hu | Xiaowen Chu
Findings of the Association for Computational Linguistics: EMNLP 2024

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

Refiner: Restructure Retrieved Content Efficiently to Advance Question-Answering Capabilities
Zhonghao Li | Xuming Hu | Aiwei Liu | Kening Zheng | Sirui Huang | Hui Xiong
Findings of the Association for Computational Linguistics: EMNLP 2024

Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models
Weize Liu | Guocong Li | Kai Zhang | Bang Du | Qiyuan Chen | Xuming Hu | Hongxia Xu | Jintai Chen | Jian Wu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have achieved remarkable advancements in natural language processing. However, the massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments. While techniques such as chain-of-thought (CoT) distillation have displayed promise in distilling LLMs into small language models (SLMs), there is a risk that distilled SLMs may still inherit flawed reasoning and hallucinations from LLMs. To address these issues, we propose a twofold methodology: First, we introduce a novel method for distilling the self-evaluation capability from LLMs into SLMs, aiming to mitigate the adverse effects of flawed reasoning and hallucinations inherited from LLMs. Second, we advocate for distilling more comprehensive thinking by incorporating multiple distinct CoTs and self-evaluation outputs, to ensure a more thorough and robust knowledge transfer into SLMs. Experiments on three NLP benchmarks demonstrate that our method significantly improves the performance of distilled SLMs, offering a new perspective for developing more effective and efficient SLMs in resource-constrained environments.

2023

AMR-based Network for Aspect-based Sentiment Analysis
Fukun Ma | Xuming Hu | Aiwei Liu | Yawen Yang | Shuang Li | Philip S. Yu | Lijie Wen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment classification task. Many recent works have used dependency trees to extract the relation between aspects and contexts and have achieved significant improvements. However, further improvement is limited due to the potential mismatch between the dependency tree as a syntactic structure and the sentiment classification as a semantic task. To alleviate this gap, we replace the syntactic dependency tree with the semantic structure named Abstract Meaning Representation (AMR) and propose a model called AMR-based Path Aggregation Relational Network (APARN) to take full advantage of semantic structures. In particular, we design the path aggregator and the relation-enhanced self-attention mechanism that complement each other. The path aggregator extracts semantic features from AMRs under the guidance of sentence information, while the relation-enhanced self-attention mechanism in turn improves sentence features with refined semantic information. Experimental results on four public datasets demonstrate 1.13% average F1 improvement of APARN in ABSA when compared with state-of-the-art baselines.

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis
Xuming Hu | Zhijiang Guo | Zhiyang Teng | Irwin King | Philip S. Yu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Multimodal relation extraction (MRE) is the task of identifying the semantic relationships between two entities based on the context of the sentence image pair. Existing retrieval-augmented approaches mainly focused on modeling the retrieved textual knowledge, but this may not be able to accurately identify complex relations. To improve the prediction, this research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image. We further develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities. Extensive experiments and analyses show that the proposed method is able to effectively select and compare evidence across modalities and significantly outperforms state-of-the-art models.

RAPL: A Relation-Aware Prototype Learning Approach for Few-Shot Document-Level Relation Extraction
Shiao Meng | Xuming Hu | Aiwei Liu | Shuang Li | Fukun Ma | Yawen Yang | Lijie Wen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

How to identify semantic relations among entities in a document when only a few labeled documents are available? Few-shot document-level relation extraction (FSDLRE) is crucial for addressing the pervasive data scarcity problem in real-world scenarios. Metric-based meta-learning is an effective framework widely adopted for FSDLRE, which constructs class prototypes for classification. However, existing works often struggle to obtain class prototypes with accurate relational semantics: 1) To build prototype for a target relation type, they aggregate the representations of all entity pairs holding that relation, while these entity pairs may also hold other relations, thus disturbing the prototype. 2) They use a set of generic NOTA (none-of-the-above) prototypes across all tasks, neglecting that the NOTA semantics differs in tasks with different target relation types. In this paper, we propose a relation-aware prototype learning method for FSDLRE to strengthen the relational semantics of prototype representations. By judiciously leveraging the relation descriptions and realistic NOTA instances as guidance, our method effectively refines the relation prototypes and generates task-specific NOTA prototypes. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by average 2.61% F₁ across various settings of two FSDLRE benchmarks.

Exploring the Compositional Generalization in Context Dependent Text-to-SQL Parsing
Aiwei Liu | Wei Liu | Xuming Hu | Shuang Li | Fukun Ma | Yawen Yang | Lijie Wen
Findings of the Association for Computational Linguistics: ACL 2023

In the context-dependent Text-to-SQL task, the generated SQL statements are refined iteratively based on the user input utterance from each interaction. The input text from each interaction can be viewed as component modifications to the previous SQL statements, which could be further extracted as the modification patterns. Since these modification patterns could also be combined with other SQL statements, the models are supposed to have the compositional generalization to these novel combinations. This work is the first exploration of compositional generalization in context-dependent Text-to-SQL scenarios. To facilitate related studies, we constructed two challenging benchmarks named CoSQL-CG and SParC-CG by recombining the modification patterns and existing SQL statements. The following experiments show that almost all current models struggle on our proposed benchmarks. Furthermore, we found that better aligning the previous SQL statements with the input utterance could give models better combinatorial generalization ability. Based on these observations, we propose a method name p-align to improve the combinatorial generalization of Text-to-SQL models. Further experiments validate the effectiveness of our model.

Enhancing Cross-lingual Natural Language Inference by Soft Prompting with Multilingual Verbalizer
Shuang Li | Xuming Hu | Aiwei Liu | Yawen Yang | Fukun Ma | Philip S. Yu | Lijie Wen
Findings of the Association for Computational Linguistics: ACL 2023

Cross-lingual natural language inference is a fundamental problem in cross-lingual language understanding. Many recent works have used prompt learning to address the lack of annotated parallel corpora in XNLI.However, these methods adopt discrete prompting by simply translating the templates to the target language and need external expert knowledge to design the templates. Besides, discrete prompts of human-designed template words are not trainable vectors and can not be migrated to target languages in the inference stage flexibly. In this paper, we propose a novel Soft prompt learning framework with the Multilingual Verbalizer (SoftMV) for XNLI. SoftMV first constructs cloze-style question with soft prompts for the input sample. Then we leverage bilingual dictionaries to generate an augmented multilingual question for the original question. SoftMV adopts a multilingual verbalizer to align the representations of original and augmented multilingual questions into a unified semantic space with consistency regularization. Experimental results on XNLI demonstrate that SoftMV can achieve state-of-the-art performance and significantly outperform the previous methods under the few-shot and full-shot cross-lingual transfer settings.

Automatic Table Union Search with Tabular Representation Learning
Xuming Hu | Shen Wang | Xiao Qin | Chuan Lei | Zhengyuan Shen | Christos Faloutsos | Asterios Katsifodimos | George Karypis | Lijie Wen | Philip S. Yu
Findings of the Association for Computational Linguistics: ACL 2023

Given a data lake of tabular data as well as a query table, how can we retrieve all the tables in the data lake that can be unioned with the query table? Table union search constitutes an essential task in data discovery and preparation as it enables data scientists to navigate massive open data repositories. Existing methods identify uniability based on column representations (word surface forms or token embeddings) and column relation represented by column representation similarity. However, the semantic similarity obtained between column representations is often insufficient to reveal latent relational features to describe the column relation between pair of columns and not robust to the table noise. To address these issues, in this paper, we propose a multi-stage self-supervised table union search framework called AutoTUS, which represents column relation as a vector– column relational representation and learn column relational representation in a multi-stage manner that can better describe column relation for unionability prediction. In particular, the large language model powered contextualized column relation encoder is updated by adaptive clustering and pseudo label classification iteratively so that the better column relational representation can be learned. Moreover, to improve the robustness of the model against table noises, we propose table noise generator to add table noise to the training table data. Experiments on real-world datasets as well as synthetic test set augmented with table noise show that AutoTUS achieves 5.2% performance gain over the SOTA baseline.

Entity-to-Text based Data Augmentation for various Named Entity Recognition Tasks
Xuming Hu | Yong Jiang | Aiwei Liu | Zhongqiang Huang | Pengjun Xie | Fei Huang | Lijie Wen | Philip S. Yu
Findings of the Association for Computational Linguistics: ACL 2023

Data augmentation techniques have been used to alleviate the problem of scarce labeled data in various NER tasks (flat, nested, and discontinuous NER tasks). Existing augmentation techniques either manipulate the words in the original text that break the semantic coherence of the text, or exploit generative models that ignore preserving entities in the original text, which impedes the use of augmentation techniques on nested and discontinuous NER tasks. In this work, we propose a novel Entity-to-Text based data augmentation technique named EnTDA to add, delete, replace or swap entities in the entity list of the original texts, and adopt these augmented entity lists to generate semantically coherent and entity preserving texts for various NER tasks. Furthermore, we introduce a diversity beam search to increase the diversity during the text generation process. Experiments on thirteen NER datasets across three tasks (flat, nested, and discontinuous NER tasks) and two settings (full data and low resource settings) show that EnTDA could bring more performance improvements compared to the baseline augmentation techniques.

GDA: Generative Data Augmentation Techniques for Relation Extraction Tasks
Xuming Hu | Aiwei Liu | Zeqi Tan | Xin Zhang | Chenwei Zhang | Irwin King | Philip S. Yu
Findings of the Association for Computational Linguistics: ACL 2023

Relation extraction (RE) tasks show promising performance in extracting relations from two entities mentioned in sentences, given sufficient annotations available during training. Such annotations would be labor-intensive to obtain in practice. Existing work adopts data augmentation techniques to generate pseudo-annotated sentences beyond limited annotations. These techniques neither preserve the semantic consistency of the original sentences when rule-based augmentations are adopted, nor preserve the syntax structure of sentences when expressing relations using seq2seq models, resulting in less diverse augmentations. In this work, we propose a dedicated augmentation technique for relational texts, named GDA, which uses two complementary modules to preserve both semantic consistency and syntax structures. We adopt a generative formulation and design a multi-tasking solution to achieve synergies. Furthermore, GDA adopts entity hints as the prior knowledge of the generative model to augment diverse sentences. Experimental results in three datasets under a low-resource setting showed that GDA could bring 2.0% F1 improvements compared with no augmentation technique.

2022

Domain-Specific NER via Retrieving Correlated Samples
Xin Zhang | Yong Jiang | Xiaobin Wang | Xuming Hu | Yueheng Sun | Pengjun Xie | Meishan Zhang
Proceedings of the 29th International Conference on Computational Linguistics

Successful Machine Learning based Named Entity Recognition models could fail on texts from some special domains, for instance, Chinese addresses and e-commerce titles, where requires adequate background knowledge. Such texts are also difficult for human annotators. In fact, we can obtain some potentially helpful information from correlated texts, which have some common entities, to help the text understanding. Then, one can easily reason out the correct answer by referencing correlated samples. In this paper, we suggest enhancing NER models with correlated samples. We draw correlated samples by the sparse BM25 retriever from large-scale in-domain unlabeled data. To explicitly simulate the human reasoning process, we perform a training-free entity type calibrating by majority voting. To capture correlation features in the training stage, we suggest to model correlated samples by the transformer-based multi-instance cross-encoder. Empirical results on datasets of the above two domains show the efficacy of our methods.

Scene Graph Modification as Incremental Structure Expanding
Xuming Hu | Zhijiang Guo | Yu Fu | Lijie Wen | Philip S. Yu
Proceedings of the 29th International Conference on Computational Linguistics

A scene graph is a semantic representation that expresses the objects, attributes, and relationships between objects in a scene. Scene graphs play an important role in many cross modality tasks, as they are able to capture the interactions between images and texts. In this paper, we focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query. Unlike previous approaches that rebuilt the entire scene graph, we frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE). ISE constructs the target graph by incrementally expanding the source graph without changing the unmodified structure. Based on ISE, we further propose a model that iterates between nodes prediction and edges prediction, inferring more accurate and harmonious expansion decisions progressively. In addition, we construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets. Experiments on four benchmarks demonstrate the effectiveness of our approach, which surpasses the previous state-of-the-art model by large margins.

Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution
Aiwei Liu | Honghai Yu | Xuming Hu | Shu’ang Li | Li Lin | Fukun Ma | Yawen Yang | Lijie Wen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We propose the first character-level white-box adversarial attack method against transformer models. The intuition of our method comes from the observation that words are split into subtokens before being fed into the transformer models and the substitution between two close subtokens has a similar effect with the character modification. Our method mainly contains three steps. First, a gradient-based method is adopted to find the most vulnerable words in the sentence. Then we split the selected words into subtokens to replace the origin tokenization result from the transformer tokenizer. Finally, we utilize an adversarial loss to guide the substitution of attachable subtokens in which the Gumbel-softmax trick is introduced to ensure gradient propagation.Meanwhile, we introduce the visual and length constraint in the optimization process to achieve minimum character modifications.Extensive experiments on both sentence-level and token-level tasks demonstrate that our method could outperform the previous attack methods in terms of success rate and edit distance. Furthermore, human evaluation verifies our adversarial examples could preserve their origin labels.

Query-based Instance Discrimination Network for Relational Triple Extraction
Zeqi Tan | Yongliang Shen | Xuming Hu | Wenqi Zhang | Xiaoxia Cheng | Weiming Lu | Yueting Zhuang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Joint entity and relation extraction has been a core task in the field of information extraction. Recent approaches usually consider the extraction of relational triples from a stereoscopic perspective, either learning a relation-specific tagger or separate classifiers for each relation type. However, they still suffer from error propagation, relation redundancy and lack of high-level connections between triples. To address these issues, we propose a novel query-based approach to construct instance-level representations for relational triples. By metric-based comparison between query embeddings and token embeddings, we can extract all types of triples in one step, thus eliminating the error propagation problem. In addition, we learn the instance-level representation of relational triples via contrastive learning. In this way, relational triples can not only enclose rich class-level semantics but also access to high-order global connections. Experimental results show that our proposed method achieves the state of the art on five widely used benchmarks.

CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking
Xuming Hu | Zhijiang Guo | GuanYu Wu | Aiwei Liu | Lijie Wen | Philip Yu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The explosion of misinformation spreading in the media ecosystem urges for automated fact-checking. While misinformation spans both geographic and linguistic boundaries, most work in the field has focused on English. Datasets and tools available in other languages, such as Chinese, are limited. In order to bridge this gap, we construct CHEF, the first CHinese Evidence-based Fact-checking dataset of 10K real-world claims. The dataset covers multiple domains, ranging from politics to public health, and provides annotated evidence retrieved from the Internet. Further, we develop established baselines and a novel approach that is able to model the evidence retrieval as a latent variable, allowing jointly training with the veracity prediction model in an end-to-end fashion. Extensive experiments show that CHEF will provide a challenging testbed for the development of fact-checking systems designed to retrieve and reason over non-English claims.

HiURE: Hierarchical Exemplar Contrastive Learning for Unsupervised Relation Extraction
Shuliang Liu | Xuming Hu | Chenwei Zhang | Shu’ang Li | Lijie Wen | Philip Yu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Unsupervised relation extraction aims to extract the relationship between entities from natural language sentences without prior information on relational scope or distribution. Existing works either utilize self-supervised schemes to refine relational feature signals by iteratively leveraging adaptive clustering and classification that provoke gradual drift problems, or adopt instance-wise contrastive learning which unreasonably pushes apart those sentence pairs that are semantically similar. To overcome these defects, we propose a novel contrastive learning framework named HiURE, which has the capability to derive hierarchical signals from relational feature space using cross hierarchy attention and effectively optimize relation representation of sentences under exemplar-wise contrastive learning. Experimental results on two public datasets demonstrate the advanced effectiveness and robustness of HiURE on unsupervised relation extraction when compared with state-of-the-art models.

2021

Gradient Imitation Reinforcement Learning for Low Resource Relation Extraction
Xuming Hu | Chenwei Zhang | Yawen Yang | Xiaohe Li | Li Lin | Lijie Wen | Philip S. Yu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Low-resource Relation Extraction (LRE) aims to extract relation facts from limited labeled corpora when human annotation is scarce. Existing works either utilize self-training scheme to generate pseudo labels that will cause the gradual drift problem, or leverage meta-learning scheme which does not solicit feedback explicitly. To alleviate selection bias due to the lack of feedback loops in existing LRE learning paradigms, we developed a Gradient Imitation Reinforcement Learning method to encourage pseudo label data to imitate the gradient descent direction on labeled data and bootstrap its optimization capability through trial and error. We also propose a framework called GradLRE, which handles two major scenarios in low-resource relation extraction. Besides the scenario where unlabeled data is sufficient, GradLRE handles the situation where no unlabeled data is available, by exploiting a contextualized augmentation method to generate data. Experimental results on two public datasets demonstrate the effectiveness of GradLRE on low resource relation extraction when comparing with baselines.

Semi-supervised Relation Extraction via Incremental Meta Self-Training
Xuming Hu | Chenwei Zhang | Fukun Ma | Chenyao Liu | Lijie Wen | Philip S. Yu
Findings of the Association for Computational Linguistics: EMNLP 2021

To alleviate human efforts from obtaining large-scale annotations, Semi-Supervised Relation Extraction methods aim to leverage unlabeled data in addition to learning from limited samples. Existing self-training methods suffer from the gradual drift problem, where noisy pseudo labels on unlabeled data are incorporated during training. To alleviate the noise in pseudo labels, we propose a method called MetaSRE, where a Relation Label Generation Network generates accurate quality assessment on pseudo labels by (meta) learning from the successful and failed attempts on Relation Classification Network as an additional meta-objective. To reduce the influence of noisy pseudo labels, MetaSRE adopts a pseudo label selection and exploitation scheme which assesses pseudo label quality on unlabeled samples and only exploits high-quality pseudo labels in a self-training fashion to incrementally augment labeled samples for both robustness and accuracy. Experimental results on two public datasets demonstrate the effectiveness of the proposed approach.

2020

SelfORE: Self-supervised Relational Feature Learning for Open Relation Extraction
Xuming Hu | Lijie Wen | Yusong Xu | Chenwei Zhang | Philip Yu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Open relation extraction is the task of extracting open-domain relation facts from natural language sentences. Existing works either utilize heuristics or distant-supervised annotations to train a supervised classifier over pre-defined relations, or adopt unsupervised methods with additional assumptions that have less discriminative power. In this work, we propose a self-supervised framework named SelfORE, which exploits weak, self-supervised signals by leveraging large pretrained language model for adaptive clustering on contextualized relational features, and bootstraps the self-supervised signals by improving contextualized features in relation classification. Experimental results on three datasets show the effectiveness and robustness of SelfORE on open-domain Relation Extraction when comparing with competitive baselines.

Co-authors

Chenwei Zhang 5

Asterios Katsifodimos 3

Christos Faloutsos 2

Yuanhuiyi Lyu 2

Huzefa Rangwala 2

Zhengyuan Shen 2

Linfeng Zhang 2

Duan Bingchen 1

Xiaoxia Cheng 1

Cissy Hing Yee Choy 1

Zihao Dongfang 1

Zhongqiang Huang 1

Haojing Huang 1

Haoming Huang 1

George Karypis 1

Zhenghua Li (李正华) 1

Yongliang Shen 1

Balasubramaniam Srinivasan 1

Yik-Cheung Tam 1

Jesse Jiaxi Xu 1

Meishan Zhang 1

Xuandong Zhao 1

Hai-Tao Zheng 1

Yueting Zhuang 1

Venues