Xuefeng Bai (白雪峰) - ACL Anthology

Xuefeng Bai

Also published as: 雪峰白

2025

Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Deng Qiyuan | Xuefeng Bai | Kehai Chen | Yaowei Wang | Liqiang Nie | Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources.In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable.This stability allows the conversion of the sampling process from the target policy into a computationallyefficient re-ranking of preference data.Building on this hypothesis, we propose a new framework that leverages the model’s intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preference reordering. Extensive experiments and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while avoiding about 300x computational overheads.

Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent
Xingzuo Li | Kehai Chen | Yunfei Long | Xuefeng Bai | Yong Xu | Min Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator-Assistant Stepwise Rollback (GA-Rollback) to induce better decision-making for LLM agents. Particularly, GA-Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA-Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA-Rollback can function as a robust plug-and-play module, integrating seamlessly with other methods.

Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning
Yingjie Zhu | Xuefeng Bai | Kehai Chen | Yang Xiang | Jun Yu | Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through three self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs’ performance on fundamental and downstream graph learning tasks, as well as enhancing their robustness against complex visual graphs.

System Report for CCL25-Eval Task 3: Hallucination Mitigation in Chinese Abstract Meaning Representation Parsing with a Multi-Agent Approach
Rongbo Chen | Xuefeng Bai | Kehai Chen | Min Zhang
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"This paper introduces our system for the Fifth Chinese Abstract Meaning Representation(CAMR) Parsing Evaluation task at the 24th China National Conference on ComputationalLinguistics (CCL 2025). Our framework formulates both CAMR parsing and document-level coreference resolution as sequence-to-sequence generation tasks, employing large languagemodels (LLMs) to produce linearized CAMR sequences and coreference sequences. To mitigate hallucinations in generated graphs, we design a multi-agent system comprising: (1) two detection agents for automated error detection and hallucination identification; (2) a refinement agent that corrects graph structures based on detected inconsistencies. Experimental results show that:(1) recent LLMs, especially Qwen-3, achieve promising performance in CAMR parsing; (2)the proposed multi-agent system can effectively identify and correct hallucinations of CAMR predictions; and (3) sequence-to-sequence methods exhibit significant limitations in document-level coreference resolution due to context length constraints."

The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Yihong Tang | Kehai Chen | Xuefeng Bai | Zheng-Yu Niu | Bo Wang | Jie Liu | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model’s ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.

Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation
Andong Chen | Yuchen Song | Kehai Chen | Xuefeng Bai | Muyun Yang | Liqiang Nie | Jie Liu | Tiejun Zhao | Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing the multimodel MT. Particularly, we build heuristic feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of visual information, which breaks the high-cost bottleneck of image annotation in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 14 BLEU points on Multi30K and MSCOCO multimodal MT benchmarks.

HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track
Xuchen Wei | Yangxin Wu | Yaoyin Zhang | Henglyu Liu | Kehai Chen | Xuefeng Bai | Min Zhang
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

This paper presents HITSZ’s submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of 28.88 for English-to-Indic directions and 27.86 for Indic-to-English directions. Furthermore, we investigated the Chain-of-Thought (CoT) method. While this method showed potential for significant translation quality improvements on successfully parsed outputs (e.g. a 13.84 BLEU increase for Tamil-to-English), we observed challenges in ensuring the model consistently adheres to the required CoT output format.

LLM-based Translation Inference with Iterative Bilingual Understanding
Andong Chen | Kehai Chen | Yang Xiang | Xuefeng Bai | Muyun Yang | Yang Feng | Tiejun Zhao | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2025

The remarkable understanding and generation capabilities of large language models (LLMs) have greatly improved translation performance. However, incorrect understanding of the sentence to be translated can degrade translation quality. To address this issue, we proposed a novel Iterative Bilingual Understanding Translation (IBUT) method based on the cross-lingual capabilities of LLMs and the dual characteristics of translation tasks. The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately. Furthermore, the dual characteristics allow IBUT to generate effective cross-lingual feedback, iteratively refining contextual understanding, thereby reducing errors and improving translation performance. Experimental results showed that the proposed IBUT outperforms several strong comparison methods, especially being generalized to multiple domains (e.g., news, commonsense, and cultural translation benchmarks).

Benchmarking LLMs for Translating Classical Chinese Poetry: Evaluating Adequacy, Fluency, and Elegance
Andong Chen | Lianzhang Lou | Kehai Chen | Xuefeng Bai | Yang Xiang | Muyun Yang | Tiejun Zhao | Min Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have shown remarkable performance in general translation tasks. However, the increasing demand for high-quality translations that are not only adequate but also fluent and elegant. To assess the extent to which current LLMs can meet these demands, we introduce a suitable benchmark (PoetMT) for translating classical Chinese poetry into English. This task requires not only adequacy in translating culturally and historically significant content but also a strict adherence to linguistic fluency and poetic elegance. Our study reveals that existing LLMs fall short of this task. To address these issues, we propose RAT, a Retrieval-Augmented machine Translation method that enhances the translation process by incorporating knowledge related to classical poetry. Additionally, we propose an automatic evaluation metric based on GPT-4, which better assesses translation quality in terms of adequacy, fluency, and elegance, overcoming the limitations of traditional metrics.

Overview of CCL25-Eval Task 5: Chinese Classical Poetry Appreciation Evaluation (CCPA) Task
Zhenwu Pei | Yingjie Zhu | Rongbo Chen | Xuefeng Bai | Kehai Chen | Min Zhang
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"This paper presents a review of CCL2025-Eval Task 5: Appreciation Evaluation (CCPA). The primary aim of this task is to evaluate the ability of lan-guage models in performing deep semantic understanding and aesthetic appreciation of Chinese classical poetry. The evaluation comprises two tracks: (1) Poetic content understanding, which examines models’ ability to interpret both fine-grained and coarse-grained semantics; (2) Poetic emotion recognition, which evaluates models’ capacity to identify and analyze emotional expressions. A total of 55 teams registered for the task, among which 7 teams provided valid submissions. The paper provides an in-depth analysis of the submissions and results from all participating teams."

2024

面向中文抽象语义表示解析的大模型评估与增强
Rongbo Chen (陈荣波) | Zhenwu Pei (裴振武) | Xuefeng Bai (白雪峰) | Kehai Chen (陈科海) | Min Zhang (张民)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“本文介绍了我们在第二十三届中文计算语言学大会中文抽象语义表示解析评测任务中提交的参赛系统。中文抽象语义表示(Chinese Abstract Meaning Representa-tion,CAMR)以一个单根可遍历的有向无环图表示中文句子的语义。本系统选择大语言模型作为解决方案。我们首先系统地评估了当下中文大语言模型在AMR解析任务上的性能,在此基础上基于图融合算法整合性能较高的大模型预测结果,最终得到预测的CAMR图。实验结果表明,1)现有大模型已经具备一定的少样本中文AMR解析能力;2)基于微调中文大模型的AMR解析系统能够取得相较以往最优系统更强的性能;3)图融合算法能够进一步增强基于大模型的CAMR解析系统的性能。”

DUAL-REFLECT: Enhancing Large Language Models for Reflective Translation through Dual Learning Feedback Mechanisms
Andong Chen | Lianzhang Lou | Kehai Chen | Xuefeng Bai | Yang Xiang | Muyun Yang | Tiejun Zhao | Min Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Recently, large language models (LLMs) enhanced by self-reflection have achieved promising performance on machine transla004 tion. The key idea is guiding LLMs to generate translation with human-like feedback. However, existing self-reflection methods lack effective feedback information, limiting the translation performance. To address this, we introduce a DUAL-REFLECT framework, leveraging the dual learning of translation tasks to provide effective feedback, thereby enhancing the models’ self-reflective abilities and improving translation performance. The application of this method across various translation tasks has proven its effectiveness in improving translation accuracy and eliminating ambiguities, especially in translation tasks with low-resource language pairs.

Paying More Attention to Source Context: Mitigating Unfaithful Translations from Large Language Model
Hongbin Zhang | Kehai Chen | Xuefeng Bai | Yang Xiang | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2024

Large language models (LLMs) have showcased their remarkable capabilities to handle various downstream tasks, including multilingual machine translation ability. Despite their impressive performance, decoder-only LLMs lack an explicit alignment between source and target contexts, leading to translation that may not faithfully represent the original content. To address this, we propose three learning strategies to encourage LLMs to pay more attention to the source context during translation: 1) adjusting attention weights on the source context by adaptive attention re-weighting; 2) suppressing the irrelevant target prefix using contrastive decoding; 3) avoiding excessive reliance on the target prefix through target-constrained tuning. To verify the effectiveness of our model, we curate a new dataset specifically focusing on unfaithful translations generated by LLMs. Experimental results on both human-collected and general test sets verify the effectiveness of our model across multiple language pairs. Further human evaluation demonstrates the efficacy of our method in reducing hallucinatory translation and improving the fidelity of translations.

Chinese Vision-Language Understanding Evaluation
Jiangkuo Wang | Linwei Zheng | Kehai Chen | Xuefeng Bai | Min Zhang
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“This paper introduces our systems submitted for the Chinese Vision-Language Understanding Evaluation task at the 23rd Chinese Computational Linguistics Conference.In this competition, we utilized X2-VLM and CCLM models to participate in various subtasks such as image-text retrieval, visual grounding, visual dialogue, and visual question answering. Additionally, we employed other models to assess performance on certain subtasks. We optimized our models and successfully applied them to these different tasks”

Question-guided Knowledge Graph Re-scoring and Injection for Knowledge Graph Question Answering
Yu Zhang | Kehai Chen | Xuefeng Bai | Zhao Kang | Quanjiang Guo | Min Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Knowledge graph question answering (KGQA) involves answering natural language questions by leveraging structured information stored in a knowledge graph. Typically, KGQA initially retrieve a targeted subgraph from a large-scale knowledge graph, which serves as the basis for reasoning models to address queries. However, the retrieved subgraph inevitably brings distraction information for knowledge utilization, impeding the model’s ability to perform accurate reasoning. To address this issue, we propose a Question-guided Knowledge Graph Re-scoring method (Q-KGR) to eliminate noisy pathways for the input question, thereby focusing specifically on pertinent factual knowledge.Moreover, we introduce Knowformer, a parameter-efficient method for injecting the re-scored knowledge graph into large language models to enhance their ability to perform factual reasoning.Extensive experiments on multiple KGQA benchmarks demonstrate the superiority of our method over existing systems.

2023

Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation
Yulong Chen | Huajian Zhang | Yijie Zhou | Xuefeng Bai | Yueguan Wang | Ming Zhong | Jianhao Yan | Yafu Li | Judy Li | Xianchao Zhu | Yue Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most existing cross-lingual summarization (CLS) work constructs CLS corpora by simply and directly translating pre-annotated summaries from one language to another, which can contain errors from both summarization and translation processes. To address this issue, we propose ConvSumX, a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context. ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions. We conduct thorough analysis on ConvSumX and 3 widely-used manually annotated CLS corpora and empirically find that ConvSumX is more faithful towards input text. Additionally, based on the same intuition, we propose a 2-Step method, which takes both conversation and summary as input to simulate human annotation process. Experimental results show that 2-Step method surpasses strong baselines on ConvSumX under both automatic and human evaluation. Analysis shows that both source input text and summary are crucial for modeling cross-lingual summaries.

CCL23-Eval 任务2系统报告:WestlakeNLP,基于生成式大语言模型的中文抽象语义表示解析(System Report for CCL23-Eval Task 2: WestlakeNLP, Investigating Generative Large Language Models for Chinese AMR Parsing)
Wenyang Gao (高文炀) | Xuefeng Bai (白雪峰) | Yue Zhang (张岳)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“本文介绍了我们在第二十二届中文计算语言学大会中文抽象语义表示解析评测任务中提交的参赛系统。中文抽象语义表示(Chinese Abstract Meaning Representa-tion,CAMR)不仅以图的方式表示句子的语义,还保证了概念对齐和关系对齐。近期,生成式大规模语言模型在诸多自然语言处理任务上展现了优秀的生成能力和泛化能力。受此启发,我们选择微调Baichuan-7B模型来以端到端的形式从文本直接生成序列化的CAMR。实验结果表明,我们的系统能够在不依赖于词性、依存句法信息以及复杂规则的前提下取得了同现有方法可比的性能。”

Exploiting Abstract Meaning Representation for Open-Domain Question Answering
Cunxiang Wang | Zhikun Xu | Qipeng Guo | Xiangkun Hu | Xuefeng Bai | Zheng Zhang | Yue Zhang
Findings of the Association for Computational Linguistics: ACL 2023

The Open-Domain Question Answering (ODQA) task involves retrieving and subsequently generating answers from fine-grained relevant passages within a database. Current systems leverage Pretrained Language Models (PLMs) to model the relationship between questions and passages. However, the diversity in surface form expressions can hinder the model’s ability to capture accurate correlations, especially within complex contexts. Therefore, we utilize Abstract Meaning Representation (AMR) graphs to assist the model in understanding complex semantic information. We introduce a method known as Graph-as-Token (GST) to incorporate AMRs into PLMs. Results from Natural Questions (NQ) and TriviaQA (TQ) demonstrate that our GST method can significantly improve performance, resulting in up to 2.44/3.17 Exact Match score improvements on NQ/TQ respectively. Furthermore, our method enhances robustness and outperforms alternative Graph Neural Network (GNN) methods for integrating AMRs. To the best of our knowledge, we are the first to employ semantic graphs in ODQA.

2022

Cross-domain Generalization for AMR Parsing
Xuefeng Bai | Sen Yang | Leyang Cui | Linfeng Song | Yue Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Abstract Meaning Representation (AMR) parsing aims to predict an AMR graph from textual input. Recently, there has been notable growth in AMR parsing performance. However, most existing work focuses on improving the performance in the specific domain, ignoring the potential domain dependence of AMR parsing systems. To address this, we extensively evaluate five representative AMR parsers on five domains and analyze challenges to cross-domain AMR parsing. We observe that challenges to cross-domain AMR parsing mainly arise from the distribution shift of words and AMR concepts. Based on our observation, we investigate two approaches to reduce the domain distribution divergence of text and AMR features, respectively. Experimental results on two out-of-domain test sets show the superiority of our method.

The Cross-lingual Conversation Summarization Challenge
Yulong Chen | Ming Zhong | Xuefeng Bai | Naihao Deng | Jing Li | Xianchao Zhu | Yue Zhang
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We propose the shared task of cross-lingual conversation summarization, ConvSumX Challenge, opening new avenues for researchers to investigate solutions that integrate conversation summarization and machine translation. This task can be particularly useful due to the emergence of online meetings and conferences. We use a new benchmark, covering 2 real-world scenarios and 3 language directions, including a low-resource language, for evaluation. We hope that ConvSumX can motivate research to go beyond English and break the barrier for non-English speakers to benefit from recent advances of conversation summarization.

Graph Pre-training for AMR Parsing and Generation
Xuefeng Bai | Yulong Chen | Yue Zhang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Abstract meaning representation (AMR) highlights the core semantic information of text in a graph structure. Recently, pre-trained language models (PLMs) have advanced tasks of AMR parsing and AMR-to-text generation, respectively. However, PLMs are typically pre-trained on textual data, thus are sub-optimal for modeling structural knowledge. To this end, we investigate graph self-supervised training to improve the structure awareness of PLMs over AMR graphs. In particular, we introduce two graph auto-encoding strategies for graph-to-graph pre-training and four tasks to integrate text and graph information during pre-training. We further design a unified framework to bridge the gap between pre-training and fine-tuning tasks. Experiments on both AMR parsing and AMR-to-text generation show the superiority of our model. To our knowledge, we are the first to consider pre-training on semantic graphs.

Semantic-based Pre-training for Dialogue Understanding
Xuefeng Bai | Linfeng Song | Yue Zhang
Proceedings of the 29th International Conference on Computational Linguistics

Pre-trained language models have made great progress on dialogue tasks. However, these models are typically trained on surface dialogue text, thus are proven to be weak in understanding the main semantic meaning of a dialogue context. We investigate Abstract Meaning Representation (AMR) as explicit semantic knowledge for pre-training models to capture the core semantic information in dialogues during pre-training. In particular, we propose a semantic-based pre-training framework that extends the standard pre-training framework (Devlin et al.,2019) by three tasks for learning 1) core semantic units, 2) semantic relations and 3) the overall semantic representation according to AMR graphs. Experiments on the understanding of both chit-chats and task-oriented dialogues show the superiority of our model. To our knowledge, we are the first to leverage a deep semantic representation for dialogue pre-training.

2021

Semantic Representation for Dialogue Modeling
Xuefeng Bai | Yulong Chen | Linfeng Song | Yue Zhang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Although neural models have achieved competitive results in dialogue systems, they have shown limited ability in representing core semantics, such as ignoring important entities. To this end, we exploit Abstract Meaning Representation (AMR) to help dialogue modeling. Compared with the textual input, AMR explicitly provides core semantic knowledge and reduces data sparsity. We develop an algorithm to construct dialogue-level AMR graphs from sentence-level AMRs and explore two ways to incorporate AMRs into dialogue systems. Experimental results on both dialogue understanding and response generation tasks show the superiority of our model. To our knowledge, we are the first to leverage a formal semantic representation into neural dialogue modeling.

2020

Online Back-Parsing for AMR-to-Text Generation
Xuefeng Bai | Linfeng Song | Yue Zhang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

AMR-to-text generation aims to recover a text containing the same meaning as an input AMR graph. Current research develops increasingly powerful graph encoders to better represent AMR graphs, with decoders based on standard language modeling being used to generate outputs. We propose a decoder that back predicts projected AMR graphs on the target sentence during text generation. As the result, our outputs can better preserve the input meaning than standard decoders. Experiments on two AMR benchmarks show the superiority of our model over the previous state-of-the-art system based on graph Transformer.

Co-authors

Muyun Yang (杨沐昀) 4

Tiejun Zhao (赵铁军) 4

Lianzhang Lou 2

Wenyang Gao (高文炀) 1

Quanjiang Guo 1

Cunxiang Wang 1

Jiangkuo Wang 1

Huajian Zhang 1

Hongbin Zhang 1

Venues