Wen Xiao - ACL Anthology

Wen Xiao

2025

LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback
Bofei Gao | Zefan Cai | Runxin Xu | Peiyi Wang | Ce Zheng | Runji Lin | Keming Lu | Dayiheng Liu | Chang Zhou | Wen Xiao | Tianyu Liu | Baobao Chang
Findings of the Association for Computational Linguistics: ACL 2025

In recent progress, mathematical verifiers have achieved success in mathematical reasoning tasks by validating the correctness of solutions generated by policy models. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedback as rationale labels, that is, the correctness of each step and the detailed explanations. In this paper, we propose Math-Minos, a natural language feedback-enhanced verifier by constructing automatically generated training data and a two-stage training paradigm for effective training and efficient inference. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier in both verification and reinforcement learning and also significantly alleviates the data-demanding problems of the reward model with an over 700% data efficiency improvement.

Proceedings of The 5th New Frontiers in Summarization Workshop
Yue Dong | Wen Xiao | Haopeng Zhang | Rui Zhang | Ori Ernst | Lu Wang | Fei Liu
Proceedings of The 5th New Frontiers in Summarization Workshop

2024

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
Yu Fu | Wen Xiao | Jia Chen | Jiachen Li | Evangelos Papalexakis | Aichi Chien | Yue Dong
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

Recent studies reveal that Large Language Models (LLMs) face challenges in balancing safety with utility, particularly when processing long texts for NLP tasks like summarization and translation. Despite defenses against malicious short questions, the ability of LLMs to safely handle dangerous long content, such as manuals teaching illicit activities, remains unclear. Our work aims to develop robust defenses for LLMs in processing malicious documents alongside benign NLP task queries. We introduce a defense dataset comprised of safety-related examples and propose single-task and mixed-task losses for instruction tuning. Our empirical results demonstrate that LLMs can significantly enhance their capacity to safely manage dangerous content with appropriate instruction tuning. Additionally, strengthening the defenses of tasks most susceptible to misuse is effective in protecting LLMs against processing harmful information. We also observe that trade-offs between utility and safety exist in defense strategies, where Llama2, utilizing our proposed approach, displays a significantly better balance compared to Llama1.

Personalized Abstractive Summarization by Tri-agent Generation Pipeline
Wen Xiao | Yujia Xie | Giuseppe Carenini | Pengcheng He
Findings of the Association for Computational Linguistics: EACL 2024

Tailoring outputs from large language models, like ChatGPT, to implicit user preferences remains a challenge despite their impressive generative capabilities. In this paper, we propose a tri-agent generation pipeline comprising a generator, an instructor, and an editor to enhance output personalization. The generator produces an initial output, the instructor automatically generates editing instructions based on user preferences, and the editor refines the output to align with those preferences. The inference-only large language model (ChatGPT) serves as both the generator and editor, with a smaller model acting as the instructor to guide output generation. We train the instructor using editor-steered reinforcement learning, leveraging feedback from a large-scale editor model to optimize instruction generation. Experimental results on two abstractive summarization datasets demonstrate the effectiveness of our approach in generating outputs that better meet user expectations.

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
Yu Fu | Yufei Li | Wen Xiao | Cong Liu | Yue Dong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent developments in balancing the usefulness and safety of Large Language Models (LLMs) have raised a critical question: Are mainstream NLP tasks adequately aligned with safety consideration? Our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various NLP tasks. For instance, LLMs can effectively summarize malicious long documents but often refuse to translate them. This discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integrity of tasks traditionally deemed more robust, such as translation and question-answering (QA). Moreover, the concurrent use of multiple NLP tasks with lesser safety alignment increases the risk of LLMs inadvertently processing harmful content. We demonstrate these vulnerabilities in various safety-aligned LLMs, particularly Llama2 models, Gemini and GPT-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of NLP tasks.

2023

Proceedings of the 4th New Frontiers in Summarization Workshop
Yue Dong | Wen Xiao | Lu Wang | Fei Liu | Giuseppe Carenini
Proceedings of the 4th New Frontiers in Summarization Workshop

Entity-based SpanCopy for Abstractive Summarization to Improve the Factual Consistency
Wen Xiao | Giuseppe Carenini
Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)

Discourse-aware techniques, including entity-aware approaches, play a crucial role in summarization. In this paper, we propose an entity-based SpanCopy mechanism to tackle the entity-level factual inconsistency problem in abstractive summarization, i.e. reducing the mismatched entities between the generated summaries and the source documents. Complemented by a Global Relevance component to identify summary-worthy entities, our approach demonstrates improved factual consistency while preserving saliency on four summarization datasets, contributing to the effective application of discourse-aware methods summarization tasks.

Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues
Chuyuan Li | Patrick Huber | Wen Xiao | Maxime Amblard | Chloe Braud | Giuseppe Carenini
Findings of the Association for Computational Linguistics: EACL 2023

Discourse processing suffers from data sparsity, especially for dialogues. As a result, we explore approaches to infer latent discourse structures for dialogues, based on attention matrices from Pre-trained Language Models (PLMs). We investigate multiple auxiliary tasks for fine-tuning and show that the dialogue-tailored Sentence Ordering task performs best. To locate and exploit discourse information in PLMs, we propose an unsupervised and a semi-supervised method. Our proposals thereby achieve encouraging results on the STAC corpus, with F1 scores of 57.2 and 59.3 for the unsupervised and semi-supervised methods, respectively. When restricted to projective trees, our scores improved to 63.3 and 68.1.

2022

PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Wen Xiao | Iz Beltagy | Giuseppe Carenini | Arman Cohan
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers to simplify the processing of concatenated input documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on zero-shot, few-shot and full-supervised settings, PRIMERA outperforms current state-of-the-art dataset-specific and pre-trained models on most of these settings with large margins.

Human Guided Exploitation of Interpretable Attention Patterns in Summarization and Topic Segmentation
Raymond Li | Wen Xiao | Linzi Xing | Lanjun Wang | Gabriel Murray | Giuseppe Carenini
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The multi-head self-attention mechanism of the transformer model has been thoroughly investigated recently. In one vein of study, researchers are interested in understanding why and how transformers work. In another vein, researchers propose new attention augmentation methods to make transformers more accurate, efficient and interpretable. In this paper, we combine these two lines of research in a human-in-the-loop pipeline to first discover important task-specific attention patterns. Then those patterns are injected, not only to smaller models, but also to the original model. The benefits of our pipeline and discovered patterns are demonstrated in two case studies with extractive summarization and topic segmentation. After discovering interpretable patterns in BERT-based models fine-tuned for the two downstream tasks, experiments indicate that when we inject the patterns into attention heads, the models show considerable improvements in accuracy and efficiency.

2021

W-RST: Towards a Weighted RST-style Discourse Framework
Patrick Huber | Wen Xiao | Giuseppe Carenini
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Aiming for a better integration of data-driven and linguistically-inspired approaches, we explore whether RST Nuclearity, assigning a binary assessment of importance between text segments, can be replaced by automatically generated, real-valued scores, in what we call a Weighted-RST framework. In particular, we find that weighted discourse trees from auxiliary tasks can benefit key NLP downstream applications, compared to nuclearity-centered approaches. We further show that real-valued importance distributions partially and interestingly align with the assessment and uncertainty of human annotators.

Predicting Discourse Trees from Transformer-based Neural Summarizers
Wen Xiao | Patrick Huber | Giuseppe Carenini
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Previous work indicates that discourse information benefits summarization. In this paper, we explore whether this synergy between discourse and summarization is bidirectional, by inferring document-level discourse trees from pre-trained neural summarizers. In particular, we generate unlabeled RST-style discourse trees from the self-attention matrices of the transformer model. Experiments across models and datasets reveal that the summarizer learns both, dependency- and constituency-style discourse information, which is typically encoded in a single head, covering long- and short-distance discourse dependencies. Overall, the experimental results suggest that the learned discourse information is general and transferable inter-domain.

Demoting the Lead Bias in News Summarization via Alternating Adversarial Learning
Linzi Xing | Wen Xiao | Giuseppe Carenini
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In news articles the lead bias is a common phenomenon that usually dominates the learning signals for neural extractive summarizers, severely limiting their performance on data with different or even no bias. In this paper, we introduce a novel technique to demote lead bias and make the summarizer focus more on the content semantics. Experiments on two news corpora with different degrees of lead bias show that our method can effectively demote the model’s learned lead bias and improve its generality on out-of-distribution data, with little to no performance loss on in-distribution data.

T3-Vis: visual analytic for Training and fine-Tuning Transformers in NLP
Raymond Li | Wen Xiao | Lanjun Wang | Hyeju Jang | Giuseppe Carenini
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Transformers are the dominant architecture in NLP, but their training and fine-tuning is still very challenging. In this paper, we present the design and implementation of a visual analytic framework for assisting researchers in such process, by providing them with valuable insights about the model’s intrinsic properties and behaviours. Our framework offers an intuitive overview that allows the user to explore different facets of the model (e.g., hidden states, attention) through interactive visualization, and allows a suite of built-in algorithms that compute the importance of model components and different parts of the input sequence. Case studies and feedback from a user focus group indicate that the framework is useful, and suggest several improvements. Our framework is available at: https://github.com/raymondzmc/T3-Vis.

KW-ATTN: Knowledge Infused Attention for Accurate and Interpretable Text Classification
Hyeju Jang | Seojin Bang | Wen Xiao | Giuseppe Carenini | Raymond Ng | Young ji Lee
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Text classification has wide-ranging applications in various domains. While neural network approaches have drastically advanced performance in text classification, they tend to be powered by a large amount of training data, and interpretability is often an issue. As a step towards better accuracy and interpretability especially on small data, in this paper we present a new knowledge-infused attention mechanism, called KW-ATTN (KnoWledge-infused ATTentioN) to incorporate high-level concepts from external knowledge bases into Neural Network models. We show that KW-ATTN outperforms baseline models using only words as well as other approaches using concepts by classification accuracy, which indicates that high-level concepts help model prediction. Furthermore, crowdsourced human evaluation suggests that additional concept information helps interpretability of the model.

2020

Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help !
Wen Xiao | Patrick Huber | Giuseppe Carenini
Proceedings of the First Workshop on Computational Approaches to Discourse

The multi-head self-attention of popular transformer models is widely used within Natural Language Processing (NLP), including for the task of extractive summarization. With the goal of analyzing and pruning the parameter-heavy self-attention mechanism, there are multiple approaches proposing more parameter-light self-attention alternatives. In this paper, we present a novel parameter-lean self-attention mechanism using discourse priors. Our new tree self-attention is based on document-level discourse information, extending the recently proposed “Synthesizer” framework with another lightweight alternative. We show empirical results that our tree self-attention approach achieves competitive ROUGE-scores on the task of extractive summarization. When compared to the original single-head transformer model, the tree attention approach reaches similar performance on both, EDU and sentence level, despite the significant reduction of parameters in the attention component. We further significantly outperform the 8-head transformer model on sentence level when applying a more balanced hyper-parameter setting, requiring an order of magnitude less parameters.

Systematically Exploring Redundancy Reduction in Summarizing Long Documents
Wen Xiao | Giuseppe Carenini
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Our analysis of large summarization datasets indicates that redundancy is a very serious problem when summarizing long documents. Yet, redundancy reduction has not been thoroughly investigated in neural summarization. In this work, we systematically explore and compare different ways to deal with redundancy when summarizing long documents. Specifically, we organize existing methods into categories based on when and how the redundancy is considered. Then, in the context of these categories, we propose three additional methods balancing non-redundancy and importance in a general and flexible way. In a series of experiments, we show that our proposed methods achieve the state-of-the-art with respect to ROUGE scores on two scientific paper datasets, Pubmed and arXiv, while reducing redundancy significantly.

2019

Extractive Summarization of Long Documents by Combining Global and Local Context
Wen Xiao | Giuseppe Carenini
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In this paper, we propose a novel neural single-document extractive summarization model for long documents, incorporating both the global context of the whole document and the local context within the current topic. We evaluate the model on two datasets of scientific papers , Pubmed and arXiv, where it outperforms previous work, both extractive and abstractive models, on ROUGE-1, ROUGE-2 and METEOR scores. We also show that, consistently with our goal, the benefits of our method become stronger as we apply it to longer documents. Rather surprisingly, an ablation study indicates that the benefits of our model seem to come exclusively from modeling the local context, even for the longest documents.

Co-authors

Maxime Amblard 1

Baobao Chang (常宝宝) 1

Gabriel Murray 1

Evangelos Papalexakis 1

Peiyi Wang (王培懿) 1

Haopeng Zhang 1

Venues