Yue Dong - ACL Anthology

Yue Dong

2025

ExpertGenQA: Open-ended QA generation in Specialized Domains
Haz Sameen Shahgir | Chansong Lim | Jia Chen | Evangelos E. Papalexakis | Yue Dong
Findings of the Association for Computational Linguistics: EMNLP 2025

Generating high-quality question–answer (QA) pairs for specialized technical domains is essential for advancing knowledge comprehension, yet remains challenging. Existing methods often yield generic or shallow questions that fail to reflect the depth and structure of expert-written examples. We propose ExpertGenQA, a generation protocol that combines few-shot prompting with dual categorization by topic and question style to produce more diverse and cognitively meaningful QA pairs. ExpertGenQA achieves twice the efficiency of standard few-shot methods while maintaining 94.4% topic coverage. Unlike LLM-based judges, which often favor surface fluency, Bloom’s Taxonomy analysis shows that ExpertGenQA better captures expert-level cognitive complexity. When used to train retrieval systems, our questions improve top-1 accuracy by 13.02%, demonstrating their practical value for domain-specific applications.

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment
Pedram Zaree | Md Abdullah Al Mamun | Quazi Mishkatul Alam | Yue Dong | Ihsen Alouani | Nael Abu-Ghazaleh
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms, including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B-chat/AdvBench, using less than a third of the generation time).

Proceedings of The 5th New Frontiers in Summarization Workshop
Yue Dong | Wen Xiao | Haopeng Zhang | Rui Zhang | Ori Ernst | Lu Wang | Fei Liu
Proceedings of The 5th New Frontiers in Summarization Workshop

HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models
Trishna Chakraborty | Udita Ghosh | Xiaopan Zhang | Fahim Faisal Niloy | Yue Dong | Jiachen Li | Amit Roy-Chowdhury | Chengyu Song
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene–task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40× higher than base prompts. Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies — highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.

2024

Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations
Lei Yu | Meng Cao | Jackie CK Cheung | Yue Dong
Findings of the Association for Computational Linguistics: EMNLP 2024

State-of-the-art language models (LMs) sometimes generate that misalign with world knowledge. To explore the mechanistic causes of these hallucinations, we create diagnostic datasets with subject-relation queries and adapt interpretability methods to trace hallucinations through internal model representations. We discover two general and distinct mechanistic causes of hallucinations shared across LMs (Llama-2, Pythia, GPT-J): 1) : insufficient subject attribute knowledge in lower layer MLPs, and 2) : failure to select the correct object attribute in upper layer attention heads. We also found these two internal mechanistic causes of hallucinations are reflected in external manifestations. Based on insights from our mechanistic analysis, we propose a novel hallucination mitigation method through targeted restoration of the LM’s internal fact recall pipeline, demonstrating superior performance compared to baselines.

Source-Free Domain Adaptation for Question Answering with Masked Self-training
Maxwell J. Yin | Boyu Wang | Yue Dong | Charles Ling
Transactions of the Association for Computational Linguistics, Volume 12

Previous unsupervised domain adaptation (UDA) methods for question answering (QA) require access to source domain data while fine-tuning the model for the target domain. Source domain data may, however, contain sensitive information and should be protected. In this study, we investigate a more challenging setting, source-free UDA, in which we have only the pretrained source model and target domain data, without access to source domain data. We propose a novel self-training approach to QA models that integrates a specially designed mask module for domain adaptation. The mask is auto-adjusted to extract key domain knowledge when trained on the source domain. To maintain previously learned domain knowledge, certain mask weights are frozen during adaptation, while other weights are adjusted to mitigate domain shifts with pseudo-labeled samples generated in the target domain. Our empirical results on four benchmark datasets suggest that our approach significantly enhances the performance of pretrained QA models on the target domain, and even outperforms models that have access to the source data during adaptation.

Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation
G M Shahariar | Jia Chen | Jiachen Li | Yue Dong
Findings of the Association for Computational Linguistics: EMNLP 2024

Recent studies show that text-to-image (T2I) models are vulnerable to adversarial attacks, especially with noun perturbations in text prompts. In this study, we investigate the impact of adversarial attacks on different POS tags within text prompts on the images generated by T2I models. We create a high-quality dataset for realistic POS tag token swapping and perform gradient-based attacks to find adversarial suffixes that mislead T2I models into generating images with altered tokens. Our empirical results show that the attack success rate (ASR) varies significantly among different POS tag categories, with nouns, proper nouns, and adjectives being the easiest to attack. We explore the mechanism behind the steering effect of adversarial suffixes, finding that the number of critical tokens and information fusion vary among POS tags, while features like suffix transferability are consistent across categories.

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
Yu Fu | Yufei Li | Wen Xiao | Cong Liu | Yue Dong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent developments in balancing the usefulness and safety of Large Language Models (LLMs) have raised a critical question: Are mainstream NLP tasks adequately aligned with safety consideration? Our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various NLP tasks. For instance, LLMs can effectively summarize malicious long documents but often refuse to translate them. This discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integrity of tasks traditionally deemed more robust, such as translation and question-answering (QA). Moreover, the concurrent use of multiple NLP tasks with lesser safety alignment increases the risk of LLMs inadvertently processing harmful content. We demonstrate these vulnerabilities in various safety-aligned LLMs, particularly Llama2 models, Gemini and GPT-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of NLP tasks.

Vulnerabilities of Large Language Models to Adversarial Attacks
Yu Fu | Erfan Shayegan | Md. Mamun Al Abdullah | Pedram Zaree | Nael Abu-Ghazaleh | Yue Dong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)

This tutorial serves as a comprehensive guide on the vulnerabilities of Large Language Models (LLMs) to adversarial attacks, an interdisciplinary field that blends perspectives from Natural Language Processing (NLP) and Cybersecurity. As LLMs become more complex and integrated into various systems, understanding their security attributes is crucial. However, current research indicates that even safety-aligned models are not impervious to adversarial attacks that can result in incorrect or harmful outputs. The tutorial first lays the foundation by explaining safety-aligned LLMs and concepts in cybersecurity. It then categorizes existing research based on different types of learning architectures and attack methods. We highlight the existing vulnerabilities of unimodal LLMs, multi-modal LLMs, and systems that integrate LLMs, focusing on adversarial attacks designed to exploit weaknesses and mislead AI systems. Finally, the tutorial delves into the potential causes of these vulnerabilities and discusses potential defense mechanisms.

Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks
Haz Shahgir | Xianghao Kong | Greg Ver Steeg | Yue Dong
Findings of the Association for Computational Linguistics: ACL 2024

The widespread use of Text-to-Image (T2I) models in content generation requires careful examination of their safety, including their robustness to adversarial attacks. Despite extensive research on adversarial attacks, the reasons for their effectiveness remain underexplored. This paper presents an empirical study on adversarial attacks against T2I models, focusing on analyzing factors associated with attack success rates (ASR). We introduce a new attack objective - entity swapping using adversarial suffixes and two gradient-based attack algorithms. Human and automatic evaluations reveal the asymmetric nature of ASRs on entity swap: for example, it is easier to replace “human” with “robot” in the prompt “a human dancing in the rain.” with an adversarial suffix, but the reverse replacement is significantly harder. We further propose probing metrics to establish indicative signals from the model’s beliefs to the adversarial ASR. We identify conditions that result in a success probability of 60% for adversarial attacks and others where this likelihood drops below 5%. The code and data are available at https://github.com/Patchwork53/AsymmetricAttack

PAT-Questions: A Self-Updating Benchmark for Present-Anchored Temporal Question-Answering
Jannat Meem | Muhammad Rashid | Yue Dong | Vagelis Hristidis
Findings of the Association for Computational Linguistics: ACL 2024

Existing work on Temporal Question Answering (TQA) has predominantly focused on questions anchored to specific timestamps or events (e.g. ‘Who was the US president in 1970?’). Little work has studied questions whose temporal context is relative to the present time (e.g. ‘Who was the previous US president?’). We refer to this problem as Present-Anchored Temporal QA (PATQA). PATQA poses unique challenges: (1) large language models (LLMs) may have outdated knowledge, (2) complex temporal relationships (e.g. ‘before’, ‘previous’) are hard to reason, (3) multi-hop reasoning may be required, and (4) the gold answers of benchmarks must be continuously updated. To address these challenges, we introduce the PAT-Questions benchmark, which includes single and multi-hop temporal questions. The answers in PAT-Questions can be automatically refreshed by re-running SPARQL queries on a knowledge graph, if available. We evaluate several state-of-the-art LLMs and a SOTA temporal reasoning model (TEMPREASON-T5) on PAT-Questions through direct prompting and retrieval-augmented generation (RAG). The results highlight the limitations of existing solutions in PATQA and motivate the need for new methods to improve PATQA reasoning capabilities.

EcoRank: Budget-Constrained Text Re-ranking Using Large Language Models
Muhammad Rashid | Jannat Meem | Yue Dong | Vagelis Hristidis
Findings of the Association for Computational Linguistics: ACL 2024

Large Language Models (LLMs) have achieved state-of-the-art performance in text re-ranking. This process includes queries and candidate passages in the prompts, utilizing pointwise, listwise, and pairwise prompting strategies. A limitation of these ranking strategies with LLMs is their cost: the process can become expensive due to API charges, which are based on the number of input and output tokens. We study how to maximize the re-ranking performance given a budget, by navigating the vast search spaces of prompt choices, LLM APIs, and budget splits. We propose a suite of budget-constrained methods to perform text re-ranking using a set of LLM APIs. Our most efficient method, called EcoRank, is a two-layered pipeline that jointly optimizes decisions regarding budget allocation across prompt strategies and LLM APIs. Our experimental results on four popular QA and passage reranking datasets show that EcoRank outperforms other budget-aware supervised and unsupervised baselines.

Can Textual Unlearning Solve Cross-Modality Safety Alignment?
Trishna Chakraborty | Erfan Shayegani | Zikui Cai | Nael B. Abu-Ghazaleh | M. Salman Asif | Yue Dong | Amit Roy-Chowdhury | Chengyu Song
Findings of the Association for Computational Linguistics: EMNLP 2024

Recent studies reveal that integrating new modalities into large language models (LLMs), such as vision-language models (VLMs), creates a new attack surface that bypasses existing safety training techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where all input modalities are ultimately fused into the language space, we explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our empirical evaluation across seven datasets demonstrates promising transferability — textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands.

Biasly: An Expert-Annotated Dataset for Subtle Misogyny Detection and Mitigation
Brooklyn Sheppard | Anna Richter | Allison Cohen | Elizabeth Smith | Tamara Kneese | Carolyne Pelletier | Ioana Baldini | Yue Dong
Findings of the Association for Computational Linguistics: ACL 2024

Using novel approaches to dataset development, the Biasly dataset captures the nuance and subtlety of misogyny in ways that are unique within the literature. Built in collaboration with multi-disciplinary experts and annotators themselves, the dataset contains annotations of movie subtitles, capturing colloquial expressions of misogyny in North American film. The open-source dataset can be used for a range of NLP tasks, including binary and multi-label classification, severity score regression, and text generation for rewrites. In this paper, we discuss the methodology used, analyze the annotations obtained, provide baselines for each task using common NLP algorithms, and furnish error analyses to give insight into model behaviour when fine-tuned on the Biasly dataset.

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
Yu Fu | Wen Xiao | Jia Chen | Jiachen Li | Evangelos Papalexakis | Aichi Chien | Yue Dong
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

Recent studies reveal that Large Language Models (LLMs) face challenges in balancing safety with utility, particularly when processing long texts for NLP tasks like summarization and translation. Despite defenses against malicious short questions, the ability of LLMs to safely handle dangerous long content, such as manuals teaching illicit activities, remains unclear. Our work aims to develop robust defenses for LLMs in processing malicious documents alongside benign NLP task queries. We introduce a defense dataset comprised of safety-related examples and propose single-task and mixed-task losses for instruction tuning. Our empirical results demonstrate that LLMs can significantly enhance their capacity to safely manage dangerous content with appropriate instruction tuning. Additionally, strengthening the defenses of tasks most susceptible to misuse is effective in protecting LLMs against processing harmful information. We also observe that trade-offs between utility and safety exist in defense strategies, where Llama2, utilizing our proposed approach, displays a significantly better balance compared to Llama1.

2023

Proceedings of the 4th New Frontiers in Summarization Workshop
Yue Dong | Wen Xiao | Lu Wang | Fei Liu | Giuseppe Carenini
Proceedings of the 4th New Frontiers in Summarization Workshop

Inverse Reinforcement Learning for Text Summarization
Yu Fu | Deyi Xiong | Yue Dong
Findings of the Association for Computational Linguistics: EMNLP 2023

We introduce inverse reinforcement learning (IRL) as an effective paradigm for training abstractive summarization models, imitating human summarization behaviors. Our IRL model estimates the reward function using a suite of important sub-rewards for summarization and concurrently optimizes the policy network. Experimental results across datasets in different domains (CNN/DailyMail and WikiHow) and various model sizes (BART-base and BART-large) demonstrate the superiority of our proposed IRL model for summarization over MLE and RL baselines. The resulting summaries exhibit greater similarity to human-crafted gold references, outperforming MLE and RL baselines on metrics such as ROUGE, coverage, novelty, compression ratio, factuality, and human evaluations.

2022

Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization
Meng Cao | Yue Dong | Jackie Cheung
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

State-of-the-art abstractive summarization systems often generate hallucinations; i.e., content that is not directly inferable from the source text. Despite being assumed to be incorrect, we find that much hallucinated content is actually consistent with world knowledge, which we call factual hallucinations. Including these factual hallucinations in a summary can be beneficial because they provide useful background information. In this work, we propose a novel detection approach that separates factual from non-factual hallucinations of entities. Our method is based on an entity’s prior and posterior probabilities according to pre-trained and finetuned masked language models, respectively. Empirical results suggest that our method vastly outperforms two baselines in both accuracy and F1 scores and has a strong correlation with human judgments on factuality classification tasks. Furthermore, we use our method as a reward signal to train a summarization system using an off-line reinforcement learning (RL) algorithm that can significantly improve the factuality of generated summaries while maintaining the level of abstractiveness.

Learning with Rejection for Abstractive Text Summarization
Meng Cao | Yue Dong | Jingyi He | Jackie Chi Kit Cheung
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

State-of-the-art abstractive summarization systems frequently hallucinate content that is not supported by the source document, mainly due to noise in the training dataset.Existing methods opt to drop the noisy samples or tokens from the training set entirely, reducing the effective training set size and creating an artificial propensity to copy words from the source. In this work, we propose a training objective for abstractive summarization based on rejection learning, in which the model learns whether or not to reject potentially noisy tokens. We further propose a regularized decoding objective that penalizes non-factual candidate summaries during inference by using the rejection probability learned during training.We show that our method considerably improves the factuality of generated summaries in automatic and human evaluations when compared to five baseline models, and that it does so while increasing the abstractiveness of the generated summaries.

Text Generation with Text-Editing Models
Eric Malmi | Yue Dong | Jonathan Mallinson | Aleksandr Chuklin | Jakub Adamek | Daniil Mirylenka | Felix Stahlberg | Sebastian Krause | Shankar Kumar | Aliaksei Severyn
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts

Text-editing models have recently become a prominent alternative to seq2seq models for monolingual text-generation tasks such as grammatical error correction, text simplification, and style transfer. These tasks share a common trait – they exhibit a large amount of textual overlap between the source and target texts. Text-editing models take advantage of this observation and learn to generate the output by predicting edit operations applied to the source sequence. In contrast, seq2seq models generate outputs word-by-word from scratch thus making them slow at inference time. Text-editing models provide several benefits over seq2seq models including faster inference speed, higher sample efficiency, and better control and interpretability of the outputs. This tutorial provides a comprehensive overview of the text-edit based models and current state-of-the-art approaches analyzing their pros and cons. We discuss challenges related to deployment and how these models help to mitigate hallucination and bias, both pressing challenges in the field of text generation.

Faithful to the Document or to the World? Mitigating Hallucinations via Entity-Linked Knowledge in Abstractive Summarization
Yue Dong | John Wieting | Pat Verga
Findings of the Association for Computational Linguistics: EMNLP 2022

Existing abstractive summarization systems are hampered by content hallucinations in which models generate text that is not directly inferable from the source alone. Annotations from prior work have shown that some of these hallucinations, while being ‘unfaithful’ to the source, are nonetheless factual. Our analysis in this paper suggests that these factual hallucinations occur as a result of the prevalence of factual yet unfaithful entities in summarization datasets. We find that these entities are not aberrations, but instead examples of additional world knowledge being readily used to latently connect entities and concepts – in this case connecting entities in the source document to those in the target summary. In our analysis and experiments, we demonstrate that connecting entities to an external knowledge base can lend provenance to many of these unfaithful yet factual entities, and further, this knowledge can be used to improve the factuality of summaries without simply making them more extractive.

2021

Proceedings of the Third Workshop on New Frontiers in Summarization
Giuseppe Carenini | Jackie Chi Kit Cheung | Yue Dong | Fei Liu | Lu Wang
Proceedings of the Third Workshop on New Frontiers in Summarization

On-the-Fly Attention Modulation for Neural Generation
Yue Dong | Chandra Bhagavatula | Ximing Lu | Jena D. Hwang | Antoine Bosselut | Jackie Chi Kit Cheung | Yejin Choi
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Discourse-Aware Unsupervised Summarization for Long Scientific Documents
Yue Dong | Andrei Mircea | Jackie Chi Kit Cheung
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We propose an unsupervised graph-based ranking model for extractive summarization of long scientific documents. Our method assumes a two-level hierarchical graph representation of the source document, and exploits asymmetrical positional cues to determine sentence importance. Results on the PubMed and arXiv datasets show that our approach outperforms strong unsupervised baselines by wide margins in automatic metrics and human evaluation. In addition, it achieves performance comparable to many state-of-the-art supervised approaches which are trained on hundreds of thousands of examples. These results suggest that patterns in the discourse structure are a strong signal for determining importance in scientific articles.

Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents
Rui Meng | Khushboo Thaker | Lei Zhang | Yue Dong | Xingdi Yuan | Tong Wang | Daqing He
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Faceted summarization provides briefings of a document from different perspectives. Readers can quickly comprehend the main points of a long document with the help of a structured outline. However, little research has been conducted on this subject, partially due to the lack of large-scale faceted summarization datasets. In this study, we present FacetSum, a faceted summarization benchmark built on Emerald journal articles, covering a diverse range of domains. Different from traditional document-summary pairs, FacetSum provides multiple summaries, each targeted at specific sections of a long document, including the purpose, method, findings, and value. Analyses and empirical results on our dataset reveal the importance of bringing structure into summaries. We believe FacetSum will spur further advances in summarization research and foster the development of NLP systems that can leverage the structured information in both long texts and summaries.

2020

Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
Yao Lu | Yue Dong | Laurent Charlin
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results—using several state-of-the-art models trained on the Multi-XScience dataset—reveal that Multi-XScience is well suited for abstractive models.

Factual Error Correction for Abstractive Summarization Models
Meng Cao | Yue Dong | Jiapeng Wu | Jackie Chi Kit Cheung
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Neural abstractive summarization systems have achieved promising progress, thanks to the availability of large-scale datasets and models pre-trained with self-supervised methods. However, ensuring the factual consistency of the generated summaries for abstractive summarization systems is a challenge. We propose a post-editing corrector module to address this issue by identifying and correcting factual errors in generated summaries. The neural corrector model is pre-trained on artificial examples that are created by applying a series of heuristic transformations on reference summaries. These transformations are inspired by the error analysis of state-of-the-art summarization model outputs. Experimental results show that our model is able to correct factual errors in summaries generated by other neural summarization models and outperforms previous models on factual consistency evaluation on the CNN/DailyMail dataset. We also find that transferring from artificial error correction to downstream settings is still very challenging.

Multi-Fact Correction in Abstractive Text Summarization
Yue Dong | Shuohang Wang | Zhe Gan | Yu Cheng | Jackie Chi Kit Cheung | Jingjing Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Pre-trained neural abstractive summarization systems have dominated extractive strategies on news summarization performance, at least in terms of ROUGE. However, system-generated abstractive summaries often face the pitfall of factual inconsistency: generating incorrect facts with respect to the source text. To address this challenge, we propose Span-Fact, a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection. Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text, while retaining the syntactic structure of summaries generated by abstractive summarization models. Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.

2019

EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing
Yue Dong | Zichao Li | Mehdi Rezagholizadeh | Jackie Chi Kit Cheung
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present the first sentence simplification model that learns explicit edit operations (ADD, DELETE, and KEEP) via a neural programmer-interpreter approach. Most current neural sentence simplification systems are variants of sequence-to-sequence models adopted from machine translation. These methods learn to simplify sentences as a byproduct of the fact that they are trained on complex-simple sentence pairs. By contrast, our neural programmer-interpreter is directly trained to predict explicit edit operations on targeted parts of the input sentence, resembling the way that humans perform simplification and revision. Our model outperforms previous state-of-the-art neural sentence simplification models (without external knowledge) by large margins on three benchmark text simplification corpora in terms of SARI (+0.95 WikiLarge, +1.89 WikiSmall, +1.41 Newsela), and is judged by humans to produce overall better and simpler output sentences.

Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses
Matt Grenander | Yue Dong | Jackie Chi Kit Cheung | Annie Louis
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Sentence position is a strong feature for news summarization, since the lead often (but not always) summarizes the key points of the article. In this paper, we show that recent neural systems excessively exploit this trend, which although powerful for many inputs, is also detrimental when summarizing documents where important content should be extracted from later parts of the article. We propose two techniques to make systems sensitive to the importance of content in different parts of the article. The first technique employs ‘unbiased’ data; i.e., randomly shuffled sentences of the source document, to pretrain the model. The second technique uses an auxiliary ROUGE-based loss that encourages the model to distribute importance scores throughout a document by mimicking sentence-level ROUGE scores on the training data. We show that these techniques significantly improve the performance of a competitive reinforcement learning based extractive system, with the auxiliary loss being more powerful than pretraining.

2018

BanditSum: Extractive Summarization as a Contextual Bandit
Yue Dong | Yikang Shen | Eric Crawford | Herke van Hoof | Jackie Chi Kit Cheung
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this work, we propose a novel method for training neural networks to perform single-document extractive summarization without heuristically-generated extractive labels. We call our approach BanditSum as it treats extractive summarization as a contextual bandit (CB) problem, where the model receives a document to summarize (the context), and chooses a sequence of sentences to include in the summary (the action). A policy gradient reinforcement learning algorithm is used to train the model to select sequences of sentences that maximize ROUGE score. We perform a series of experiments demonstrating that BanditSum is able to achieve ROUGE scores that are better than or comparable to the state-of-the-art for extractive summarization, and converges using significantly fewer update steps than competing approaches. In addition, we show empirically that BanditSum performs significantly better than competing approaches when good summary sentences appear late in the source document.

A Hierarchical Neural Attention-based Text Classifier
Koustuv Sinha | Yue Dong | Jackie Chi Kit Cheung | Derek Ruths
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Deep neural networks have been displaying superior performance over traditional supervised classifiers in text classification. They learn to extract useful features automatically when sufficient amount of data is presented. However, along with the growth in the number of documents comes the increase in the number of categories, which often results in poor performance of the multiclass classifiers. In this work, we use external knowledge in the form of topic category taxonomies to aide the classification by introducing a deep hierarchical neural attention-based classifier. Our model performs better than or comparable to state-of-the-art hierarchical models at significantly lower computational cost while maintaining high interpretability.

Co-authors

Nael Abu-Ghazaleh 2

Giuseppe Carenini 2

Trishna Chakraborty 2

Vagelis Hristidis 2

Muhammad Rashid 2

Amit Roy-Chowdhury 2

Md. Mamun Al Abdullah 1

Nael B. Abu-Ghazaleh 1

Md Abdullah Al Mamun 1

Quazi Mishkatul Alam 1

Ihsen Alouani 1

M. Salman Asif 1

Ioana Baldini 1

Chandra Bhagavatula 1

Antoine Bosselut 1

Laurent Charlin 1

Jackie CK Cheung 1

Aleksandr Chuklin 1

Allison Cohen 1

Eric Crawford 1

Matt Grenander 1

Jena D. Hwang 1

Tamara Kneese 1

Xianghao Kong 1

Sebastian Krause 1

Shankar Kumar 1

Jonathan Mallinson 1

Andrei Mircea 1

Daniil Mirylenka 1

Fahim Faisal Niloy 1

Evangelos E. Papalexakis 1

Evangelos Papalexakis 1

Carolyne Pelletier 1

Mehdi Rezagholizadeh 1

Aliaksei Severyn 1

G. M. Shahariar 1

Haz Sameen Shahgir 1

Erfan Shayegan 1

Erfan Shayegani 1

Brooklyn Sheppard 1

Koustuv Sinha 1

Elizabeth Smith 1

Felix Stahlberg 1

Khushboo Thaker 1

Greg Ver Steeg 1

Shuohang Wang 1

Deyi Xiong (德意熊) 1

Maxwell J. Yin 1

Haopeng Zhang 1

Xiaopan Zhang 1

Herke van Hoof 1

Venues