Yangfeng Ji - ACL Anthology

Yangfeng Ji

2025

Improving Aspect-Based Summarization via Contrastive Learning with Anchored Negative Examples
Elizabeth Palmieri | Yangfeng Ji
Proceedings of The 5th New Frontiers in Summarization Workshop

Text summarization helps users manage information overload, but traditional methods can be cumbersome when seeking specific details within a document. Aspect-based text summarization addresses this by using a query to guide which information should be summarized. However, distinguishing relevant from irrelevant information for a given aspect remains challenging in LLM-based summarization models. In this work, we propose utilizing contrastive learning to encourage LLMs to focus on aspect-related signals during training. We further design two variants of the learning algorithm, aspect-anchored and summary-anchored, corresponding to the strategies used in constructing negative examples. Evaluation with two representative LLM families (Llama 2 and Pythia) and two benchmark datasets (AnyAspect and CovidET) demonstrates the proposed methods’ strong performance compared to their supervised fine-tuning and zero-shot counterparts, highlighting contrastive learning as a promising direction for aspect-based text summarization.

A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension
Saahith Janapati | Yangfeng Ji
Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)

The performance of Large Language Models (LLMs) on natural language tasks can be improved through both supervised fine-tuning (SFT) and in-context learning (ICL), which operate via distinct mechanisms. SFT updates the model’s weights by minimizing loss on training data, whereas ICL leverages task demonstrations embedded in the prompt, without changing the model’s parameters. This study investigates the effects of these learning paradigms on the hidden representations of LLMs using Intrinsic Dimension (ID). We use ID to estimate the number of degrees of freedom between representations extracted from LLMs as they perform specific natural language tasks. We first explore how the ID of LLM representations evolves during SFT and how it varies due to the number of demonstrations in ICL. We then compare the IDs induced by SFT and ICL and find that ICL consistently induces a higher ID compared to SFT, suggesting that representations generated during ICL reside in higher dimensional manifolds in the embedding space.

Syntactic Blind Spots: How Misalignment Leads to LLMs’ Mathematical Errors
Dane A Williamson | Yangfeng Ji | Matthew B. Dwyer
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)

Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.

Unsupervised Concept Vector Extraction for Bias Control in LLMs
Hannah Cyberey | Yangfeng Ji | David Evans
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate these biases, but most work studies biases as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of “gender” is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model’s representation. We develop a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs and show that it also generalizes to racial bias.

The Good, the Bad, and the Debatable: A Survey on the Impacts of Data for In-Context Learning
Stephanie Schoch | Yangfeng Ji
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

In-context learning is an emergent learning paradigm that enables an LLM to learn an unseen task by seeing a number of demonstrations in the context window. The quality of the demonstrations is of paramount importance as 1) context window size limitations restrict the number of demonstrations that can be presented to the model, and 2) the model must identify the task and potentially learn new, unseen input-output mappings from the limited demonstration set. An increasing body of work has also shown the sensitivity of predictions to perturbations on the demonstration set. Given this importance, this work presents a survey on the current literature pertaining to the relationship between data and in-context learning. We present our survey in three parts: the “good” – qualities that are desirable when selecting demonstrations, the “bad” – qualities of demonstrations that can negatively impact the model, as well as issues that can arise in presenting demonstrations, and the “debatable” – qualities of demonstrations with mixed results or factors modulating data impacts.

Do Prevalent Bias Metrics Capture Allocational Harms from LLMs?
Hannah Cyberey | Yangfeng Ji | David Evans
The Sixth Workshop on Insights from Negative Results in NLP

Allocational harms occur when resources or opportunities are unfairly withheld from specific groups. Many proposed bias measures ignore the discrepancy between predictions, which are what the proposed methods consider, and decisions that are made as a result of those predictions. Our work examines the reliability of current bias metrics in assessing allocational harms arising from predictions of large language models (LLMs). We evaluate their predictive validity and utility for model selection across ten LLMs and two allocation tasks. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes. Our work highlights the need to account for how model predictions are used in decisions, in particular in contexts where they are influenced by how limited resources are allocated.

In-Context Learning (and Unlearning) of Length Biases
Stephanie Schoch | Yangfeng Ji
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models have demonstrated strong capabilities to learn in-context, where exemplar input-output pairings are appended to the prompt for demonstration. However, existing work has demonstrated the ability of models to learn lexical and label biases in-context, which negatively impacts both performance and robustness of models. The impact of other statistical data biases remains under-explored, which this work aims to address. We specifically investigate the impact of length biases on in-context learning. We demonstrate that models do learn length biases in the context window for their predictions, and further empirically analyze the factors that modulate the level of bias exhibited by the model. In addition, we show that learning length information in-context can be used to counter the length bias that has been encoded in models (e.g., via fine-tuning). This reveals the power of in-context learning in debiasing model prediction behaviors without the need for costly parameter updates.

Monte Carlo Sampling for Analyzing In-Context Examples
Stephanie Schoch | Yangfeng Ji
The Sixth Workshop on Insights from Negative Results in NLP

Prior works have shown that in-context learning is brittle to presentation factors such as the order, number, and choice of selected examples. However, ablation-based guidance on selecting the number of examples may ignore the interplay between different presentation factors. In this work we develop a Monte Carlo sampling-based method to study the impact of number of examples while explicitly accounting for effects from order and selected examples. We find that previous guidance on how many in-context examples to select does not always generalize across different sets of selected examples and orderings, and whether one-shot settings outperform zero-shot settings is highly dependent on the selected example. Additionally, inspired by data valuation, we apply our sampling method to in-context example selection to select examples that perform well across different orderings. We find a negative result, that while performance is robust to ordering and number of examples, there is an unexpected performance degradation compared to random sampling.

2024

Addressing Both Statistical and Causal Gender Fairness in NLP Models
Hannah Chen | Yangfeng Ji | David Evans
Findings of the Association for Computational Linguistics: NAACL 2024

Statistical fairness stipulates equivalent outcomes for every protected group, whereas causal fairness prescribes that a model makes the same prediction for an individual regardless of their protected characteristics. Counterfactual data augmentation (CDA) is effective for reducing bias in NLP models, yet models trained with CDA are often evaluated only on metrics that are closely tied to the causal fairness notion; similarly, sampling-based methods designed to promote statistical fairness are rarely evaluated for causal fairness. In this work, we evaluate both statistical and causal debiasing methods for gender bias in NLP models, and find that while such methods are effective at reducing bias as measured by the targeted metric, they do not necessarily improve results on other bias metrics. We demonstrate that combinations of statistical and causal debiasing techniques are able to reduce bias measured through both types of metrics.

2023

REV: Information-Theoretic Evaluation of Free-Text Rationales
Hanjie Chen | Faeze Brahman | Xiang Ren | Yangfeng Ji | Yejin Choi | Swabha Swayamdipta
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Generating free-text rationales is a promising step towards explainable NLP, yet evaluating such rationales remains a challenge. Existing metrics have mostly focused on measuring the association between the rationale and a given label. We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label. We investigate this research problem from an information-theoretic perspective using conditional V-information (Hewitt et al., 2021). More concretely, we propose a metric called REV (Rationale Evaluation with conditional V-information), to quantify the amount of new, label-relevant information in a rationale beyond the information already available in the input or the label. Experiments across four benchmarks with reasoning tasks, including chain-of-thought, demonstrate the effectiveness of REV in evaluating rationale-label pairs, compared to existing metrics. We further demonstrate REV is consistent with human judgments on rationale evaluations and provides more sensitive measurements of new information in free-text rationales. When used alongside traditional performance metrics, REV provides deeper insights into models’ reasoning and prediction processes.

PLAtE: A Large-scale Dataset for List Page Web Extraction
Aidan San | Yuan Zhuang | Jan Bakus | Colin Lockard | David Ciemiewicz | Sandeep Atluri | Kevin Small | Yangfeng Ji | Heba Elfardy
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) benchmark dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items encompassing the tasks of: (1) finding product list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 52,898 items collected from 6,694 pages and 156,014 attributes, making it the first large-scale list page web extraction dataset. We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.

Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values
Stephanie Schoch | Ritwick Mishra | Yangfeng Ji
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Although Shapley values have been shown to be highly effective for identifying harmful training instances, dataset size and model complexity constraints limit the ability to apply Shapley-based data valuation to fine-tuning large pre-trained language models. To address this, we propose TS-DShapley, an algorithm that reduces computational cost of Shapley-based data valuation through: 1) an efficient sampling-based method that aggregates Shapley values computed from subsets for valuation of the entire training set, and 2) a value transfer method that leverages value information extracted from a simple classifier trained using representations from the target language model. Our experiments applying TS-DShapley to select data for fine-tuning BERT-based language models on benchmark natural language understanding (NLU) datasets show that TS-DShapley outperforms existing data selection methods. Further, TS-DShapley can filter fine-tuning data to increase language model performance compared to training with the full fine-tuning dataset.

2022

Identifying the Source of Vulnerability in Explanation Discrepancy: A Case Study in Neural Text Classification
Ruixuan Tang | Hanjie Chen | Yangfeng Ji
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Some recent works observed the instability of post-hoc explanations when input side perturbations are applied to the model. This raises the interest and concern in the stability of post-hoc explanations. However, the remaining question is: is the instability caused by the neural network model or the post-hoc explanation method? This work explores the potential source that leads to unstable post-hoc explanations. To separate the influence from the model, we propose a simple output probability perturbation method. Compared to prior input side perturbation methods, the output probability perturbation method can circumvent the neural model’s potential effect on the explanations and allow the analysis on the explanation method. We evaluate the proposed method with three widely-used post-hoc explanation methods (LIME (Ribeiro et al., 2016), Kernel Shapley (Lundberg and Lee, 2017a), and Sample Shapley (Strumbelj and Kononenko, 2010)). The results demonstrate that the post-hoc methods are stable, barely producing discrepant explanations under output probability perturbations. The observation suggests that neural network models may be the primary source of fragile explanations.

FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows
Jianqiao Zhao | Yanyang Li | Wanyu Du | Yangfeng Ji | Dong Yu | Michael Lyu | Liwei Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Despite recent progress in open-domain dialogue evaluation, how to develop automatic metrics remains an open problem. We explore the potential of dialogue evaluation featuring dialog act information, which was hardly explicitly modeled in previous methods. However, defined at the utterance level in general, dialog act is of coarse granularity, as an utterance can contain multiple segments possessing different functions. Hence, we propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it. To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval. This framework provides a reference-free approach for dialog evaluation by finding pseudo-references. Extensive experiments against strong baselines on three benchmark datasets demonstrate the effectiveness and other desirable characteristics of our FlowEval, pointing out a potential path for better dialogue evaluation.

Self-training with Two-phase Self-augmentation for Few-shot Dialogue Generation
Wanyu Du | Hanjie Chen | Yangfeng Ji
Findings of the Association for Computational Linguistics: EMNLP 2022

In task-oriented dialogue systems, response generation from meaning representations (MRs) often suffers from limited training examples, due to the high cost of annotating MR-to-Text pairs. Previous works on self-training leverage fine-tuned conversational models to automatically generate pseudo-labeled MR-to-Text pairs for further fine-tuning. However, some self-augmented data may be noisy or uninformative for the model to learn from. In this work, we propose a two-phase self-augmentation procedure to generate high-quality pseudo-labeled MR-to-Text pairs: the first phase selects the most informative MRs based on model’s prediction uncertainty; with the selected MRs, the second phase generates accurate responses by aggregating multiple perturbed latent representations from each MR. Empirical experiments on two benchmark datasets, FewShotWOZ and FewShotSGD, show that our method generally outperforms existing self-training methods on both automatic and human evaluations.

Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models
Hannah Chen | Yangfeng Ji | David Evans
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Traditional (fickle) adversarial examples involve finding a small perturbation that does not change an input’s true label but confuses the classifier into outputting a different prediction. Conversely, obstinate adversarial examples occur when an adversary finds a small perturbation that preserves the classifier’s prediction but changes the true label of an input.Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to fickle adversarial examples. We show that standard adversarial training methods focused on reducing vulnerability to fickle adversarial examples may make a model more vulnerable to obstinate adversarial examples, with experiments for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce Balanced Adversarial Training, which incorporates contrastive learning to increase robustness against both fickle and obstinate adversarial examples.

White-box Testing of NLP models with Mask Neuron Coverage
Arshdeep Sekhon | Yangfeng Ji | Matthew Dwyer | Yanjun Qi
Findings of the Association for Computational Linguistics: NAACL 2022

Recent literature has seen growing interest in using black-box strategies like for testing the behavior of NLP models. Research on white-box testing has developed a number of methods for evaluatinghow thoroughly the internal behavior of deep models is tested, but they are not applicableto NLP models. We propose a set of white-box testing methods that are customized for transformer-based NLP models. These include MASK NEURON COVERAGE (MNCOVER) that measures how thoroughlythe attention layers in models are exercised during testing. We show that MNCOVER can refine testing suites generated by CheckList by substantiallyreduce them in size, for more than 60% on average, while retaining failing tests – thereby concentrating the faultdetection power of the test suite. Further we show how can be used to guide CheckList input generation,evaluate alternative NLP testing methods, and drive data augmentation to improve accuracy.

Contrastive Data and Learning for Natural Language Processing
Rui Zhang | Yangfeng Ji | Yue Zhang | Rebecca J. Passonneau
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts

Current NLP models heavily rely on effective representation learning algorithms. Contrastive learning is one such technique to learn an embedding space such that similar data sample pairs have close representations while dissimilar samples stay far apart from each other. It can be used in supervised or unsupervised settings using different loss functions to produce task-specific or general-purpose representations. While it has originally enabled the success for vision tasks, recent years have seen a growing number of publications in contrastive NLP. This first line of works not only delivers promising performance improvements in various NLP tasks, but also provides desired characteristics such as task-agnostic sentence representation, faithful text generation, data-efficient learning in zero-shot and few-shot settings, interpretability and explainability. In this tutorial, we aim to provide a gentle introduction to the fundamentals of contrastive learning approaches and the theory behind them. We then survey the benefits and the best practices of contrastive learning for various downstream NLP applications including Text Classification, Question Answering, Summarization, Text Generation, Interpretability and Explainability, Commonsense Knowledge and Reasoning, Vision-and-Language.This tutorial intends to help researchers in the NLP and computational linguistics community to understand this emerging topic and promote future research directions of using contrastive learning for NLP applications.

Pathologies of Pre-trained Language Models in Few-shot Fine-tuning
Hanjie Chen | Guoqing Zheng | Ahmed Awadallah | Yangfeng Ji
Proceedings of the Third Workshop on Insights from Negative Results in NLP

Although adapting pre-trained language models with few examples has shown promising performance on text classification, there is a lack of understanding of where the performance gain comes from. In this work, we propose to answer this question by interpreting the adaptation behavior using post-hoc explanations from model predictions. By modeling feature statistics of explanations, we discover that (1) without fine-tuning, pre-trained models (e.g. BERT and RoBERTa) show strong prediction bias across labels; (2) although few-shot fine-tuning can mitigate the prediction bias and demonstrate promising prediction performance, our analysis shows models gain performance improvement by capturing non-task-related features (e.g. stop words) or shallow data patterns (e.g. lexical overlaps). These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior, which requires further sanity check on model predictions and careful design in model evaluations in few-shot fine-tuning.

2021

HittER: Hierarchical Transformers for Knowledge Graph Embeddings
Sanxing Chen | Xiaodong Liu | Jianfeng Gao | Jian Jiao | Ruofei Zhang | Yangfeng Ji
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This paper examines the challenging problem of learning representations of entities and relations in a complex multi-relational knowledge graph. We propose HittER, a Hierarchical Transformer model to jointly learn Entity-relation composition and Relational contextualization based on a source entity’s neighborhood. Our proposed model consists of two different Transformer blocks: the bottom block extracts features of each entity-relation pair in the local neighborhood of the source entity and the top block aggregates the relational information from outputs of the bottom block. We further design a masked entity prediction task to balance information from the relational context and the source entity itself. Experimental results show that HittER achieves new state-of-the-art results on multiple link prediction datasets. We additionally propose a simple approach to integrate HittER into BERT and demonstrate its effectiveness on two Freebase factoid question answering datasets.

SideControl: Controlled Open-domain Dialogue Generation via Additive Side Networks
Wanyu Du | Yangfeng Ji
Findings of the Association for Computational Linguistics: EMNLP 2021

Transformer-based pre-trained language models boost the performance of open-domain dialogue systems. Prior works leverage Transformer-based pre-trained language models to generate texts with desired attributes in two general approaches: (1) gradient-based methods: updating all latent representations of pre-trained models with gradients from attribute models; (2) weighted-decoding methods: re-ranking beam candidates from pre-trained models with attribute functions. However, gradient-based methods lead to high computation cost and can easily get overfitted on small training sets, while weighted-decoding methods are inherently constrained by the low-variance high-bias pre-trained model. In this work, we propose a novel approach to control the generation of Transformer-based pre-trained language models: the SideControl framework, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples. We evaluate our proposed method on two benchmark open-domain dialogue datasets, and results show that the SideControl framework has better controllability, higher generation quality and better sample-efficiency than existing gradient-based and weighted-decoding baselines.

Contextualizing Variation in Text Style Transfer Datasets
Stephanie Schoch | Wanyu Du | Yangfeng Ji
Proceedings of the 14th International Conference on Natural Language Generation

Text style transfer involves rewriting the content of a source sentence in a target style. Despite there being a number of style tasks with available data, there has been limited systematic discussion of how text style datasets relate to each other. This understanding, however, is likely to have implications for selecting multiple data sources for model training. While it is prudent to consider inherent stylistic properties when determining these relationships, we also must consider how a style is realized in a particular dataset. In this paper, we conduct several empirical analyses of existing text style datasets. Based on our results, we propose a categorization of stylistic and dataset properties to consider when utilizing or comparing text style datasets.

Explaining Neural Network Predictions on Sentence Pairs via Learning Word-Group Masks
Hanjie Chen | Song Feng | Jatin Ganhotra | Hui Wan | Chulaka Gunasekara | Sachindra Joshi | Yangfeng Ji
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Explaining neural network models is important for increasing their trustworthiness in real-world applications. Most existing methods generate post-hoc explanations for neural network models by identifying individual feature attributions or detecting interactions between adjacent features. However, for models with text pairs as inputs (e.g., paraphrase identification), existing methods are not sufficient to capture feature interactions between two texts and their simple extension of computing all word-pair interactions between two texts is computationally inefficient. In this work, we propose the Group Mask (GMASK) method to implicitly detect word correlations by grouping correlated words from the input text pair together and measure their contribution to the corresponding NLP tasks as a whole. The proposed method is evaluated with two different model architectures (decomposable attention model and BERT) across four datasets, including natural language inference and paraphrase identification tasks. Experiments show the effectiveness of GMASK in providing faithful explanations to these models.

Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing
Sanchit Sinha | Hanjie Chen | Arshdeep Sekhon | Yangfeng Ji | Yanjun Qi
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Interpretability methods like Integrated Gradient and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text. Via a small portion of word-level swaps, these adversarial perturbations aim to make the resulting text semantically and spatially similar to its seed input (therefore sharing similar interpretations). Simultaneously, the generated examples achieve the same prediction label as the seed yet are given a substantially different explanation by the interpretation methods. Our experiments generate fragile interpretations to attack two SOTA interpretation methods, across three popular Transformer models and on three different NLP datasets. We observe that the rank order correlation and top-K intersection score drops by over 20% when less than 10% of words are perturbed on average. Further, rank-order correlation keeps decreasing as more words get perturbed. Furthermore, we demonstrate that candidates generated from our method have good quality metrics.

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann | Tosin Adewumi | Karmanya Aggarwal | Pawan Sasanka Ammanamanchi | Anuoluwapo Aremu | Antoine Bosselut | Khyathi Raghavi Chandu | Miruna-Adriana Clinciu | Dipanjan Das | Kaustubh Dhole | Wanyu Du | Esin Durmus | Ondřej Dušek | Chris Chinenye Emezue | Varun Gangal | Cristina Garbacea | Tatsunori Hashimoto | Yufang Hou | Yacine Jernite | Harsh Jhamtani | Yangfeng Ji | Shailza Jolly | Mihir Kale | Dhruv Kumar | Faisal Ladhak | Aman Madaan | Mounica Maddela | Khyati Mahajan | Saad Mahamood | Bodhisattwa Prasad Majumder | Pedro Henrique Martins | Angelina McMillan-Major | Simon Mille | Emiel van Miltenburg | Moin Nadeem | Shashi Narayan | Vitaly Nikolaev | Andre Niyongabo Rubungo | Salomey Osei | Ankur Parikh | Laura Perez-Beltrachini | Niranjan Ramesh Rao | Vikas Raunak | Juan Diego Rodriguez | Sashank Santhanam | João Sedoc | Thibault Sellam | Samira Shaikh | Anastasia Shimorina | Marco Antonio Sobrevilla Cabezudo | Hendrik Strobelt | Nishant Subramani | Wei Xu | Diyi Yang | Akhila Yerukola | Jiawei Zhou
Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.

Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)
Song Feng | Siva Reddy | Malihe Alikhani | He He | Yangfeng Ji | Mohit Iyyer | Zhou Yu
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

2020

“This is a Problem, Don’t You Agree?” Framing and Bias in Human Evaluation for Natural Language Generation
Stephanie Schoch | Diyi Yang | Yangfeng Ji
Proceedings of the 1st Workshop on Evaluating NLG Evaluation

Despite recent efforts reviewing current human evaluation practices for natural language generation (NLG) research, the lack of reported question wording and potential for framing effects or cognitive biases influencing results has been widely overlooked. In this opinion paper, we detail three possible framing effects and cognitive biases that could be imposed on human evaluation in NLG. Based on this, we make a call for increased transparency for human evaluation in NLG and propose the concept of human evaluation statements. We make several recommendations for design details to report that could potentially influence results, such as question wording, and suggest that reporting pertinent design details can help increase comparability across studies as well as reproducibility of results.

Pointwise Paraphrase Appraisal is Potentially Problematic
Hannah Chen | Yangfeng Ji | David Evans
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

The prevailing approach for training and evaluating paraphrase identification models is constructed as a binary classification problem: the model is given a pair of sentences, and is judged by how accurately it classifies pairs as either paraphrases or non-paraphrases. This pointwise-based evaluation method does not match well the objective of most real world applications, so the goal of our work is to understand how models which perform well under pointwise evaluation may fail in practice and find better methods for evaluating paraphrase identification models. As a first step towards that goal, we show that although the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance, that model may perform poorly on simple tasks like identifying pairs with two identical sentences. Moreover, we show that these models may even predict a pair of randomly-selected sentences with higher paraphrase score than a pair of identical ones.

Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection
Hanjie Chen | Guangtao Zheng | Yangfeng Ji
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Generating explanations for neural networks has become crucial for their applications in real-world with respect to reliability and trustworthiness. In natural language processing, existing methods usually provide important features which are words or phrases selected from an input text as an explanation, but ignore the interactions between them. It poses challenges for humans to interpret an explanation and connect it to model prediction. In this work, we build hierarchical explanations by detecting feature interactions. Such explanations visualize how words and phrases are combined at different levels of the hierarchy, which can help users understand the decision-making of black-box models. The proposed method is evaluated with three neural text classifiers (LSTM, CNN, and BERT) on two benchmark datasets, via both automatic and human evaluations. Experiments show the effectiveness of the proposed method in providing explanations that are both faithful to models and interpretable to humans.

Learning Variational Word Masks to Improve the Interpretability of Neural Text Classifiers
Hanjie Chen | Yangfeng Ji
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

To build an interpretable neural text classifier, most of the prior work has focused on designing inherently interpretable models or finding faithful explanations. A new line of work on improving model interpretability has just started, and many existing methods require either prior information or human annotations as additional inputs in training. To address this limitation, we propose the variational word mask (VMASK) method to automatically learn task-specific important words and reduce irrelevant information on classification, which ultimately improves the interpretability of model predictions. The proposed method is evaluated with three neural text classifiers (CNN, LSTM, and BERT) on seven benchmark text classification datasets. Experiments show the effectiveness of VMASK in improving both model prediction accuracy and interpretability.

Reevaluating Adversarial Examples in Natural Language
John Morris | Eli Lifland | Jack Lanchantin | Yangfeng Ji | Yanjun Qi
Findings of the Association for Computational Linguistics: EMNLP 2020

State-of-the-art attacks on NLP models lack a shared definition of a what constitutes a successful attack. We distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows some linguistic constraints. We then analyze the outputs of two state-of-the-art synonym substitution attacks. We find that their perturbations often do not preserve semantics, and 38% introduce grammatical errors. Human surveys reveal that to successfully preserve semantics, we need to significantly increase the minimum cosine similarities between the embeddings of swapped words and between the sentence encodings of original and perturbed sentences. With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points.

Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory
Hannah Chen | Yangfeng Ji | David Evans
Findings of the Association for Computational Linguistics: EMNLP 2020

Most NLP datasets are manually labeled, so suffer from inconsistent labeling or limited size. We propose methods for automatically improving datasets by viewing them as graphs with expected semantic properties. We construct a paraphrase graph from the provided sentence pair labels, and create an augmented dataset by directly inferring labels from the original sentence pairs using a transitivity property. We use structural balance theory to identify likely mislabelings in the graph, and flip their labels. We evaluate our methods on paraphrase models trained using these datasets starting from a pretrained BERT model, and find that the automatically-enhanced training sets result in more accurate models.

The Amazing World of Neural Language Generation
Yangfeng Ji | Antoine Bosselut | Thomas Wolf | Asli Celikyilmaz
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Neural Language Generation (NLG) – using neural network models to generate coherent text – is among the most promising methods for automated text creation. Recent years have seen a paradigm shift in neural text generation, caused by the advances in deep contextual language modeling (e.g., LSTMs, GPT, GPT2) and transfer learning (e.g., ELMo, BERT). While these tools have dramatically improved the state of NLG, particularly for low resources tasks, state-of-the-art NLG models still face many challenges: a lack of diversity in generated text, commonsense violations in depicted situations, difficulties in making use of factual information, and difficulties in designing reliable evaluation metrics. In this tutorial, we will present an overview of the current state-of-the-art in neural network architectures, and how they shaped recent research directions in text generation. We will discuss how and why these models succeed/fail at generating coherent text, and provide insights on several applications.

A Tale of Two Linkings: Dynamically Gating between Schema Linking and Structural Linking for Text-to-SQL Parsing
Sanxing Chen | Aidan San | Xiaodong Liu | Yangfeng Ji
Proceedings of the 28th International Conference on Computational Linguistics

In Text-to-SQL semantic parsing, selecting the correct entities (tables and columns) for the generated SQL query is both crucial and challenging; the parser is required to connect the natural language (NL) question and the SQL query to the structured knowledge in the database. We formulate two linking processes to address this challenge: schema linking which links explicit NL mentions to the database and structural linking which links the entities in the output SQL with their structural relationships in the database schema. Intuitively, the effectiveness of these two linking processes changes based on the entity being generated, thus we propose to dynamically choose between them using a gating mechanism. Integrating the proposed method with two graph neural network-based semantic parsers together with BERT representations demonstrates substantial gains in parsing accuracy on the challenging Spider dataset. Analyses show that our proposed method helps to enhance the structure of the model output when generating complicated SQL queries and offers more explainable predictions.

2019

An Empirical Comparison on Imitation Learning and Reinforcement Learning for Paraphrase Generation
Wanyu Du | Yangfeng Ji
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Generating paraphrases from given sentences involves decoding words step by step from a large vocabulary. To learn a decoder, supervised learning which maximizes the likelihood of tokens always suffers from the exposure bias. Although both reinforcement learning (RL) and imitation learning (IL) have been widely used to alleviate the bias, the lack of direct comparison leads to only a partial image on their benefits. In this work, we present an empirical study on how RL and IL can help boost the performance of generating paraphrases, with the pointer-generator as a base model. Experiments on the benchmark datasets show that (1) imitation learning is constantly better than reinforcement learning; and (2) the pointer-generator models with imitation learning outperform the state-of-the-art methods with a large margin.

2018

Neural Text Generation in Stories Using Entity Representations as Context
Elizabeth Clark | Yangfeng Ji | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We introduce an approach to neural text generation that explicitly represents entities mentioned in the text. Entity representations are vectors that are updated as the text proceeds; they are designed specifically for narrative text like fiction or news stories. Our experiments demonstrate that modeling entities offers a benefit in two automatic evaluations: mention generation (in which a model chooses which entity to mention next and which words to use in the mention) and selection between a correct next sentence and a distractor from later in the same story. We also conduct a human evaluation on automatically generated text in story contexts; this study supports our emphasis on entities and suggests directions for further research.

2017

Dynamic Entity Representations in Neural Language Models
Yangfeng Ji | Chenhao Tan | Sebastian Martschat | Yejin Choi | Noah A. Smith
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Understanding a long document requires tracking how entities are introduced and evolve over time. We present a new type of language model, EntityNLM, that can explicitly model entities, dynamically update their representations, and contextually generate their mentions. Our model is generative and flexible; it can model an arbitrary number of entities in context while generating each entity mention at an arbitrary length. In addition, it can be used for several different tasks such as language modeling, coreference resolution, and entity prediction. Experimental results with all these tasks demonstrate that our model consistently outperforms strong baselines and prior work.

Neural Discourse Structure for Text Categorization
Yangfeng Ji | Noah A. Smith
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We show that discourse structure, as defined by Rhetorical Structure Theory and provided by an existing discourse parser, benefits text categorization. Our approach uses a recursive neural network and a newly proposed attention mechanism to compute a representation of the text that focuses on salient content, from the perspective of both RST and the task. Experiments consider variants of the approach and illustrate its strengths and weaknesses.

2016

Multiplicative Representations for Unsupervised Semantic Role Induction
Yi Luan | Yangfeng Ji | Hannaneh Hajishirzi | Boyang Li
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A Latent Variable Recurrent Neural Network for Discourse-Driven Language Models
Yangfeng Ji | Gholamreza Haffari | Jacob Eisenstein
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets
Michel Galley | Chris Brockett | Alessandro Sordoni | Yangfeng Ji | Michael Auli | Chris Quirk | Margaret Mitchell | Jianfeng Gao | Bill Dolan
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Better Document-level Sentiment Analysis from RST Discourse Parsing
Parminder Bhatia | Yangfeng Ji | Jacob Eisenstein
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses
Alessandro Sordoni | Michel Galley | Michael Auli | Chris Brockett | Yangfeng Ji | Margaret Mitchell | Jian-Yun Nie | Jianfeng Gao | Bill Dolan
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

One Vector is Not Enough: Entity-Augmented Distributed Semantics for Discourse Relations
Yangfeng Ji | Jacob Eisenstein
Transactions of the Association for Computational Linguistics, Volume 3

Discourse relations bind smaller linguistic units into coherent texts. Automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked arguments. A more subtle challenge is that it is not enough to represent the meaning of each argument of a discourse relation, because the relation may depend on links between lowerlevel components, such as entity mentions. Our solution computes distributed meaning representations for each discourse argument by composition up the syntactic parse tree. We also perform a downward compositional pass to capture the meaning of coreferent entity mentions. Implicit discourse relations are then predicted from these two representations, obtaining substantial improvements on the Penn Discourse Treebank.

Closing the Gap: Domain Adaptation from Explicit to Implicit Discourse Relations
Yangfeng Ji | Gongbo Zhang | Jacob Eisenstein
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

Representation Learning for Text-level Discourse Parsing
Yangfeng Ji | Jacob Eisenstein
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Extracting Lexically Divergent Paraphrases from Twitter
Wei Xu | Alan Ritter | Chris Callison-Burch | William B. Dolan | Yangfeng Ji
Transactions of the Association for Computational Linguistics, Volume 2

We present MultiP (Multi-instance Learning Paraphrase Model), a new model suited to identify paraphrases within the short messages on Twitter. We jointly model paraphrase relations between word and sentence pairs and assume only sentence-level annotations during learning. Using this principled latent variable model alone, we achieve the performance competitive with a state-of-the-art method which combines a latent space model with a feature-based supervised classifier. Our model also captures lexically divergent paraphrases that differ from yet complement previous methods; combining our model with previous work significantly outperforms the state-of-the-art. In addition, we present a novel annotation methodology that has allowed us to crowdsource a paraphrase corpus from Twitter. We make this new dataset available to the research community.

Mining Themes and Interests in the Asperger’s and Autism Community
Yangfeng Ji | Hwajung Hong | Rosa Arriaga | Agata Rozga | Gregory Abowd | Jacob Eisenstein
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

2013

Discriminative Improvements to Distributional Sentence Similarity
Yangfeng Ji | Jacob Eisenstein
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Co-authors

William B. Dolan 3

Noah A. Smith 3

Antoine Bosselut 2

Chris Brockett 2

Michel Galley 2

Margaret Mitchell 2

Arshdeep Sekhon 2

Alessandro Sordoni 2

Gregory Abowd 1

Tosin Adewumi 1

Karmanya Aggarwal 1

Malihe Alikhani 1

Pawan Sasanka Ammanamanchi 1

Anuoluwapo Aremu 1

Sandeep Atluri 1

Ahmed Awadallah 1

Parminder Bhatia 1

Faeze Brahman 1

Chris Callison-Burch 1

Asli Celikyilmaz 1

Khyathi Raghavi Chandu 1

David Ciemiewicz 1

Elizabeth Clark 1

Miruna Clinciu 1

Kaustubh Dhole 1

Ondřej Dušek 1

Matthew Dwyer 1

Matthew B. Dwyer 1

Chris Chinenye Emezue 1

Jatin Ganhotra 1

Cristina Garbacea 1

Sebastian Gehrmann 1

Chulaka Gunasekara 1

Gholamreza Haffari 1

Hannaneh Hajishirzi 1

Tatsunori B. Hashimoto 1

Saahith Janapati 1

Yacine Jernite 1

Harsh Jhamtani 1

Shailza Jolly 1

Sachindra Joshi 1

Faisal Ladhak 1

Jack Lanchantin 1

Colin Lockard 1

Michael R. Lyu 1

Mounica Maddela 1

Khyati Mahajan 1

Saad Mahamood 1

Bodhisattwa Prasad Majumder 1

Pedro Henrique Martins 1

Sebastian Martschat 1

Angelina McMillan-Major 1

Ritwick Mishra 1

Shashi Narayan 1

Vitaly Nikolaev 1

Elizabeth Palmieri 1

Rebecca J. Passonneau 1

Laura Perez-Beltrachini 1

Niranjan Ramesh Rao 1

Juan Diego Rodriguez 1

Andre Niyongabo Rubungo 1

Sashank Santhanam 1

Thibault Sellam 1

Samira Shaikh 1

Anastasia Shimorina 1

Sanchit Sinha 1

Marco Antonio Sobrevilla Cabezudo 1

Hendrik Strobelt 1

Nishant Subramani 1

Swabha Swayamdipta 1

Emiel Van Miltenburg 1

Dane A Williamson 1

Akhila Yerukola 1

Dong Yu (于东) 1

Jianqiao Zhao 1

Guangtao Zheng 1

Guoqing Zheng 1

Venues