Chenhao Tan


pdf bib
What to Learn, and How: Toward Effective Learning from Rationales
Samuel Carton | Surya Kanoria | Chenhao Tan
Findings of the Association for Computational Linguistics: ACL 2022

Learning from rationales seeks to augment model prediction accuracy using human-annotated rationales (i.e. subsets of input tokens) that justify their chosen labels, often in the form of intermediate or multitask supervision. While intuitive, this idea has proven elusive in practice. We make two observations about human rationales via empirical analyses:1) maximizing rationale supervision accuracy is not necessarily the optimal objective for improving model accuracy; 2) human rationales vary in whether they provide sufficient information for the model to exploit for prediction.Building on these insights, we propose several novel loss functions and learning strategies, and evaluate their effectiveness on three datasets with human rationales. Our results demonstrate consistent improvements over baselines in both label and rationale accuracy, including a 3% accuracy improvement on MultiRC. Our work highlights the importance of understanding properties of human explanations and exploiting them accordingly in model training.

pdf bib
Explaining Why: How Instructions and User Interfaces Impact Annotator Rationales When Labeling Text Data
Jamar Sullivan Jr. | Will Brackenbury | Andrew McNutt | Kevin Bryson | Kwam Byll | Yuxin Chen | Michael Littman | Chenhao Tan | Blase Ur
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In the context of data labeling, NLP researchers are increasingly interested in having humans select rationales, a subset of input tokens relevant to the chosen label. We conducted a 332-participant online user study to understand how humans select rationales, especially how different instructions and user interface affordances impact the rationales chosen. Participants labeled ten movie reviews as positive or negative, selecting words and phrases supporting their label as rationales. We varied the instructions given, the rationale-selection task, and the user interface. Participants often selected about 12% of input tokens as rationales, but selected fewer if unable to drag over multiple tokens at once. Whereas participants were near unanimous in their data labels, they were far less consistent in their rationales. The user interface affordances and task greatly impacted the types of rationales chosen. We also observed large variance across participants.

pdf bib
On the Diversity and Limits of Human Explanations
Chenhao Tan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A growing effort in NLP aims to build datasets of human explanations. However, it remains unclear whether these datasets serve their intended goals. This problem is exacerbated by the fact that the term explanation is overloaded and refers to a broad range of notions with different properties and ramifications. Our goal is to provide an overview of the diversity of explanations, discuss human limitations in providing explanations, and ultimately provide implications for collecting and using human explanations in NLP.Inspired by prior work in psychology and cognitive sciences, we group existing human explanations in NLP into three categories: proximal mechanism, evidence, and procedure. These three types differ in nature and have implications for the resultant explanations. For instance, procedure is not considered explanation in psychology and connects with a rich body of work on learning from instructions. The diversity of explanations is further evidenced by proxy questions that are needed for annotators to interpret and answer “why is [input] assigned [label]”. Finally, giving explanations may require different, often deeper, understandings than predictions, which casts doubt on whether humans can provide valid explanations in some tasks.

pdf bib
Human-Centered Evaluation of Explanations
Jordan Boyd-Graber | Samuel Carton | Shi Feng | Q. Vera Liao | Tania Lombrozo | Alison Smith-Renner | Chenhao Tan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts

The NLP community are increasingly interested in providing explanations for NLP models to help people make sense of model behavior and potentially improve human interaction with models. In addition to computational challenges in generating these explanations, evaluations of the generated explanations require human-centered perspectives and approaches. This tutorial will provide an overview of human-centered evaluations of explanations. First, we will give a brief introduction to the psychological foundation of explanations as well as types of NLP model explanations and their corresponding presentation, to provide the necessary background. We will then present a taxonomy of human-centered evaluation of explanations and dive into depth in the two categories: 1) evaluation based on human-annotated explanations; 2) evaluation with human-subjects studies. We will conclude by discussing future directions. We will also adopt a flipped format to maximize the in- teractive components for the live audience.


pdf bib
Decision-Focused Summarization
Chao-Chun Hsu | Chenhao Tan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Relevance in summarization is typically de- fined based on textual information alone, without incorporating insights about a particular decision. As a result, to support risk analysis of pancreatic cancer, summaries of medical notes may include irrelevant information such as a knee injury. We propose a novel problem, decision-focused summarization, where the goal is to summarize relevant information for a decision. We leverage a predictive model that makes the decision based on the full text to provide valuable insights on how a decision can be inferred from text. To build a summary, we then select representative sentences that lead to similar model decisions as using the full text while accounting for textual non-redundancy. To evaluate our method (DecSum), we build a testbed where the task is to summarize the first ten reviews of a restaurant in support of predicting its future rating on Yelp. DecSum substantially outperforms text-only summarization methods and model-based explanation methods in decision faithfulness and representativeness. We further demonstrate that DecSum is the only method that enables humans to outperform random chance in predicting which restaurant will be better rated in the future.

pdf bib
Investigating the Effect of Natural Language Explanations on Out-of-Distribution Generalization in Few-shot NLI
Yangqiaoyu Zhou | Chenhao Tan
Proceedings of the Second Workshop on Insights from Negative Results in NLP

Although neural models have shown strong performance in datasets such as SNLI, they lack the ability to generalize out-of-distribution (OOD). In this work, we formulate a few-shot learning setup and examine the effects of natural language explanations on OOD generalization. We leverage the templates in the HANS dataset and construct templated natural language explanations for each template. Although generated explanations show competitive BLEU scores against ground truth explanations, they fail to improve prediction performance. We further show that generated explanations often hallucinate information and miss key elements that indicate the label.

pdf bib
On Positivity Bias in Negative Reviews
Madhusudhan Aithal | Chenhao Tan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Prior work has revealed that positive words occur more frequently than negative words in human expressions, which is typically attributed to positivity bias, a tendency for people to report positive views of reality. But what about the language used in negative reviews? Consistent with prior work, we show that English negative reviews tend to contain more positive words than negative words, using a variety of datasets. We reconcile this observation with prior findings on the pragmatics of negation, and show that negations are commonly associated with positive words in negative reviews. Furthermore, in negative reviews, the majority of sentences with positive words express negative opinions based on sentiment classifiers, indicating some form of negation.


pdf bib
Characterizing the Value of Information in Medical Notes
Chao-Chun Hsu | Shantanu Karnwal | Sendhil Mullainathan | Ziad Obermeyer | Chenhao Tan
Findings of the Association for Computational Linguistics: EMNLP 2020

Machine learning models depend on the quality of input data. As electronic health records are widely adopted, the amount of data in health care is growing, along with complaints about the quality of medical notes. We use two prediction tasks, readmission prediction and in-hospital mortality prediction, to characterize the value of information in medical notes. We show that as a whole, medical notes only provide additional predictive power over structured information in readmission prediction. We further propose a probing framework to select parts of notes that enable more accurate predictions than using all notes, despite that the selected information leads to a distribution shift from the training data (“all notes”). Finally, we demonstrate that models trained on the selected valuable information achieve even better predictive performance, with only 6.8%of all the tokens for readmission prediction.

pdf bib
Evaluating and Characterizing Human Rationales
Samuel Carton | Anirudh Rathore | Chenhao Tan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Two main approaches for evaluating the quality of machine-generated rationales are: 1) using human rationales as a gold standard; and 2) automated metrics based on how rationales affect model behavior. An open question, however, is how human rationales fare with these automatic metrics. Analyzing a variety of datasets and models, we find that human rationales do not necessarily perform well on these metrics. To unpack this finding, we propose improved metrics to account for model-dependent baseline performance. We then propose two methods to further characterize rationale quality, one based on model retraining and one on using “fidelity curves” to reveal properties such as irrelevance and redundancy. Our work leads to actionable suggestions for evaluating and characterizing rationales.


pdf bib
Many Faces of Feature Importance: Comparing Built-in and Post-hoc Feature Importance in Text Classification
Vivian Lai | Zheng Cai | Chenhao Tan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Feature importance is commonly used to explain machine predictions. While feature importance can be derived from a machine learning model with a variety of methods, the consistency of feature importance via different methods remains understudied. In this work, we systematically compare feature importance from built-in mechanisms in a model such as attention values and post-hoc methods that approximate model behavior such as LIME. Using text classification as a testbed, we find that 1) no matter which method we use, important features from traditional models such as SVM and XGBoost are more similar with each other, than with deep learning models; 2) post-hoc methods tend to generate more similar important features for two models than built-in methods. We further demonstrate how such similarity varies across instances. Notably, important features do not always resemble each other better when two models agree on the predicted label than when they disagree.

pdf bib
What Gets Echoed? Understanding the “Pointers” in Explanations of Persuasive Arguments
David Atkinson | Kumar Bhargav Srinivasan | Chenhao Tan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Explanations are central to everyday life, and are a topic of growing interest in the AI community. To investigate the process of providing natural language explanations, we leverage the dynamics of the /r/ChangeMyView subreddit to build a dataset with 36K naturally occurring explanations of why an argument is persuasive. We propose a novel word-level prediction task to investigate how explanations selectively reuse, or echo, information from what is being explained (henceforth, explanandum). We develop features to capture the properties of a word in the explanandum, and show that our proposed features not only have relatively strong predictive power on the echoing of a word in an explanation, but also enhance neural methods of generating explanations. In particular, while the non-contextual properties of a word itself are more valuable for stopwords, the interaction between the constituent parts of an explanandum is crucial in predicting the echoing of content words. We also find intriguing patterns of a word being echoed. For example, although nouns are generally less likely to be echoed, subjects and objects can, depending on their source, be more likely to be echoed in the explanations.

pdf bib
Measuring Online Debaters’ Persuasive Skill from Text over Time
Kelvin Luu | Chenhao Tan | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 7

Online debates allow people to express their persuasive abilities and provide exciting opportunities for understanding persuasion. Prior studies have focused on studying persuasion in debate content, but without accounting for each debater’s history or exploring the progression of a debater’s persuasive ability. We study debater skill by modeling how participants progress over time in a collection of debates from We build on a widely used model of skill in two-player games and augment it with linguistic features of a debater’s content. We show that online debaters’ skill levels do tend to improve over time. Incorporating linguistic profiles leads to more robust skill estimation than winning records alone. Notably, we find that an interaction feature combining uncertainty cues (hedging) with terms strongly associated with either side of a particular debate (fightin’ words) is more predictive than either feature on its own, indicating the importance of fine- grained linguistic features.

pdf bib
No Permanent Friends or Enemies: Tracking Relationships between Nations from News
Xiaochuang Han | Eunsol Choi | Chenhao Tan
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Understanding the dynamics of international politics is important yet challenging for civilians. In this work, we explore unsupervised neural models to infer relations between nations from news articles. We extend existing models by incorporating shallow linguistics information and propose a new automatic evaluation metric that aligns relationship dynamics with manually annotated key events. As understanding international relations requires carefully analyzing complex relationships, we conduct in-person human evaluations with three groups of participants. Overall, humans prefer the outputs of our model and give insightful feedback that suggests future directions for human-centered models. Furthermore, our model reveals interesting regional differences in news coverage. For instance, with respect to US-China relations, Singaporean media focus more on “strengthening” and “purchasing”, while US media focus more on “criticizing” and “denouncing”.


pdf bib
LSTMs Exploit Linguistic Attributes of Data
Nelson F. Liu | Omer Levy | Roy Schwartz | Chenhao Tan | Noah A. Smith
Proceedings of the Third Workshop on Representation Learning for NLP

While recurrent neural networks have found success in a variety of natural language processing applications, they are general models of sequential data. We investigate how the properties of natural language data affect an LSTM’s ability to learn a nonlinguistic task: recalling elements from its input. We find that models trained on natural language data are able to recall tokens from much longer sequences than models trained on non-language sequential data. Furthermore, we show that the LSTM learns to solve the memorization task by explicitly using a subset of its neurons to count timesteps in the input. We hypothesize that the patterns and structure in natural language data enable LSTMs to learn by providing approximate ways of reducing loss, but understanding the effect of different training data on the learnability of LSTMs remains an open question.

pdf bib
Neural Models for Documents with Metadata
Dallas Card | Chenhao Tan | Noah A. Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most real-world document collections involve various types of metadata, such as author, source, and date, and yet the most commonly-used approaches to modeling text corpora ignore this information. While specialized models have been developed for particular applications, few are widely used in practice, as customization typically requires derivation of a custom inference algorithm. In this paper, we build on recent advances in variational inference methods and propose a general neural framework, based on topic models, to enable flexible incorporation of metadata and allow for rapid exploration of alternative models. Our approach achieves strong performance, with a manageable tradeoff between perplexity, coherence, and sparsity. Finally, we demonstrate the potential of our framework through an exploration of a corpus of articles about US immigration.


pdf bib
Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts
Chenhao Tan | Dallas Card | Noah A. Smith
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding how ideas relate to each other is a fundamental question in many domains, ranging from intellectual history to public communication. Because ideas are naturally embedded in texts, we propose the first framework to systematically characterize the relations between ideas based on their occurrence in a corpus of documents, independent of how these ideas are represented. Combining two statistics—cooccurrence within documents and prevalence correlation over time—our approach reveals a number of different ways in which ideas can cooperate and compete. For instance, two ideas can closely track each other’s prevalence over time, and yet rarely cooccur, almost like a “cold war” scenario. We observe that pairwise cooccurrence and prevalence correlation exhibit different distributions. We further demonstrate that our approach is able to uncover intriguing relations between ideas through in-depth case studies on news articles and research papers.

pdf bib
Dynamic Entity Representations in Neural Language Models
Yangfeng Ji | Chenhao Tan | Sebastian Martschat | Yejin Choi | Noah A. Smith
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Understanding a long document requires tracking how entities are introduced and evolve over time. We present a new type of language model, EntityNLM, that can explicitly model entities, dynamically update their representations, and contextually generate their mentions. Our model is generative and flexible; it can model an arbitrary number of entities in context while generating each entity mention at an arbitrary length. In addition, it can be used for several different tasks such as language modeling, coreference resolution, and entity prediction. Experimental results with all these tasks demonstrate that our model consistently outperforms strong baselines and prior work.


pdf bib
The effect of wording on message propagation: Topic- and author-controlled natural experiments on Twitter
Chenhao Tan | Lillian Lee | Bo Pang
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Corpus of Sentence-level Revisions in Academic Writing: A Step towards Understanding Statement Strength in Communication
Chenhao Tan | Lillian Lee
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


pdf bib
Hedge Detection as a Lens on Framing in the GMO Debates: A Position Paper
Eunsol Choi | Chenhao Tan | Lillian Lee | Cristian Danescu-Niculescu-Mizil | Jennifer Spindel
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics


pdf bib
Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
Bin Lu | Chenhao Tan | Claire Cardie | Benjamin K. Tsou
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies