Junyi Jessy Li - ACL Anthology

Junyi Jessy Li

2026

Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs
Junyi Jessy Li | Yang Janet Liu | Kanishka Misra | Valentina Pyatkin | William Sheffield
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)

The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula.We present a new course, "Computational Discourse and Natural Language Generation". The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.

Wugnectives: Novel Entity Inferences of Language Models from Discourse Connectives
Daniel Brubaker | William Sheffield | Junyi Jessy Li | Kanishka Misra
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present Wugnectives, a dataset of 8,880 stimuli that evaluates LMs’ inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs’ overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs.We release Wugnectives at https://github.com/kanishkamisra/wugnectives

Strategic Dialogue Assessment: The Crooked Path to Innocence
Anshun Asher Zheng | Junyi Jessy Li | David I. Beaver
Dialogue & Discourse Volume 17

Language is often used strategically, particularly in high-stakes, adversarial settings, yet most work on pragmatics and LLMs centers on cooperative settings. This leaves a gap in the systematic understanding of strategic communication in adversarial settings. To address this, we introduce SDA (Strategic Dialogue Assessment), a framework grounded in Gricean and game-theoretic pragmatics to assess strategic use of language. It adapts the ME Game jury function to make it empirically estimable for analyzing dialogue. Our approach incorporates two key adaptations: a commitment-based taxonomy of discourse moves, which provides a finer-grained account of strategic effects, and the use of estimable proxies grounded in Gricean maxims to operationalize abstract constructs such as credibility. Together, these adaptations build on discourse theory by treating discourse as the strategic management of commitments, enabling systematic evaluation of how conversational moves advance or undermine discourse goals. We further derive three interpretable metrics - Benefit at Turn (BAT), Penalty at Turn (PAT), and Normalized Relative Benefit at Turn (NRBAT) - to quantify the perceived strategic effects of discourse moves. We also present CPD (the Crooked Path Dataset), an annotated dataset of real courtroom cross-examinations, to demonstrate the framework’s effectiveness. Using these tools, we evaluate a range of LLMs and show that LLMs generally exhibit limited pragmatic understanding of strategic language. While model size shows an increase in performance on our metrics, reasoning ability does not help and largely hurts, introducing overcomplication and internal confusion.

2025

Behavioral Analysis of Information Salience in Large Language Models
Jan Trienes | Jörg Schlötterer | Junyi Jessy Li | Christin Seifert
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.

Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)
Michael Strube | Chloe Braud | Christian Hardmeier | Junyi Jessy Li | Sharid Loaiciga | Amir Zeldes | Chuyuan Li
Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)

Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs
William Sheffield | Kanishka Misra | Valentina Pyatkin | Ashwini Deo | Kyle Mahowald | Junyi Jessy Li
Findings of the Association for Computational Linguistics: ACL 2025

Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects,as exemplified by the diverse uses of the particle *just* (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English *just*, a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.

2024

Learning to Refine with Fine-Grained Natural Language Feedback
Manya Wadhwa | Xinyu Zhao | Junyi Jessy Li | Greg Durrett
Findings of the Association for Computational Linguistics: EMNLP 2024

Recent work has explored the capability of large language models (LLMs) to identify and correct errors in LLM-generated responses. These refinement approaches frequently evaluate what sizes of models are able to do refinement for what problems, but less attention is paid to what effective feedback for refinement looks like. In this work, we propose looking at refinement with feedback as a composition of three distinct LLM competencies: (1) detection of bad generations; (2) fine-grained natural language critique generation; (3) refining with fine-grained feedback. The first step can be implemented with a high-performing discriminative model and steps 2 and 3 can be implemented either via prompted or fine-tuned LLMs. A key property of the proposed Detect, Critique, Refine (“DCR”) method is that the step 2 critique model can give fine-grained feedback about errors, made possible by offloading the discrimination to a separate model in step 1. We show that models of different capabilities benefit from refining with DCR on the task of improving factual consistency of document grounded summaries. Overall, DCR consistently outperforms existing end-to-end refinement approaches and current trained models not fine-tuned for factuality critiquing.

Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)
Michael Strube | Chloe Braud | Christian Hardmeier | Junyi Jessy Li | Sharid Loaiciga | Amir Zeldes | Chuyuan Li
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)

Detection and Measurement of Syntactic Templates in Generated Text
Chantal Shaib | Yanai Elazar | Junyi Jessy Li | Byron C Wallace
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The diversity of text can be measured beyond word-level features, however existing diversity evaluation focuses primarily on word-level features. Here we propose a method for evaluating diversity over syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates (e.g., strings comprising parts-of-speech) and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference textsWe find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning or alignment processes such as RLHF. The connection between templates in generated text and the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data.We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions.Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.

InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification
Jan Trienes | Sebastian Joseph | Jörg Schlötterer | Christin Seifert | Kyle Lo | Wei Xu | Byron Wallace | Junyi Jessy Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Questions Under Discussion, the QA pairs are designed to help readers deepen their knowledge of a text. First, we collect a dataset of 1,000 linguist-curated QA pairs derived from 104 LLM simplifications of English medical study abstracts. Our analyses of this data reveal that information loss occurs frequently, and that the QA pairs give a high-level overview of what information was lost. Second, we devise two methods for this task: end-to-end prompting of open-source and commercial language models, and a natural language inference pipeline. With a novel evaluation framework considering the correctness of QA pairs and their linguistic suitability, our expert evaluation reveals that models struggle to reliably identify information loss and applying similar standards as humans at what constitutes information loss.

Wrapper Boxes for Faithful Attribution of Model Predictions to Training Data
Yiheng Su | Junyi Jessy Li | Matthew Lease
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a “wrapper box” pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.

Do *they* mean ‘us’? Interpreting Referring Expression variation under Intergroup Bias
Venkata S Govindarajan | Matianyu Zang | Kyle Mahowald | David Beaver | Junyi Jessy Li
Findings of the Association for Computational Linguistics: EMNLP 2024

The variations between in-group and out-group speech (intergroup bias) are subtle and could underlie many social phenomena like stereotype perpetuation and implicit bias. In this paper, we model intergroup bias as a tagging task on English sports comments from forums dedicated to fandom for NFL teams. We curate a dataset of over 6 million game-time comments from opposing perspectives (the teams in the game), each comment grounded in a non-linguistic description of the events that precipitated these comments (live win probabilities for each team). Expert and crowd annotations justify modeling the bias through tagging of implicit and explicit referring expressions and reveal the rich, contextual understanding of language and the world required for this task. For large-scale analysis of intergroup variation, we use LLMs for automated tagging, and discover that LLMs occasionally perform better when prompted with linguistic descriptions of the win probability at the time of the comment, rather than numerical probability. Further, large-scale tagging of comments using LLMs uncovers linear variations in the form of referent across win probabilities that distinguish in-group and out-group utterances.

Which questions should I answer? Salience Prediction of Inquisitive Questions
Yating Wu | Ritika Mangla | Alexandros G. Dimakis | Greg Durrett | Junyi Jessy Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Inquisitive questions — open-ended, curiosity-driven questions people ask as they read — are an integral part of discourse processing and comprehension. Recent work in NLP has taken advantage of question generation capabilities of LLMs to enhance a wide range of applications. But the space of inquisitive questions is vast: many questions can be evoked from a given context. So which of those should be prioritized to find answers? Linguistic theories, unfortunately, have not yet provided an answer to this question. This paper presents QSalience, a salience predictor of inquisitive questions. QSalience is instruction-tuned over our dataset of linguist-annotated salience scores of 1,766 (context, question) pairs. A question scores high on salience if answering it would greatly enhance the understanding of the text. We show that highly salient questions are empirically more likely to be answered in the same article, bridging potential questions with Questions Under Discussion. We further validate our findings by showing that answering salient questions is an indicator of summarization quality in news.

FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence
Sebastian Joseph | Lily Chen | Jan Trienes | Hannah Göke | Monika Coers | Wei Xu | Byron Wallace | Junyi Jessy Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.

Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion
Smriti Singh | Cornelia Caragea | Junyi Jessy Li
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? This work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; these were annotated by experts for emotion triggers with high agreement. Using EmoTrigger, we evaluate the ability of large language models (LLMs) to identify emotion triggers, and conduct a comparative analysis of the features considered important for these tasks between LLMs and fine-tuned models. Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection.

Sarcasm Detection in a Disaster Context
Tiberiu Sosea | Junyi Jessy Li | Cornelia Caragea
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

During natural disasters, people often use social media platforms such as Twitter to ask for help, to provide information about the disaster situation, or to express contempt about the unfolding event or public policies and guidelines. This contempt is in some cases expressed as sarcasm or irony. Understanding this form of speech in a disaster-centric context is essential to improving natural language understanding of disaster-related tweets. In this paper, we introduce HurricaneSARC, a dataset of 15,000 tweets annotated for intended sarcasm, and provide a comprehensive investigation of sarcasm detection using pre-trained language models. Our best model is able to obtain as much as 0.70 F1 on our dataset. We also demonstrate that the performance on HurricaneSARC can be improved by leveraging intermediate task transfer learning

2023

How people talk about each other: Modeling Generalized Intergroup Bias and Emotion
Venkata Subrahmanyan Govindarajan | Katherine Atwell | Barea Sinno | Malihe Alikhani | David I. Beaver | Junyi Jessy Li
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Current studies of bias in NLP rely mainly on identifying (unwanted or negative) bias towards a specific demographic group. While this has led to progress recognizing and mitigating negative bias, and having a clear notion of the targeted group is necessary, it is not always practical. In this work we extrapolate to a broader notion of bias, rooted in social science and psychology literature. We move towards predicting interpersonal group relationship (IGR) - modeling the relationship between the speaker and the target in an utterance - using fine-grained interpersonal emotions as an anchor. We build and release a dataset of English tweets by US Congress members annotated for interpersonal emotion - the first of its kind, and ‘found supervision’ for IGR labels; our analyses show that subtle emotional signals are indicative of different biases. While humans can perform better than chance at identifying IGR given an utterance, we show that neural models perform much better; furthermore, a shared encoding between IGR and interpersonal perceived emotion enabled performance gains in both tasks.

Discourse Analysis via Questions and Answers: Parsing Dependency Structures of Questions Under Discussion
Wei-Jen Ko | Yating Wu | Cutter Dalton | Dananjay Srinivas | Greg Durrett | Junyi Jessy Li
Findings of the Association for Computational Linguistics: ACL 2023

Automatic discourse processing is bottlenecked by data: current discourse formalisms pose highly demanding annotation tasks involving large taxonomies of discourse relations, making them inaccessible to lay annotators. This work instead adopts the linguistic framework of Questions Under Discussion (QUD) for discourse analysis and seeks to derive QUD structures automatically. QUD views each sentence as an answer to a question triggered in prior context; thus, we characterize relationships between sentences as free-form questions, in contrast to exhaustive fine-grained taxonomies. We develop the first-of-its-kind QUD parser that derives a dependency structure of questions over full documents, trained using a large, crowdsourced question-answering dataset DCQA (Ko et al., 2022). Human evaluation results show that QUD dependency parsing is possible for language models trained with this crowdsourced, generalizable annotation scheme. We illustrate how our QUD structure is distinct from RST trees, and demonstrate the utility of QUD analysis in the context of document simplification. Our findings show that QUD parsing is an appealing alternative for automatic discourse processing.

Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)
Michael Strube | Chloe Braud | Christian Hardmeier | Junyi Jessy Li | Sharid Loaiciga | Amir Zeldes
Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)

Elaborative Simplification as Implicit Questions Under Discussion
Yating Wu | William Sheffield | Kyle Mahowald | Junyi Jessy Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Automated text simplification, a technique useful for making text more accessible to people such as children and emergent bilinguals, is often thought of as a monolingual translation task from complex sentences to simplified sentences using encoder-decoder models. This view fails to account for elaborative simplification, where new information is added into the simplified text. This paper proposes to view elaborative simplification through the lens of the Question Under Discussion (QUD) framework, providing a robust way to investigate what writers elaborate upon, how they elaborate, and how elaborations fit into the discourse context by viewing elaborations as explicit answers to implicit questions. We introduce ELABQUD, consisting of 1.3K elaborations accompanied with implicit QUDs, to study these phenomena. We show that explicitly modeling QUD (via question generation) not only provides essential understanding of elaborative simplification and how the elaborations connect with the rest of the discourse, but also substantially improves the quality of elaboration generation.

Multilingual Simplification of Medical Texts
Sebastian Joseph | Kathryn Kazanas | Keziah Reina | Vishnesh Ramanathan | Wei Xu | Byron Wallace | Junyi Jessy Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Automated text simplification aims to produce simple versions of complex texts. This task is especially useful in the medical domain, where the latest medical findings are typically communicated via complex and technical articles. This creates barriers for laypeople seeking access to up-to-date medical findings, consequently impeding progress on health literacy. Most existing work on medical text simplification has focused on monolingual settings, with the result that such evidence would be available only in just one language (most often, English). This work addresses this limitation via multilingual simplification, i.e., directly simplifying complex texts into simplified texts in multiple languages. We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages: English, Spanish, French, and Farsi. We evaluate fine-tuned and zero-shot models across these languages with extensive human assessments and analyses. Although models can generate viable simplified texts, we identify several outstanding challenges that this dataset might be used to address.

Counterfactual Probing for the Influence of Affect and Specificity on Intergroup Bias
Venkata Subrahmanyan Govindarajan | David Beaver | Kyle Mahowald | Junyi Jessy Li
Findings of the Association for Computational Linguistics: ACL 2023

While existing work on studying bias in NLP focues on negative or pejorative language use, Govindarajan et al. (2023) offer a revised framing of bias in terms of intergroup social context, and its effects on language behavior. In this paper, we investigate if two pragmatic features (specificity and affect) systematically vary in different intergroup contexts — thus connecting this new framing of bias to language output. Preliminary analysis finds modest correlations between specificity and affect of tweets with supervised intergroup relationship (IGR) labels. Counterfactual probing further reveals that while neural models finetuned for predicting IGR reliably use affect in classification, the model’s usage of specificity is inconclusive.

Summarizing, Simplifying, and Synthesizing Medical Evidence using GPT-3 (with Varying Success)
Chantal Shaib | Millicent Li | Sebastian Joseph | Iain Marshall | Junyi Jessy Li | Byron Wallace
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large language models, particularly GPT-3, are able to produce high quality summaries ofgeneral domain news articles in few- and zero-shot settings. However, it is unclear if such models are similarly capable in more specialized domains such as biomedicine. In this paper we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given no supervision. We consider bothsingle- and multi-document settings. In the former, GPT-3 is tasked with generating regular and plain-language summaries of articles describing randomized controlled trials; in thelatter, we assess the degree to which GPT-3 is able to synthesize evidence reported acrossa collection of articles. We design an annotation scheme for evaluating model outputs, withan emphasis on assessing the factual accuracy of generated summaries. We find that whileGPT-3 is able to summarize and simplify single biomedical articles faithfully, it strugglesto provide accurate aggregations of findings over multiple documents. We release all data,code, and annotations used in this work.

QUDeval: The Evaluation of Questions Under Discussion Discourse Parsing
Yating Wu | Ritika Mangla | Greg Durrett | Junyi Jessy Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Questions Under Discussion (QUD) is a versatile linguistic framework in which discourse progresses as continuously asking questions and answering them. Automatic parsing of a discourse to produce a QUD structure thus entails a complex question generation task: given a document and an answer sentence, generate a question that satisfies linguistic constraints of QUD and can be grounded in an anchor sentence in prior context. These questions are known to be curiosity-driven and open-ended. This work introduces the first framework for the automatic evaluation of QUD parsing, instantiating the theoretical constraints of QUD in a concrete protocol. We present QUDeval, a dataset of fine-grained evaluation of 2,190 QUD questions generated from both fine-tuned systems and LLMs. Using QUDeval, we show that satisfying all constraints of QUD is still challenging for modern LLMs, and that existing evaluation metrics poorly approximate parser quality. Encouragingly, human-authored QUDs are scored highly by our human evaluators, suggesting that there is headroom for further progress on language modeling to improve both QUD parsing and QUD evaluation.

Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses
Yaman Kumar | Swapnil Parekh | Somesh Singh | Junyi Jessy Li | Rajiv Ratn Shah | Changyou Chen
Dialogue & Discourse Volume 14

Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.

Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models
Hongli Zhan | Desmond C. Ong | Junyi Jessy Li
Findings of the Association for Computational Linguistics: EMNLP 2023

The emotions we experience involve complex processes; besides physiological aspects, research in psychology has studied cognitive appraisals where people assess their situations subjectively, according to their own values (Scherer, 2005). Thus, the same situation can often result in different emotional experiences. While the detection of emotion is a well-established task, there is very limited work so far on the automatic prediction of cognitive appraisals. This work fills the gap by presenting CovidET-Appraisals, the most comprehensive dataset to-date that assesses 24 appraisal dimensions, each with a natural language rationale, across 241 Reddit posts. CovidET-Appraisals presents an ideal testbed to evaluate the ability of large language models — excelling at a wide range of NLP tasks — to automatically assess and explain cognitive appraisals. We found that while the best models are performant, open-sourced LLMs fall short at this task, presenting a new challenge in the future development of emotionally intelligent models. We release our dataset at https://github.com/honglizhan/CovidET-Appraisals-Public.

Unsupervised Extractive Summarization of Emotion Triggers
Tiberiu Sosea | Hongli Zhan | Junyi Jessy Li | Cornelia Caragea
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding what leads to emotions during large-scale crises is important as it can provide groundings for expressed emotions and subsequently improve the understanding of ongoing disasters. Recent approaches trained supervised models to both detect emotions and explain emotion triggers (events and appraisals) via abstractive summarization. However, obtaining timely and qualitative abstractive summaries is expensive and extremely time-consuming, requiring highly-trained expert annotators. In time-sensitive, high-stake contexts, this can block necessary responses. We instead pursue unsupervised systems that extract triggers from text. First, we introduce CovidET-EXT, augmenting (Zhan et al., 2022)’s abstractive dataset (in the context of the COVID-19 crisis) with extractive triggers. Second, we develop new unsupervised learning models that can jointly detect emotions and summarize their triggers. Our best approach, entitled Emotion-Aware Pagerank, incorporates emotion information from external sources combined with a language understanding module, and outperforms strong baselines. We release our data and code at https://github.com/tsosea2/CovidET-EXT.

2022

Why Do You Feel This Way? Summarizing Triggers of Emotions in Social Media Posts
Hongli Zhan | Tiberiu Sosea | Cornelia Caragea | Junyi Jessy Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Crises such as the COVID-19 pandemic continuously threaten our world and emotionally affect billions of people worldwide in distinct ways. Understanding the triggers leading to people’s emotions is of crucial importance. Social media posts can be a good source of such analysis, yet these texts tend to be charged with multiple emotions, with triggers scattering across multiple sentences. This paper takes a novel angle, namely, emotion detection and trigger summarization, aiming to both detect perceived emotions in text, and summarize events and their appraisals that trigger each emotion. To support this goal, we introduce CovidET (Emotions and their Triggers during Covid-19), a dataset of ~1,900 English Reddit posts related to COVID-19, which contains manual annotations of perceived emotions and abstractive summaries of their triggers described in the post. We develop strong baselines to jointly detect emotions and summarize emotion triggers. Our analyses show that CovidET presents new challenges in emotion-specific summarization, as well as multi-emotion detection in long social media posts.

Political Ideology and Polarization: A Multi-dimensional Approach
Barea Sinno | Bernardo Oviedo | Katherine Atwell | Malihe Alikhani | Junyi Jessy Li
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Analyzing ideology and polarization is of critical importance in advancing our grasp of modern politics. Recent research has made great strides towards understanding the ideological bias (i.e., stance) of news media along the left-right spectrum. In this work, we instead take a novel and more nuanced approach for the study of ideology based on its left or right positions on the issue being discussed. Aligned with the theoretical accounts in political science, we treat ideology as a multi-dimensional construct, and introduce the first diachronic dataset of news articles whose ideological positions are annotated by trained political scientists and linguists at the paragraph level. We showcase that, by controlling for the author’s stance, our method allows for the quantitative and temporal measurement and analysis of polarization as a multidimensional ideological distance. We further present baseline models for ideology prediction, outlining a challenging task distinct from stance detection.

Impact of Evaluation Methodologies on Code Summarization
Pengyu Nie | Jiyang Zhang | Junyi Jessy Li | Ray Mooney | Milos Gligoric
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization considered the timestamps of code and comments during evaluation. This may lead to evaluations that are inconsistent with the intended use cases. In this paper, we introduce the time-segmented evaluation methodology, which is novel to the code summarization research community, and compare it with the mixed-project and cross-project methodologies that have been commonly used. Each methodology can be mapped to some use cases, and the time-segmented methodology should be adopted in the evaluation of ML models for code summarization. To assess the impact of methodologies, we collect a dataset of (code, comment) pairs with timestamps to train and evaluate several recent ML models for code summarization. Our experiments show that different methodologies lead to conflicting evaluation results. We invite the community to expand the set of methodologies used in evaluations.

FALTE: A Toolkit for Fine-grained Annotation for Long Text Evaluation
Tanya Goyal | Junyi Jessy Li | Greg Durrett
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

A growing swath of NLP research is tackling problems related to generating long text, including tasks such as open-ended story generation, summarization, dialogue, and more. However, we currently lack appropriate tools to evaluate these long outputs of generation models: classic automatic metrics such as ROUGE have been shown to perform poorly, and newer learned metrics do not necessarily work wellfor all tasks and domains of text. Human rating and error analysis remains a crucial component for any evaluation of long text generation. In this paper, we introduce FALTE, a web-based annotation toolkit designed to address this shortcoming. Our tool allows researchers to collect fine-grained judgments of text quality from crowdworkers using an error taxonomy specific to the downstream task. Using the taskinterface, annotators can select and assign error labels to text span selections in an incremental paragraph-level annotation workflow. The latter functionality is designed to simplify the document-level task into smaller units and reduce cognitive load on the annotators. Our tool has previously been used to run a large-scale annotation study that evaluates the coherence of long generated summaries, demonstrating its utility.

Proceedings of the 3rd Workshop on Computational Approaches to Discourse
Chloe Braud | Christian Hardmeier | Junyi Jessy Li | Sharid Loaiciga | Michael Strube | Amir Zeldes
Proceedings of the 3rd Workshop on Computational Approaches to Discourse

Training Dynamics for Text Summarization Models
Tanya Goyal | Jiacheng Xu | Junyi Jessy Li | Greg Durrett
Findings of the Association for Computational Linguistics: ACL 2022

Pre-trained language models (e.g. BART) have shown impressive results when fine-tuned on large summarization datasets. However, little is understood about this fine-tuning process, including what knowledge is retained from pre-training time or how content selection and generation strategies are learnt across iterations. In this work, we analyze the training dynamics for generation models, focusing on summarization. Across different datasets (CNN/DM, XSum, MediaSum) and summary properties, such as abstractiveness and hallucination, we study what the model learns at different stages of its fine-tuning process. We find that a propensity to copy the input is learned early in the training process consistently across all datasets studied. On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, though this behavior is more varied across domains. Based on these observations, we explore complementary approaches for modifying training: first, disregarding high-loss tokens that are challenging to learn and second, disregarding low-loss tokens that are learnt very quickly in the latter stages of the training process. We show that these simple training modifications allow us to configure our model to achieve different goals, such as improving factuality or improving abstractiveness.

longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A systematic approach to adversarial attacks.
Venelin Kovatchev | Trina Chatterjee | Venkata S Govindarajan | Jifan Chen | Eunsol Choi | Gabriella Chronis | Anubrata Das | Katrin Erk | Matthew Lease | Junyi Jessy Li | Yating Wu | Kyle Mahowald
Proceedings of the First Workshop on Dynamic Adversarial Data Collection

Developing methods to adversarially challenge NLP systems is a promising avenue for improving both model performance and interpretability. Here, we describe the approach of the team “longhorns” on Task 1 of the First Workshop on Dynamic Adversarial Data Collection (DADC), which asked teams to manually fool a model on an Extractive Question Answering task. Our team finished first (pending validation), with a model error rate of 62%. We advocate for a systematic, linguistically informed approach to formulating adversarial questions, and we describe the results of our pilot experiments, as well as our official submission.

Evaluating Factuality in Text Simplification
Ashwin Devaraj | William Sheffield | Byron Wallace | Junyi Jessy Li
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automated simplification models aim to make input texts more readable. Such methods have the potential to make complex information accessible to a wider audience, e.g., providing access to recent medical literature which might otherwise be impenetrable for a lay reader. However, such models risk introducing errors into automatically simplified texts, for instance by inserting statements unsupported by the corresponding original text, or by omitting key information. Providing more readable but inaccurate versions of texts may in many cases be worse than providing no such access at all. The problem of factual accuracy (and the lack thereof) has received heightened attention in the context of summarization models, but the factuality of automatically simplified texts has not been investigated. We introduce a taxonomy of errors that we use to analyze both references drawn from standard simplification datasets and state-of-the-art model outputs. We find that errors often appear in both that are not captured by existing evaluation metrics, motivating a need for research into ensuring the factual accuracy of automated simplification models.

Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Oliver Lemon | Dilek Hakkani-Tur | Junyi Jessy Li | Arash Ashrafzadeh | Daniel Hernández Garcia | Malihe Alikhani | David Vandyke | Ondřej Dušek
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

ProtoTEx: Explaining Model Decisions with Prototype Tensors
Anubrata Das | Chitrank Gupta | Venelin Kovatchev | Matthew Lease | Junyi Jessy Li
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present ProtoTEx, a novel white-box NLP classification architecture based on prototype networks (Li et al., 2018). ProtoTEx faithfully explains model decisions based on prototype tensors that encode latent clusters of training examples. At inference time, classification decisions are based on the distances between the input text and the prototype tensors, explained via the training examples most similar to the most influential prototypes. We also describe a novel interleaved training algorithm that effectively handles classes characterized by ProtoTEx indicative features. On a propaganda detection task, ProtoTEx accuracy matches BART-large and exceeds BERTlarge with the added benefit of providing faithful explanations. A user study also shows that prototype-based explanations help non-experts to better recognize propaganda in online news.

SNaC: Coherence Error Detection for Narrative Summarization
Tanya Goyal | Junyi Jessy Li | Greg Durrett
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Progress in summarizing long texts is inhibited by the lack of appropriate evaluation frameworks. A long summary that appropriately covers the facets of that text must also present a coherent narrative, but current automatic and human evaluation methods fail to identify gaps in coherence. In this work, we introduce SNaC, a narrative coherence evaluation framework for fine-grained annotations of long summaries. We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie summaries. Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowdworkers. Furthermore, we show that the collected annotations allow us to benchmark past work in coherence modeling and train a strong classifier for automatically localizing coherence errors in generated summaries. Finally, our SNaC framework can support future work in long document summarization and coherence evaluation, including improved summarization modeling and post-hoc summary correction.

Text Simplification of College Admissions Instructions: A Professionally Simplified and Verified Corpus
Zachary W. Taylor | Maximus H. Chu | Junyi Jessy Li
Proceedings of the 29th International Conference on Computational Linguistics

Access to higher education is critical for minority populations and emergent bilingual students. However, the language used by higher education institutions to communicate with prospective students is often too complex; concretely, many institutions in the US publish admissions application instructions far above the average reading level of a typical high school graduate, often near the 13th or 14th grade level. This leads to an unnecessary barrier between students and access to higher education. This work aims to tackle this challenge via text simplification. We present PSAT (Professionally Simplified Admissions Texts), a dataset with 112 admissions instructions randomly selected from higher education institutions across the US. These texts are then professionally simplified, and verified and accepted by subject-matter experts who are full-time employees in admissions offices at various institutions. Additionally, PSAT comes with manual alignments of 1,883 original-simplified sentence pairs. The result is a first-of-its-kind corpus for the evaluation and fine-tuning of text simplification systems in a high-stakes genre distinct from existing simplification resources.

How Do We Answer Complex Questions: Discourse Structure of Long-form Answers
Fangyuan Xu | Junyi Jessy Li | Eunsol Choi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Long-form answers, consisting of multiple sentences, can provide nuanced and comprehensive answers to a broader set of questions. To better understand this complex and understudied task, we study the functional structure of long-form answers collected from three datasets, ELI5, WebGPT and Natural Questions. Our main goal is to understand how humans organize information to craft complex answers. We develop an ontology of six sentence-level functional roles for long-form answers, and annotate 3.9k sentences in 640 answer paragraphs. Different answer collection methods manifest in different discourse structures. We further analyze model-generated answers – finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers. Our annotated data enables training a strong classifier that can be used for automatic analysis. We hope our work can inspire future research on discourse-level modeling and evaluation of long-form QA systems.

Emotion analysis and detection during COVID-19
Tiberiu Sosea | Chau Pham | Alexander Tekle | Cornelia Caragea | Junyi Jessy Li
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Understanding emotions that people express during large-scale crises helps inform policy makers and first responders about the emotional states of the population as well as provide emotional support to those who need such support. We present CovidEmo, a dataset of ~3,000 English tweets labeled with emotions and temporally distributed across 18 months. Our analyses reveal the emotional toll caused by COVID-19, and changes of the social narrative and associated emotions over time. Motivated by the time-sensitive nature of crises and the cost of large-scale annotation efforts, we examine how well large pre-trained language models generalize across domains and timeline in the task of perceived emotion prediction in the context of COVID-19. Our analyses suggest that cross-domain information transfers occur, yet there are still significant gaps. We propose semi-supervised learning as a way to bridge this gap, obtaining significantly better performance using unlabeled data from the target domain.

The Role of Context and Uncertainty in Shallow Discourse Parsing
Katherine Atwell | Remi Choi | Junyi Jessy Li | Malihe Alikhani
Proceedings of the 29th International Conference on Computational Linguistics

Discourse parsing has proven to be useful for a number of NLP tasks that require complex reasoning. However, over a decade since the advent of the Penn Discourse Treebank, predicting implicit discourse relations in text remains challenging. There are several possible reasons for this, and we hypothesize that models should be exposed to more context as it plays an important role in accurate human annotation; meanwhile adding uncertainty measures can improve model accuracy and calibration. To thoroughly investigate this phenomenon, we perform a series of experiments to determine 1) the effects of context on human judgments, and 2) the effect of quantifying uncertainty with annotator confidence ratings on model accuracy and calibration (which we measure using the Brier score (Brier et al, 1950)). We find that including annotator accuracy and confidence improves model accuracy, and incorporating confidence in the model’s temperature function can lead to models with significantly better-calibrated confidence measures. We also find some insightful qualitative results regarding human and model behavior on these datasets.

Learning to Describe Solutions for Bug Reports Based on Developer Discussions
Sheena Panthaplackel | Junyi Jessy Li | Milos Gligoric | Ray Mooney
Findings of the Association for Computational Linguistics: ACL 2022

When a software bug is reported, developers engage in a discussion to collaboratively resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend and delaying its implementation. To expedite bug resolution, we propose generating a concise natural language description of the solution by synthesizing relevant content within the discussion, which encompasses both natural language and source code. We build a corpus for this task using a novel technique for obtaining noisy supervision from repository changes linked to bug reports, with which we establish benchmarks. We also design two systems for generating a description during an ongoing discussion by classifying when sufficient context for performing the task emerges in real-time. With automated and human evaluation, we find this task to form an ideal testbed for complex reasoning in long, bimodal dialogue context.

Discourse Comprehension: A Question Answering Framework to Represent Sentence Connections
Wei-Jen Ko | Cutter Dalton | Mark Simmons | Eliza Fisher | Greg Durrett | Junyi Jessy Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

While there has been substantial progress in text comprehension through simple factoid question answering, more holistic comprehension of a discourse still presents a major challenge (Dunietz et al., 2020). Someone critically reflecting on a text as they read it will pose curiosity-driven, often open-ended questions, which reflect deep understanding of the content and require complex reasoning to answer (Ko et al., 2020; Westera et al., 2020). A key challenge in building and evaluating models for this type of discourse comprehension is the lack of annotated data, especially since collecting answers to such questions requires high cognitive load for annotators.This paper presents a novel paradigm that enables scalable data collection targeting the comprehension of news documents, viewing these questions through the lens of discourse. The resulting corpus, DCQA (Discourse Comprehension by Question Answering), captures both discourse and semantic links between sentences in the form of free-form, open-ended questions. On an evaluation set that we annotated on questions from Ko et al. (2020), we show that DCQA provides valuable supervision for answering open-ended questions. We additionally design pre-training methods utilizing existing question-answering resources, and use synthetic data to accommodate unanswerable questions.

Using Developer Discussions to Guide Fixing Bugs in Software
Sheena Panthaplackel | Milos Gligoric | Junyi Jessy Li | Raymond Mooney
Findings of the Association for Computational Linguistics: EMNLP 2022

Automatically fixing software bugs is a challenging task. While recent work showed that natural language context is useful in guiding bug-fixing models, the approach required prompting developers to provide this context, which was simulated through commit messages written after the bug-fixing code changes were made. We instead propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for any additional information from developers. For this, we augment standard bug-fixing datasets with bug report discussions. Using these newly compiled datasets, we demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.

2021

Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Haizhou Li | Gina-Anne Levow | Zhou Yu | Chitralekha Gupta | Berrak Sisman | Siqi Cai | David Vandyke | Nina Dethlefs | Yan Wu | Junyi Jessy Li
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Paragraph-level Simplification of Medical Texts
Ashwin Devaraj | Iain Marshall | Byron Wallace | Junyi Jessy Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.

Proceedings of the 2nd Workshop on Computational Approaches to Discourse
Chloé Braud | Christian Hardmeier | Junyi Jessy Li | Annie Louis | Michael Strube | Amir Zeldes
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

Elaborative Simplification: Content Addition and Explanation Generation in Text Simplification
Neha Srikanth | Junyi Jessy Li
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Where Are We in Discourse Relation Recognition?
Katherine Atwell | Junyi Jessy Li | Malihe Alikhani
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Discourse parsers recognize the intentional and inferential relationships that organize extended texts. They have had a great influence on a variety of NLP tasks as well as theoretical studies in linguistics and cognitive science. However it is often difficult to achieve good results from current discourse models, largely due to the difficulty of the task, particularly recognizing implicit discourse relations. Recent developments in transformer-based models have shown great promise on these analyses, but challenges still remain. We present a position paper which provides a systematic analysis of the state of the art discourse parsers. We aim to examine the performance of current discourse parsing models via gradual domain shift: within the corpus, on in-domain texts, and on out-of-domain texts, and discuss the differences between the transformer-based models and the previous models in predicting different types of implicit relations both inter- and intra-sentential. We conclude by describing several shortcomings of the existing models and a discussion of how future work should approach this problem.

Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)
Royi Lachmy | Ziyu Yao | Greg Durrett | Milos Gligoric | Junyi Jessy Li | Ray Mooney | Graham Neubig | Yu Su | Huan Sun | Reut Tsarfaty
Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)

Did they answer? Subjective acts and intents in conversational discourse
Elisa Ferracane | Greg Durrett | Junyi Jessy Li | Katrin Erk
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Discourse signals are often implicit, leaving it up to the interpreter to draw the required inferences. At the same time, discourse is embedded in a social context, meaning that interpreters apply their own assumptions and beliefs when resolving these inferences, leading to multiple, valid interpretations. However, current discourse data and frameworks ignore the social aspect, expecting only a single ground truth. We present the first discourse dataset with multiple and subjective interpretations of English conversation in the form of perceived conversation acts and intents. We carefully analyze our dataset and create computational models to (1) confirm our hypothesis that taking into account the bias of the interpreters leads to better predictions of the interpretations, (2) and show disagreements are nuanced and require a deeper understanding of the different contextual factors. We share our dataset and code at http://github.com/elisaF/subjective_discourse.

How to apply for financial aid: Exploring perplexity and jargon in texts for non-expert audiences
Laura Manor | Yiran Su | Zachary W. Taylor | Junyi Jessy Li
Proceedings of the Society for Computation in Linguistics 2021

2020

Help! Need Advice on Identifying Advice
Venkata Subrahmanyan Govindarajan | Benjamin Chen | Rebecca Warholic | Katrin Erk | Junyi Jessy Li
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Humans use language to accomplish a wide variety of tasks - asking for and giving advice being one of them. In online advice forums, advice is mixed in with non-advice, like emotional support, and is sometimes stated explicitly, sometimes implicitly. Understanding the language of advice would equip systems with a better grasp of language pragmatics; practically, the ability to identify advice would drastically increase the efficiency of advice-seeking online, as well as advice-giving in natural language generation systems. We present a dataset in English from two Reddit advice forums - r/AskParents and r/needadvice - annotated for whether sentences in posts contain advice or not. Our analysis reveals rich linguistic phenomena in advice discourse. We present preliminary models showing that while pre-trained language models are able to capture advice better than rule-based systems, advice identification is challenging, and we identify directions for future research.

Assessing Discourse Relations in Language Generation from GPT-2
Wei-Jen Ko | Junyi Jessy Li
Proceedings of the 13th International Conference on Natural Language Generation

Recent advances in NLP have been attributed to the emergence of large-scale pre-trained language models. GPT-2, in particular, is suited for generation tasks given its left-to-right language modeling objective, yet the linguistic quality of its generated text has largely remain unexplored. Our work takes a step in understanding GPT-2’s outputs in terms of discourse coherence. We perform a comprehensive study on the validity of explicit discourse relations in GPT-2’s outputs under both organic generation and fine-tuned scenarios. Results show GPT-2 does not always generate text containing valid discourse relations; nevertheless, its text is more aligned with human expectation in the fine-tuned scenario. We propose a decoupled strategy to mitigate these problems and highlight the importance of explicitly modeling discourse information.

Detecting Perceived Emotions in Hurricane Disasters
Shrey Desai | Cornelia Caragea | Junyi Jessy Li
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural disasters (e.g., hurricanes) affect millions of people each year, causing widespread destruction in their wake. People have recently taken to social media websites (e.g., Twitter) to share their sentiments and feelings with the larger community. Consequently, these platforms have become instrumental in understanding and perceiving emotions at scale. In this paper, we introduce HurricaneEmo, an emotion dataset of 15,000 English tweets spanning three hurricanes: Harvey, Irma, and Maria. We present a comprehensive study of fine-grained emotions and propose classification tasks to discriminate between coarse-grained emotion groups. Our best BERT model, even after task-guided pre-training which leverages unlabeled Twitter data, achieves only 68% accuracy (averaged across all groups). HurricaneEmo serves not only as a challenging benchmark for models but also as a valuable resource for analyzing emotions in disaster-centric domains.

Proceedings of the First Workshop on Computational Approaches to Discourse
Chloé Braud | Christian Hardmeier | Junyi Jessy Li | Annie Louis | Michael Strube
Proceedings of the First Workshop on Computational Approaches to Discourse

An Annotated Dataset of Discourse Modes in Hindi Stories
Swapnil Dhanwal | Hritwik Dutta | Hitesh Nankani | Nilay Shrivastava | Yaman Kumar | Junyi Jessy Li | Debanjan Mahata | Rakesh Gosangi | Haimin Zhang | Rajiv Ratn Shah | Amanda Stent
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a new corpus consisting of sentences from Hindi short stories annotated for five different discourse modes argumentative, narrative, descriptive, dialogic and informative. We present a detailed account of the entire data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.87 k-alpha). We analyze the data in terms of label distributions, part of speech tags, and sentence lengths. We characterize the performance of various classification algorithms on this dataset and perform ablation studies to understand the nature of the linguistic models suitable for capturing the nuances of the embedded discourse structures in the presented corpus.

Learning to Update Natural Language Comments Based on Code Changes
Sheena Panthaplackel | Pengyu Nie | Milos Gligoric | Junyi Jessy Li | Raymond Mooney
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We formulate the novel task of automatically updating an existing natural language comment based on changes in the body of code it accompanies. We propose an approach that learns to correlate changes across two distinct language representations, to generate a sequence of edits that are applied to the existing comment to reflect the source code modifications. We train and evaluate our model using a dataset that we collected from commit histories of open-source software projects, with each example consisting of a concurrent update to a method and its corresponding comment. We compare our approach against multiple baselines using both automatic metrics and human evaluation. Results reflect the challenge of this task and that our model outperforms baselines with respect to making edits.

Inquisitive Question Generation for High Level Text Comprehension
Wei-Jen Ko | Te-yuan Chen | Yiyan Huang | Greg Durrett | Junyi Jessy Li
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Inquisitive probing questions come naturally to humans in a variety of settings, but is a challenging task for automatic systems. One natural type of question to ask tries to fill a gap in knowledge during text comprehension, like reading a news article: we might ask about background information, deeper reasons behind things occurring, or more. Despite recent progress with data-driven approaches, generating such questions is beyond the range of models trained on existing datasets. We introduce INQUISITIVE, a dataset of ~19K questions that are elicited while a person is reading through a document. Compared to existing datasets, INQUISITIVE questions target more towards high-level (semantic and discourse) comprehension of text. We show that readers engage in a series of pragmatic strategies to seek information. Finally, we evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions although the task is challenging, and highlight the importance of context to generate INQUISITIVE questions.

2019

Plain English Summarization of Contracts
Laura Manor | Junyi Jessy Li
Proceedings of the Natural Legal Language Processing Workshop 2019

Unilateral legal contracts, such as terms of service, play a substantial role in modern digital life. However, few read these documents before accepting the terms within, as they are too long and the language too complicated. We propose the task of summarizing such legal documents in plain English, which would enable users to have a better understanding of the terms they are accepting. We propose an initial dataset of legal text snippets paired with summaries written in plain English. We verify the quality of these summaries manually, and show that they involve heavy abstraction, compression, and simplification. Initial experiments show that unsupervised extractive summarization methods do not perform well on this task due to the level of abstraction and style differences. We conclude with a call for resource and technique development for simplification and style transfer for legal language.

Linguistically-Informed Specificity and Semantic Plausibility for Dialogue Generation
Wei-Jen Ko | Greg Durrett | Junyi Jessy Li
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Sequence-to-sequence models for open-domain dialogue generation tend to favor generic, uninformative responses. Past work has focused on word frequency-based approaches to improving specificity, such as penalizing responses with only common words. In this work, we examine whether specificity is solely a frequency-related notion and find that more linguistically-driven specificity measures are better suited to improving response informativeness. However, we find that forcing a sequence-to-sequence model to be more specific can expose a host of other problems in the responses, including flawed discourse and implausible semantics. We rerank our model’s outputs using externally-trained classifiers targeting each of these identified factors. Experiments show that our final model using linguistically motivated specificity and plausibility reranking improves the informativeness, reasonableness, and grammatically of responses.

Adaptive Ensembling: Unsupervised Domain Adaptation for Political Document Analysis
Shrey Desai | Barea Sinno | Alex Rosenfeld | Junyi Jessy Li
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Insightful findings in political science often require researchers to analyze documents of a certain subject or type, yet these documents are usually contained in large corpora that do not distinguish between pertinent and non-pertinent documents. In contrast, we can find corpora that label relevant documents but have limitations (e.g., from a single source or era), preventing their use for political science research. To bridge this gap, we present adaptive ensembling, an unsupervised domain adaptation framework, equipped with a novel text classification model and time-aware training to ensure our methods work well with diachronic corpora. Experiments on an expert-annotated dataset show that our framework outperforms strong benchmarks. Further analysis indicates that our methods are more stable, learn better representations, and extract cleaner corpora for fine-grained analysis.

From News to Medical: Cross-domain Discourse Segmentation
Elisa Ferracane | Titan Page | Junyi Jessy Li | Katrin Erk
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

The first step in discourse analysis involves dividing a text into segments. We annotate the first high-quality small-scale medical corpus in English with discourse segments and analyze how well news-trained segmenters perform on this domain. While we expectedly find a drop in performance, the nature of the segmentation errors suggests some problems can be addressed earlier in the pipeline, while others would require expanding the corpus to a trainable size to learn the nuances of the medical domain.

Unsupervised Adversarial Domain Adaptation for Implicit Discourse Relation Classification
Hsin-Ping Huang | Junyi Jessy Li
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Implicit discourse relations are not only more challenging to classify, but also to annotate, than their explicit counterparts. We tackle situations where training data for implicit relations are lacking, and exploit domain adaptation from explicit relations (Ji et al., 2015). We present an unsupervised adversarial domain adaptive network equipped with a reconstruction component. Our system outperforms prior works and other adversarial benchmarks for unsupervised domain adaptation. Additionally, we extend our system to take advantage of labeled data if some are available.

Evaluating Discourse in Structured Text Representations
Elisa Ferracane | Greg Durrett | Junyi Jessy Li | Katrin Erk
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Discourse structure is integral to understanding a text and is helpful in many NLP tasks. Learning latent representations of discourse is an attractive alternative to acquiring expensive labeled discourse data. Liu and Lapata (2018) propose a structured attention mechanism for text classification that derives a tree over a text, akin to an RST discourse tree. We examine this model in detail, and evaluate on additional discourse-relevant tasks and datasets, in order to assess whether the structured attention improves performance on the end task and whether it captures a text’s discourse structure. We find the learned latent trees have little to no structure and instead focus on lexical cues; even after obtaining more structured trees with proposed model modifications, the trees are still far from capturing discourse structure when compared to discourse dependency trees from an existing discourse parser. Finally, ablation studies show the structured attention provides little benefit, sometimes even hurting performance.

2018

Why Swear? Analyzing and Inferring the Intentions of Vulgar Expressions
Eric Holgate | Isabel Cachola | Daniel Preoţiuc-Pietro | Junyi Jessy Li
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Vulgar words are employed in language use for several different functions, ranging from expressing aggression to signaling group identity or the informality of the communication. This versatility of usage of a restricted set of words is challenging for downstream applications and has yet to be studied quantitatively or using natural language processing techniques. We introduce a novel data set of 7,800 tweets from users with known demographic traits where all instances of vulgar words are annotated with one of the six categories of vulgar word use. Using this data set, we present the first analysis of the pragmatic aspects of vulgarity and how they relate to social factors. We build a model able to predict the category of a vulgar word based on the immediate context it appears in with 67.4 macro F1 across six classes. Finally, we demonstrate the utility of modeling the type of vulgar word use in context by using this information to achieve state-of-the-art performance in hate speech detection on a benchmark data set.

A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature
Benjamin Nye | Junyi Jessy Li | Roma Patel | Yinfei Yang | Iain Marshall | Ani Nenkova | Byron Wallace
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a corpus of 5,000 richly annotated abstracts of medical articles describing clinical randomized controlled trials. Annotations include demarcations of text spans that describe the Patient population enrolled, the Interventions studied and to what they were Compared, and the Outcomes measured (the ‘PICO’ elements). These spans are further annotated at a more granular level, e.g., individual interventions within them are marked and mapped onto a structured medical vocabulary. We acquired annotations from a diverse set of workers with varying levels of expertise and cost. We describe our data collection process and the corpus itself in detail. We then outline a set of challenging NLP tasks that would aid searching of the medical literature and the practice of evidence-based medicine.

Expressively vulgar: The socio-dynamics of vulgarity and its effects on sentiment analysis in social media
Isabel Cachola | Eric Holgate | Daniel Preoţiuc-Pietro | Junyi Jessy Li
Proceedings of the 27th International Conference on Computational Linguistics

Vulgarity is a common linguistic expression and is used to perform several linguistic functions. Understanding their usage can aid both linguistic and psychological phenomena as well as benefit downstream natural language processing applications such as sentiment analysis. This study performs a large-scale, data-driven empirical analysis of vulgar words using social media data. We analyze the socio-cultural and pragmatic aspects of vulgarity using tweets from users with known demographics. Further, we collect sentiment ratings for vulgar tweets to study the relationship between the use of vulgar words and perceived sentiment and show that explicitly modeling vulgar words can boost sentiment analysis performance.

2017

Aggregating and Predicting Sequence Labels from Crowd Annotations
An Thanh Nguyen | Byron Wallace | Junyi Jessy Li | Ani Nenkova | Matthew Lease
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite sequences being core to NLP, scant work has considered how to handle noisy sequence labels from multiple annotators for the same text. Given such annotations, we consider two complementary tasks: (1) aggregating sequential crowd labels to infer a best single set of consensus annotations; and (2) using crowd annotations as training data for a model that can predict sequences in unannotated text. For aggregation, we propose a novel Hidden Markov Model variant. To predict sequences in unannotated text, we propose a neural approach using Long Short Term Memory. We evaluate a suite of methods across two different applications and text genres: Named-Entity Recognition in news articles and Information Extraction from biomedical abstracts. Results show improvement over strong baselines. Our source code and data are available online.

2016

The Role of Discourse Units in Near-Extractive Summarization
Junyi Jessy Li | Kapil Thadani | Amanda Stent
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The Instantiation Discourse Relation: A Corpus Analysis of Its Properties and Improved Detection
Junyi Jessy Li | Ani Nenkova
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Improving the Annotation of Sentence Specificity
Junyi Jessy Li | Bridget O’Daniel | Yi Wu | Wenli Zhao | Ani Nenkova
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce improved guidelines for annotation of sentence specificity, addressing the issues encountered in prior work. Our annotation provides judgements of sentences in context. Rather than binary judgements, we introduce a specificity scale which accommodates nuanced judgements. Our augmented annotation procedure also allows us to define where in the discourse context the lack of specificity can be resolved. In addition, the cause of the underspecification is annotated in the form of free text questions. We present results from a pilot annotation with this new scheme and demonstrate good inter-annotator agreement. We found that the lack of specificity distributes evenly among immediate prior context, long distance prior context and no prior context. We find that missing details that are not resolved in the the prior context are more likely to trigger questions about the reason behind events, “why” and “how”. Our data is accessible at http://www.cis.upenn.edu/~nlp/corpora/lrec16spec.html

2015

Detecting Content-Heavy Sentences: A Cross-Language Case Study
Junyi Jessy Li | Ani Nenkova
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

Addressing Class Imbalance for Improved Recognition of Implicit Discourse Relations
Junyi Jessy Li | Ani Nenkova
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

Assessing the Discourse Factors that Influence the Quality of Machine Translation
Junyi Jessy Li | Marine Carpuat | Ani Nenkova
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Cross-lingual Discourse Relation Analysis: A corpus study and a semi-supervised classification system
Junyi Jessy Li | Marine Carpuat | Ani Nenkova
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Reducing Sparsity Improves the Recognition of Implicit Discourse Relations
Junyi Jessy Li | Ani Nenkova
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

Co-authors

Christian Hardmeier 6

Michael Strube 6

Malihe Alikhani 5

Milos Gligoric 5

Kyle Mahowald 5

William Sheffield 5

Katherine Atwell 4

Sebastian Joseph 4

Matthew Lease 4

Sharid Loáiciga 4

Tiberiu Sosea 4

Elisa Ferracane 3

Venkata Subrahmanyan Govindarajan 3

Iain Marshall 3

Kanishka Misra 3

Sheena Panthaplackel 3

David I. Beaver 2

Isabel Cachola 2

Marine Carpuat 2

Cutter Dalton 2

Ashwin Devaraj 2

Venkata S Govindarajan 2

Venelin Kovatchev 2

Ritika Mangla 2

Raymond Mooney 2

Daniel Preoţiuc-Pietro 2

Valentina Pyatkin 2

Jörg Schlötterer 2

Christin Seifert 2

Chantal Shaib 2

Zachary W. Taylor 2

David Vandyke 2

Arash Ashrafzadeh 1

Daniel Brubaker 1

Trina Chatterjee 1

Benjamin Chen 1

Changyou Chen 1

Gabriella Chronis 1

Maximus H. Chu 1

Nina Dethlefs 1

Swapnil Dhanwal 1

Alexandros G. Dimakis 1

Hritwik Dutta 1

Ondřej Dušek 1

Daniel Hernández Garcia 1

Rakesh Gosangi 1

Chitralekha Gupta 1

Chitrank Gupta 1

Dilek Hakkani-Tur 1

Hsin-Ping Huang 1

Kathryn Kazanas 1

Gina-Anne Levow 1

Yang Janet Liu 1

Debanjan Mahata 1

Hitesh Nankani 1

Graham Neubig 1

An Thanh Nguyen 1

Desmond C. Ong 1

Bernardo Oviedo 1

Bridget O’Daniel 1

Swapnil Parekh 1

Chau Minh Pham 1

Vishnesh Ramanathan 1

Alex Rosenfeld 1

Nilay Shrivastava 1

Berrak Sisman 1

Neha Srikanth 1

Dananjay Srinivas 1

Alexander Tekle 1

Kapil Thadani 1

Reut Tsarfaty 1

Rebecca Warholic 1

Matianyu Zang 1

Anshun Asher Zheng 1

Venues