Kathleen McKeown

Also published as: Kathy McKeown, Kathleen R. McKeown, Kathleen Mckeown

2025

pdf bib abs
Data Caricatures: On the Representation of African American Language in Pretraining Corpora
Nicholas Deas | Blake Vente | Amith Ananthram | Jessica A Grieser | Desmond U. Patton | Shana Kleiner | James R. Shepard Iii | Kathleen McKeown
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With a combination of quantitative experiments, human judgments, and qualitative analyses, we evaluate the quantity and quality of African American Language (AAL) representation in 12 predominantly English, open-source pretraining corpora. We specifically focus on the sources, variation, and naturalness of included AAL texts representing the AAL speaking community. We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as few as 0.007% and at most 0.18% of documents. We also find that more than 25% of AAL texts in C4 may be perceived as inappropriate for LLMs to generate and to reinforce harmful stereotypes. Finally, we find that most automated filters are more likely to conserve White Mainstream English (WME) texts over AAL in pretraining corpora.

pdf bib abs
Latent Space Interpretation for Stylistic Analysis and Explainable Authorship Attribution
Milad Alshomary | Narutatsu Ri | Marianna Apidianaki | Ajay Patel | Smaranda Muresan | Kathleen McKeown
Proceedings of the 31st International Conference on Computational Linguistics

Recent state-of-the-art authorship attribution methods learn authorship representations of text in a latent, uninterpretable space, which hinders their usability in real-world applications. We propose a novel approach for interpreting learned embeddings by identifying representative points in the latent space and leveraging large language models to generate informative natural language descriptions of the writing style associated with each point. We evaluate the alignment between our interpretable and latent spaces and demonstrate superior prediction agreement over baseline methods. Additionally, we conduct a human evaluation to assess the quality of these style descriptions and validate their utility in explaining the latent space. Finally, we show that human performance on the challenging authorship attribution task improves by +20% on average when aided with explanations from our method.

pdf bib abs
Summarization of Opinionated Political Documents with Varied Perspectives
Nicholas Deas | Kathleen McKeown
Proceedings of the 31st International Conference on Computational Linguistics

Global partisan hostility and polarization has increased, and this polarization is heightened around presidential elections. Models capable of generating accurate summaries of diverse perspectives can help reduce such polarization by exposing users to alternative perspectives. In this work, we introduce a novel dataset and task for independently summarizing each political perspective in a set of passages from opinionated news articles. For this task, we propose a framework for evaluating different dimensions of perspective summary performance. We benchmark 11 summarization models and LLMs of varying sizes and architectures through both automatic and human evaluation. While recent models like GPT-4o perform well on this task, we find that all models struggle to generate summaries that are faithful to the intended perspective. Our analysis of summaries focuses on how extraction behavior is impacted by features of the input documents.

pdf bib abs
ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media
Kung-Hsiang Huang | Hou Pong Chan | Kathleen McKeown | Heng Ji
Proceedings of the 31st International Conference on Computational Linguistics

Considerable advancements have been made to tackle the misrepresentation of information derived from reference articles in the domains of fact-checking and faithful summarization. However, an unaddressed aspect remains - the identification of social media posts that manipulate information within associated news articles. This task presents a significant challenge, primarily due to the prevalence of personal opinions in such posts. We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information. To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles. Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance. Additionally, we have developed a simple yet effective basic model that outperforms LLMs significantly on the ManiTweet dataset. Finally, we have conducted an exploratory analysis of human-written tweets, unveiling intriguing connections between manipulation and the domain and factuality of news articles, as well as revealing that manipulated sentences are more likely to encapsulate the main story or consequences of a news outlet.

pdf bib abs
Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding
Melanie Subbiah | Akankshya Mishra | Grace Kim | Liyan Tang | Greg Durrett | Kathleen McKeown
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.

pdf bib abs
Layered Insights: Generalizable Analysis of Human Authorial Style by Leveraging All Transformer Layers
Milad Alshomary | Nikhil Reddy Varimalla | Vishal Anand | Smaranda Muresan | Kathleen McKeown
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We propose a new approach for the authorship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based models. We evaluate our approach on two popular authorship attribution models and three evaluation datasets, in in-domain and out-of-domain scenarios. We found that utilizing various transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in a much stronger performance. Our analysis gives further insights into how our model’s different layers get specialized in representing certain linguistic aspects that we believe benefit the model when tested out of the domain.

pdf bib abs
Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions
Nicholas Deas | Kathleen McKeown
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce and study artificial impressions–patterns in LLMs’ internal representations of prompts that resemble human impressions and stereotypes based on language. We fit linear probes on generated prompts to predict impressions according to the two-dimensional Stereotype Content Model (SCM). Using these probes, we study the relationship between impressions and downstream model behavior as well as prompt features that may inform such impressions. We find that LLMs inconsistently report impressions when prompted, but also that impressions are more consistently linearly decodable from their hidden representations. Additionally, we show that artificial impressions of prompts are predictive of the quality and use of hedging in model responses. We also investigate how particular content, stylistic, and dialectal features in prompts impact LLM impressions.

pdf bib abs
Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment
Yunfan Zhang | Kathleen McKeown | Smaranda Muresan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism — the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.

Hallucination is a persistent challenge in large language models (LLMs), where even with rigorous quality control, models often generate distorted facts. This paradox, in which error generation continues despite high-quality training data, calls for a deeper understanding of the underlying LLM mechanisms. To address it, we propose a novel concept: knowledge overshadowing, where model’s dominant knowledge can obscure less prominent knowledge during text generation, causing the model to fabricate inaccurate details. Building on this idea, we introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing. Central to our approach is the log-linear law, which predicts that the rate of factual hallucination increases linearly with the logarithmic scale of (1) Knowledge Popularity, (2) Knowledge Length, and (3) Model Size. The law provides a means to preemptively quantify hallucinations, offering foresight into their occurrence even before model training or inference. Built on the overshadowing effect, we propose a new decoding strategy CoDA, to mitigate hallucinations, which notably enhances model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%). Our findings not only deepen understandings of the underlying mechanisms behind hallucinations but also provide actionable insights for developing more predictable and controllable language models.

Hallucination is a persistent challenge in large language models (LLMs), where even with rigorous quality control, models often generate distorted facts. This paradox, in which error generation continues despite high-quality training data, calls for a deeper understanding of the underlying LLM mechanisms. To address it, we propose a novel concept: knowledge overshadowing, where model’s dominant knowledge can obscure less prominent knowledge during text generation, causing the model to fabricate inaccurate details. Building on this idea, we introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing. Central to our approach is the log-linear law, which predicts that the rate of factual hallucination increases linearly with the logarithmic scale of (1) Knowledge Popularity, (2) Knowledge Length, and (3) Model Size. The law provides a means to preemptively quantify hallucinations, offering foresight into their occurrence even before model training or inference. Built on overshadowing effect, we propose a new decoding strategy CoDa, to mitigate hallucinations, which notably enhance model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%). Our findings not only deepen understandings of the underlying mechanisms behind hallucinations but also provide actionable insights for developing more predictable and controllable language models.

pdf bib abs
Reranking-based Generation for Unbiased Perspective Summarization
Narutatsu Ri | Nicholas Deas | Kathleen McKeown
Findings of the Association for Computational Linguistics: ACL 2025

Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model–based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.

Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect real-world complexity, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce **ModelingBench**, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. To solve these challenges, we present **ModelingAgent**, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges. All the codes are released for future research.

pdf bib abs
Mining Contextualized Visual Associations from Images for Creativity Understanding
Ananya Sahu | Amith Ananthram | Kathleen McKeown
Proceedings of the 18th International Natural Language Generation Conference

Understanding another person’s creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.

pdf bib abs
Counterfactual Simulatability of LLM Explanations for Generation Tasks
Marvin Limpijankit | Yanda Chen | Melanie Subbiah | Nicholas Deas | Kathleen McKeown
Proceedings of the 18th International Natural Language Generation Conference

LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. Counterfactual simulatability measures how well an explanation allows users to infer the model’s output on related counterfactuals and has been previously studied for yes/no question answering. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict their outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that evaluating counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

pdf bib abs
Forecasting Conversation Derailments Through Generation
Yunfan Zhang | Kathleen McKeown | Smaranda Muresan
Proceedings of the 18th International Natural Language Generation Conference

Forecasting conversation derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models’ success at identifying offensive speech present in conversations, they struggle to forecast future conversation derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the conversation outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English conversation derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.

pdf bib abs
StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples
Ajay Patel | Jiacheng Zhu | Justin Qiu | Zachary Horvitz | Marianna Apidianaki | Kathleen McKeown | Chris Callison-Burch
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content. However, the contrastive triplets often used for training these representations may vary in both style and content, leading to potential content leakage in the representations. We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings. We use a large language model to create a synthetic dataset of near-exact paraphrases with controlled style variations, and produce positive and negative examples across 40 distinct style features for precise contrastive learning. We assess the quality of our synthetic data and embeddings through human and automatic evaluations. StyleDistance enhances the content-independence of style embeddings, which generalize to real-world benchmarks and outperform leading style representations in downstream applications.

pdf bib
Evaluating Defeasible Reasoning in LLMs with DEFREASING
Emily Allaway | Kathleen McKeown
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

pdf bib abs
AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization
Mukur Gupta | Nikhil Reddy Varimalla | Nicholas Deas | Melanie Subbiah | Kathleen McKeown
Proceedings of The 5th New Frontiers in Summarization Workshop

Large Language Models (LLMs) have achieved impressive performance in text summarization and are increasingly deployed in real-world applications. However, these systems often inherit associative and framing biases from pre-training data, leading to inappropriate or unfair outputs in downstream tasks. In this work, we present AdvSumm (Adversarial Summarization), a domain-agnostic training framework designed to mitigate bias in text summarization through improved generalization. Inspired by adversarial robustness, AdvSumm introduces a novel Perturber component that applies gradient-guided perturbations at the embedding level of Sequence-to-Sequence models, enhancing the model’s robustness to input variations. We empirically demonstrate that AdvSumm effectively reduces different types of bias in summarization—specifically, name-nationality bias and political framing bias—without compromising summarization quality. Compared to standard transformers and data augmentation techniques like back-translation, AdvSumm achieves stronger bias mitigation performance across benchmark datasets.

2024

pdf bib abs
Parallel Structures in Pre-training Data Yield In-Context Learning
Yanda Chen | Chen Zhao | Zhou Yu | Kathleen McKeown | He He
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pre-trained language models (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pre-training data contribute to ICL. We find that LMs’ ICL ability depends on parallel structures in the pre-training data—pairs of phrases following similar templates in the same context window. Specifically, we detect parallel structures by checking whether training on one phrase improves prediction of the other, and conduct ablation experiments to study their effect on ICL. We show that removing parallel structures in the pre-training data reduces LMs’ ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when excluding common patterns such as n-gram repetitions and long-range dependency, showing the diversity and generality of parallel structures. A closer look at the detected parallel structures indicates that they cover diverse linguistic tasks and span long distances in the data.

pdf bib abs
Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models
Zachary Horvitz | Jingru Chen | Rahul Aditya | Harshvardhan Srivastava | Robert West | Zhou Yu | Kathleen McKeown
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. We investigate whether large language models (LLMs) can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to “unfun” jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset where we find that GPT-4’s synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.

pdf bib abs
Exceptions, Instantiations, and Overgeneralization: Insights into How Language Models Process Generics
Emily Allaway | Chandra Bhagavatula | Jena D. Hwang | Kathleen McKeown | Sarah-Jane Leslie
Computational Linguistics, Volume 50, Issue 4 - December 2024

Large language models (LLMs) have garnered a great deal of attention for their exceptional generative performance on commonsense and reasoning tasks. In this work, we investigate LLMs’ capabilities for generalization using a particularly challenging type of statement: generics. Generics express generalizations (e.g., birds can fly) but do so without explicit quantification. They are notable because they generalize over their instantiations (e.g., sparrows can fly) yet hold true even in the presence of exceptions (e.g., penguins do not). For humans, these generic generalizations play a fundamental role in cognition, concept acquisition, and intuitive reasoning. We investigate how LLMs respond to and reason about generics. To this end, we first propose a framework grounded in pragmatics to automatically generate both exceptions and instantiations – collectively exemplars. We make use of focus—a pragmatic phenomenon that highlights meaning-bearing elements in a sentence—to capture the full range of interpretations of generics across different contexts of use. This allows us to derive precise logical definitions for exemplars and operationalize them to automatically generate exemplars from LLMs. Using our system, we generate a dataset of ∼370kexemplars across ∼17k generics and conduct a human validation of a sample of the generated data. We use our final generated dataset to investigate how LLMs reason about generics. Humans have a documented tendency to conflate universally quantified statements (e.g., all birds can fly) with generics. Therefore, we probe whether LLMs exhibit similar overgeneralization behavior in terms of quantification and in property inheritance. We find that LLMs do show evidence of overgeneralization, although they sometimes struggle to reason about exceptions. Furthermore, we find that LLMs may exhibit similar non-logical behavior to humans when considering property inheritance from generics.

pdf bib abs
STORYSUMM: Evaluating Faithfulness in Story Summarization
Melanie Subbiah | Faisal Ladhak | Akankshya Mishra | Griffin Thomas Adams | Lydia Chilton | Kathleen McKeown
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, StorySumm, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.

pdf bib abs
MASIVE: Open-Ended Affective State Identification in English and Spanish
Nicholas Deas | Elsbeth Turcan | Ivan Ernesto Perez Mejia | Kathleen McKeown
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, often applied across languages. These basic sets, however, are rarely designed with textual data in mind, and culture, language, and dialect can influence how particular emotions are interpreted. In this work, we broaden our scope to a practically unbounded set of affective states, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. We then define the new problem of affective state identification for language generation models framed as a masked span prediction task. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states. Additionally, we show that pretraining on MASIVE improves model performance on existing emotion benchmarks. Finally, through machine translation experiments, we find that native speaker-written data is vital to good performance on this task.

The goal of text style transfer is to transform the style of texts while preserving their original meaning, often with only a few examples of the target style. Existing style transfer methods generally rely on the few-shot capabilities of large language models or on complex controllable text generation approaches that are inefficient and underperform on fluency metrics. We introduce TinyStyler, a lightweight but effective approach, which leverages a small language model (800M params) and pre-trained authorship embeddings to perform efficient, few-shot text style transfer. We evaluate on the challenging task of authorship style transfer and find TinyStyler outperforms strong approaches such as GPT-4. We also evaluate TinyStyler’s ability to perform text attribute style transfer (formal ↔ informal) with automatic and human evaluations and find that the approach outperforms recent controllable text generation methods.

pdf bib abs
Social Orientation: A New Feature for Dialogue Analysis
Todd Morrill | Zhaoyuan Deng | Yanda Chen | Amith Ananthram | Colin Wayne Leach | Kathleen McKeown
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

There are many settings where it is useful to predict and explain the success or failure of a dialogue. Circumplex theory from psychology models the social orientations (e.g., Warm-Agreeable, Arrogant-Calculating) of conversation participants and can be used to predict and explain the outcome of social interactions. Our work is novel in its systematic application of social orientation tags to modeling conversation outcomes. In this paper, we introduce a new data set of dialogue utterances machine-labeled with social orientation tags. We show that social orientation tags improve task performance, especially in low-resource settings, on both English and Chinese language benchmarks. We also demonstrate how social orientation tags help explain the outcomes of social interactions when used in neural models. Based on these results showing the utility of social orientation tags for dialogue outcome prediction tasks, we release our data sets, code, and models that are fine-tuned to predict social orientation tags on dialogue utterances.

People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence- level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model’s size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLM’s zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.

pdf bib abs
Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
Melanie Subbiah | Sean Zhang | Lydia B. Chilton | Kathleen McKeown
Transactions of the Association for Computational Linguistics, Volume 12

We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

2023

pdf bib abs
Generating EDU Extracts for Plan-Guided Summary Re-Ranking
Griffin Adams | Alex Fabbri | Faisal Ladhak | Noémie Elhadad | Kathleen McKeown
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Two-step approaches, in which summary candidates are generated-then-reranked to return a single summary, can improve ROUGE scores over the standard single-step approach. Yet, standard decoding methods (i.e., beam search, nucleus sampling, and diverse beam search) produce candidates with redundant, and often low quality, content. In this paper, we design a novel method to generate candidates for re-ranking that addresses these issues. We ground each candidate abstract on its own unique content plan and generate distinct plan-guided abstracts using a model’s top beam. More concretely, a standard language model (a BART LM) auto-regressively generates elemental discourse unit (EDU) content plans with an extractive copy mechanism. The top K beams from the content plan generator are then used to guide a separate LM, which produces a single abstractive candidate for each distinct plan. We apply an existing re-ranker (BRIO) to abstractive candidates generated from our method, as well as baseline decoding methods. We show large relevance improvements over previously published methods on widely used single document news article corpora, with ROUGE-2 F1 gains of 0.88, 2.01, and 0.38 on CNN / Dailymail, NYT, and Xsum, respectively. A human evaluation on CNN / DM validates these results. Similarly, on 1k samples from CNN / DM, we show that prompting GPT-3 to follow EDU plans outperforms sampling-based methods by by 1.05 ROUGE-2 F1 points. Code to generate and realize plans is available at https://github.com/griff4692/edu-sum.

pdf bib abs
Unsupervised Selective Rationalization with Noise Injection
Adam Storek | Melanie Subbiah | Kathleen McKeown
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A major issue with using deep learning models in sensitive applications is that they provide no explanation for their output. To address this problem, unsupervised selective rationalization produces rationales alongside predictions by chaining two jointly-trained components, a rationale generator and a predictor. Although this architecture guarantees that the prediction relies solely on the rationale, it does not ensure that the rationale contains a plausible explanation for the prediction. We introduce a novel training technique that effectively limits generation of implausible rationales by injecting noise between the generator and the predictor. Furthermore, we propose a new benchmark for evaluating unsupervised selective rationalization models using movie reviews from existing datasets. We achieve sizeable improvements in rationale plausibility and task accuracy over the state-of-the-art across a variety of tasks, including our new benchmark, while maintaining or improving model faithfulness.

pdf bib abs
Faking Fake News for Real Fake News Detection: Propaganda-Loaded Training Data Generation
Kung-Hsiang Huang | Kathleen McKeown | Preslav Nakov | Yejin Choi | Heng Ji
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite recent advances in detecting fake news generated by neural models, their results are not readily applicable to effective detection of human-written disinformation. What limits the successful transfer between them is the sizable gap between machine-generated fake news and human-authored ones, including the notable differences in terms of style and underlying intent. With this in mind, we propose a novel framework for generating training examples that are informed by the known styles and strategies of human-authored propaganda. Specifically, we perform self-critical sequence training guided by natural language inference to ensure the validity of the generated articles, while also incorporating propaganda techniques, such as appeal to authority and loaded language. In particular, we create a new training dataset, PropaNews, with 2,256 examples, which we release for future use. Our experimental results show that fake news detectors trained on PropaNews are better at detecting human-written disinformation by 3.62–7.69% F1 score on two public datasets.

In this paper, we present a novel approach for data-to-text generation that addresses the limitations of current methods that primarily focus on specific types of structured data. Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios by providing a unified representation that can handle various forms of structured data such as tables, knowledge graph triples, and meaning representations. We demonstrate that our proposed approach can effectively adapt to new structured forms, and can improve performance in comparison to current methods. For example, our method resulted in a 66% improvement in zero-shot BLEU scores when transferring models trained on table inputs to a knowledge graph dataset. Our proposed method is an important step towards a more general data-to-text generation framework.

pdf bib abs
Penguins Don’t Fly: Reasoning about Generics through Instantiations and Exceptions
Emily Allaway | Jena D. Hwang | Chandra Bhagavatula | Kathleen McKeown | Doug Downey | Yejin Choi
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Generics express generalizations about the world (e.g., birds can fly) that are not universally true (e.g., newborn birds and penguins cannot fly). Commonsense knowledge bases, used extensively in NLP, encode some generic knowledge but rarely enumerate such exceptions and knowing when a generic statement holds or does not hold true is crucial for developing a comprehensive understanding of generics. We present a novel framework informed by linguistic theory to generate exemplars—specific cases when a generic holds true or false. We generate ~19k exemplars for ~650 generics and show that our framework outperforms a strong GPT-3 baseline by 12.8 precision points. Our analysis highlights the importance of linguistic theory-based controllability for generating exemplars, the insufficiency of knowledge bases as a source of exemplars, and the challenges exemplars pose for the task of natural language inference.

pdf bib abs
Faithfulness-Aware Decoding Strategies for Abstractive Summarization
David Wan | Mengwen Liu | Kathleen McKeown | Markus Dreyer | Mohit Bansal
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Despite significant progress in understanding and improving faithfulness in abstractive summarization, the question of how decoding strategies affect faithfulness is less studied. We present a systematic study of the effect of generation techniques such as beam search and nucleus sampling on faithfulness in abstractive summarization. We find a consistent trend where beam search with large beam sizes produces the most faithful summaries while nucleus sampling generates the least faithful ones. We propose two faithfulness-aware generation methods to further improve faithfulness over current generation techniques: (1) ranking candidates generated by beam search using automatic faithfulness metrics and (2) incorporating lookahead heuristics that produce a faithfulness score on the future summary. We show that both generation methods significantly improve faithfulness across two datasets as evaluated by four automatic faithfulness metrics and human evaluation. To reduce computational cost, we demonstrate a simple distillation approach that allows the model to generate faithful summaries with just greedy decoding.

pdf bib abs
When Do Pre-Training Biases Propagate to Downstream Tasks? A Case Study in Text Summarization
Faisal Ladhak | Esin Durmus | Mirac Suzgun | Tianyi Zhang | Dan Jurafsky | Kathleen McKeown | Tatsunori Hashimoto
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Large language models (LLMs) are subject to sociocultural and other biases previously identified using intrinsic evaluations. However, when and how these intrinsic biases in pre-trained LM representations propagate to downstream, fine-tuned NLP tasks like summarization is not well understood. In this work, we investigate one type of bias—name-nationality bias—and trace it from the pre-training stage to a downstream summarization task across multiple summarization modeling choices. We show that these biases manifest themselves as hallucinations in summarization, leading to factually incorrect summaries. We also find that this propagation of biases is algorithm-dependent: more abstractive models allow biases to propagate more directly to downstream tasks as hallucinated facts. Building on these observations, we further analyze how changes to the adaptation method and fine-tuning data set affect name nationality biases and show that while they can reduce the overall rate of hallucinations, they do not change the types of biases that do appear.

pdf bib abs
Evaluation of African American Language Bias in Natural Language Generation
Nicholas Deas | Jessica Grieser | Shana Kleiner | Desmond Patton | Elsbeth Turcan | Kathleen McKeown
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

While biases disadvantaging African American Language (AAL) have been uncovered in models for tasks such as speech recognition and toxicity detection, there has been little investigation of these biases for language generation models like ChatGPT. We evaluate how well LLMs understand AAL in comparison to White Mainstream English (WME), the encouraged “standard” form of English taught in American classrooms. We measure large language model performance on two tasks: a counterpart generation task, where a model generates AAL given WME and vice versa, and a masked span prediction (MSP) task, where models predict a phrase hidden from their input. Using a novel dataset of AAL texts from a variety of regions and contexts, we present evidence of dialectal bias for six pre-trained LLMs through performance gaps on these tasks.

Missing information is a common issue of dialogue summarization where some information in the reference summaries is not covered in the generated summaries. To address this issue, we propose to utilize natural language inference (NLI) models to improve coverage while avoiding introducing factual inconsistencies. Specifically, we use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered, as well as to distinguish between factually consistent and inconsistent generated sentences. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach in balancing coverage and faithfulness, validated with automatic metrics and human evaluations. Additionally, we compute the correlation between commonly used automatic metrics with human judgments in terms of three different dimensions regarding coverage and factual consistency to provide insight into the most suitable metric for evaluating dialogue summaries.

pdf bib abs
Improving Long Dialogue Summarization with Semantic Graph Representation
Yilun Hua | Zhaoyuan Deng | Kathleen McKeown
Findings of the Association for Computational Linguistics: ACL 2023

Although Large Language Models (LLMs) are successful in abstractive summarization of short dialogues, summarization of long dialogues remains challenging. To address this challenge, we propose a novel algorithm that processes complete dialogues comprising thousands of tokens into topic-segment-level Abstract Meaning Representation (AMR) graphs, which explicitly capture the dialogue structure, highlight salient semantics, and preserve high-level information. We also develop a new text-graph attention to leverage both graph semantics and a pretrained LLM that exploits the text. Finally, we propose an AMR node selection loss used jointly with conventional cross-entropy loss, to create additional training signals that facilitate graph feature encoding and content selection. Experiments show that our system outperforms the state-of-the-art models on multiple long dialogue summarization datasets, especially in low-resource settings, and generalizes well to out-of-domain data.

We present a new fact-checking benchmark, Check-COVID, that requires systems to verify claims about COVID-19 from news using evidence from scientific articles. This approach to fact-checking is particularly challenging as it requires checking internet text written in everyday language against evidence from journal articles written in formal academic language. Check-COVID contains 1, 504 expert-annotated news claims about the coronavirus paired with sentence-level evidence from scientific journal articles and veracity labels. It includes both extracted (journalist-written) and composed (annotator-written) claims. Experiments using both a fact-checking specific system and GPT-3.5, which respectively achieve F1 scores of 76.99 and 69.90 on this task, reveal the difficulty of automatically fact-checking both claim types and the importance of in-domain data for good performance. Our data and models are released publicly at https://github.com/posuer/Check-COVID.

pdf bib abs
On the Relation between Sensitivity and Accuracy in In-Context Learning
Yanda Chen | Chen Zhao | Zhou Yu | Kathleen McKeown | He He
Findings of the Association for Computational Linguistics: EMNLP 2023

In-context learning (ICL) suffers from oversensitivity to the prompt, making it unreliable in real-world scenarios. We study the sensitivity of ICL with respect to multiple perturbation types. First, we find that label bias obscures the true sensitivity, and therefore prior work may have significantly underestimated ICL sensitivity. Second, we observe a strong negative correlation between ICL sensitivity and accuracy: predictions sensitive to perturbations are less likely to be correct. Motivated by these findings, we propose SenSel, a few-shot selective prediction method that abstains from sensitive predictions. Experiments on ten classification datasets show that SenSel consistently outperforms two commonly used confidence-based and entropy-based baselines on abstention decisions.

pdf bib abs
Learning Interpretable Style Embeddings via Prompting LLMs
Ajay Patel | Delip Rao | Ansh Kothary | Kathleen McKeown | Chris Callison-Burch
Findings of the Association for Computational Linguistics: EMNLP 2023

Style representation learning builds content-independent representations of author style in text. To date, no large dataset of texts with stylometric annotations on a wide range of style dimensions has been compiled, perhaps because the linguistic expertise to perform such annotation would be prohibitively expensive. Therefore, current style representation approaches make use of unsupervised neural methods to disentangle style from content to create style vectors. These approaches, however, result in uninterpretable representations, complicating their usage in downstream applications like authorship attribution where auditing and explainability is critical. In this work, we use prompting to perform stylometry on a large number of texts to generate a synthetic stylometry dataset. We use this synthetic data to then train human-interpretable style representations we call LISA embeddings. We release our synthetic dataset (StyleGenome) and our interpretable style embedding model (LISA) as resources.

pdf bib
PhraseSumm: Abstractive Short Phrase Summarization
Kasturi Bhattacharjee | Kathleen McKeown | Rashmi Gangadharaiah
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

pdf bib abs
Towards Detecting Harmful Agendas in News Articles
Melanie Subbiah | Amrita Bhattacharjee | Yilun Hua | Tharindu Kumarage | Huan Liu | Kathleen McKeown
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Manipulated news online is a growing problem which necessitates the use of automated systems to curtail its spread. We argue that while misinformation and disinformation detection have been studied, there has been a lack of investment in the important open challenge of detecting harmful agendas in news articles; identifying harmful agendas is critical to flag news campaigns with the greatest potential for real world harm. Moreover, due to real concerns around censorship, harmful agenda detectors must be interpretable to be effective. In this work, we propose this new task and release a dataset, NewsAgendas, of annotated news articles for agenda identification. We show how interpretable systems can be effective on this task and demonstrate that they can perform comparably to black-box models.

2022

pdf bib abs
Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization
Faisal Ladhak | Esin Durmus | He He | Claire Cardie | Kathleen McKeown
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs as one naive way to improve faithfulness is to make summarization models more extractive. In this work, we present a framework for evaluating the effective faithfulness of summarization systems, by generating a faithfulness-abstractiveness trade-off curve that serves as a control at different operating points on the abstractiveness spectrum. We then show that the Maximum Likelihood Estimation (MLE) baseline as well as recently proposed methods for improving faithfulness, fail to consistently improve over the control at the same level of abstractiveness. Finally, we learn a selector to identify the most faithful and abstractive summary for a given document, and show that this system can attain higher faithfulness scores in human evaluations while being more abstractive than the baseline system on two datasets. Moreover, we show that our system is able to achieve a better faithfulness-abstractiveness trade-off than the control at the same level of abstractiveness.

Query-focused summaries of foreign-language, retrieved documents can help a user understand whether a document is actually relevant to the query term. A standard approach to this problem is to first translate the source documents and then perform extractive summarization to find relevant snippets. However, in a cross-lingual setting, the query term does not necessarily appear in the translations of relevant documents. In this work, we show that constrained machine translation and constrained post-editing can improve human relevance judgments by including a query term in a summary when its translation appears in the source document. We also present several strategies for selecting only certain documents for regeneration which yield further improvements

pdf bib abs
Using Structured Content Plans for Fine-grained Syntactic Control in Pretrained Language Model Generation
Fei-Tzin Lee | Miguel Ballesteros | Feng Nan | Kathleen McKeown
Proceedings of the 29th International Conference on Computational Linguistics

Large pretrained language models offer powerful generation capabilities, but cannot be reliably controlled at a sub-sentential level. We propose to make such fine-grained control possible in pretrained LMs by generating text directly from a semantic representation, Abstract Meaning Representation (AMR), which is augmented at the node level with syntactic control tags. We experiment with English-language generation of three modes of syntax relevant to the framing of a sentence - verb voice, verb tense, and realization of human entities - and demonstrate that they can be reliably controlled, even in settings that diverge drastically from the training distribution. These syntactic aspects contribute to how information is framed in text, something that is important for applications such as summarization which aim to highlight salient information.

pdf bib
Proceedings of the Workshop on Automatic Summarization for Creative Writing
Kathleen Mckeown
Proceedings of the Workshop on Automatic Summarization for Creative Writing

This paper introduces the shared task of summrizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations, as well as understanding non-trivial temporal dependencies in texts containing varied styles of plot development and narrative structure. This poses unique challenges and is yet underexplored for text summarization systems. In this shared task, we introduce four sub-tasks and their corresponding datasets, focusing on summarizing books, movie scripts, primetime television scripts, and daytime soap opera scripts. We detail the process of curating these datasets for the task, as well as the metrics used for the evaluation of the submissions. As part of the CREATIVESUMM workshop at COLING 2022, the shared task attracted 18 submissions in total. We discuss the submissions and the baselines for each sub-task in this paper, along with directions for facilitating future work.

Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create the first benchmark dataset, SafeText, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. We utilize SafeText to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. As a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release.

pdf bib abs
Read Top News First: A Document Reordering Approach for Multi-Document News Summarization
Chao Zhao | Tenghao Huang | Somnath Basu Roy Chowdhury | Muthu Kumar Chandrasekaran | Kathleen McKeown | Snigdha Chaturvedi
Findings of the Association for Computational Linguistics: ACL 2022

A common method for extractive multi-document news summarization is to re-formulate it as a single-document summarization problem by concatenating all documents as a single meta-document. However, this method neglects the relative importance of documents. We propose a simple approach to reorder the documents according to their relative importance before concatenating and summarizing them. The reordering makes the salient content easier to learn by the summarization model. Experiments show that our approach outperforms previous state-of-the-art methods with more complex architectures.

pdf bib abs
Seeded Hierarchical Clustering for Expert-Crafted Taxonomies
Anish Saha | Amith Ananthram | Emily Allaway | Heng Ji | Kathleen McKeown
Findings of the Association for Computational Linguistics: EMNLP 2022

Practitioners from many disciplines (e.g., political science) use expert-crafted taxonomies to make sense of large, unlabeled corpora. In this work, we study Seeded Hierarchical Clustering (SHC): the task of automatically fitting unlabeled data to such taxonomies using a small set of labeled examples. We propose HierSeed, a novel weakly supervised algorithm for this task that uses only a small set of labeled seed examples in a computation and data efficient manner. HierSeed assigns documents to topics by weighing document density against topic hierarchical structure. It outperforms unsupervised and supervised baselines for the SHC task on three real-world datasets.

An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system’s information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems.

In real-world scenarios with naturally occurring datasets, reference summaries are noisy and may contain information that cannot be inferred from the source text. On large news corpora, removing low quality samples has been shown to reduce model hallucinations. Yet, for smaller, and/or noisier corpora, filtering is detrimental to performance. To improve reference quality while retaining all data, we propose a new approach: to selectively re-write unsupported reference sentences to better reflect source data. We automatically generate a synthetic dataset of positive and negative revisions by corrupting supported sentences and learn to revise reference sentences with contrastive learning. The intensity of revisions is treated as a controllable attribute so that, at inference, diverse candidates can be over-generated-then-rescored to balance faithfulness and abstraction. To test our methods, we extract noisy references from publicly available MIMIC-III discharge summaries for the task of hospital-course summarization, and vary the data on which models are trained. According to metrics and human evaluation, models trained on revised clinical references are much more faithful, informative, and fluent than models trained on original or filtered data.

pdf bib abs
What Do Users Care About? Detecting Actionable Insights from User Feedback
Kasturi Bhattacharjee | Rashmi Gangadharaiah | Kathleen McKeown | Dan Roth
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track

Users often leave feedback on a myriad of aspects of a product which, if leveraged successfully, can help yield useful insights that can lead to further improvements down the line. Detecting actionable insights can be challenging owing to large amounts of data as well as the absence of labels in real-world scenarios. In this work, we present an aggregation and graph-based ranking strategy for unsupervised detection of these insights from real-world, noisy, user-generated feedback. Our proposed approach significantly outperforms strong baselines on two real-world user feedback datasets and one academic dataset.

pdf bib abs
Legal and Political Stance Detection of SCOTUS Language
Noah Bergam | Emily Allaway | Kathleen Mckeown
Proceedings of the Natural Legal Language Processing Workshop 2022

We analyze publicly available US Supreme Court documents using automated stance detection. In the first phase of our work, we investigate the extent to which the Court’s public-facing language is political. We propose and calculate two distinct ideology metrics of SCOTUS justices using oral argument transcripts. We then compare these language-based metrics to existing social scientific measures of the ideology of the Supreme Court and the public. Through this cross-disciplinary analysis, we find that justices who are more responsive to public opinion tend to express their ideology during oral arguments. This observation provides a new kind of evidence in favor of the attitudinal change hypothesis of Supreme Court justice behavior. As a natural extension of this political stance detection, we propose the more specialized task of legal stance detection with our new dataset SC-stance, which matches written opinions to legal questions. We find competitive performance on this dataset using language adapters trained on legal documents.

2021

pdf bib abs
InfoSurgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection
Yi Fung | Christopher Thomas | Revanth Gangi Reddy | Sandeep Polisetty | Heng Ji | Shih-Fu Chang | Kathleen McKeown | Mohit Bansal | Avi Sil
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

To defend against machine-generated fake news, an effective mechanism is urgently needed. We contribute a novel benchmark for fake news detection at the knowledge element level, as well as a solution for this task which incorporates cross-media consistency checking to detect the fine-grained knowledge elements making news articles misinformative. Due to training data scarcity, we also formulate a novel data synthesis method by manipulating knowledge elements within the knowledge graph to generate noisy training data with specific, hard to detect, known inconsistencies. Our detection approach outperforms the state-of-the-art (up to 16.8% accuracy gain), and more critically, yields fine-grained explanations.

pdf bib abs
Cross-language Sentence Selection via Data Augmentation and Rationale Training
Yanda Chen | Chris Kedzie | Suraj Nair | Petra Galuscakova | Rui Zhang | Douglas Oard | Kathleen McKeown
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper proposes an approach to cross-language sentence selection in a low-resource setting. It uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model. Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data. Moreover, when a rationale training secondary objective is applied to encourage the model to match word alignment hints from a phrase-based statistical machine translation model, consistent improvements are seen across three language pairs (English-Somali, English-Swahili and English-Tagalog) over a variety of state-of-the-art baselines.

pdf bib abs
Improving Factual Consistency of Abstractive Summarization via Question Answering
Feng Nan | Cicero Nogueira dos Santos | Henghui Zhu | Patrick Ng | Kathleen McKeown | Ramesh Nallapati | Dejiao Zhang | Zhiguo Wang | Andrew O. Arnold | Bing Xiang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

A commonly observed problem with the state-of-the art abstractive summarization models is that the generated summaries can be factually inconsistent with the input documents. The fact that automatic summarization may produce plausible-sounding yet inaccurate summaries is a major concern that limits its wide application. In this paper we present an approach to address factual consistency in summarization. We first propose an efficient automatic evaluation metric to measure factual consistency; next, we propose a novel learning algorithm that maximizes the proposed metric during model training. Through extensive experiments, we confirm that our method is effective in improving factual consistency and even overall quality of the summaries, as judged by both automatic metrics and human evaluation.

pdf bib abs
Event-Centric Natural Language Processing
Muhao Chen | Hongming Zhang | Qiang Ning | Manling Li | Heng Ji | Kathleen McKeown | Dan Roth
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Tutorial Abstracts

This tutorial targets researchers and practitioners who are interested in AI technologies that help machines understand natural language text, particularly real-world events described in the text. These include methods to extract the internal structures of an event regarding its protagonist(s), participant(s) and properties, as well as external structures concerning memberships, temporal and causal relations of multiple events. This tutorial will provide audience with a systematic introduction of (i) knowledge representations of events, (ii) various methods for automated extraction, conceptualization and prediction of events and their relations, (iii) induction of event processes and properties, and (iv) a wide range of NLU and commonsense understanding tasks that benefit from aforementioned techniques. We will conclude the tutorial by outlining emerging research problems in this area.

pdf bib abs
A Unified Feature Representation for Lexical Connotations
Emily Allaway | Kathleen McKeown
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Ideological attitudes and stance are often expressed through subtle meanings of words and phrases. Understanding these connotations is critical to recognizing the cultural and emotional perspectives of the speaker. In this paper, we use distant labeling to create a new lexical resource representing connotation aspects for nouns and adjectives. Our analysis shows that it aligns well with human judgments. Additionally, we present a method for creating lexical representations that capture connotations within the embedding space and show that using the embeddings provides a statistically significant improvement on the task of stance detection when data is limited.

pdf bib abs
Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings
Kailash Karthik Saravanakumar | Miguel Ballesteros | Muthu Kumar Chandrasekaran | Kathleen McKeown
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We propose a method for online news stream clustering that is a variant of the non-parametric streaming K-means algorithm. Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations and makes the clustering decision using a neural classifier. The weighted document-cluster similarity model is learned using a novel adaptation of the triplet loss into a linear classification objective. We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings for clustering. Our model achieves a new state-of-the-art on a standard stream clustering dataset of English documents.

A key challenge for abstractive summarization is ensuring factual consistency of the generated summary with respect to the original document. For example, state-of-the-art models trained on existing datasets exhibit entity hallucination, generating names of entities that are not present in the source document. We propose a set of new metrics to quantify the entity-level factual consistency of generated summaries and we show that the entity hallucination problem can be alleviated by simply filtering the training data. In addition, we propose a summary-worthy entity classification task to the training process as well as a joint entity and summary generation approach, which yield further improvements in entity level metrics.

Typical ASR systems segment the input audio into utterances using purely acoustic information, which may not resemble the sentence-like units that are expected by conventional machine translation (MT) systems for Spoken Language Translation. In this work, we propose a model for correcting the acoustic segmentation of ASR models for low-resource languages to improve performance on downstream tasks. We propose the use of subtitles as a proxy dataset for correcting ASR acoustic segmentation, creating synthetic acoustic utterances by modeling common error modes. We train a neural tagging model for correcting ASR acoustic segmentation and show that it improves downstream performance on MT and audio-document cross-language information retrieval (CLIR).

Timeline Summarization identifies major events from a news collection and describes them following temporal order, with key dates tagged. Previous methods generally generate summaries separately for each date after they determine the key dates of events. These methods overlook the events’ intra-structures (arguments) and inter-structures (event-event connections). Following a different route, we propose to represent the news articles as an event-graph, thus the summarization becomes compressing the whole graph to its salient sub-graph. The key hypothesis is that the events connected through shared arguments and temporal order depict the skeleton of a timeline, containing events that are semantically related, temporally coherent and structurally salient in the global event graph. A time-aware optimal transport distance is then introduced for learning the compression model in an unsupervised manner. We show that our approach significantly improves on the state of the art on three real-world datasets, including two public standard benchmarks and our newly collected Timeline100 dataset.

pdf bib abs
A Bag of Tricks for Dialogue Summarization
Muhammad Khalifa | Miguel Ballesteros | Kathleen McKeown
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Dialogue summarization comes with its own peculiar challenges as opposed to news or scientific articles summarization. In this work, we explore four different challenges of the task: handling and differentiating parts of the dialogue belonging to multiple speakers, negation understanding, reasoning about the situation, and informal language understanding. Using a pretrained sequence-to-sequence language model, we explore speaker name substitution, negation scope highlighting, multi-task learning with relevant tasks, and pretraining on in-domain data. Our experiments show that our proposed techniques indeed improve summarization performance, outperforming strong baselines.

pdf bib abs
Emotion-Infused Models for Explainable Psychological Stress Detection
Elsbeth Turcan | Smaranda Muresan | Kathleen McKeown
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The problem of detecting psychological stress in online posts, and more broadly, of detecting people in distress or in need of help, is a sensitive application for which the ability to interpret models is vital. Here, we present work exploring the use of a semantically related task, emotion detection, for equally competent but more explainable and human-like psychological stress detection as compared to a black-box model. In particular, we explore the use of multi-task learning as well as emotion-based language model fine-tuning. With our emotion-infused models, we see comparable results to state-of-the-art BERT. Our analysis of the words used for prediction show that our emotion-infused models mirror psychological components of stress.

pdf bib abs
Adversarial Learning for Zero-Shot Stance Detection on Social Media
Emily Allaway | Malavika Srikanth | Kathleen McKeown
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Stance detection on social media can help to identify and understand slanted news or commentary in everyday life. In this work, we propose a new model for zero-shot stance detection on Twitter that uses adversarial learning to generalize across topics. Our model achieves state-of-the-art performance on a number of unseen test topics with minimal computational costs. In addition, we extend zero-shot stance detection to topics not previously considered, highlighting future directions for zero-shot transfer.

Unsupervised clustering aims at discovering the semantic categories of data according to some distance measured in the representation space. However, different categories often overlap with each other in the representation space at the beginning of the learning process, which poses a significant challenge for distance-based clustering in achieving good separation between different categories. To this end, we propose Supporting Clustering with Contrastive Learning (SCCL) – a novel framework to leverage contrastive learning to promote better separation. We assess the performance of SCCL on short text clustering and show that SCCL significantly advances the state-of-the-art results on most benchmark datasets with 3%-11% improvement on Accuracy and 4%-15% improvement on Normalized Mutual Information. Furthermore, our quantitative analysis demonstrates the effectiveness of SCCL in leveraging the strengths of both bottom-up instance discrimination and top-down clustering to achieve better intra-cluster and inter-cluster distances when evaluated with the ground truth cluster labels.

pdf bib abs
Semantic Categorization of Social Knowledge for Commonsense Question Answering
Gengyu Wang | Xiaochen Hou | Diyi Yang | Kathleen McKeown | Jing Huang
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

Large pre-trained language models (PLMs) have led to great success on various commonsense question answering (QA) tasks in an end-to-end fashion. However, little attention has been paid to what commonsense knowledge is needed to deeply characterize these QA tasks. In this work, we proposed to categorize the semantics needed for these tasks using the SocialIQA as an example. Building upon our labeled social knowledge categories dataset on top of SocialIQA, we further train neural QA models to incorporate such social knowledge categories and relation information from a knowledge base. Unlike previous work, we observe our models with semantic categorizations of social knowledge can achieve comparable performance with a relatively simple model and smaller size compared to other complex approaches.

2020

pdf bib abs
Exploring Content Selection in Summarization of Novel Chapters
Faisal Ladhak | Bryan Li | Yaser Al-Onaizan | Kathleen McKeown
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present a new summarization task, generating summaries of novel chapters using summary/chapter pairs from online study guides. This is a harder task than the news summarization task, given the chapter length as well as the extreme paraphrasing and generalization found in the summaries. We focus on extractive summarization, which requires the creation of a gold-standard set of extractive summaries. We present a new metric for aligning reference summary sentences with chapter sentences to create gold extracts and also experiment with different alignment methods. Our experiments demonstrate significant improvement over prior alignment approaches for our task as shown through automatic metrics and a crowd-sourced pyramid analysis.

pdf bib
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)
Kathy McKeown | Douglas W. Oard | Elizabeth | Richard Schwartz
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)

pdf bib abs
Subtitles to Segmentation: Improving Low-Resource Speech-to-TextTranslation Pipelines
David Wan | Zhengping Jiang | Chris Kedzie | Elsbeth Turcan | Peter Bell | Kathy McKeown
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)

In this work, we focus on improving ASR output segmentation in the context of low-resource language speech-to-text translation. ASR output segmentation is crucial, as ASR systems segment the input audio using purely acoustic information and are not guaranteed to output sentence-like segments. Since most MT systems expect sentences as input, feeding in longer unsegmented passages can lead to sub-optimal performance. We explore the feasibility of using datasets of subtitles from TV shows and movies to train better ASR segmentation models. We further incorporate part-of-speech (POS) tag and dependency label information (derived from the unsegmented ASR outputs) into our segmentation model. We show that this noisy syntactic information can improve model accuracy. We evaluate our models intrinsically on segmentation quality and extrinsically on downstream MT performance, as well as downstream tasks including cross-lingual information retrieval (CLIR) tasks and human relevance assessments. Our model shows improved performance on downstream tasks for Lithuanian and Bulgarian.

pdf bib abs
Event-Guided Denoising for Multilingual Relation Learning
Amith Ananthram | Emily Allaway | Kathleen McKeown
Proceedings of the 28th International Conference on Computational Linguistics

General purpose relation extraction has recently seen considerable gains in part due to a massively data-intensive distant supervision technique from Soares et al. (2019) that produces state-of-the-art results across many benchmarks. In this work, we present a methodology for collecting high quality training data for relation extraction from unlabeled text that achieves a near-recreation of their zero-shot and few-shot results at a fraction of the training cost. Our approach exploits the predictable distributional structure of date-marked news articles to build a denoised corpus – the extraction process filters out low quality examples. We show that a smaller multilingual encoder trained on this corpus performs comparably to the current state-of-the-art (when both receive little to no fine-tuning) on few-shot and standard relation benchmarks in English and Spanish despite using many fewer examples (50k vs. 300mil+).

pdf bib abs
Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for Low Resource Languages
Efsun Sarioglu Kayi | Linyong Nan | Bohan Qu | Mona Diab | Kathleen McKeown
Proceedings of the 28th International Conference on Computational Linguistics

We release an urgency dataset that consists of English tweets relating to natural crises, along with annotations of their corresponding urgency status. Additionally, we release evaluation datasets for two low-resource languages, i.e. Sinhala and Odia, and demonstrate an effective zero-shot transfer from English to these two languages by training cross-lingual classifiers. We adopt cross-lingual embeddings constructed using different methods to extract features of the tweets, including a few state-of-the-art contextual embeddings such as BERT, RoBERTa and XLM-R. We train classifiers of different architectures on the extracted features. We also explore semi-supervised approaches by utilizing unlabeled tweets and experiment with ensembling different classifiers. With very limited amounts of labeled data in English and zero data in the low resource languages, we show a successful framework of training monolingual and cross-lingual classifiers using deep learning methods which are known to be data hungry. Specifically, we show that the recent deep contextual embeddings are also helpful when dealing with very small-scale datasets. Classifiers that incorporate RoBERTa yield the best performance for English urgency detection task, with F1 scores that are more than 25 points over our baseline classifier. For the zero-shot transfer to low resource languages, classifiers that use LASER features perform the best for Sinhala transfer while XLM-R features benefit the Odia transfer the most.

pdf bib abs
Controllable Meaning Representation to Text Generation: Linearization and Data Augmentation Strategies
Chris Kedzie | Kathleen McKeown
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We study the degree to which neural sequence-to-sequence models exhibit fine-grained controllability when performing natural language generation from a meaning representation. Using two task-oriented dialogue generation benchmarks, we systematically compare the effect of four input linearization strategies on controllability and faithfulness. Additionally, we evaluate how a phrase-based data augmentation method can improve performance. We find that properly aligning input sequences during training leads to highly controllable generation, both when training from scratch or when fine-tuning a larger pre-trained model. Data augmentation further improves control on difficult, randomly generated utterance plans.

In this paper, we propose a neural architecture and a set of training methods for ordering events by predicting temporal relations. Our proposed models receive a pair of events within a span of text as input and they identify temporal relations (Before, After, Equal, Vague) between them. Given that a key challenge with this task is the scarcity of annotated data, our models rely on either pretrained representations (i.e. RoBERTa, BERT or ELMo), transfer and multi-task learning (by leveraging complementary datasets), and self-training techniques. Experiments on the MATRES dataset of English documents establish a new state-of-the-art on this task.

pdf bib abs
Zero-Shot Stance Detection: A Dataset and Model using Generalized Topic Representations
Emily Allaway | Kathleen McKeown
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Stance detection is an important component of understanding hidden influences in everyday life. Since there are thousands of potential topics to take a stance on, most with little to no training data, we focus on zero-shot stance detection: classifying stance from no training examples. In this paper, we present a new dataset for zero-shot stance detection that captures a wider range of topics and lexical variation than in previous datasets. Additionally, we propose a new model for stance detection that implicitly captures relationships between topics using generalized topic representations and show that this model improves performance on a number of challenging linguistic phenomena.

Unsupervised domain adaptation addresses the problem of leveraging labeled data in a source domain to learn a well-performing model in a target domain where labels are unavailable. In this paper, we improve upon a recent theoretical work (Zhang et al., 2019b) and adopt the Margin Disparity Discrepancy (MDD) unsupervised domain adaptation algorithm to solve the cross-lingual text labeling problems. Experiments on cross-lingual document classification and NER demonstrate the proposed domain adaptation approach advances the state-of-the-art results by a large margin. Specifically, we improve MDD by efficiently optimizing the margin loss on the source domain via Virtual Adversarial Training (VAT). This bridges the gap between theory and the loss function used in the original work Zhang et al.(2019b), and thereby significantly boosts the performance. Our numerical results also indicate that VAT can remarkably improve the generalization performance of both domains for various domain adaptation approaches.

pdf bib abs
WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization
Faisal Ladhak | Esin Durmus | Claire Cardie | Kathleen McKeown
Findings of the Association for Computational Linguistics: EMNLP 2020

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of cross-lingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct cross-lingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.

pdf bib abs
Towards Augmenting Lexical Resources for Slang and African American English
Alyssa Hwang | William R. Frey | Kathleen McKeown
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Researchers in natural language processing have developed large, robust resources for understanding formal Standard American English (SAE), but we lack similar resources for variations of English, such as slang and African American English (AAE). In this work, we use word embeddings and clustering algorithms to group semantically similar words in three datasets, two of which contain high incidence of slang and AAE. Since high-quality clusters would contain related words, we could also infer the meaning of an unfamiliar word based on the meanings of words clustered with it. After clustering, we compute precision and recall scores using WordNet and ConceptNet as gold standards and show that these scores are unimportant when the given resources do not fully represent slang and AAE. Amazon Mechanical Turk and expert evaluations show that clusters with low precision can still be considered high quality, and we propose the new Cluster Split Score as a metric for machine-generated clusters. These contributions emphasize the gap in natural language processing research for variations of English and motivate further work to close it.

pdf bib abs
Incorporating Terminology Constraints in Automatic Post-Editing
David Wan | Chris Kedzie | Faisal Ladhak | Marine Carpuat | Kathleen McKeown
Proceedings of the Fifth Conference on Machine Translation

Users of machine translation (MT) may want to ensure the use of specific lexical terminologies. While there exist techniques for incorporating terminology constraints during inference for MT, current APE approaches cannot ensure that they will appear in the final translation. In this paper, we present both autoregressive and non-autoregressive models for lexically constrained APE, demonstrating that our approach enables preservation of 95% of the terminologies and also improves translation quality on English-German benchmarks. Even when applied to lexically constrained MT output, our approach is able to improve preservation of the terminologies. However, we show that our models do not learn to copy constraints systematically and suggest a simple data augmentation technique that leads to improved performance and robustness.

2019

pdf bib abs
AMPERSAND: Argument Mining for PERSuAsive oNline Discussions
Tuhin Chakrabarty | Christopher Hidey | Smaranda Muresan | Kathy McKeown | Alyssa Hwang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Argumentation is a type of discourse where speakers try to persuade their audience about the reasonableness of a claim by presenting supportive arguments. Most work in argument mining has focused on modeling arguments in monologues. We propose a computational model for argument mining in online persuasive discussion forums that brings together the micro-level (argument as product) and macro-level (argument as process) models of argumentation. Fundamentally, this approach relies on identifying relations between components of arguments in a discussion thread. Our approach for relation prediction uses contextual information in terms of fine-tuning a pre-trained language model and leveraging discourse relations based on Rhetorical Structure Theory. We additionally propose a candidate selection method to automatically predict what parts of one’s argument will be targeted by other participants in the discussion. Our models obtain significant improvements compared to recent state-of-the-art approaches using pointer networks and a pre-trained language model.

pdf bib abs
Detecting and Reducing Bias in a High Stakes Domain
Ruiqi Zhong | Yanda Chen | Desmond Patton | Charlotte Selous | Kathy McKeown
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Gang-involved youth in cities such as Chicago sometimes post on social media to express their aggression towards rival gangs and previous research has demonstrated that a deep learning approach can predict aggression and loss in posts. To address the possibility of bias in this sensitive application, we developed an approach to systematically interpret the state of the art model. We found, surprisingly, that it frequently bases its predictions on stop words such as “a” or “on”, an approach that could harm social media users who have no aggressive intentions. To tackle this bias, domain experts annotated the rationales, highlighting words that explain why a tweet is labeled as “aggression”. These new annotations enable us to quantitatively measure how justified the model predictions are, and build models that drastically reduce bias. Our study shows that in high stake scenarios, accuracy alone cannot guarantee a good system and we need new evaluation methods.

pdf bib abs
Automatically Inferring Gender Associations from Language
Serina Chang | Kathy McKeown
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In this paper, we pose the question: do people talk about women and men in different ways? We introduce two datasets and a novel integration of approaches for automatically inferring gender associations from language, discovering coherent word clusters, and labeling the clusters for the semantic concepts they represent. The datasets allow us to compare how people write about women and men in two different settings – one set draws from celebrity news and the other from student reviews of computer science professors. We demonstrate that there are large-scale differences in the ways that people talk about women and men and that these differences vary across domains. Human evaluations show that our methods significantly outperform strong baselines.

pdf bib abs
Dreaddit: A Reddit Dataset for Stress Analysis in Social Media
Elsbeth Turcan | Kathy McKeown
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

Stress is a nigh-universal human experience, particularly in the online world. While stress can be a motivator, too much stress is associated with many negative health outcomes, making its identification useful across a range of domains. However, existing computational research typically only studies stress in domains such as speech, or in short genres such as Twitter. We present Dreaddit, a new text corpus of lengthy multi-domain social media data for the identification of stress. Our dataset consists of 190K posts from five different categories of Reddit communities; we additionally label 3.5K total segments taken from 3K posts using Amazon Mechanical Turk. We present preliminary supervised learning methods for identifying stress, both neural and traditional, and analyze the complexity and diversity of the data and characteristics of each category.

pdf bib abs
IMHO Fine-Tuning Improves Claim Detection
Tuhin Chakrabarty | Christopher Hidey | Kathy McKeown
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Claims are the central component of an argument. Detecting claims across different domains or data sets can often be challenging due to their varying conceptualization. We propose to alleviate this problem by fine-tuning a language model using a Reddit corpus of 5.5 million opinionated claims. These claims are self-labeled by their authors using the internet acronyms IMO/IMHO (in my (humble) opinion). Empirical results show that using this approach improves the state of art performance across four benchmark argumentation data sets by an average of 4 absolute F1 points in claim detection. As these data sets include diverse domains such as social media and student essays this improvement demonstrates the robustness of fine-tuning on this novel corpus.

pdf bib abs
Fixed That for You: Generating Contrastive Claims with Semantic Edits
Christopher Hidey | Kathy McKeown
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Understanding contrastive opinions is a key component of argument generation. Central to an argument is the claim, a statement that is in dispute. Generating a counter-argument then requires generating a response in contrast to the main claim of the original argument. To generate contrastive claims, we create a corpus of Reddit comment pairs self-labeled by posters using the acronym FTFY (fixed that for you). We then train neural models on these pairs to edit the original claim and produce a new claim with a different view. We demonstrate significant improvement over a sequence-to-sequence baseline in BLEU score and a human evaluation for fluency, coherence, and contrast.

pdf bib abs
A Robust Abstractive System for Cross-Lingual Summarization
Jessica Ouyang | Boya Song | Kathy McKeown
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present a robust neural abstractive summarization system for cross-lingual summarization. We construct summarization corpora for documents automatically translated from three low-resource languages, Somali, Swahili, and Tagalog, using machine translation and the New York Times summarization corpus. We train three language-specific abstractive summarizers and evaluate on documents originally written in the source languages, as well as on a fourth, unseen language: Arabic. Our systems achieve significantly higher fluency than a standard copy-attention summarizer on automatically translated input documents, as well as comparable content selection.

pdf bib abs
Neural Network Alignment for Sentential Paraphrases
Jessica Ouyang | Kathy McKeown
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present a monolingual alignment system for long, sentence- or clause-level alignments, and demonstrate that systems designed for word- or short phrase-based alignment are ill-suited for these longer alignments. Our system is capable of aligning semantically similar spans of arbitrary length. We achieve significantly higher recall on aligning phrases of four or more words and outperform state-of-the- art aligners on the long alignments in the MSR RTE corpus.

pdf bib abs
Identifying therapist conversational actions across diverse psychotherapeutic approaches
Fei-Tzin Lee | Derrick Hull | Jacob Levine | Bonnie Ray | Kathy McKeown
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

While conversation in therapy sessions can vary widely in both topic and style, an understanding of the underlying techniques used by therapists can provide valuable insights into how therapists best help clients of different types. Dialogue act classification aims to identify the conversational “action” each speaker takes at each utterance, such as sympathizing, problem-solving or assumption checking. We propose to apply dialogue act classification to therapy transcripts, using a therapy-specific labeling scheme, in order to gain a high-level understanding of the flow of conversation in therapy sessions. We present a novel annotation scheme that spans multiple psychotherapeutic approaches, apply it to a large and diverse corpus of psychotherapy transcripts, and present and discuss classification results obtained using both SVM and neural network-based models. The results indicate that identifying the structure and flow of therapeutic actions is an obtainable goal, opening up the opportunity in the future to provide therapeutic recommendations tailored to specific client situations.

pdf bib abs
A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models
Chris Kedzie | Kathleen McKeown
Proceedings of the 12th International Conference on Natural Language Generation

Deep neural networks (DNN) are quickly becoming the de facto standard modeling method for many natural language generation (NLG) tasks. In order for such models to truly be useful, they must be capable of correctly generating utterances for novel meaning representations (MRs) at test time. In practice, even sophisticated DNNs with various forms of semantic control frequently fail to generate utterances faithful to the input MR. In this paper, we propose an architecture agnostic self-training method to sample novel MR/text utterance pairs to augment the original training data. Remarkably, after training on the augmented data, even simple encoder-decoder models with greedy decoding are capable of generating semantically correct utterances that are as good as state-of-the-art outputs in both automatic and human evaluations of quality.

2018

Gang-involved youth in cities such as Chicago have increasingly turned to social media to post about their experiences and intents online. In some situations, when they experience the loss of a loved one, their online expression of emotion may evolve into aggression towards rival gangs and ultimately into real-world violence. In this paper, we present a novel system for detecting Aggression and Loss in social media. Our system features the use of domain-specific resources automatically derived from a large unlabeled corpus, and contextual representations of the emotional and semantic content of the user’s recent tweets as well as their interactions with other users. Incorporating context in our Convolutional Neural Network (CNN) leads to a significant improvement.

pdf bib abs
Content Selection in Deep Learning Models of Summarization
Chris Kedzie | Kathleen McKeown | Hal Daumé III
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We carry out experiments with deep learning models of summarization across the domains of news, personal stories, meetings, and medical articles in order to understand how content selection is performed. We find that many sophisticated features of state of the art extractive summarizers do not improve performance over simpler models. These results suggest that it is easier to create a summarizer for a new domain than previous work suggests and bring into question the benefit of deep learning models for summarization for those domains that do have massive datasets (i.e., news). At the same time, they suggest important questions for new research in summarization; namely, new forms of sentence representations or external knowledge sources are needed that are better suited to the sumarization task.

pdf bib abs
Predictive Embeddings for Hate Speech Detection on Twitter
Rohan Kshirsagar | Tyrus Cukuvac | Kathy McKeown | Susan McGregor
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

We present a neural-network based approach to classifying online hate speech in general, as well as racist and sexist speech in particular. Using pre-trained word embeddings and max/mean pooling from simple, fully-connected transformations of these embeddings, we are able to predict the occurrence of hate speech on three commonly used publicly available datasets. Our models match or outperform state of the art F1 performance on all three datasets using significantly fewer parameters and minimal feature preprocessing compared to previous methods.

2017

pdf bib abs
SMARTies: Sentiment Models for Arabic Target entities
Noura Farra | Kathy McKeown
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We consider entity-level sentiment analysis in Arabic, a morphologically rich language with increasing resources. We present a system that is applied to complex posts written in response to Arabic newspaper articles. Our goal is to identify important entity “targets” within the post along with the polarity expressed about each target. We achieve significant improvements over multiple baselines, demonstrating that the use of specific morphological representations improves the performance of identifying both important targets and their sentiment, and that the use of distributional semantic clusters further boosts performances for these representations, especially when richer linguistic resources are not available.

pdf bib abs
Crowd-Sourced Iterative Annotation for Narrative Summarization Corpora
Jessica Ouyang | Serina Chang | Kathy McKeown
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present an iterative annotation process for producing aligned, parallel corpora of abstractive and extractive summaries for narrative. Our approach uses a combination of trained annotators and crowd-sourcing, allowing us to elicit human-generated summaries and alignments quickly and at low cost. We use crowd-sourcing to annotate aligned phrases with the text-to-text generation techniques needed to transform each phrase into the other. We apply this process to a corpus of 476 personal narratives, which we make available on the Web.

pdf bib abs
Domain-Adaptable Hybrid Generation of RDF Entity Descriptions
Or Biran | Kathleen McKeown
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

RDF ontologies provide structured data on entities in many domains and continue to grow in size and diversity. While they can be useful as a starting point for generating descriptions of entities, they often miss important information about an entity that cannot be captured as simple relations. In addition, generic approaches to generation from RDF cannot capture the unique style and content of specific domains. We describe a framework for hybrid generation of entity descriptions, which combines generation from RDF data with text extracted from a corpus, and extracts unique aspects of the domain from the corpus to create domain-specific generation systems. We show that each component of our approach significantly increases the satisfaction of readers with the text across multiple applications and domains.

pdf bib abs
Analyzing the Semantic Types of Claims and Premises in an Online Persuasive Forum
Christopher Hidey | Elena Musi | Alyssa Hwang | Smaranda Muresan | Kathy McKeown
Proceedings of the 4th Workshop on Argument Mining

Argumentative text has been analyzed both theoretically and computationally in terms of argumentative structure that consists of argument components (e.g., claims, premises) and their argumentative relations (e.g., support, attack). Less emphasis has been placed on analyzing the semantic types of argument components. We propose a two-tiered annotation scheme to label claims and premises and their semantic types in an online persuasive forum, Change My View, with the long-term goal of understanding what makes a message persuasive. Premises are annotated with the three types of persuasive modes: ethos, logos, pathos, while claims are labeled as interpretation, evaluation, agreement, or disagreement, the latter two designed to account for the dialogical nature of our corpus. We aim to answer three questions: 1) can humans reliably annotate the semantic types of argument components? 2) are types of premises/claims positioned in recurrent orders? and 3) are certain types of claims and/or premises more likely to appear in persuasive messages than in non-persuasive messages?

2016

pdf bib abs
Automatically Processing Tweets from Gang-Involved Youth: Towards Detecting Loss and Aggression
Terra Blevins | Robert Kwiatkowski | Jamie MacBeth | Kathleen McKeown | Desmond Patton | Owen Rambow
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Violence is a serious problems for cities like Chicago and has been exacerbated by the use of social media by gang-involved youths for taunting rival gangs. We present a corpus of tweets from a young and powerful female gang member and her communicators, which we have annotated with discourse intention, using a deep read to understand how and what triggered conversations to escalate into aggression. We use this corpus to develop a part-of-speech tagger and phrase table for the variant of English that is used and a classifier for identifying tweets that express grieving and aggression.

pdf bib
Identifying Causal Relations Using Parallel Wikipedia Articles
Christopher Hidey | Kathy McKeown
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Mining Paraphrasal Typed Templates from a Plain Text Corpus
Or Biran | Terra Blevins | Kathleen McKeown
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
An Entity-Focused Approach to Generating Company Descriptions
Gavin Saldanha | Or Biran | Kathleen McKeown | Alfio Gliozzo
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Social Proof: The Impact of Author Traits on Influence Detection
Sara Rosenthal | Kathy McKeown
Proceedings of the First Workshop on NLP and Computational Social Science

2015

pdf bib
System Combination for Machine Translation through Paraphrasing
Wei-Yun Ma | Kathleen McKeown
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Discourse Planning with an N-gram Model of Relations
Or Biran | Kathleen McKeown
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Modeling Reportable Events as Turning Points in Narrative
Jessica Ouyang | Kathleen McKeown
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Predicting Salient Updates for Disaster Summarization
Chris Kedzie | Kathleen McKeown | Fernando Diaz
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Annotating Targets of Opinions in Arabic using Crowdsourcing
Noura Farra | Kathy McKeown | Nizar Habash
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
PDTB Discourse Parsing as a Tagging Task: The Two Taggers Approach
Or Biran | Kathleen McKeown
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
I Couldn’t Agree More: The Role of Conversational Structure in Agreement and Disagreement Detection in Online Discussions
Sara Rosenthal | Kathy McKeown
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2014

pdf bib abs
Towards Automatic Detection of Narrative Structure
Jessica Ouyang | Kathy McKeown
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present novel computational experiments using William Labov’s theory of narrative analysis. We describe his six elements of narrative structure and construct a new corpus based on his most recent work on narrative. Using this corpus, we explore the correspondence between Labovs elements of narrative structure and the implicit discourse relations of the Penn Discourse Treebank, and we construct a mapping between the elements of narrative structure and the discourse relation classes of the PDTB. We present first experiments on detecting Complicating Actions, the most common of the elements of narrative structure, achieving an f-score of 71.55. We compare the contributions of features derived from narrative analysis, such as the length of clauses and the tenses of main verbs, with those of features drawn from work on detecting implicit discourse relations. Finally, we suggest directions for future research on narrative structure, such as applications in assessing text quality and in narrative generation.

pdf bib
Columbia NLP: Sentiment Detection of Sentences and Subjective Phrases in Social Media
Sara Rosenthal | Kathy McKeown | Apoorv Agarwal
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science
Cristian Danescu-Niculescu-Mizil | Jacob Eisenstein | Kathleen McKeown | Noah A. Smith
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science

2013

pdf bib
Classifying Taxonomic Relations between Pairs of Wikipedia Articles
Or Biran | Kathleen McKeown
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Cluster-based Web Summarization
Yves Petinot | Kathleen McKeown | Kapil Thadani
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Supervised Sentence Fusion with Single-Stage Inference
Kapil Thadani | Kathleen McKeown
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Using a Supertagged Dependency Language Model to Select a Good Translation in System Combination
Wei-Yun Ma | Kathleen McKeown
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Aggregated Word Pair Features for Implicit Discourse Relation Disambiguation
Or Biran | Kathleen McKeown
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Columbia NLP: Sentiment Detection of Subjective Phrases in Social Media
Sara Rosenthal | Kathy McKeown
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
Semantic Technologies in IBM Watson
Alfio Gliozzo | Or Biran | Siddharth Patwardhan | Kathleen McKeown
Proceedings of the Fourth Workshop on Teaching NLP and CL

pdf bib
Sentence Compression with Joint Structural Inference
Kapil Thadani | Kathleen McKeown
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

2012

pdf bib abs
Phrase-level System Combination for Machine Translation Based on Target-to-Target Decoding
Wei-Yun Ma | Kathleen McKeown
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In this paper, we propose a novel lattice-based MT combination methodology that we call Target-to-Target Decoding (TTD). The combination process is carried out as a “translation” from backbone to the combination result. This perspective suggests the use of existing phrase-based MT techniques in the combination framework. We show how phrase extraction rules and confidence estimations inspired from machine translation improve results. We also propose system-specific LMs for estimating N-gram consensus. Our results show that our approach yields a strong improvement over the best single MT system and competes with other state-of-the-art combination systems.

pdf bib abs
Lost & Found in Translation: Impact of Machine Translated Results on Translingual Information Retrieval
Kristen Parton | Nizar Habash | Kathleen McKeown
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In an ideal cross-lingual information retrieval (CLIR) system, a user query would generate a search over documents in a different language and the relevant results would be presented in the user’s language. In practice, CLIR systems are typically evaluated by judging result relevance in the document language, to factor out the effects of translating the results using machine translation (MT). In this paper, we investigate the influence of four different approaches for integrating MT and CLIR on both retrieval accuracy and user judgment of relevancy. We create a corpus with relevance judgments for both human and machine translated results, and use it to quantify the effect that MT quality has on end-to-end relevance. We find that MT errors result in a 16-39% decrease in mean average precision over the ground truth system that uses human translations. MT errors also caused relevant sentences to appear irrelevant – 5-19% of sentences were relevant in human translation, but were judged irrelevant in MT. To counter this degradation, we present two hybrid retrieval models and two automatic MT post-editing techniques and show that these approaches substantially mitigate the errors and improve the end-to-end relevance.

pdf bib abs
Learning to Automatically Post-Edit Dropped Words in MT
Jacob Mundt | Kristen Parton | Kathleen McKeown
Workshop on Post-Editing Technology and Practice

Automatic post-editors (APEs) can improve adequacy of MT output by detecting and reinserting dropped content words, but the location where these words are inserted is critical. In this paper, we describe a probabilistic approach for learning reinsertion rules for specific languages and MT systems, as well as a method for synthesizing training data from reference translations. We test the insertion logic on MT systems for Chinese to English and Arabic to English. Our adaptive APE is able to insert within 3 words of the best location 73% of the time (32% in the exact location) in Arabic-English MT output, and 67% of the time in Chinese-English output (30% in the exact location), and delivers improved performance on automated adequacy metrics over a previous rule-based approach to insertion. We consider how particular aspects of the insertion problem make it particularly amenable to machine learning solutions.

pdf bib
Can Automatic Post-Editing Make MT More Meaningful
Kristen Parton | Nizar Habash | Kathleen McKeown | Gonzalo Iglesias | Adrià de Gispert
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf bib abs
Annotating Agreement and Disagreement in Threaded Discussion
Jacob Andreas | Sara Rosenthal | Kathleen McKeown
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We introduce a new corpus of sentence-level agreement and disagreement annotations over LiveJournal and Wikipedia threads. This is the first agreement corpus to offer full-document annotations for threaded discussions. We provide a methodology for coding responses as well as an implemented tool with an interface that facilitates annotation of a specific response while viewing the full context of the thread. Both the results of an annotator questionnaire and high inter-annotator agreement statistics indicate that the annotations collected are of high quality.

pdf bib
Detecting and Correcting Syntactic Errors in Machine Translation Using Feature-Based Lexicalized Tree Adjoining Grammars
Wei-Yun Ma | Kathleen McKeown
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

pdf bib
Detecting and Correcting Syntactic Errors in Machine Translation Using Feature-Based Lexicalized Tree Adjoining Grammars
Wei-Yun Ma | Kathleen McKeown
International Journal of Computational Linguistics & Chinese Language Processing, Volume 17, Number 4, December 2012-Special Issue on Selected Papers from ROCLING XXIV

pdf bib
Detecting Influencers in Written Online Conversations
Or Biran | Sara Rosenthal | Jacob Andreas | Kathleen McKeown | Owen Rambow
Proceedings of the Second Workshop on Language in Social Media

2011

pdf bib
System Combination for Machine Translation Based on Text-to-Text Generation
Wei-Yun Ma | Kathleen Mckeown
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Identifying Event Descriptions using Co-training with Online News Summaries
William Yang Wang | Kapil Thadani | Kathleen McKeown
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Information Status Distinctions and Referring Expressions: An Empirical Study of References to People in News Summaries
Advaith Siddharthan | Ani Nenkova | Kathleen McKeown
Computational Linguistics, Volume 37, Issue 4 - December 2011

pdf bib
Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
Sara Rosenthal | Kathleen McKeown
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
Kapil Thadani | Kathleen McKeown
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A Hierarchical Model of Web Summaries
Yves Petinot | Kathleen McKeown | Kapil Thadani
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Towards Strict Sentence Intersection: Decoding and Evaluation Strategies
Kapil Thadani | Kathleen McKeown
Proceedings of the Workshop on Monolingual Text-To-Text Generation

2010

pdf bib
“Got You!”: Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling
William Yang Wang | Kathleen McKeown
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
MT Error Detection for Cross-Lingual Question Answering
Kristen Parton | Kathleen McKeown
Coling 2010: Posters

pdf bib abs
Towards Semi-Automated Annotation for Prepositional Phrase Attachment
Sara Rosenthal | William Lipovsky | Kathleen McKeown | Kapil Thadani | Jacob Andreas
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper investigates whether high-quality annotations for tasks involving semantic disambiguation can be obtained without a major investment in time or expense. We examine the use of untrained human volunteers from Amazons Mechanical Turk in disambiguating prepositional phrase (PP) attachment over sentences drawn from the Wall Street Journal corpus. Our goal is to compare the performance of these crowdsourced judgments to the annotations supplied by trained linguists for the Penn Treebank project in order to indicate the viability of this approach for annotation projects that involve contextual disambiguation. The results of our experiments on a sample of the Wall Street Journal corpus show that invoking majority agreement between multiple human workers can yield PP attachments with fairly high precision. This confirms that a crowdsourcing approach to syntactic annotation holds promise for the generation of training corpora in new domains and genres where high-quality annotations are not available and difficult to obtain.

pdf bib abs
Building a Bank of Semantically Encoded Narratives
David K. Elson | Kathleen R. McKeown
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We propose a methodology for a novel type of discourse annotation whose model is tuned to the analysis of a text as narrative. This is intended to be the basis of a story bank resource that would facilitate the automatic analysis of narrative structure and content. The methodology calls for annotators to construct propositions that approximate a reference text, by selecting predicates and arguments from among controlled vocabularies drawn from resources such as WordNet and VerbNet. Annotators then integrate the propositions into a conceptual graph that maps out the entire discourse; the edges represent temporal, causal and other relationships at the level of story content. Because annotators must identify the recurring objects and themes that appear in the text, they also perform coreference resolution and word sense disambiguation as they encode propositions. We describe a collection experiment and a method for determining inter-annotator agreement when multiple annotators encode the same short story. Finally, we describe ongoing work toward extending the method to integrate the annotators interpretations of character agency (the goals, plans and beliefs that are relevant, yet not explictly stated in the text).

pdf bib
Time-Efficient Creation of an Accurate Sentence Fusion Corpus
Kathleen McKeown | Sara Rosenthal | Kapil Thadani | Coleman Moore
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Extracting Social Networks from Literary Fiction
David Elson | Nicholas Dames | Kathleen McKeown
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment
Mukund Jha | Jacob Andreas | Kapil Thadani | Sara Rosenthal | Kathleen McKeown
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Tense and Aspect Assignment in Narrative Discourse
David Elson | Kathleen McKeown
Proceedings of the 6th International Natural Language Generation Conference