Jonathan K. Kummerfeld - ACL Anthology

Jonathan K. Kummerfeld

2025

Aligning AI Research with the Needs of Clinical Coding Workflows: Eight Recommendations Based on US Data Analysis and Critical Review
Yidong Gan | Maciej Rybinski | Ben Hachey | Jonathan K. Kummerfeld
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Clinical coding is crucial for healthcare billing and data analysis. Manual clinical coding is labour-intensive and error-prone, which has motivated research towards full automation of the process. However, our analysis, based on US English electronic health records and automated coding research using these records, shows that widely used evaluation methods are not aligned with real clinical contexts. For example, evaluations that focus on the top 50 most common codes are an oversimplification, as there are thousands of codes used in practice. This position paper aims to align AI coding research more closely with practical challenges of clinical coding. Based on our analysis, we offer eight specific recommendations, suggesting ways to improve current evaluation methods. Additionally, we propose new AI-based methods beyond automated coding, suggesting alternative approaches to assist clinical coders in their workflows.

Less is More: Explainable and Efficient ICD Code Prediction with Clinical Entities
James C. Douglas | Yidong Gan | Ben Hachey | Jonathan K. Kummerfeld
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Clinical coding, assigning standardized codes to medical notes, is critical for epidemiological research, hospital planning, and reimbursement. Neural coding models generally process entire discharge summaries, which are often lengthy and contain information that is not relevant to coding. We propose an approach that combines Named Entity Recognition (NER) and Assertion Classification (AC) to filter for clinically important content before supervised code prediction. On MIMIC-IV, a standard evaluation dataset, our approach achieves near-equivalent performance to a state-of-the-art full-text baseline while using only 22% of the content and reducing training time by over half. Additionally, mapping model attention to complete entity spans yields coherent, clinically meaningful explanations, capturing coding-relevant modifiers such as acuity and laterality. We release a newly annotated NER+AC dataset for MIMIC-IV, designed specifically for ICD coding. Our entity-centric approach lays a foundation for more transparent and cost-effective assisted coding.

Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association
Jonathan K. Kummerfeld | Aditya Joshi | Mark Dras
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association

Simple and Effective Baselines for Code Summarisation Evaluation
Jade Robinson | Jonathan K. Kummerfeld
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association

Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.

Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL
Wichayaporn Wongkamjan | Yanze Wang | Feng Gu | Denis Peskoff | Jonathan K. Kummerfeld | Jonathan May | Jordan Lee Boyd-Graber
Findings of the Association for Computational Linguistics: ACL 2025

An increasingly common socio-technical problem is people being taken in by offers that sound “too good to be true”, where persuasion and trust shape decision-making. This paper investigates how AI can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in Diplomacy, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms representing proposals—agreements that players suggest during communication—and computing their relative rewards using agents’ value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-AI interaction tools can build on our methods for deception detection by triggering friction to give users a chance of interrogating suspicious proposals.

Personalized Help for Optimizing Low-Skilled Users’ Strategy
Feng Gu | Wichayaporn Wongkamjan | Jordan Lee Boyd-Graber | Jonathan K. Kummerfeld | Denis Peskoff | Jonathan May
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied. We augment Cicero, a natural language agent that demonstrates superhuman performance in Diplomacy, to generate both move and message advice based on player intentions. A dozen Diplomacy games with novice and experienced players, with varying advice settings, show that some of the generated advice is beneficial. It helps novices compete with experienced players and in some instances even surpass them. The mere presence of advice can be advantageous, even if players do not follow it.

2024

More Victories, Less Cooperation: Assessing Cicero’s Diplomacy Play
Wichayaporn Wongkamjan | Feng Gu | Yanze Wang | Ulf Hermjakob | Jonathan May | Brandon M. Stewart | Jonathan K. Kummerfeld | Denis Peskoff | Jordan Lee Boyd-Graber
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The boardgame Diplomacy is a challenging setting for communicative and cooperative artificial intelligence. The most prominent communicative Diplomacy AI, Cicero, has excellent strategic abilities, exceeding human players. However, the best Diplomacy players master communication, not just tactics, which is why the game has received attention as an AI challenge. This work seeks to understand the degree to which Cicero succeeds at communication. First, we annotate in-game communication with abstract meaning representation to separate in-game tactics from general language. Second, we run two dozen games with humans and Cicero, totaling over 200 human-player hours of competition. While AI can consistently outplay human players, AI-Human communication is still limited because of AI’s difficulty with deception and persuasion. This shows that Cicero relies on strategy and has not yet reached the full promise of communicative and cooperative AI.

Do LLMs Generate Creative and Visually Accessible Data visualisations?
Clarissa Miranda-Pena | Andrew Reeson | Cécile Paris | Josiah Poon | Jonathan K. Kummerfeld
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

Data visualisation is a valuable task that combines careful data processing with creative design. Large Language Models (LLMs) are now capable of responding to a data visualisation request in natural language with code that generates accurate data visualisations (e.g., using Matplotlib), but what about human-centered factors, such as the creativity and accessibility of the data visualisations? In this work, we study human perceptions of creativity in the data visualisations generated by LLMs, and propose metrics for accessibility. We generate a range of visualisations using GPT-4 and Claude-2 with controlled variations in prompt and inference parameters, to encourage the generation of different types of data visualisations for the same data. Subsets of these data visualisations are presented to people in a survey with questions that probe human perceptions of different aspects of creativity and accessibility. We find that the models produce visualisations that are novel, but not surprising. Our results also show that our accessibility metrics are consistent with human judgements. In all respects, the LLMs underperform visualisations produced by human-written code. To go beyond the simplest requests, these models need to become aware of human-centered factors, while maintaining accuracy.

A Comparative Multidimensional Analysis of Empathetic Systems
Andrew Lee | Jonathan K. Kummerfeld | Larry Ann | Rada Mihalcea
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, empathetic dialogue systems have received significant attention.While some researchers have noted limitations, e.g., that these systems tend to generate generic utterances, no study has systematically verified these issues. We survey 21 systems, asking what progress has been made on the task. We observe multiple limitations of current evaluation procedures. Most critically, studies tend to rely on a single non-reproducible empathy score, which inadequately reflects the multidimensional nature of empathy. To better understand the differences between systems, we comprehensively analyze each system with automated methods that are grounded in a variety of aspects of empathy. We find that recent systems lack three important aspects of empathy: specificity, reflection levels, and diversity. Based on our results, we discuss problematic behaviors that may have gone undetected in prior evaluations, and offer guidance for developing future systems.

Do Text-to-Vis Benchmarks Test Real Use of Visualisations?
Hy Nguyen | Xuefei He | Andrew Reeson | Cecile Paris | Josiah Poon | Jonathan K. Kummerfeld
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models are able to generate code for visualisations in response to simple user requests.This is a useful application and an appealing one for NLP research because plots of data provide grounding for language.However, there are relatively few benchmarks, and those that exist may not be representative of what users do in practice.This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark. This shows that new benchmarks are needed to support the development of systems that truly address users’ visualisation needs.These observations will guide future data creation, highlighting which features hold genuine significance for users.

2023

Chat Disentanglement: Data for New Domains and Methods for More Accurate Annotation
Sai R. Gouravajhala | Andrew M. Vernier | Yiming Shi | Zihan Li | Mark S. Ackerman | Jonathan K. Kummerfeld
Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association

Conversation disentanglement is the task of taking a log of intertwined conversations from a shared channel and breaking the log into individual conversations. The standard datasets for disentanglement are in a single domain and were annotated by linguistics experts with careful training for the task. In this paper, we introduce the first multi-domain dataset and a study of annotation by people without linguistics expertise or extensive training. We experiment with several variations in interfaces, conducting user studies with domain experts and crowd workers. We also test a hypothesis from prior work that link-based annotation is more accurate, finding that it actually has comparable accuracy to set-based annotation. Our new dataset will support the development of more useful systems for this task, and our experimental findings suggest that users are capable of improving the usefulness of these systems by accurately annotating their own data.

Empathy Identification Systems are not Accurately Accounting for Context
Andrew Lee | Jonathan K. Kummerfeld | Larry An | Rada Mihalcea
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Understanding empathy in text dialogue data is a difficult, yet critical, skill for effective human-machine interaction. In this work, we ask whether systems are making meaningful progress on this challenge. We consider a simple model that checks if an input utterance is similar to a small set of empathetic examples. Crucially, the model does not look at what the utterance is a response to, i.e., the dialogue context. This model performs comparably to other work on standard benchmarks and even outperforms state-of-the-art models for empathetic rationale extraction by 16.7 points on T-F1 and 4.3 on IOU-F1. This indicates that current systems rely on the surface form of the response, rather than whether it is suitable in context. To confirm this, we create examples with dialogue contexts that change the interpretation of the response and show that current systems continue to label utterances as empathetic. We discuss the implications of our findings, including improvements for empathetic benchmarks and how our model can be an informative baseline.

Interactive Text-to-SQL Generation via Editable Step-by-Step Explanations
Yuan Tian | Zheng Zhang | Zheng Ning | Toby Jia-Jun Li | Jonathan K. Kummerfeld | Tianyi Zhang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Relational databases play an important role in business, science, and more. However, many users cannot fully unleash the analytical power of relational databases, because they are not familiar with database languages such as SQL. Many techniques have been proposed to automatically generate SQL from natural language, but they suffer from two issues: (1) they still make many mistakes, particularly for complex queries, and (2) they do not provide a flexible way for non-expert users to validate and refine incorrect queries. To address these issues, we introduce a new interaction mechanism that allows users to directly edit a step-by-step explanation of a query to fix errors. Our experiments on multiple datasets, as well as a user study with 24 participants, demonstrate that our approach can achieve better performance than multiple SOTA approaches. Our code and datasets are available at https://github.com/magic-YuanTian/STEPS.

2022

Leveraging Similar Users for Personalized Language Modeling with Limited Data
Charles Welch | Chenxi Gu | Jonathan K. Kummerfeld | Veronica Perez-Rosas | Rada Mihalcea
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Personalized language models are designed and trained to capture language patterns specific to individual users. This makes them more accurate at predicting what a user will write. However, when a new user joins a platform and not enough text is available, it is harder to build effective personalized language models. We propose a solution for this problem, using a model trained on users that are similar to a new user. In this paper, we explore strategies for finding the similarity between new users and existing ones and methods for using the data from existing users who are a good match. We further explore the trade-off between available data for new users and how well their language can be modeled.

Using Paraphrases to Study Properties of Contextual Embeddings
Laura Burdick | Jonathan K. Kummerfeld | Rada Mihalcea
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We use paraphrases as a unique source of data to analyze contextualized embeddings, with a particular focus on BERT. Because paraphrases naturally encode consistent word and phrase semantics, they provide a unique lens for investigating properties of embeddings. Using the Paraphrase Database’s alignments, we study words within paraphrases as well as phrase representations. We find that contextual embeddings effectively handle polysemous words, but give synonyms surprisingly different representations in many cases. We confirm previous findings that BERT is sensitive to word order, but find slightly different patterns than prior work in terms of the level of contextualization across BERT’s layers.

2021

Quantifying and Avoiding Unfair Qualification Labour in Crowdsourcing
Jonathan K. Kummerfeld
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Extensive work has argued in favour of paying crowd workers a wage that is at least equivalent to the U.S. federal minimum wage. Meanwhile, research on collecting high quality annotations suggests using a qualification that requires workers to have previously completed a certain number of tasks. If most requesters who pay fairly require workers to have completed a large number of tasks already then workers need to complete a substantial amount of poorly paid work before they can earn a fair wage. Through analysis of worker discussions and guidance for researchers, we estimate that workers spend approximately 2.25 months of full time effort on poorly paid tasks in order to get the qualifications needed for better paid tasks. We discuss alternatives to this qualification and conduct a study of the correlation between qualifications and work quality on two NLP tasks. We find that it is possible to reduce the burden on workers while still collecting high quality data.

Analyzing the Surprising Variability in Word Embedding Stability Across Languages
Laura Burdick | Jonathan K. Kummerfeld | Rada Mihalcea
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Word embeddings are powerful representations that form the foundation of many natural language processing architectures, both in English and in other languages. To gain further insight into word embeddings, we explore their stability (e.g., overlap between the nearest neighbors of a word in different embedding spaces) in diverse languages. We discuss linguistic properties that are related to stability, drawing out insights about correlations with affixing, language gender systems, and other features. This has implications for embedding use, particularly in research that uses them to study language trends.

Exploring Self-Identified Counseling Expertise in Online Support Forums
Allison Lahnala | Yuntian Zhao | Charles Welch | Jonathan K. Kummerfeld | Lawrence C An | Kenneth Resnicow | Rada Mihalcea | Verónica Pérez-Rosas
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Micromodels for Efficient, Explainable, and Reusable Systems: A Case Study on Mental Health
Andrew Lee | Jonathan K. Kummerfeld | Larry An | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2021

Many statistical models have high accuracy on test benchmarks, but are not explainable, struggle in low-resource scenarios, cannot be reused for multiple tasks, and cannot easily integrate domain expertise. These factors limit their use, particularly in settings such as mental health, where it is difficult to annotate datasets and model outputs have significant impact. We introduce a micromodel architecture to address these challenges. Our approach allows researchers to build interpretable representations that embed domain knowledge and provide explanations throughout the model’s decision process. We demonstrate the idea on multiple mental health tasks: depression classification, PTSD classification, and suicidal risk assessment. Our systems consistently produce strong results, even in low-resource scenarios, and are more interpretable than alternative methods.

Learning to Learn End-to-End Goal-Oriented Dialog From Related Dialog Tasks
Janarthanan Rajendran | Jonathan K. Kummerfeld | Satinder Baveja
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

For each goal-oriented dialog task of interest, large amounts of data need to be collected for end-to-end learning of a neural dialog system. Collecting that data is a costly and time-consuming process. Instead, we show that we can use only a small amount of data, supplemented with data from a related dialog task. Naively learning from related data fails to improve performance as the related data can be inconsistent with the target task. We describe a meta-learning based method that selectively learns from the related dialog task data. Our approach leads to significant accuracy improvements in an example dialog task.

2020

Inconsistencies in Crowdsourced Slot-Filling Annotations: A Typology and Identification Methods
Stefan Larson | Adrian Cheung | Anish Mahendran | Kevin Leach | Jonathan K. Kummerfeld
Proceedings of the 28th International Conference on Computational Linguistics

Slot-filling models in task-driven dialog systems rely on carefully annotated training data. However, annotations by crowd workers are often inconsistent or contain errors. Simple solutions like manually checking annotations or having multiple workers label each sample are expensive and waste effort on samples that are correct. If we can identify inconsistencies, we can focus effort where it is needed. Toward this end, we define six inconsistency types in slot-filling annotations. Using three new noisy crowd-annotated datasets, we show that a wide range of inconsistencies occur and can impact system performance if not addressed. We then introduce automatic methods of identifying inconsistencies. Experiments on our new datasets show that these methods effectively reveal inconsistencies in data, though there is further scope for improvement.

Exploring the Value of Personalized Word Embeddings
Charles Welch | Jonathan K. Kummerfeld | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we introduce personalized word embeddings, and examine their value for language modeling. We compare the performance of our proposed prediction model when using personalized versus generic word representations, and study how these representations can be leveraged for improved performance. We provide insight into what types of words can be more accurately predicted when building personalized models. Our results show that a subset of words belonging to specific psycholinguistic categories tend to vary more in their representations across users and that combining generic and personalized word embeddings yields the best performance, with a 4.7% relative reduction in perplexity. Additionally, we show that a language model using personalized word embeddings can be effectively used for authorship attribution.

Compositional Demographic Word Embeddings
Charles Welch | Jonathan K. Kummerfeld | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.

Iterative Feature Mining for Constraint-Based Data Collection to Increase Data Diversity and Model Robustness
Stefan Larson | Anthony Zheng | Anish Mahendran | Rishi Tekriwal | Adrian Cheung | Eric Guldan | Kevin Leach | Jonathan K. Kummerfeld
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Diverse data is crucial for training robust models, but crowdsourced text often lacks diversity as workers tend to write simple variations from prompts. We propose a general approach for guiding workers to write more diverse text by iteratively constraining their writing. We show how prior workflows are special cases of our approach, and present a way to apply the approach to dialog tasks such as intent classification and slot-filling. Using our method, we create more challenging versions of test sets from prior dialog datasets and find dramatic performance drops for standard models. Finally, we show that our approach is complementary to recent work on improving data diversity, and training on data collected with our approach leads to more robust models.

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
Charles Welch | Rada Mihalcea | Jonathan K. Kummerfeld
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.

A Novel Workflow for Accurately and Efficiently Crowdsourcing Predicate Senses and Argument Labels
Youxuan Jiang | Huaiyu Zhu | Jonathan K. Kummerfeld | Yunyao Li | Walter Lasecki
Findings of the Association for Computational Linguistics: EMNLP 2020

Resources for Semantic Role Labeling (SRL) are typically annotated by experts at great expense. Prior attempts to develop crowdsourcing methods have either had low accuracy or required substantial expert annotation. We propose a new multi-stage crowd workflow that substantially reduces expert involvement without sacrificing accuracy. In particular, we introduce a unique filter stage based on the key observation that crowd workers are able to almost perfectly filter out incorrect options for labels. Our three-stage workflow produces annotations with 95% accuracy for predicate labels and 93% for argument labels, which is comparable to expert agreement. Compared to prior work on crowdsourcing for SRL, we decrease expert effort by 4x, from 56% to 14% of cases. Our approach enables more scalable annotation of SRL, and could enable annotation of NLP tasks that have previously been considered too complex to effectively crowdsource.

2019

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction
Stefan Larson | Anish Mahendran | Joseph J. Peper | Christopher Clarke | Andrew Lee | Parker Hill | Jonathan K. Kummerfeld | Kevin Leach | Michael A. Laurenzano | Lingjia Tang | Jason Mars
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope—i.e., queries that do not fall into any of the system’s supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. We evaluate a range of benchmark classifiers on our dataset along with several different out-of-scope identification schemes. We find that while the classifiers perform well on in-scope intent classification, they struggle to identify out-of-scope queries. Our dataset and evaluation fill an important gap in the field, offering a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.

Outlier Detection for Improved Data Quality and Diversity in Dialog Systems
Stefan Larson | Anish Mahendran | Andrew Lee | Jonathan K. Kummerfeld | Parker Hill | Michael A. Laurenzano | Johann Hauswald | Lingjia Tang | Jason Mars
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In a corpus of data, outliers are either errors: mistakes in the data that are counterproductive, or are unique: informative samples that improve model robustness. Identifying outliers can lead to better datasets by (1) removing noise in datasets and (2) guiding collection of additional data to fill gaps. However, the problem of detecting both outlier types has received relatively little attention in NLP, particularly for dialog systems. We introduce a simple and effective technique for detecting both erroneous and unique samples in a corpus of short texts using neural sentence embeddings combined with distance-based outlier detection. We also present a novel data collection pipeline built atop our detection technique to automatically and iteratively mine unique data samples while discarding erroneous samples. Experiments show that our outlier detection technique is effective at finding errors while our data collection pipeline yields highly diverse corpora that in turn produce more robust intent classification and slot-filling models.

A Large-Scale Corpus for Conversation Disentanglement
Jonathan K. Kummerfeld | Sai R. Gouravajhala | Joseph J. Peper | Vignesh Athreya | Chulaka Gunasekara | Jatin Ganhotra | Siva Sankalp Patel | Lazaros C Polymenakos | Walter Lasecki
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Disentangling conversations mixed together in a single stream of messages is a difficult task, made harder by the lack of large manually annotated datasets. We created a new dataset of 77,563 messages manually annotated with reply-structure graphs that both disentangle conversations and define internal conversation structure. Our data is 16 times larger than all previously released datasets combined, the first to include adjudication of annotation disagreements, and the first to include context. We use our data to re-examine prior work, in particular, finding that 89% of conversations in a widely used dialogue corpus are either missing messages or contain extra messages. Our manually-annotated data presents an opportunity to develop robust data-driven methods for conversation disentanglement, which will help advance dialogue research.

SLATE: A Super-Lightweight Annotation Tool for Experts
Jonathan K. Kummerfeld
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Many annotation tools have been developed, covering a wide variety of tasks and providing features like user management, pre-processing, and automatic labeling. However, all of these tools use Graphical User Interfaces, and often require substantial effort to install and configure. This paper presents a new annotation tool that is designed to fill the niche of a lightweight interface for users with a terminal-based workflow. SLATE supports annotation at different scales (spans of characters, tokens, and lines, or a document) and of different types (free text, labels, and links), with easily customisable keybindings, and unicode support. In a user study comparing with other tools it was consistently the easiest to install and use. SLATE fills a need not met by existing systems, and has already been used to annotate two corpora, one of which involved over 250 hours of annotation effort.

DSTC7 Task 1: Noetic End-to-End Response Selection
Chulaka Gunasekara | Jonathan K. Kummerfeld | Lazaros Polymenakos | Walter Lasecki
Proceedings of the First Workshop on NLP for Conversational AI

Goal-oriented dialogue in complex domains is an extremely challenging problem and there are relatively few datasets. This task provided two new resources that presented different challenges: one was focused but small, while the other was large but diverse. We also considered several new variations on the next utterance selection problem: (1) increasing the number of candidates, (2) including paraphrases, and (3) not including a correct option in the candidate set. Twenty teams participated, developing a range of neural network models, including some that successfully incorporated external data to boost performance. Both datasets have been publicly released, enabling future work to build on these results, working towards robust goal-oriented dialogue systems.

2018

World Knowledge for Abstract Meaning Representation Parsing
Charles Welch | Jonathan K. Kummerfeld | Song Feng | Rada Mihalcea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Factors Influencing the Surprising Instability of Word Embeddings
Laura Wendlandt | Jonathan K. Kummerfeld | Rada Mihalcea
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this paper, we consider one aspect of embedding spaces, namely their stability. We show that even relatively high frequency words (100-200 occurrences) are often unstable. We provide empirical evidence for how various factors contribute to the stability of word embeddings, and we analyze the effects of stability on downstream tasks.

Effective Crowdsourcing for a New Type of Summarization Task
Youxuan Jiang | Catherine Finegan-Dollak | Jonathan K. Kummerfeld | Walter Lasecki
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Most summarization research focuses on summarizing the entire given text, but in practice readers are often interested in only one aspect of the document or conversation. We propose targeted summarization as an umbrella category for summarization tasks that intentionally consider only parts of the input data. This covers query-based summarization, update summarization, and a new task we propose where the goal is to summarize a particular aspect of a document. However, collecting data for this new task is hard because directly asking annotators (e.g., crowd workers) to write summaries leads to data with low accuracy when there are a large number of facts to include. We introduce a novel crowdsourcing workflow, Pin-Refine, that allows us to collect high-quality summaries for our task, a necessary step for the development of automatic systems.

Data Collection for Dialogue System: A Startup Perspective
Yiping Kang | Yunqi Zhang | Jonathan K. Kummerfeld | Lingjia Tang | Jason Mars
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

Industrial dialogue systems such as Apple Siri and Google Now rely on large scale diverse and robust training data to enable their sophisticated conversation capability. Crowdsourcing provides a scalable and inexpensive way of data collection but collecting high quality data efficiently requires thoughtful orchestration of the crowdsourcing jobs. Prior study of this topic have focused on tasks only in the academia settings with limited scope or only provide intrinsic dataset analysis, lacking indication on how it affects the trained model performance. In this paper, we present a study of crowdsourcing methods for a user intent classification task in our deployed dialogue system. Our task requires classification of 47 possible user intents and contains many intent pairs with subtle differences. We consider different crowdsourcing job types and job prompts and analyze quantitatively the quality of the collected data and the downstream model performance on a test set of real user queries from production logs. Our observation provides insights into designing efficient crowdsourcing jobs and provide recommendations for future dialogue system data collection process.

Improving Text-to-SQL Evaluation Methodology
Catherine Finegan-Dollak | Jonathan K. Kummerfeld | Li Zhang | Karthik Ramanathan | Sesh Sadasivam | Rui Zhang | Dragomir Radev
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

To be informative, an evaluation must measure how well systems generalize to realistic unseen data. We identify limitations of and propose improvements to current evaluations of text-to-SQL systems. First, we compare human-generated and automatically generated questions, characterizing properties of queries necessary for real-world applications. To facilitate evaluation on multiple datasets, we release standardized and improved versions of seven existing datasets and one new text-to-SQL dataset. Second, we show that the current division of data into training and test sets measures robustness to variations in the way questions are asked, but only partially tests how well systems generalize to new queries; therefore, we propose a complementary dataset split for evaluation of future work. Finally, we demonstrate how the common practice of anonymizing variables during evaluation removes an important challenge of the task. Our observations highlight key difficulties, and our methodology enables effective measurement of future development.

2017

Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation
Greg Durrett | Jonathan K. Kummerfeld | Taylor Berg-Kirkpatrick | Rebecca Portnoff | Sadia Afroz | Damon McCoy | Kirill Levchenko | Vern Paxson
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data. In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own “fine-grained domain” in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.

Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection
Youxuan Jiang | Jonathan K. Kummerfeld | Walter S. Lasecki
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Linguistically diverse datasets are critical for training and evaluating robust machine learning systems, but data collection is a costly process that often requires experts. Crowdsourcing the process of paraphrase generation is an effective means of expanding natural language datasets, but there has been limited analysis of the trade-offs that arise when designing tasks. In this paper, we present the first systematic study of the key factors in crowdsourcing paraphrase collection. We consider variations in instructions, incentives, data domains, and workflows. We manually analyzed paraphrases for correctness, grammaticality, and linguistic diversity. Our observations provide new insight into the trade-offs between accuracy and diversity in crowd responses that arise as a result of task design, providing guidance for future paraphrase generation procedures.

Parsing with Traces: An O(n4) Algorithm and a Structural Representation
Jonathan K. Kummerfeld | Dan Klein
Transactions of the Association for Computational Linguistics, Volume 5

General treebank analyses are graph structured, but parsers are typically restricted to tree structures for efficiency and modeling reasons. We propose a new representation and algorithm for a class of graph structures that is flexible enough to cover almost all treebank structures, while still admitting efficient learning and inference. In particular, we consider directed, acyclic, one-endpoint-crossing graph structures, which cover most long-distance dislocation, shared argumentation, and similar tree-violating linguistic phenomena. We describe how to convert phrase structure parses, including traces, to our new representation in a reversible manner. Our dynamic program uniquely decomposes structures, is sound and complete, and covers 97.3% of the Penn English Treebank. We also implement a proof-of-concept parser that recovers a range of null elements and trace types.

2015

An Empirical Analysis of Optimization for Max-Margin NLP
Jonathan K. Kummerfeld | Taylor Berg-Kirkpatrick | Dan Klein
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2013

Error-Driven Analysis of Challenges in Coreference Resolution
Jonathan K. Kummerfeld | Dan Klein
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

An Empirical Examination of Challenges in Chinese Parsing
Jonathan K. Kummerfeld | Daniel Tse | James R. Curran | Dan Klein
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output
Jonathan K. Kummerfeld | David Hall | James R. Curran | Dan Klein
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Robust Conversion of CCG Derivations to Phrase Structure Trees
Jonathan K. Kummerfeld | Dan Klein | James R. Curran
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

Mention Detection: Heuristics for the OntoNotes annotations
Jonathan K. Kummerfeld | Mohit Bansal | David Burkett | Dan Klein
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

2010

Morphological Analysis Can Improve a CCG Parser for English
Matthew Honnibal | Jonathan K. Kummerfeld | James R. Curran
Coling 2010: Posters

Faster Parsing by Supertagger Adaptation
Jonathan K. Kummerfeld | Jessika Roesner | Tim Dawborn | James Haggerty | James R. Curran | Stephen Clark
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

Faster parsing and supertagging model estimation
Jonathan K. Kummerfeld | Jessika Roesner | James Curran
Proceedings of the Australasian Language Technology Association Workshop 2009

2008

Classification of Verb Particle Constructions with the Google Web1T Corpus
Jonathan K. Kummerfeld | James R. Curran
Proceedings of the Australasian Language Technology Association Workshop 2008

Co-authors

Stefan Larson 4

Anish Mahendran 4

Verónica Pérez-Rosas 4

Jordan Lee Boyd-Graber 3

Laura Burdick 3

Youxuan Jiang 3

Denis Peskoff 3

Wichayaporn Wongkamjan 3

Taylor Berg-Kirkpatrick 2

Adrian Cheung 2

Catherine Finegan-Dollak 2

Sai R. Gouravajhala 2

Chulaka Gunasekara 2

Michael A. Laurenzano 2

Joseph J. Peper 2

Andrew Reeson 2

Jessika Roesner 2

Mark S. Ackerman 1

Lawrence C An 1

Vignesh Athreya 1

Satinder Baveja 1

David Burkett 1

Stephen Clark 1

Christopher Clarke 1

James C. Douglas 1

Jatin Ganhotra 1

James Haggerty 1

Johann Hauswald 1

Ulf Hermjakob 1

Matthew Honnibal 1

Allison Lahnala 1

Kirill Levchenko 1

Toby Jia-Jun Li 1

Clarissa Miranda-Pena 1

Siva Sankalp Patel 1

Lazaros C. Polymenakos 1

Lazaros Polymenakos 1

Rebecca Portnoff 1

Dragomir Radev 1

Janarthanan Rajendran 1

Karthik Ramanathan 1

Kenneth Resnicow 1

Jade Robinson 1

Maciej Rybinski 1

Sesh Sadasivam 1

Brandon M. Stewart 1

Rishi Tekriwal 1

Andrew M. Vernier 1

Anthony Zheng 1

Venues