Rebecca J. Passonneau

Also published as: Rebecca Passonneau

2025

pdf bib abs
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta | Candace Ross | David Pantoja | Rebecca J. Passonneau | Megan Ung | Adina Williams
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and lower quality examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART Filtering on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48% on average, while increasing Pearson correlation with rankings from ChatBot Arena, a more open-ended human evaluation setting. Our method enables us to be more efficient, whether we are using SMART Filtering to make new benchmarks more challenging, or to revitalize older, human generated datasets, while still preserving the relative model rankings.

pdf bib abs
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Berk Atil | Vipul Gupta | Sarkar Snigdha Sarathi Das | Rebecca Passonneau
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations, such as their propensity to generate harmful output. This includes smaller LLMs, which are important for settings with constrained compute resources, such as edge devices. Detection of LLM harm typically requires human annotation, which is expensive to collect. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we compare harm annotation from three state-of-the-art large LLMs with each other and with humans. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans.

2024

pdf bib abs
Sociodemographic Bias in Language Models: A Survey and Forward Path
Vipul Gupta | Pranav Narayanan Venkit | Shomir Wilson | Rebecca Passonneau
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Sociodemographic bias in language models (LMs) has the potential for harm when deployed in real-world settings. This paper presents a comprehensive survey of the past decade of research on sociodemographic bias in LMs, organized into a typology that facilitates examining the different aims: types of bias, quantifying bias, and debiasing techniques. We track the evolution of the latter two questions, then identify current trends and their limitations, as well as emerging techniques. To guide future research towards more effective and reliable solutions, and to help authors situate their work within this broad landscape, we conclude with a checklist of open questions.

2023

pdf bib abs
The Sentiment Problem: A Critical Survey towards Deconstructing Sentiment Analysis
Pranav Venkit | Mukund Srinath | Sanjana Gautam | Saranya Venkatraman | Vipul Gupta | Rebecca Passonneau | Shomir Wilson
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We conduct an inquiry into the sociotechnical aspects of sentiment analysis (SA) by critically examining 189 peer-reviewed papers on their applications, models, and datasets. Our investigation stems from the recognition that SA has become an integral component of diverse sociotechnical systems, exerting influence on both social and technical users. By delving into sociological and technological literature on sentiment, we unveil distinct conceptualizations of this term in domains such as finance, government, and medicine. Our study exposes a lack of explicit definitions and frameworks for characterizing sentiment, resulting in potential challenges and biases. To tackle this issue, we propose an ethics sheet encompassing critical inquiries to guide practitioners in ensuring equitable utilization of SA. Our findings underscore the significance of adopting an interdisciplinary approach to defining sentiment in SA and offer a pragmatic solution for its implementation.

pdf bib abs
Answer-state Recurrent Relational Network (AsRRN) for Constructed Response Assessment and Feedback Grouping
Zhaohui Li | Susan Lloyd | Matthew Beckman | Rebecca Passonneau
Findings of the Association for Computational Linguistics: EMNLP 2023

STEM educators must trade off the ease of assessing selected response (SR) questions, like multiple choice, with constructed response (CR) questions, where students articulate their own reasoning. Our work addresses a CR type new to NLP but common in college STEM, consisting of multiple questions per context. To relate the context, the questions, the reference responses, and students’ answers, we developed an Answer-state Recurrent Relational Network (AsRRN). In recurrent time-steps, relation vectors are learned for specific dependencies in a computational graph, where the nodes encode the distinct types of text input. AsRRN incorporates contrastive loss for better representation learning, which improves performance and supports student feedback. AsRRN was developed on a new dataset of 6,532 student responses to three, two-part CR questions. AsRRN outperforms classifiers based on LLMs, a previous relational network for CR questions, and few-shot learning with GPT-3.5. Ablation studies show the distinct contributions of AsRRN’s dependency structure, the number of time steps in the recurrence, and the contrastive loss.

2022

pdf bib abs
CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning
Sarkar Snigdha Sarathi Das | Arzoo Katiyar | Rebecca Passonneau | Rui Zhang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Named Entity Recognition (NER) in Few-Shot setting is imperative for entity tagging in low resource domains. Existing approaches only learn class-specific semantic features and intermediate representations from source domains. This affects generalizability to unseen target domains, resulting in suboptimal performances. To this end, we present CONTaiNER, a novel contrastive learning technique that optimizes the inter-token distribution distance for Few-Shot NER. Instead of optimizing class-specific attributes, CONTaiNER optimizes a generalized objective of differentiating between token categories based on their Gaussian-distributed embeddings. This effectively alleviates overfitting issues originating from training domains. Our experiments in several traditional test domains (OntoNotes, CoNLL’03, WNUT ‘17, GUM) and a new large scale Few-Shot NER dataset (Few-NERD) demonstrate that on average, CONTaiNER outperforms previous methods by 3%-13% absolute F1 points while showing consistent performance trends, even in challenging scenarios where previous approaches could not achieve appreciable performance.

pdf bib abs
A POMDP Dialogue Policy with 3-way Grounding and Adaptive Sensing for Learning through Communication
Maryam Zare | Alan Wagner | Rebecca Passonneau
Findings of the Association for Computational Linguistics: EMNLP 2022

Agents to assist with rescue, surgery, and similar activities could collaborate better with humans if they could learn new strategic behaviors through communication. We introduce a novel POMDP dialogue policy for learning from people. The policy has 3-way grounding of language in the shared physical context, the dialogue context, and persistent knowledge. It can learn distinct but related games, and can continue learning across dialogues for complex games. A novel sensing component supports adaptation to information-sharing differences across people. The single policy performs better than oracle policies customized to specific games and information behavior.

pdf bib abs
Contrastive Data and Learning for Natural Language Processing
Rui Zhang | Yangfeng Ji | Yue Zhang | Rebecca J. Passonneau
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts

Current NLP models heavily rely on effective representation learning algorithms. Contrastive learning is one such technique to learn an embedding space such that similar data sample pairs have close representations while dissimilar samples stay far apart from each other. It can be used in supervised or unsupervised settings using different loss functions to produce task-specific or general-purpose representations. While it has originally enabled the success for vision tasks, recent years have seen a growing number of publications in contrastive NLP. This first line of works not only delivers promising performance improvements in various NLP tasks, but also provides desired characteristics such as task-agnostic sentence representation, faithful text generation, data-efficient learning in zero-shot and few-shot settings, interpretability and explainability. In this tutorial, we aim to provide a gentle introduction to the fundamentals of contrastive learning approaches and the theory behind them. We then survey the benefits and the best practices of contrastive learning for various downstream NLP applications including Text Classification, Question Answering, Summarization, Text Generation, Interpretability and Explainability, Commonsense Knowledge and Reasoning, Vision-and-Language.This tutorial intends to help researchers in the NLP and computational linguistics community to understand this emerging topic and promote future research directions of using contrastive learning for NLP applications.

2021

pdf bib abs
ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences
Yanjun Gao | Ting-Hao Huang | Rebecca J. Passonneau
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.

pdf bib abs
A Semantic Feature-Wise Transformation Relation Network for Automatic Short Answer Grading
Zhaohui Li | Yajur Tomar | Rebecca J. Passonneau
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Automatic short answer grading (ASAG) is the task of assessing students’ short natural language responses to objective questions. It is a crucial component of new education platforms, and could support more wide-spread use of constructed response questions to replace cognitively less challenging multiple choice questions. We propose a Semantic Feature-wise transformation Relation Network (SFRN) that exploits the multiple components of ASAG datasets more effectively. SFRN captures relational knowledge among the questions (Q), reference answers or rubrics (R), and labeled student answers (A). A relation network learns vector representations for the elements of QRA triples, then combines the learned representations using learned semantic feature-wise transformations. We apply translation-based data augmentation to address the two problems of limited training data, and high data skew for multi-class ASAG tasks. Our model has up to 11% performance improvement over state-of-the-art results on the benchmark SemEval-2013 datasets, and surpasses custom approaches designed for a Kaggle challenge, demonstrating its generality.

pdf bib abs
Learning Clause Representation from Dependency-Anchor Graph for Connective Prediction
Yanjun Gao | Ting-Hao Huang | Rebecca J. Passonneau
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

Semantic representation that supports the choice of an appropriate connective between pairs of clauses inherently addresses discourse coherence, which is important for tasks such as narrative understanding, argumentation, and discourse parsing. We propose a novel clause embedding method that applies graph learning to a data structure we refer to as a dependency-anchor graph. The dependency anchor graph incorporates two kinds of syntactic information, constituency structure, and dependency relations, to highlight the subject and verb phrase relation. This enhances coherence-related aspects of representation. We design a neural model to learn a semantic representation for clauses from graph convolution over latent representations of the subject and verb phrase. We evaluate our method on two new datasets: a subset of a large corpus where the source texts are published novels, and a new dataset collected from students’ essays. The results demonstrate a significant improvement over tree-based models, confirming the importance of emphasizing the subject and verb phrase. The performance gap between the two datasets illustrates the challenges of analyzing student’s written text, plus a potential evaluation task for coherence modeling and an application for suggesting revisions to students.

2020

pdf bib abs
Dialogue Policies for Learning Board Games through Multimodal Communication
Maryam Zare | Ali Ayub | Aishan Liu | Sweekar Sudhakara | Alan Wagner | Rebecca Passonneau
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

This paper presents MDP policy learning for agents to learn strategic behavior–how to play board games–during multimodal dialogues. Policies are trained offline in simulation, with dialogues carried out in a formal language. The agent has a temporary belief state for the dialogue, and a persistent knowledge store represented as an extensive-form game tree. How well the agent learns a new game from a dialogue with a simulated partner is evaluated by how well it plays the game, given its dialogue-final knowledge state. During policy training, we control for the simulated dialogue partner’s level of informativeness in responding to questions. The agent learns best when its trained policy matches the current dialogue partner’s informativeness. We also present a novel data collection for training natural language modules. Human subjects who engaged in dialogues with a baseline system rated the system’s language skills as above average. Further, results confirm that human dialogue partners also vary in their informativeness.

2019

pdf bib abs
Automated Pyramid Summarization Evaluation
Yanjun Gao | Chen Sun | Rebecca J. Passonneau
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Pyramid evaluation was developed to assess the content of paragraph length summaries of source texts. A pyramid lists the distinct units of content found in several reference summaries, weights content units by how many reference summaries they occur in, and produces three scores based on the weighted content of new summaries. We present an automated method that is more efficient, more transparent, and more complete than previous automated pyramid methods. It is tested on a new dataset of student summaries, and historical NIST data from extractive summarizers.

pdf bib abs
Rubric Reliability and Annotation of Content and Argument in Source-Based Argument Essays
Yanjun Gao | Alex Driban | Brennan Xavier McManus | Elena Musi | Patricia Davies | Smaranda Muresan | Rebecca J. Passonneau
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present a unique dataset of student source-based argument essays to facilitate research on the relations between content, argumentation skills, and assessment. Two classroom writing assignments were given to college students in a STEM major, accompanied by a carefully designed rubric. The paper presents a reliability study of the rubric, showing it to be highly reliable, and initial annotation on content and argumentation annotation of the essays.

2018

pdf bib
PyrEval: An Automated Method for Summary Content Analysis
Yanjun Gao | Andrew Warner | Rebecca Passonneau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Mohit Bansal | Rebecca Passonneau
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib abs
Automated Content Analysis: A Case Study of Computer Science Student Summaries
Yanjun Gao | Patricia M. Davies | Rebecca J. Passonneau
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

Technology is transforming Higher Education learning and teaching. This paper reports on a project to examine how and why automated content analysis could be used to assess precis writing by university students. We examine the case of one hundred and twenty-two summaries written by computer science freshmen. The texts, which had been hand scored using a teacher-designed rubric, were autoscored using the Natural Language Processing software, PyrEval. Pearson’s correlation coefficient and Spearman rank correlation were used to analyze the relationship between the teacher score and the PyrEval score for each summary. Three content models automatically constructed by PyrEval from different sets of human reference summaries led to consistent correlations, showing that the approach is reliable. Also observed was that, in cases where the focus of student assessment centers on formative feedback, categorizing the PyrEval scores by examining the average and standard deviations could lead to novel interpretations of their relationships. It is suggested that this project has implications for the ways in which automated content analysis could be used to help university students improve their summarization skills.

2015

pdf bib
Estimation of Discourse Segmentation Labels from Crowd Data
Ziheng Huang | Jialu Zhong | Rebecca J. Passonneau
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Abstractive Multi-Document Summarization via Phrase Selection and Merging
Lidong Bing | Piji Li | Yi Liao | Wai Lam | Weiwei Guo | Rebecca Passonneau
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Biber Redux: Reconsidering Dimensions of Variation in American English
Rebecca J. Passonneau | Nancy Ide | Songqiao Su | Jesse Stuart
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib abs
Annotating the MASC Corpus with BabelNet
Andrea Moro | Roberto Navigli | Francesco Maria Tucci | Rebecca J. Passonneau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we tackle the problem of automatically annotating, with both word senses and named entities, the MASC 3.0 corpus, a large English corpus covering a wide range of genres of written and spoken text. We use BabelNet 2.0, a multilingual semantic network which integrates both lexicographic and encyclopedic knowledge, as our sense/entity inventory together with its semantic structure, to perform the aforementioned annotation task. Word sense annotated corpora have been around for more than twenty years, helping the development of Word Sense Disambiguation algorithms by providing both training and testing grounds. More recently Entity Linking has followed the same path, with the creation of huge resources containing annotated named entities. However, to date, there has been no resource that contains both kinds of annotation. In this paper we present an automatic approach for performing this annotation, together with its output on the MASC corpus. We use this corpus because its goal of integrating different types of annotations goes exactly in our same direction. Our overall aim is to stimulate research on the joint exploitation and disambiguation of word senses and named entities. Finally, we estimate the quality of our annotations using both manually-tagged named entities and word senses, obtaining an accuracy of roughly 70% for both named entities and word sense annotations.

pdf bib abs
The Benefits of a Model of Annotation
Rebecca J. Passonneau | Bob Carpenter
Transactions of the Association for Computational Linguistics, Volume 2

Standard agreement measures for interannotator reliability are neither necessary nor sufficient to ensure a high quality corpus. In a case study of word sense annotation, conventional methods for evaluating labels from trained annotators are contrasted with a probabilistic annotation model applied to crowdsourced data. The annotation model provides far more information, including a certainty measure for each gold standard label; the crowdsourced data was collected at less than half the cost of the conventional approach.

pdf bib
Aspectual Properties of Conversational Activities
Rebecca J. Passonneau | Boxuan Guan | Cho Ho Yeung | Yuan Du | Emma Conner
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

2013

pdf bib
Open Dialogue Management for Relational Databases
Ben Hixon | Rebecca J. Passonneau
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Semantic Frames to Predict Stock Price Movement
Boyi Xie | Rebecca J. Passonneau | Leon Wu | Germán G. Creamer
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Automated Pyramid Scoring of Summaries using Distributional Semantics
Rebecca J. Passonneau | Emily Chen | Weiwei Guo | Dolores Perin
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
The Benefits of a Model of Annotation
Rebecca J. Passonneau | Bob Carpenter
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

pdf bib abs
The MASC Word Sense Corpus
Rebecca J. Passonneau | Collin F. Baker | Christiane Fellbaum | Nancy Ide
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The MASC project has produced a multi-genre corpus with multiple layers of linguistic annotation, together with a sentence corpus containing WordNet 3.1 sense tags for 1000 occurrences of each of 100 words produced by multiple annotators, accompanied by indepth inter-annotator agreement data. Here we give an overview of the contents of MASC and then focus on the word sense sentence corpus, describing the characteristics that differentiate it from other word sense corpora and detailing the inter-annotator agreement studies that have been performed on the annotations. Finally, we discuss the potential to grow the word sense sentence corpus through crowdsourcing and the plan to enhance the content and annotations of MASC through a community-based collaborative effort.

pdf bib abs
Empirical Comparisons of MASC Word Sense Annotations
Gerard de Melo | Collin F. Baker | Nancy Ide | Rebecca J. Passonneau | Christiane Fellbaum
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We analyze how different conceptions of lexical semantics affect sense annotations and how multiple sense inventories can be compared empirically, based on annotated text. Our study focuses on the MASC project, where data has been annotated using WordNet sense identifiers on the one hand, and FrameNet lexical units on the other. This allows us to compare the sense inventories of these lexical resources empirically rather than just theoretically, based on their glosses, leading to new insights. In particular, we compute contingency matrices and develop a novel measure, the Expected Jaccard Index, that quantifies the agreement between annotations of the same data based on two different resources even when they have different sets of categories.

pdf bib
Semantic Specificity in Spoken Dialogue Requests
Ben Hixon | Rebecca J. Passonneau | Susan L. Epstein
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2011

pdf bib
Sentiment Analysis of Twitter Data
Apoorv Agarwal | Boyi Xie | Ilia Vovsha | Owen Rambow | Rebecca Passonneau
Proceedings of the Workshop on Language in Social Media (LSM 2011)

pdf bib
Proceedings of the SIGDIAL 2011 Conference
Joyce Y. Chai | Johanna D. Moore | Rebecca J. Passonneau | David R. Traum
Proceedings of the SIGDIAL 2011 Conference

pdf bib
Embedded Wizardry
Rebecca J. Passonneau | Susan L. Epstein | Tiziana Ligorio | Joshua Gordon
Proceedings of the SIGDIAL 2011 Conference

pdf bib
Learning to Balance Grounding Rationales for Dialogue Systems
Joshua Gordon | Rebecca J. Passonneau | Susan L. Epstein
Proceedings of the SIGDIAL 2011 Conference

pdf bib
PARADISE-style Evaluation of a Human-Human Library Corpus
Rebecca J. Passonneau | Irene Alvarado | Phil Crone | Simon Jerome
Proceedings of the SIGDIAL 2011 Conference

2010

pdf bib abs
Word Sense Annotation of Polysemous Words by Multiple Annotators
Rebecca J. Passonneau | Ansaf Salleb-Aoussi | Vikas Bhardwaj | Nancy Ide
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe results of a word sense annotation task using WordNet, involving half a dozen well-trained annotators on ten polysemous words for three parts of speech. One hundred sentences for each word were annotated. Annotators had the same level of training and experience, but interannotator agreement (IA) varied across words. There was some effect of part of speech, with higher agreement on nouns and adjectives, but within the words for each part of speech there was wide variation. This variation in IA does not correlate with number of senses in the inventory, or the number of senses actually selected by annotators. In fact, IA was sometimes quite high for words with many senses. We claim that the IA variation is due to the word meanings, contexts of use, and individual differences among annotators. We find some correlation of IA with sense confusability as measured by a sense confusion threshhold (CT). Data mining for association rules on a flattened data representation indicating each annotator's sense choices identifies outliers for some words, and systematic differences among pairs of annotators on others.

pdf bib abs
An Evaluation Framework for Natural Language Understanding in Spoken Dialogue Systems
Joshua B. Gordon | Rebecca J. Passonneau
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present an evaluation framework to enable developers of information seeking, transaction based spoken dialogue systems to compare the robustness of natural language understanding (NLU) approaches across varying levels of word error rate and contrasting domains. We develop statistical and semantic parsing based approaches to dialogue act identification and concept retrieval. Voice search is used in each approach to ultimately query the database. Included in the framework is a method for developers to bootstrap a representative pseudo-corpus, which is used to estimate NLU performance in a new domain. We illustrate the relative merits of these NLU techniques by contrasting our statistical NLU approach with a semantic parsing method over two contrasting applications, our CheckItOut library system and the deployed Lets Go Public! system, across four levels of word error rate. We find that with respect to both dialogue act identification and concept retrieval, our statistical NLU approach is more likely to robustly accommodate the freer form, less constrained utterances of CheckItOut at higher word error rates than is possible with semantic parsing.

pdf bib
Learning about Voice Search for Spoken Dialogue Systems
Rebecca Passonneau | Susan L. Epstein | Tiziana Ligorio | Joshua B. Gordon | Pravin Bhutada
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
The Manually Annotated Sub-Corpus: A Community Resource for and by the People
Nancy Ide | Collin Baker | Christiane Fellbaum | Rebecca Passonneau
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Annotation Scheme for Social Network Extraction from Text
Apoorv Agarwal | Owen C. Rambow | Rebecca J. Passonneau
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Anveshan: A Framework for Analysis of Multiple Annotators’ Labeling Behavior
Vikas Bhardwaj | Rebecca Passonneau | Ansaf Salleb-Aouissi | Nancy Ide
Proceedings of the Fourth Linguistic Annotation Workshop

2009

pdf bib
Making Sense of Word Sense Variation
Rebecca Passonneau | Ansaf Salleb-Aouissi | Nancy Ide
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

pdf bib
Contrasting the Interaction Structure of an Email and a Telephone Corpus: A Machine Learning Approach to Annotation of Dialogue Function Units
Jun Hu | Rebecca Passonneau | Owen Rambow
Proceedings of the SIGDIAL 2009 Conference

2008

pdf bib abs
MASC: the Manually Annotated Sub-Corpus of American English
Nancy Ide | Collin Baker | Christiane Fellbaum | Charles Fillmore | Rebecca Passonneau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing maximum accessibility for researchers from around the globe.

pdf bib abs
Relation between Agreement Measures on Human Labeling and Machine Learning Performance: Results from an Art History Domain
Rebecca Passonneau | Tom Lippincott | Tae Yano | Judith Klavans
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We discuss factors that affect human agreement on a semantic labeling task in the art history domain, based on the results of four experiments where we varied the number of labels annotators could assign, the number of annotators, the type and amount of training they received, and the size of the text span being labeled. Using the labelings from one experiment involving seven annotators, we investigate the relation between interannotator agreement and machine learning performance. We construct binary classifiers and vary the training and test data by swapping the labelings from the seven annotators. First, we find performance is often quite good despite lower than recommended interannotator agreement. Second, we find that on average, learning performance for a given functional semantic category correlates with the overall agreement among the seven annotators for that category. Third, we find that learning performance on the data from a given annotator does not correlate with the quality of that annotators labeling. We offer recommendations for the use of labeled data in machine learning, and argue that learners should attempt to accommodate human variation. We also note implications for large scale corpus annotation projects that deal with similarly subjective phenomena.

2007

2006

pdf bib abs
Inter-annotator Agreement on a Multilingual Semantic Annotation Task
Rebecca Passonneau | Nizar Habash | Owen Rambow
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Six sites participated in the Interlingual Annotation of Multilingual Text Corpora (IAMTC) project (Dorr et al., 2004; Farwell et al., 2004; Mitamura et al., 2004). Parsed versions of English translations of news articles in Arabic, French, Hindi, Japanese, Korean and Spanish were annotated by up to ten annotators. Their task was to match open-class lexical items (nouns, verbs, adjectives, adverbs) to one or more concepts taken from the Omega ontology (Philpot et al., 2003), and to identify theta roles for verb arguments. The annotated corpus is intended to be a resource for meaning-based approaches to machine translation. Here we discuss inter-annotator agreement for the corpus. The annotation task is characterized by annotators freedom to select multiple concepts or roles per lexical item. As a result, the annotation categories are sets, the number of which is bounded only by the number of distinct annotator-lexical item pairs. We use a reliability metric designed to handle partial agreement between sets. The best results pertain to the part of the ontology derived from WordNet. We examine change over the course of the project, differences among annotators, and differences across parts of speech. Our results suggest a strong learning effect early in the project.

pdf bib abs
Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation
Rebecca Passonneau
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs) generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC). The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to.

pdf bib abs
CLiMB ToolKit: A Case Study of Iterative Evaluation in a Multidisciplinary Project
Rebecca Passonneau | Roberta Blitz | David Elson | Angela Giral | Judith Klavans
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Digital image collections in libraries and other curatorial institutions grow too rapidly to create new descriptive metadata for subject matter search or browsing. CLiMB (Computational Linguistics for Metadata Building) was a project designed to address this dilemma that involved computer scientists, linguists, librarians, and art librarians. The CLiMB project followed an iterative evaluation model: each next phase of the project emerged from the results of an evaluation. After assembling a suite of text processing tools to be used in extracting metada, we conducted a formative evaluation with thirteen participants, using a survey in which we varied the order and type of four conditions under which respondents would propose or select image search terms. Results of the formative evaluation led us to conclude that a CLiMB ToolKit would work best if its main function was to propose terms for users to review. After implementing a prototype ToolKit using a browser interface, we conducted an evaluation with ten experts. Users found the ToolKit very habitable, remained consistently satisfied throughout a lengthy evaluation, and selected a large number of terms per image.