2018
pdf
bib
abs
Learning from Measurements in Crowdsourcing Models: Inferring Ground Truth from Diverse Annotation Types
Paul Felt
|
Eric Ringger
|
Jordan Boyd-Graber
|
Kevin Seppi
Proceedings of the 27th International Conference on Computational Linguistics
Annotated corpora enable supervised machine learning and data analysis. To reduce the cost of manual annotation, tasks are often assigned to internet workers whose judgments are reconciled by crowdsourcing models. We approach the problem of crowdsourcing using a framework for learning from rich prior knowledge, and we identify a family of crowdsourcing models with the novel ability to combine annotations with differing structures: e.g., document labels and word labels. Annotator judgments are given in the form of the predicted expected value of measurement functions computed over annotations and the data, unifying annotation models. Our model, a specific instance of this framework, compares favorably with previous work. Furthermore, it enables active sample selection, jointly selecting annotator, data item, and annotation structure to reduce annotation effort.
2016
pdf
bib
abs
Semantic Annotation Aggregation with Conditional Crowdsourcing Models and Word Embeddings
Paul Felt
|
Eric Ringger
|
Kevin Seppi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
In modern text annotation projects, crowdsourced annotations are often aggregated using item response models or by majority vote. Recently, item response models enhanced with generative data models have been shown to yield substantial benefits over those with conditional or no data models. However, suitable generative data models do not exist for many tasks, such as semantic labeling tasks. When no generative data model exists, we demonstrate that similar benefits may be derived by conditionally modeling documents that have been previously embedded in a semantic space using recent work in vector space models. We use this approach to show state-of-the-art results on a variety of semantic annotation aggregation tasks.
pdf
bib
abs
Fast Inference for Interactive Models of Text
Jeffrey Lund
|
Paul Felt
|
Kevin Seppi
|
Eric Ringger
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Probabilistic models are a useful means for analyzing large text corpora. Integrating such models with human interaction enables many new use cases. However, adding human interaction to probabilistic models requires inference algorithms which are both fast and accurate. We explore the use of Iterated Conditional Modes as a fast alternative to Gibbs sampling or variational EM. We demonstrate superior performance both in run time and model quality on three different models of text including a DP Mixture of Multinomials for web search result clustering, the Interactive Topic Model, and M OM R ESP , a multinomial crowdsourcing model.
2015
pdf
bib
An Analytic and Empirical Evaluation of Return-on-Investment-Based Active Learning
Robbie Haertel
|
Eric Ringger
|
Kevin Seppi
|
Paul Felt
Proceedings of the 9th Linguistic Annotation Workshop
pdf
bib
Is Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models
Thang Nguyen
|
Jordan Boyd-Graber
|
Jeffrey Lund
|
Kevin Seppi
|
Eric Ringger
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
pdf
bib
Early Gains Matter: A Case for Preferring Generative over Discriminative Crowdsourcing Models
Paul Felt
|
Kevin Black
|
Eric Ringger
|
Kevin Seppi
|
Robbie Haertel
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
pdf
bib
Making the Most of Crowdsourced Document Annotations: Confused Supervised LDA
Paul Felt
|
Eric Ringger
|
Jordan Boyd-Graber
|
Kevin Seppi
Proceedings of the Nineteenth Conference on Computational Natural Language Learning
2014
pdf
bib
abs
Momresp: A Bayesian Model for Multi-Annotator Document Labeling
Paul Felt
|
Robbie Haertel
|
Eric Ringger
|
Kevin Seppi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Data annotation in modern practice often involves multiple, imperfect human annotators. Multiple annotations can be used to infer estimates of the ground-truth labels and to estimate individual annotator error characteristics (or reliability). We introduce MomResp, a model that incorporates information from both natural data clusters as well as annotations from multiple annotators to infer ground-truth labels and annotator reliability for the document classification task. We implement this model and show dramatic improvements over majority vote in situations where both annotations are scarce and annotation quality is low as well as in situations where annotators disagree consistently. Because MomResp predictions are subject to label switching, we introduce a solution that finds nearly optimal predicted class reassignments in a variety of settings using only information available to the model at inference time. Although MomResp does not perform well in annotation-rich situations, we show evidence suggesting how this shortcoming may be overcome in future work.
pdf
bib
abs
Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage
Kevin Black
|
Eric Ringger
|
Paul Felt
|
Kevin Seppi
|
Kristian Heal
|
Deryle Lonsdale
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The task of corpus-dictionary linkage (CDL) is to annotate each word in a corpus with a link to an appropriate dictionary entry that documents the sense and usage of the word. Corpus-dictionary linked resources include concordances, dictionaries with word usage examples, and corpora annotated with lemmas or word-senses. Such CDL resources are essential in learning a language and in linguistic research, translation, and philology. Lemmatization is a common approximation to automating corpus-dictionary linkage, where lemmas are treated as dictionary entry headwords. We intend to use data-driven lemmatization models to provide machine assistance to human annotators in the form of pre-annotations, and thereby reduce the costs of CDL annotation. In this work we adapt the discriminative string transducer DirecTL+ to perform lemmatization for classical Syriac, a low-resource language. We compare the accuracy of DirecTL+ with the Morfette discriminative lemmatizer. DirecTL+ achieves 96.92% overall accuracy but only by a margin of 0.86% over Morfette at the cost of a longer time to train the model. Error analysis on the models provides guidance on how to apply these models in a machine assistance setting for corpus-dictionary linkage.
pdf
bib
abs
Using Transfer Learning to Assist Exploratory Corpus Annotation
Paul Felt
|
Eric Ringger
|
Kevin Seppi
|
Kristian Heal
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We describe an under-studied problem in language resource management: that of providing automatic assistance to annotators working in exploratory settings. When no satisfactory tagset already exists, such as in under-resourced or undocumented languages, it must be developed iteratively while annotating data. This process naturally gives rise to a sequence of datasets, each annotated differently. We argue that this problem is best regarded as a transfer learning problem with multiple source tasks. Using part-of-speech tagging data with simulated exploratory tagsets, we demonstrate that even simple transfer learning techniques can significantly improve the quality of pre-annotations in an exploratory annotation.
2013
pdf
bib
Proceedings of the 2013 NAACL HLT Student Research Workshop
Annie Louis
|
Richard Socher
|
Julia Hockenmaier
|
Eric K. Ringger
Proceedings of the 2013 NAACL HLT Student Research Workshop
2012
pdf
bib
abs
First Results in a Study Evaluating Pre-annotation and Correction Propagation for Machine-Assisted Syriac Morphological Analysis
Paul Felt
|
Eric Ringger
|
Kevin Seppi
|
Kristian Heal
|
Robbie Haertel
|
Deryle Lonsdale
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Manual annotation of large textual corpora can be cost-prohibitive, especially for rare and under-resourced languages. One potential solution is pre-annotation: asking human annotators to correct sentences that have already been annotated, usually by a machine. Another potential solution is correction propagation: using annotator corrections to bad pre-annotations to dynamically improve to the remaining pre-annotations within the current sentence. The research presented in this paper employs a controlled user study to discover under what conditions these two machine-assisted annotation techniques are effective in increasing annotator speed and accuracy and thereby reducing the cost for the task of morphologically annotating texts written in classical Syriac. A preliminary analysis of the data indicates that pre-annotations improve annotator accuracy when they are at least 60% accurate, and annotator speed when they are at least 80% accurate. This research constitutes the first systematic evaluation of pre-annotation and correction propagation together in a controlled user study.
2010
pdf
bib
Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
Daniel Walker
|
William B. Lund
|
Eric K. Ringger
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
pdf
bib
A Probabilistic Morphological Analyzer for Syriac
Peter McClanahan
|
George Busby
|
Robbie Haertel
|
Kristian Heal
|
Deryle Lonsdale
|
Kevin Seppi
|
Eric Ringger
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
pdf
bib
Parallel Active Learning: Eliminating Wait Time with Minimal Staleness
Robbie Haertel
|
Paul Felt
|
Eric K. Ringger
|
Kevin Seppi
Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing
pdf
bib
Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM
Robbie Haertel
|
Peter McClanahan
|
Eric K. Ringger
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
pdf
bib
abs
CCASH: A Web Application Framework for Efficient, Distributed Language Resource Development
Paul Felt
|
Owen Merkling
|
Marc Carmen
|
Eric Ringger
|
Warren Lemmon
|
Kevin Seppi
|
Robbie Haertel
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We introduce CCASH (Cost-Conscious Annotation Supervised by Humans), an extensible web application framework for cost-efficient annotation. CCASH provides a framework in which cost-efficient annotation methods such as Active Learning can be explored via user studies and afterwards applied to large annotation projects. CCASHs architecture is described as well as the technologies that it is built on. CCASH allows custom annotation tasks to be built from a growing set of useful annotation widgets. It also allows annotation methods (such as AL) to be implemented in any language. Being a web application framework, CCASH offers secure centralized data and annotation storage and facilitates collaboration among multiple annotations. By default it records timing information about each annotation and provides facilities for recording custom statistics. The CCASH framework has been used to evaluate a novel annotation strategy presented in a concurrently published paper, and will be used in the future to annotate a large Syriac corpus.
pdf
bib
abs
Tag Dictionaries Accelerate Manual Annotation
Marc Carmen
|
Paul Felt
|
Robbie Haertel
|
Deryle Lonsdale
|
Peter McClanahan
|
Owen Merkling
|
Eric Ringger
|
Kevin Seppi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Expert human input can contribute in various ways to facilitate automatic annotation of natural language text. For example, a part-of-speech tagger can be trained on labeled input provided offline by experts. In addition, expert input can be solicited by way of active learning to make the most of annotator expertise. However, hiring individuals to perform manual annotation is costly both in terms of money and time. This paper reports on a user study that was performed to determine the degree of effect that a part-of-speech dictionary has on a group of subjects performing the annotation task. The user study was conducted using a modular, web-based interface created specifically for text annotation tasks. The user study found that for both native and non-native English speakers a dictionary with greater than 60% coverage was effective at reducing annotation time and increasing annotator accuracy. On the basis of this study, we predict that using a part-of-speech tag dictionary with coverage greater than 60% can reduce the cost of annotation in terms of both time and money.
2009
pdf
bib
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Eric Ringger
|
Robbie Haertel
|
Katrin Tomanek
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
2008
pdf
bib
abs
Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study
Eric Ringger
|
Marc Carmen
|
Robbie Haertel
|
Kevin Seppi
|
Deryle Lonsdale
|
Peter McClanahan
|
James Carroll
|
Noel Ellison
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Fixed, limited budgets often constrain the amount of expert annotation that can go into the construction of annotated corpora. Estimating the cost of annotation is the first step toward using annotation resources wisely. We present here a study of the cost of annotation. This study includes the participation of annotators at various skill levels and with varying backgrounds. Conducted over the web, the study consists of tests that simulate machine-assisted pre-annotation, requiring correction by the annotator rather than annotation from scratch. The study also includes tests representative of an annotation scenario involving Active Learning as it progresses from a naïve model to a knowledgeable model; in particular, annotators encounter pre-annotation of varying degrees of accuracy. The annotation interface lists tags considered likely by the annotation model in preference to other tags. We present the experimental parameters of the study and report both descriptive and inferential statistics on the results of the study. We conclude with a model for estimating the hourly cost of annotation for annotators of various skill levels. We also present models for two granularities of annotation: sentence at a time and word at a time.
pdf
bib
Assessing the Costs of Sampling Methods in Active Learning for Annotation
Robbie Haertel
|
Eric Ringger
|
Kevin Seppi
|
James Carroll
|
Peter McClanahan
Proceedings of ACL-08: HLT, Short Papers
2007
pdf
bib
Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation
Eric Ringger
|
Peter McClanahan
|
Robbie Haertel
|
George Busby
|
Marc Carmen
|
James Carroll
|
Kevin Seppi
|
Deryle Lonsdale
Proceedings of the Linguistic Annotation Workshop
2006
pdf
bib
Multilingual Dependency Parsing using Bayes Point Machines
Simon Corston-Oliver
|
Anthony Aue
|
Kevin Duh
|
Eric Ringger
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference
2005
pdf
bib
Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing
Eric Ringger
Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing
2004
pdf
bib
Using the Penn Treebank to Evaluate Non-Treebank Parsers
Eric K. Ringger
|
Robert C. Moore
|
Eugene Charniak
|
Lucy Vanderwende
|
Hisami Suzuki
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
bib
Converting Treebank Annotations to Language Neutral Syntax
Richard Campbell
|
Eric Ringger
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
bib
Task-Focused Summarization of Email
Simon Corston-Oliver
|
Eric Ringger
|
Michael Gamon
|
Richard Campbell
Text Summarization Branches Out
pdf
bib
Statistical machine translation using labeled semantic dependency graphs
Anthony Aue
|
Arul Menezes
|
Bob Moore
|
Chris Quirk
|
Eric Ringger
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages
pdf
bib
Linguistically Informed Statistical Models of Constituent Structure for Ordering in Sentence Realization
Eric Ringger
|
Michael Gamon
|
Robert C. Moore
|
David Rojas
|
Martine Smets
|
Simon Corston-Oliver
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics
2003
pdf
bib
French Amalgam: a quick adaptation of a sentence realization system to French
Martine Smets
|
Michael Gamon
|
Simon Corston-Oliver
|
Eric Ringger
10th Conference of the European Chapter of the Association for Computational Linguistics
pdf
bib
abs
French Amalgam: A machine-learned sentence realization system
Martine Smets
|
Michael Gamon
|
Simon Corston-Oliver
|
Eric Ringger
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
This paper presents the French implementation of Amalgam, a machine-learned sentence realization system. It presents in some detail two of the machine-learned models employed in Amalgam and shows how linguistic intuition and knowledge can be combined with statistical techniques to improve the performance of the models.
2002
pdf
bib
Extraposition: A Case Study in German Sentence Realization
Michael Gamon
|
Eric Ringger
|
Zhu Zhang
|
Robert Moore
|
Simon Corston-Oliver
COLING 2002: The 19th International Conference on Computational Linguistics
pdf
bib
An Overview of Amalgam: A Machine-learned Generation Module
Simon Corston-Oliver
|
Michael Gamon
|
Eric Ringger
|
Robert Moore
Proceedings of the International Natural Language Generation Conference
pdf
bib
Machine-learned contexts for linguistic operations in German sentence realization
Michael Gamon
|
Eric Ringger
|
Simon Corston-Oliver
|
Robert Moore
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics
1996
pdf
bib
A Robust System for Natural Spoken Dialogue
James F. Allen
|
Bradford W. Miller
|
Eric K. Ringger
|
Teresa Sikorski
34th Annual Meeting of the Association for Computational Linguistics