Hannah Copperman
2010
Evaluating Complex Semantic Artifacts
Christopher R Walker
|
Hannah Copperman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Evaluating complex Natural Language Processing (NLP) systems can prove extremely difficult. In many cases, the best one can do is to evaluate these systems indirectly, by looking at the impact they have on the performance of the downstream use case. For complex end-to-end systems, these metrics are not always enlightening, especially from the perspective of NLP failure analysis, as component interaction can obscure issues specific to the NLP technology. We present an evaluation program for complex NLP systems designed to produce meaningful aggregate accuracy metrics with sufficient granularity to support active development by NLP specialists. Our goals were threefold: to produce reliable metrics, to produce useful metrics and to produce actionable data. Our use case is a graph-based Wikipedia search index. Since the evaluation of a complex graph structure is beyond the conceptual grasp of a single human judge, the problem needs to be broken down. Slices of complex data reflective of coherent Decision Points provide a good framework for evaluation using human judges (Medero et al., 2006). For NL semantics, there really is no substitute. Leveraging Decision Points allows complex semantic artifacts to be tracked with judge-driven evaluations that are accurate, timely and actionable.
Fred’s Reusable Evaluation Device: Providing Support for Quick and Reliable Linguistic Annotation
Hannah Copperman
|
Christopher R. Walker
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes an interface that was developed for processing large amounts of human judgments of linguistically annotated data. Freds Reusable Evaluation Device (Fred) provides administrators with a tool to submit linguistic evaluation tasks to judges. Each evaluation task is then presented to exactly two judges, who can submit their judgments at their own leisure. Fred then provides several metrics to administrators. The most important metric is precision, which is provided for each evaluation task and each annotator. Administrators can look at precision for a given data set over time, as well as by evaluation type, data set, or annotator. Inter-annotator agreement is also reported, and that can be tracked over time as well. The interface was developed to provide a tool for evaluating semantically marked up text. The types of evaluations Fred has been used for so far include things like correctness of subject-relation identification, and correctness of temporal relations. However, Freds full versatility has not yet been fully exploited.
A Hybrid Model for Annotating Named Entity Training Corpora
Robert Voyer
|
Valerie Nygaard
|
Will Fitzgerald
|
Hannah Copperman
Proceedings of the Fourth Linguistic Annotation Workshop