Şenay Kafkas
2012
Centroids: Gold standards with distributional variation
Ian Lewin
|
Şenay Kafkas
|
Dietrich Rebholz-Schuhmann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Motivation: Gold Standards for named entities are, ironically, not standard themselves. Some specify the one perfect annotation. Others specify perfectly good alternatives. The concept of Silver standard is relatively new. The objective is consensus rather than perfection. How should the two concepts be best represented and related? Approach: We examine several Biomedical Gold Standards and motivate a new representational format, centroids, which simply and effectively represents name distributions. We define an algorithm for finding centroids, given a set of alternative input annotations and we test the outputs quantitatively and qualitatively. We also define a metric of relatively acceptability on top of the centroid standard. Results: Precision, recall and F-scores of over 0.99 are achieved for the simple sanity check of giving the algorithm Gold Standard inputs. Qualitative analysis of the differences very often reveals errors and incompleteness in the original Gold Standard. Given automatically generated annotations, the centroids effectively represent the range of those contributions and the quality of the centroid annotations is highly competitive with the best of the contributors. Conclusion: Centroids cleanly represent alternative name variations for Silver and Gold Standards. A centroid Silver Standard is derived just like a Gold Standard, only from imperfect inputs.
CALBC: Releasing the Final Corpora
Şenay Kafkas
|
Ian Lewin
|
David Milward
|
Erik van Mulligen
|
Jan Kors
|
Udo Hahn
|
Dietrich Rebholz-Schuhmann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
A number of gold standard corpora for named entity recognition are available to the public. However, the existing gold standard corpora are limited in size and semantic entity types. These usually lead to implementation of trained solutions (1) for a limited number of semantic entity types and (2) lacking in generalization capability. In order to overcome these problems, the CALBC project has aimed to automatically generate large scale corpora annotated with multiple semantic entity types in a community-wide manner based on the consensus of different named entity solutions. The generated corpus is called the silver standard corpus since the corpus generation process does not involve any manual curation. In this publication, we announce the release of the final CALBC corpora which include the silver standard corpus in different versions and several gold standard corpora for the further usage of the biomedical text mining community. The gold standard corpora are utilised to benchmark the methods used in the silver standard corpora generation process and released in a shared format. All the corpora are released in a shared format and accessible at www.calbc.eu.
Search
Fix data
Co-authors
- Ian Lewin 2
- Dietrich Rebholz Schuhmann 2
- Udo Hahn 1
- Jan Kors 1
- David Milward 1
- show all...
Venues
- lrec2