Corpus Annotation as a Scientific Task

Donia Scott, Rossano Barone, Rob Koeling


Abstract
Annotation studies in CL are generally unscientific: they are mostly not reproducible, make use of too few (and often non-independent) annotators and use guidelines that are often something of a moving target. Additionally, the notion of ‘expert annotators' invariably means only that the annotators have linguistic training. While this can be acceptable in some special contexts, it is often far from ideal. This is particularly the case when subtle judgements are required or when, as increasingly, one is making use of corpora originating from technical texts that have been produced by, and intended to be consumed by, an audience of technical experts in the field. We outline a more rigorous approach to collecting human annotations, using as our example a study designed to capture judgements on the meaning of hedge words in medical records.
Anthology ID:
L12-1322
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1481–1485
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/569_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Donia Scott, Rossano Barone, and Rob Koeling. 2012. Corpus Annotation as a Scientific Task. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1481–1485, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Corpus Annotation as a Scientific Task (Scott et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/569_Paper.pdf