Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing

Noushin Rezapour Asheghi; Serge Sharoff; Katja Markert

Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing

Noushin Rezapour Asheghi, Serge Sharoff, Katja Markert

Abstract

Research in Natural Language Processing often relies on a large collection of manually annotated documents. However, currently there is no reliable genre-annotated corpus of web pages to be employed in Automatic Genre Identification (AGI). In AGI, documents are classified based on their genres rather than their topics or subjects. The major shortcoming of available web genre collections is their relatively low inter-coder agreement. Reliability of annotated data is an essential factor for reliability of the research result. In this paper, we present the first web genre corpus which is reliably annotated. We developed precise and consistent annotation guidelines which consist of well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing which is a novel approach in genre annotation. We computed the overall as well as the individual categories’ chance-corrected inter-annotator agreement. The results show that the corpus has been annotated reliably.

Anthology ID:: L14-1398
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1339–1346
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/470_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Noushin Rezapour Asheghi, Serge Sharoff, and Katja Markert. 2014. Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1339–1346, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing (Asheghi et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/470_Paper.pdf

PDF Cite Search Fix data