Noushin Rezapour Asheghi
Also published as: Noushin Rezapour Asheghi
2014
Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing
Noushin Rezapour Asheghi
|
Serge Sharoff
|
Katja Markert
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Research in Natural Language Processing often relies on a large collection of manually annotated documents. However, currently there is no reliable genre-annotated corpus of web pages to be employed in Automatic Genre Identification (AGI). In AGI, documents are classified based on their genres rather than their topics or subjects. The major shortcoming of available web genre collections is their relatively low inter-coder agreement. Reliability of annotated data is an essential factor for reliability of the research result. In this paper, we present the first web genre corpus which is reliably annotated. We developed precise and consistent annotation guidelines which consist of well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing which is a novel approach in genre annotation. We computed the overall as well as the individual categories’ chance-corrected inter-annotator agreement. The results show that the corpus has been annotated reliably.
Semi-supervised Graph-based Genre Classification for Web Pages
Noushin Rezapour Asheghi
|
Katja Markert
|
Serge Sharoff
Proceedings of TextGraphs-9: the workshop on Graph-based Methods for Natural Language Processing