Evaluating Web-as-corpus Topical Document Retrieval with an Index of the OpenDirectory

Clément de Groc, Xavier Tannier


Abstract
This article introduces a novel protocol and resource to evaluate Web-as-corpus topical document retrieval. To the contrary of previous work, our goal is to provide an automatic, reproducible and robust evaluation for this task. We rely on the OpenDirectory (DMOZ) as a source of topically annotated webpages and index them in a search engine. With this OpenDirectory search engine, we can then easily evaluate the impact of various parameters such as the number of seed terms, queries or documents, or the usefulness of various term selection algorithms. A first fully automatic evaluation is described and provides baseline performances for this task. The article concludes with practical information regarding the availability of the index and resource files.
Anthology ID:
L14-1736
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/980_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Clément de Groc and Xavier Tannier. 2014. Evaluating Web-as-corpus Topical Document Retrieval with an Index of the OpenDirectory. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Evaluating Web-as-corpus Topical Document Retrieval with an Index of the OpenDirectory (de Groc & Tannier, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/980_Paper.pdf