WIKIR: A Python Toolkit for Building a Large-scale Wikipedia-based English Information Retrieval Dataset

Jibril Frej, Didier Schwab, Jean-Pierre Chevallet


Abstract
Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR59k: a large-scale publicly available dataset that contains 59,252 queries and 2,617,003 (query, relevant documents) pairs.
Anthology ID:
2020.lrec-1.237
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1926–1933
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.237
DOI:
Bibkey:
Cite (ACL):
Jibril Frej, Didier Schwab, and Jean-Pierre Chevallet. 2020. WIKIR: A Python Toolkit for Building a Large-scale Wikipedia-based English Information Retrieval Dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1926–1933, Marseille, France. European Language Resources Association.
Cite (Informal):
WIKIR: A Python Toolkit for Building a Large-scale Wikipedia-based English Information Retrieval Dataset (Frej et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.237.pdf
Code
 getalp/wikIR