NLP Scholar: A Dataset for Examining the State of NLP Research

Saif M. Mohammad


Abstract
Google Scholar is the largest web search engine for academic literature that also provides access to rich metadata associated with the papers. The ACL Anthology (AA) is the largest repository of articles on Natural Language Processing (NLP). We extracted information from AA for about 44 thousand NLP papers and identified authors who published at least three papers there. We then extracted citation information from Google Scholar for all their papers (not just their AA papers). This resulted in a dataset of 1.1 million papers and associated Google Scholar information. We aligned the information in the AA and Google Scholar datasets to create the NLP Scholar Dataset – a single unified source of information (from both AA and Google Scholar) for tens of thousands of NLP papers. It can be used to identify broad trends in productivity, focus, and impact of NLP research. We present here initial work on analyzing the volume of research in NLP over the years and identifying the most cited papers in NLP. We also list a number of additional potential applications.
Anthology ID:
2020.lrec-1.109
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
868–877
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.109
DOI:
Bibkey:
Cite (ACL):
Saif M. Mohammad. 2020. NLP Scholar: A Dataset for Examining the State of NLP Research. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 868–877, Marseille, France. European Language Resources Association.
Cite (Informal):
NLP Scholar: A Dataset for Examining the State of NLP Research (Mohammad, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.109.pdf