On the Use of Web Search to Improve Scientific Collections

Krutarth Patel, Cornelia Caragea, Sujatha Das Gollapalli


Abstract
Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for such scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. We were able to obtain ~267,000 unique research papers through our fully-automated framework using ~76,000 queries, resulting in almost 200,000 more papers than the number of queries. Moreover, through a combination of title and author name search, we were able to recover 78% of the original searched titles.
Anthology ID:
2020.sdp-1.20
Volume:
Proceedings of the First Workshop on Scholarly Document Processing
Month:
November
Year:
2020
Address:
Online
Editors:
Muthu Kumar Chandrasekaran, Anita de Waard, Guy Feigenblat, Dayne Freitag, Tirthankar Ghosal, Eduard Hovy, Petr Knoth, David Konopnicki, Philipp Mayr, Robert M. Patton, Michal Shmueli-Scheuer
Venue:
sdp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
174–183
Language:
URL:
https://aclanthology.org/2020.sdp-1.20
DOI:
10.18653/v1/2020.sdp-1.20
Bibkey:
Cite (ACL):
Krutarth Patel, Cornelia Caragea, and Sujatha Das Gollapalli. 2020. On the Use of Web Search to Improve Scientific Collections. In Proceedings of the First Workshop on Scholarly Document Processing, pages 174–183, Online. Association for Computational Linguistics.
Cite (Informal):
On the Use of Web Search to Improve Scientific Collections (Patel et al., sdp 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.sdp-1.20.pdf
Video:
 https://slideslive.com/38940728
Data
DBLP