SChuBERT: Scholarly Document Chunks with BERT-encoding boost Citation Count Prediction.

Thomas van Dongen, Gideon Maillette de Buy Wenniger, Lambert Schomaker


Abstract
Predicting the number of citations of scholarly documents is an upcoming task in scholarly document processing. Besides the intrinsic merit of this information, it also has a wider use as an imperfect proxy for quality which has the advantage of being cheaply available for large volumes of scholarly documents. Previous work has dealt with number of citations prediction with relatively small training data sets, or larger datasets but with short, incomplete input text. In this work we leverage the open access ACL Anthology collection in combination with the Semantic Scholar bibliometric database to create a large corpus of scholarly documents with associated citation information and we propose a new citation prediction model called SChuBERT. In our experiments we compare SChuBERT with several state-of-the-art citation prediction models and show that it outperforms previous methods by a large margin. We also show the merit of using more training data and longer input for number of citations prediction.
Anthology ID:
2020.sdp-1.17
Volume:
Proceedings of the First Workshop on Scholarly Document Processing
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | sdp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
148–157
Language:
URL:
https://aclanthology.org/2020.sdp-1.17
DOI:
10.18653/v1/2020.sdp-1.17
Bibkey:
Cite (ACL):
Thomas van Dongen, Gideon Maillette de Buy Wenniger, and Lambert Schomaker. 2020. SChuBERT: Scholarly Document Chunks with BERT-encoding boost Citation Count Prediction.. In Proceedings of the First Workshop on Scholarly Document Processing, pages 148–157, Online. Association for Computational Linguistics.
Cite (Informal):
SChuBERT: Scholarly Document Chunks with BERT-encoding boost Citation Count Prediction. (van Dongen et al., sdp 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.sdp-1.17.pdf
Video:
 https://slideslive.com/38940730
Data
PeerReadS2ORC