Finding the Paper Behind the Data: Automatic Identification of Research Articles related to Data Publications

Barbara McGillivray; Kaveh Aryan; Viola Harperath; Marton Ribary; Mandy Wigdorowitz

Finding the Paper Behind the Data: Automatic Identification of Research Articles related to Data Publications

Barbara McGillivray, Kaveh Aryan, Viola Harperath, Marton Ribary, Mandy Wigdorowitz

Abstract

Data papers are scholarly publications that describe datasets in detail, including their structure, collection methods, and potential for reuse, typically without presenting new analyses. As data sharing becomes increasingly central to research workflows, linking data papers to relevant research papers is essential for improving transparency, reproducibility, and scholarly credit. However, these links are rarely made explicit in metadata and are often difficult to identify manually at scale. In this study, we present a comprehensive approach to automating the linking process using natural language processing (NLP) techniques. We evaluate both set-based and vector-based methods, including Jaccard similarity, TF-IDF, SBERT, and reranking with large language models. Our experiments on a curated benchmark dataset reveal that no single method consistently outperforms others across all metrics, in line with the multifaceted nature of the task. Set-based methods using frequent words (N=50) achieve the highest top-10% accuracy, closely followed by TF-IDF, which also leads in MRR and top-1% and top-5% accuracy. SBERT-based reranking with LLMs yields the best results in top-N accuracy. This dispersion suggests that different approaches capture complementary aspects of similarity (lexical, semantic, and contextual), showing the value of hybrid strategies for robust matching between data papers and research articles. For several methods, we find no statistically significant difference between using abstracts and full texts, suggesting that abstracts may be sufficient for effective matching. Our findings demonstrate the feasibility of scalable, automated linking between data papers and research articles, enabling more accurate bibliometric analyses, improved tracking of data reuse, and fairer credit assignment for data sharing. This contributes to a more transparent, interconnected, and accessible research ecosystem.

Anthology ID:: 2025.wasp-main.5
Volume:: Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Month:: December
Year:: 2025
Address:: Mumbai, India and virtual
Editors:: Alberto Accomazzi, Tirthankar Ghosal, Felix Grezes, Kelly Lockhart
Venues:: WASP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34–43
Language:
URL:: https://aclanthology.org/2025.wasp-main.5/
DOI:
Bibkey:
Cite (ACL):: Barbara McGillivray, Kaveh Aryan, Viola Harperath, Marton Ribary, and Mandy Wigdorowitz. 2025. Finding the Paper Behind the Data: Automatic Identification of Research Articles related to Data Publications. In Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications, pages 34–43, Mumbai, India and virtual. Association for Computational Linguistics.
Cite (Informal):: Finding the Paper Behind the Data: Automatic Identification of Research Articles related to Data Publications (McGillivray et al., WASP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.wasp-main.5.pdf

PDF Cite Search Fix data