AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers

Yuan-Sen Ting; Alberto Accomazzi; Tirthankar Ghosal; Tuan Dung Nguyen; Rui Pan; Zechang Sun; Tijmen de Haan

AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers

Yuan-Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Tuan Dung Nguyen, Rui Pan, Zechang Sun, Tijmen de Haan

Abstract

We present a dataset of 408,590 astrophysics papers from arXiv (astro-ph), spanning 1992 through July 2025. Each paper has been processed through a multi-stage pipeline to produce: (1) structured summaries organized into six semantic sections (Background, Motivation, Methodology, Results, Interpretation, Implication), and (2) concept extraction yielding 9,999 unique concepts with detailed descriptions. The dataset contains 3.8 million paper-concept associations and includes semantic embeddings for all concepts. Comparison with traditional ADS keywords reveals that the concepts provide denser coverage and more uniform distribution, while analysis of embedding space structure demonstrates that concepts are semantically dispersed within papers—enabling discovery through multiple diverse entry points. Concept vocabulary and embeddings are publicly released at https://github.com/tingyuansen/astro-ph_knowledge_graph.

Anthology ID:: 2025.wasp-main.19
Volume:: Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Month:: December
Year:: 2025
Address:: Mumbai, India and virtual
Editors:: Alberto Accomazzi, Tirthankar Ghosal, Felix Grezes, Kelly Lockhart
Venues:: WASP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 170–185
Language:
URL:: https://aclanthology.org/2025.wasp-main.19/
DOI:
Bibkey:
Cite (ACL):: Yuan-Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Tuan Dung Nguyen, Rui Pan, Zechang Sun, and Tijmen de Haan. 2025. AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers. In Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications, pages 170–185, Mumbai, India and virtual. Association for Computational Linguistics.
Cite (Informal):: AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers (Ting et al., WASP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.wasp-main.19.pdf

PDF Cite Search Fix data