Can AI Validate Science? Benchmarking LLMs on Claim →Evidence Reasoning in AI Papers

Shashidhar Reddy Javaji; Yupeng Cao; Haohang Li; Yangyang Yu; Nikhil Muralidhar; Zining Zhu

Can AI Validate Science? Benchmarking LLMs on Claim →Evidence Reasoning in AI Papers

Shashidhar Reddy Javaji, Yupeng Cao, Haohang Li, Yangyang Yu, Nikhil Muralidhar, Zining Zhu

Abstract

Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs’ capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs’ ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs’ abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

Anthology ID:: 2025.ijcnlp-long.127
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:: IJCNLP | AACL
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 2355–2379
Language:
URL:: https://aclanthology.org/2025.ijcnlp-long.127/
DOI:
Bibkey:
Cite (ACL):: Shashidhar Reddy Javaji, Yupeng Cao, Haohang Li, Yangyang Yu, Nikhil Muralidhar, and Zining Zhu. 2025. Can AI Validate Science? Benchmarking LLMs on Claim →Evidence Reasoning in AI Papers. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 2355–2379, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: Can AI Validate Science? Benchmarking LLMs on Claim →Evidence Reasoning in AI Papers (Javaji et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ijcnlp-long.127.pdf

PDF Cite Search Fix data