MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Zaid Alyafeai; Maged S. Al-shaibani; Bernard Ghanem

doi:10.18653/v1/2025.findings-emnlp.655

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Zaid Alyafeai, Maged S. Al-shaibani, Bernard Ghanem

Abstract

Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (CITATION) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets’ scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance.

Anthology ID:: 2025.findings-emnlp.655
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12236–12264
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.655/
DOI:: 10.18653/v1/2025.findings-emnlp.655
Bibkey:
Cite (ACL):: Zaid Alyafeai, Maged S. Al-shaibani, and Bernard Ghanem. 2025. MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 12236–12264, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs (Alyafeai et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.655.pdf
Checklist:: 2025.findings-emnlp.655.checklist.pdf

PDF Cite Search Checklist Fix data