Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Dataset

Satanu Ghosh, Neal Brodnik, Carolina Frey, Collin Holgate, Tresa Pollock, Samantha Daly, Samuel Carton


Abstract
We explore the ability of GPT-4 to perform ad-hoc schema-based information extraction from scientific literature. We assess specifically whether it can, with a basic one-shot prompting approach over the full text of the included manusciprts, replicate two existing material science datasets, one pertaining to multi-principal element alloys (MPEAs), and one to silicate diffusion. We collaborate with materials scientists to perform a detailed manual error analysis to assess where and why the model struggles to faithfully extract the desired information, and draw on their insights to suggest research directions to address this broadly important task.
Anthology ID:
2024.findings-acl.897
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15109–15123
Language:
URL:
https://aclanthology.org/2024.findings-acl.897
DOI:
Bibkey:
Cite (ACL):
Satanu Ghosh, Neal Brodnik, Carolina Frey, Collin Holgate, Tresa Pollock, Samantha Daly, and Samuel Carton. 2024. Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Dataset. In Findings of the Association for Computational Linguistics ACL 2024, pages 15109–15123, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Dataset (Ghosh et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.897.pdf