Metadata Enhancement Using Large Language Models

Hyunju Song, Steven Bethard, Andrea Thomer


Abstract
In the natural sciences, a common form of scholarly document is a physical sample record, which provides categorical and textual metadata for specimens collected and analyzed for scientific research. Physical sample archives like museums and repositories publish these records in data repositories to support reproducible science and enable the discovery of physical samples. However, the success of resource discovery in such interfaces depends on the completeness of the sample records. We investigate approaches for automatically completing the scientific metadata fields of sample records. We apply large language models in zero and few-shot settings and incorporate the hierarchical structure of the taxonomy. We show that a combination of record summarization, bottom-up taxonomy traversal, and few-shot prompting yield F1 as high as 0.928 on metadata completion in the Earth science domain.
Anthology ID:
2024.sdp-1.14
Volume:
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Tirthankar Ghosal, Amanpreet Singh, Anita Waard, Philipp Mayr, Aakanksha Naik, Orion Weller, Yoonjoo Lee, Shannon Shen, Yanxia Qin
Venues:
sdp | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
145–154
Language:
URL:
https://aclanthology.org/2024.sdp-1.14
DOI:
Bibkey:
Cite (ACL):
Hyunju Song, Steven Bethard, and Andrea Thomer. 2024. Metadata Enhancement Using Large Language Models. In Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024), pages 145–154, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Metadata Enhancement Using Large Language Models (Song et al., sdp-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sdp-1.14.pdf