Knowledge Graph Extraction from Total Synthesis Documents

Andres M Bran, Zlatko Jončev, Philippe Schwaller


Abstract
Knowledge graphs (KGs) have emerged as a powerful tool for organizing and integrating complex information, making it a suitable format for scientific knowledge. However, translating scientific knowledge into KGs is challenging as a wide variety of styles and elements to present data and ideas is used. Although efforts for KG extraction (KGE) from scientific documents exist, evaluation remains challenging and field-dependent; and existing benchmarks do not focuse on scientific information. Furthermore, establishing a general benchmark for this task is challenging as not all scientific knowledge has a ground-truth KG representation, making any benchmark prone to ambiguity. Here we propose Graph of Organic Synthesis Benchmark (GOSyBench), a benchmark for KG extraction from scientific documents in chemistry, that leverages the native KG-like structure of synthetic routes in organic chemistry. We develop KG-extraction algorithms based on LLMs (GPT-4, Claude, Mistral) and VLMs (GPT-4o), the best of which reaches 73% recovery accuracy and 59% precision, leaving a lot of room for improvement. We expect GOSyBench can serve as a valuable resource for evaluating and advancing KGE methods in the scientific domain, ultimately facilitating better organization, integration, and discovery of scientific knowledge.
Anthology ID:
2024.langmol-1.9
Volume:
Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Carl Edwards, Qingyun Wang, Manling Li, Lawrence Zhao, Tom Hope, Heng Ji
Venues:
LangMol | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
74–84
Language:
URL:
https://aclanthology.org/2024.langmol-1.9
DOI:
Bibkey:
Cite (ACL):
Andres M Bran, Zlatko Jončev, and Philippe Schwaller. 2024. Knowledge Graph Extraction from Total Synthesis Documents. In Proceedings of the 1st Workshop on Language + Molecules (L+M 2024), pages 74–84, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Knowledge Graph Extraction from Total Synthesis Documents (M Bran et al., LangMol-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.langmol-1.9.pdf