Zach Jensen


2021

pdf bib
MS-Mentions: Consistently Annotating Entity Mentions in Materials Science Procedural Text
Tim O’Gorman | Zach Jensen | Sheshera Mysore | Kevin Huang | Rubayyat Mahbub | Elsa Olivetti | Andrew McCallum
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Material science synthesis procedures are a promising domain for scientific NLP, as proper modeling of these recipes could provide insight into new ways of creating materials. However, a fundamental challenge in building information extraction models for material science synthesis procedures is getting accurate labels for the materials, operations, and other entities of those procedures. We present a new corpus of entity mention annotations over 595 Material Science synthesis procedural texts (157,488 tokens), which greatly expands the training data available for the Named Entity Recognition task. We outline a new label inventory designed to provide consistent annotations and a new annotation approach intended to maximize the consistency and annotation speed of domain experts. Inter-annotator agreement studies and baseline models trained upon the data suggest that the corpus provides high-quality annotations of these mention types. This corpus helps lay a foundation for future high-quality modeling of synthesis procedures.