The WEAVE Corpus: Annotating Synthetic Chemical Procedures in Patents with Chemical Named Entities

Ravindra Nittala, Manish Shrivastava


Abstract
The Modern pharmaceutical industry depends on the iterative design of novel synthetic routes for drugs while not infringing on existing intellectual property rights. Such a design process calls for analyzing many existing synthetic chemical reactions and planning the synthesis of novel chemicals. These procedures have been historically available in unstructured raw text form in publications and patents. To facilitate automated synthetic chemical reactions analysis and design the novel synthetic reactions using Natural Language Processing (NLP) methods, we introduce a Named Entity Recognition (NER) dataset of the Examples section in 180 full-text patent documents with 5188 synthetic procedures annotated by domain experts. All the chemical entities which are part of the synthetic discourse were annotated with suitable class labels. We present the second-largest chemical NER corpus with 100,129 annotations and the highest IAA value of 98.73% (F-measure) on a 45 document subset. We discuss this new resource in detail and highlight some specific challenges in annotating synthetic chemical procedures with chemical named entities. We make the corpus available to the community to promote further research and development of downstream NLP systems applications. We also provide baseline results for the NER model to the community to improve on.
Anthology ID:
2020.icon-main.1
Volume:
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2020
Address:
Indian Institute of Technology Patna, Patna, India
Editors:
Pushpak Bhattacharyya, Dipti Misra Sharma, Rajeev Sangal
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
1–9
Language:
URL:
https://aclanthology.org/2020.icon-main.1
DOI:
Bibkey:
Cite (ACL):
Ravindra Nittala and Manish Shrivastava. 2020. The WEAVE Corpus: Annotating Synthetic Chemical Procedures in Patents with Chemical Named Entities. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 1–9, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
Cite (Informal):
The WEAVE Corpus: Annotating Synthetic Chemical Procedures in Patents with Chemical Named Entities (Nittala & Shrivastava, ICON 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.icon-main.1.pdf
Code
 nv-ravindra/the-weave-corpus
Data
CoNLL 2003