SC-CoMIcs: A Superconductivity Corpus for Materials Informatics

Kyosuke Yamaguchi, Ryoji Asahi, Yutaka Sasaki


Abstract
This paper describes a novel corpus tailored for the text mining of superconducting materials in Materials Informatics (MI), named SuperConductivety Corpus for Materials Informatics (SC-CoMIcs). Different from biomedical informatics, there exist very few corpora targeting Materials Science and Engineering (MSE). Especially, there is no sizable corpus which can be used to assist the search of superconducting materials. A team of materials scientists and natural language processing experts jointly designed the annotation and constructed a corpus consisting of manually-annotated 1,000 MSE abstracts related to superconductivity. We conducted experiments on the corpus with a neural Named Entity Recognition (NER) tool. The experimental results show that NER performance over the corpus is around 77% in terms of micro-F1, which is comparable to human annotator agreement rates. Using the trained NER model, we automatically annotated 9,000 abstracts and created a term retrieval tool based on the term similarity. This tool can find superconductivity terms relevant to a query term within a specified Named Entity category, which demonstrates the power of our SC-CoMIcs, efficiently providing knowledge for Materials Informatics applications from rapidly expanding publications.
Anthology ID:
2020.lrec-1.834
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6753–6760
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.834
DOI:
Bibkey:
Cite (ACL):
Kyosuke Yamaguchi, Ryoji Asahi, and Yutaka Sasaki. 2020. SC-CoMIcs: A Superconductivity Corpus for Materials Informatics. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6753–6760, Marseille, France. European Language Resources Association.
Cite (Informal):
SC-CoMIcs: A Superconductivity Corpus for Materials Informatics (Yamaguchi et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.834.pdf