PcMSP: A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text

Scientific action graphs extraction from materials synthesis procedures is important for reproducible research, machine automation, and material prediction. But the lack of annotated data has hindered progress in this field. We demonstrate an effort to annotate Polycrystalline Materials Synthesis Procedures (PcMSP) from 305 open access scientific articles for the construction of synthesis action graphs. This is a new dataset for material science information extraction that simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations. A two-step human annotation and inter-annotator agreement study guarantee the high quality of the PcMSP corpus. We introduce four natural language processing tasks: sentence classification, named entity recognition, relation classification, and joint extraction of entities and relations. Comprehensive experiments validate the effectiveness of several state-of-the-art models for these challenges while leaving large space for improvement. We also perform the error analysis and point out some unique challenges that require further investigation. We will release our annotation scheme, the corpus, and codes to the research community to alleviate the scarcity of labeled data in this domain.

The goal of information extraction from procedures is to construct the action graphs, which refer to all the steps in a synthesis making up a Directed Acyclic Graph (DAG) (Mysore et al., 2019;Kulkarni et al., 2018) (as can be seen from one example in Figure 1).This can be further breakdown into three tasks: sentence classification, named entity recognition (NER), and relation extraction (RE).Previous research (Mysore et al., 2017(Mysore et al., , 2019) ) either annotates the whole synthesis paragraph in the general inorganic domain, ignoring the nonsynthesis sentences and subdomain discrepancy or only focuses on entity mentions (Friedrich et al., 2020;O'Gorman et al., 2021).
To fill this gap, we focus on one important category of polycrystalline materials and simultane-Figure 1: A synthesis action graph constructed from Table 1.
ously include all three tasks.The annotation guidelines are designed by materials experts after comprehensive discussion, and the new dataset is subsequently labeled with a two-round annotation.
The key contributions of this paper include: • We contribute a new large-scale dataset, as well as an annotation scheme with high quality for information extraction in materials science.
• We conduct comprehensive experiments on four tasks, sentence classification, named entity recognition, relation extraction, and joint extraction to provide baselines.
• We perform error analysis and point out unique challenges and potential use of this dataset for future research.

Materials procedures information extraction
In the area of annotation of materials synthesis procedures, (Mysore et al., 2019) annotate 230 general materials synthesis paragraphs for NER and RE tasks.Similar work is also undertaken by (Friedrich et al., 2020), in which 45 open access scholarly articles are labeled for experimentdescribing sentence classification, NER, and slot filling tasks.However, in contrast to our works, their annotation scheme focuses on the full text rather than the experimental section.(Kuniyoshi et al., 2020) annotate the synthesis process of allsolid-state batteries from the scientific literature, but their corpus is not publicly available.(Walker et al., 2021) release MatBERT trained on 50 million materials science paragraphs to explore the impact of domain-specific pre-training on NER task.Also of interest, (O'Gorman et al., 2021) recently create the largest corpus for entity mentions extraction in both general domain and subdomain from material synthesis text, but the relations between entities are still missing.

Named entity recognition and relation extraction
Many neural network-based models have been proposed for named entity recognition, for example, (Huang et al., 2015;Lample et al., 2016;Panchendrarajan and Amaresan, 2018).The core idea uses one encoding layer (e.g.Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), BERT) for representation and one additional conditional random fields (CRF (Lafferty et al., 2001)) layer for sequence labeling.Then relations are predicted based on either gold entities or predicted entities, and PURE (Zhong and Chen, 2021) designs two separate encoders for joint extraction of entities and relations.We adopt their model for our tasks due to its super performance.

The Selection of Our Dataset
Here we talk about the importance of our selection and how is it different from other materials procedural text corpora.
Why do we choose inorganic polycrystalline materials?There are a number of sub-categories within solid-state inorganic materials.For example, materials can be divided based on function and properties, such as the battery or thermoelectric materials.Synthesis within both categories largely falls within the broader category of solid-state synthesis and even then, there is a high degree of overlap with other function categories, such as quantum and magnetic materials.More importantly, those materials are usually in the form of polycrystalline.Other subcategories relate to form factors, for instance, single-crystalline synthesis often starts with a polycrystalline synthesis and therefore has a high degree of overlap with solid-state synthesis.
Inorganic polycrystal compounds span combinations of the entire periodic table and different chemical bonding schemes, such that their synthesis typically takes place under extreme conditions, such as high temperature and pressure.Reaction pathways are therefore difficult to characterize without specialized equipment and are not well established for any given material.In particular, solid-state reactions, which are the main techniques to synthesize inorganic polycrystalline materials, are particularly similar to a "black box", where materials scientists can only make educated guesses to the procedure or stability of a new reaction.This presents a prime opportunity (Mysore et al., 2017(Mysore et al., , 2019) ) for compiling published inorganic synthesis data in order to demystify the black box of solid-state inorganic materials synthesis and create datasets for future text mining endeavors.While there have been efforts within general solid-state materials (Mysore et al., 2017(Mysore et al., , 2019;;O'Gorman et al., 2021) and battery materials subcategory (Friedrich et al., 2020), this work aims to extend the subcategory of inorganic solid-state synthesis methods in order to address the frequent overlap and "borrowing" of materials between subdisciplines of materials science.
Why do we discard characterization sentences?Inorganic reactions typically involve relatively few reactions from a set of precursors and there are very few purification pathways for solid materials compared to organic materials or liquids.Therefore, characterizations of solid-state inorganic reactions are seldom reported in literature unless they proceed to complete purity within standard measurement fidelity.This is in contrast to organic materials where there are a number of important characterization metrics in a compound, such as molecular weight in polymers or reaction yield.Therefore, these standard characterization measurements do not add valuable information for a researcher attempting to recreate the reported synthesis method and we decide to discard these characterization sentences.
Why do we annotate sentence, entity, and relation simultaneously?A full action graph consists of both entities and relations extracted from experimental-describing sentences.However, most previous research either ignores the annotation of sentence or relation information, making them in-complete for action graph construction.To fill this gap, we aim to annotate all pertinent information jointly.

Selection of synthesis procedures for annotation
We begin by harvesting the polycrystalline materials synthesis-related open access publications from the main journal publishers by searching keywords (e.g.'polycrystalline+synthesis').The journals that we used include Physical Review Journals2 , Nature journals3 , Science journals4 , Journal of the American Chemical Society5 , Advanced Materials6 , Journal of Physics Condensed Matter7 , Chemistry of Materials8 and ArXiv9 .After the collection of 305 publications, each portable document format (PDF) document is converted into a plain text file by pdfminer10 .The experimental paragraphs usually appear in the experimental section within an article and are selected by one materials expert.
To improve the data quality, the selected paragraphs are double-checked by another annotator to ensure their correctness.And some missing sentences caused by the conversion process are also added.Finally, the collected paragraphs are prepared for the next step of annotations.

Sentence annotation
Based on the selected paragraphs from the aforementioned step, each document is annotated on the semantic annotation platform INCEpTION (Klie et al., 2018), and the sentence segmentation is carried out automatically11 .Each line represents all tokens of one sentence, and the annotation is done on the token level.In practice, only the synthesisrelated sentences are annotated for NER and RC.
The resulting unlabeled sentences automatically obtain non-synthesis labels.This process resulted in 1497 synthesis-related sentences and 971 nonrelated sentences.It is worthwhile to point out that several selected paragraphs also contain single crystal synthesis (this occurs < 1%), but we do not take those as synthesis-related sentences so as to focus purely on polycrystalline synthesis.In general, most non-synthesis sentences are relevant to the characterization of materials, description of devices, etc.While synthesis sentences typically describe the synthesis actions conducted in the experiments.For example, in Table 1, the first two sentences are synthesis-related while the remaining sentences are not.

Entity type annotation
We defined 13 entity types to include the most useful entity mentions, which are decided by the materials experts.denotes that no such information is contained in the corresponding corpus.-denotes that the corpus has not been released yet.
Property-temperature: a temperature condition associated with an operation, which is usually composed of numerical values and temperature units.
Property-rate: a rate condition associated with an operation, which is usually composed of numerical values and rate units.The rates can be rotation speed, cooling, or heating rates, etc. Property-pressure: a pressure condition associated with an operation, which is not only in the form of value and units but also can be a certain condition like vacuum, helium, or air.
Value: numerical values and their corresponding units.In addition, we include specifications like "around", "over", "more than" or "between" in the annotation span (e.g., "around 250 g," and "over 20 mol").We do not include time, temperature, pressure, or rate in this category, as they are already included in properties.Device: mentions of the type of device used in the corresponding operation, which can contain the device name and serial number.Brand: the brand name or source laboratory associated with the equipment or material.Descriptor: description of an operation or a material or a value that does not apply to properties but is necessarily included for clear descriptions.

Relation type annotation
The previous two steps provide us with the labeled entity mentions within each sentence.We then connect each entity pair by a relation type when there is a believed necessary connection, according to the definition of agreement study.The full descriptions of relation labels are listed in the following.Participant-material: materials that are involved in one operation process, and we also mark the target material and its synthesis action as this label.Device-of-operation: a device used in an operation.Condition-of: indicates the conditions of an operation (such as the temperature, time, and pressure) for performing an operation.Value-of: expresses the relationships between participated material and their weight, mass, volume, or purity, and also represents the relationship between the device and its serial number.Next-operation: represents the order of an operation sequence that one operation that happens following the previous operation.Note that we assume the linear sequence of synthesis operations happens sentence by sentence, which is true for most cases.Brand-of: expresses the relationships between a raw material or device and its manufacturer name or source laboratory.Descriptor-of: the descriptor for the material, device, or operation that can not be covered by other labels.
Coreference: represents the same material or operation in the same sentence.
Besides, according to the largest Document-level relation extraction dataset (Yao et al., 2019), around 40% of relations exist across multiple sentences.But cross-sentence relation is out of our scope for current work and we leave it for future investigation.

Inter-annotator Agreement Study
We perform a two-round agreement study to ensure that our corpus has a high quality of annotation.Before undertaking the formal annotation, all four annotators participate in a discussion of the formulation rules and discuss the necessary entity and relation types.In the warm-up exercise, all annotators annotate the same documents individually and then compare and discuss the results together to achieve better agreement on annotation.After the agreements are formulated, in the first-round annotation four annotators are randomly assigned different documents to work on.It takes around twenty to thirty minutes to annotate one document on average for all annotators.When all of the annotations are finished, two of the four annotators select several typical examples for analysis and eventually set more rules for annotating the most debatable parts.In the second round of annotation, two lead annotators individually re-annotate half of the documents, guaranteeing that there are no significant differences or mistakes.It takes around 500 hours for our material expert team in total to create this corpus to guarantee high quality.
We use Fleiss' Kappa to measure the agreement scores between our four annotators.The result is shown in Table 3, with substantially high agreement scores.We can see obvious improvements in all aspects from the first to second round annotation, demonstrating the effectiveness of our annotation pipeline.We use five metrics to measure the agreement score: Sen. refers to sentence agreement, En1.means span boundaries and type are both correct, En2.means matched type on same spans, Re1.represents complete relation triple with correct entities and Re2.stands for correct relation type on same entities.More details are discussed in Appendix D.

Statistics of Corpus and Problem Formulation
In this section, we describe the statistics of this new dataset, the comparison with precious corpora, and formulated tasks.

PcMSP corpus
We outline the main material science corpus in  4. We provide the train/validation/test split for potential use in the future.

Task definition
The PcMSP corpus labels every sentence with entity mentions and relations among entity pairs.Formally, given a sentence of n words s = {w 1 , ..., w n } with the labeled sentence type, entity set E and relation set R, four information extraction tasks are introduced: 1) SC: classification of the sentence as an experimental procedure sentence or irrelevant sentences, 2) NER: recognition of all named entities mentions in E, 3) RE: identification of the entity pair relations in R and 4) Joint: joint extraction of all entities and relations.

Results and Analysis
We present the main experimental results in this section, and more modeling details are included in Appendix B. PURE refers to the advanced joint extraction model by (Zhong and Chen, 2021).For all the experiments, we use the bert-base-uncased (Devlin  et al., 2019), and matbert-base-uncased (Walker et al., 2021) as encoders.Generally, BERT with domain-specific pretraining considerably improves the performance.

Sentence classification
We summarize the results for the experimentdescribing sentence detection in Table 5.For this binary classification task, we fine-tune the BERT, SciBERT, and MatBERT (Walker et al., 2021) models, resulting in an F1 score of 87.20, 88.85, and 90.16%, respectively.The best result is achieved by MatBERT, demonstrating the usefulness of domain-specific pretraining.The close-human performance of sentence classification stems from the obvious difference in expression between synthesis-describing sentences and others.Generally, synthesis-describing sentences contain 1) the material's chemical formulas, 2) the operations (usually certain verbs), and 3) experimental conditions.In contrast, other sentences often describe the characterization approaches which are totally different.In conclusion, synthesis sentence detection is the foundation for other downstream tasks and the high detection accuracy guarantees the success of our workflow for other downstream tasks.

Named entity recognition
In Table 6, we present the NER results obtained from different models.Based on the synthesis procedure sentences detected earlier, we train the models only on the experiment-describing sentences, ignoring irrelevant sentences.The SciB-ERT model is trained with one CRF layer for sequence labeling and the MatBERT is stacked with one additional forward layer for span-based tagging.The MatBERT model with PURE achieves the best F1 result of 79.46%, although a large gap of 10 points still exists compared with the human agreement score.When looking at all the label performance from Table 7, recognizing the labels such as P roperty −rate, P roperty −time and Operation achieves good scores of 92.31%, 84.38%, and 83.39%, respectively.On the contrary, the recognition is still difficult for labels like M aterial − others, M aterial − interdium, etc.One possible reason might be those mentions require cross-sentence reasoning, while the current model is only trained on single sentences.We also report SciBERT results on other previously mentioned materials procedural datasets and the overall sentence-level results are very consistent.Thus, a promising direction for improving the results is to include paragraph-level context or use cross-domain transfer learning and we leave this for future work.

Relation classification
In this section, the modeling is performed on gold entities to investigate individual modeling capability.The relation classification results are provided in Table 8.For entity pairs without any relation, a 'NA' label is given for modeling.Here, the human agreement score is calculated by treating one annotation as gold and another one as predictions.Among all of the relation modeling results in Table 8, we can see that the F1 score is almost always above 80%, demonstrating promising prediction results on all label levels.In particular, the Condition − of and Brand − of relation predictions achieve a high F1 score of 89.21% and 88.46%, respectively.But Coref erence prediction is more difficult, achieving only 71.74 points.
Overall, the RE modeling achieves comparable results to those of human annotators, although leaving more than 10% points for improvement.Similarly, we believe cross-sentence information can further improve the results and leave it for further investigation.

Joint entity and relation extraction
Previous sections consider entity and relation extraction separately, but the practical scenario involves joint extraction of entities and relations.
Here we use the super performing joint extraction PURE (Zhong and Chen, 2021) model to evaluate the joint extraction performance.The PURE model first produces all the possible entities and then uses these predicted entities for relation extraction.Following their work, the evaluation is conducted on three metrics: (1) Ent: a predicted entity is correct only if the predicted span boundaries and entity type are both correct.(2) Rel: a predicted relation type is correct given the correct boundaries of two spans.
(3) Rel+: in addition to the boundaries requirements, the predicted entity must conserve the correct type.
As can be seen from Rel+, with 66.69% and 62.53% respectively.This is not unexpected since the RE relies on the previous entity prediction result and the error inevitably propagates.Compared with previous individual extraction, the joint extraction achieves lower results and leaves a large margin for improvement.Considering the goal of action graphs extraction from procedures is the joint extraction of all entities and relations, we encourage more research towards better modeling.Also of notice, the current joint evaluation is on a single sentence, while more realistic end-to-end extraction is conducted on the whole paragraph.And cross-sentence relations will also preserve in such a scenario, but this is out of the scope of this work.

Conclusion
In summary, we contribute a new dataset PcMSP collected from 305 open access scholarly publications for action graphs construction from material synthesis procedures.The two-round human expert's annotations guarantee the high quality of the dataset, which is evident by the agreement study.Based on this new dataset, we perform sentence classification, named entity recognition, and relation extraction tasks.We also experiment with the joint extraction of entities and relations.Several good-performing neural models are utilized to provide competitive baselines, although leaving a big gap compared with the human upper bound.To alleviate the data scarcity of this domain, we will make our dataset publicly available.
Some future directions would be to investigate incorporating cross-sentence context, improving the joint extraction results, performing paragraphlevel end-to-end extraction, as well as using our PcMSP to investigate domain adaptation.For example, pre-training with distant supervision in the materials domain might also help improve the results.Considering the high labeling cost, how to efficiently transfer knowledge into other domains to reduce human annotations is also of great importance.

Limitations
Even though we try our best to guarantee high annotation quality, inaccurate labels may still exist.We are not responsible for any products derived from our dataset.Also, the real-world end-to-end actions graphs construction involves the whole pipeline and will inevitably face the error propagation problem.

Ethics Statement
We notice that our data source comes from open access publications and we make our dataset publicly available, but further use might also fall into potential limitations required by certain journals.Besides, in our annotation process, all the annotators are paid as research assistants following the campus policy.

A Background on Polycrystalline Materials
Polycrystalline materials are solids composed of small randomly oriented crystallites, also called grains, with the size varying from a few nanometers to several millimeters.Most of the inorganic solid materials available in macroscopic quantities are in fact polycrystals, including common metals, ceramics, and rocks.They provide versatility in numerous applications such as superconductors, batteries, photovoltaic cells, and shape memory alloys (Husain et al., 2018;Peng et al., 2018Peng et al., , 2017;;Biswal and Mohanta, 2021).The structure of a single crystal or monocrystal (Figure 3a) is continuous and highly ordered, while an amorphous phase (non-crystal) (Figure 3b) such as glass does not display any structures, as the constituent atoms are not arranged in an ordered manner.In-between these two extremes, a polycrystal (Figure 3c) exists, which is made up of many crystallites, also referred to as grains.During the solidification of polycrystalline materials, small nuclei first form at different spots of the liquid sample and subsequently absorb atoms from the surrounding liquid to grow into larger grains.These grains vary in size from nanometers to millimeters and are randomly oriented with no preferred direction in the structure.Therefore, a large enough volume of polycrystalline material can be approximately considered isotropic.Compared to single crystals, polycrystalline materials also require less sophisticated techniques to make, significantly lowering the cost of production.As most real-world solids are polycrystalline materials, it is critical to synthesize and understand polycrystalline materials.A substantial number of studies have been done by researchers across the world to discover new materials.This work exacts knowledge from those synthesis processes and aims to guide the synthesis efforts toward the unexplored space.

B Modeling
We mainly use PURE (Zhong and Chen, 2021) as backbones for our tasks.

B.1 Sentence classification
Sentence classification is a binary text classification problem.We build one additional layer on top of the BERT and fine-tune it for another 10 epochs.

B.2 Named entity recognition
For the SciBERT model, we stack another conditional random field (CRF) (Tseng et al., 2005) layer on top of SciBERT for sequence labeling following the traditional BIO notation.For the MatBERT result, we follow the span-based approach in (Zhong and Chen, 2021) to obtain the contextualized representation for any span and feed it into another forward layer to predict the entity type.

B.3 Relation classification
We utilize the span representations of entity mentions for relation prediction with typed entity markers as proposed by the relation model in (Zhong and Chen, 2021).

B.4 Joint extraction
Following (Zhong and Chen, 2021), the predicted entities are fed into another encoder for relation prediction.And we adopt two different encoders for the joint extraction of entities and relations.

C Experimental settings
We select the best combination of hyperparameters from the development set by random search.Three random seeds are chosen for all models, and we report the results based on the median performance.The standard macro-average precision(P), recall(R), and F1 scores are calculated.
The Adam optimizer (Kingma and Ba, 2015) is used for all models.Other parameters are selected within a range of values, for example learning rate ranges from [1e-4, 5e-5, 1e-6] and batch size of 8 or 16.The models are implemented in PyTorch 12 , and a Tesla P40 with 24GB RAM is used for all experiments.The model takes around half-hour, one hour, and three hours for the training of sentence, entity, and relation tasks for 10 to 50 epochs.

C.1 Data preprocessing
Each plain text document containing the synthesis paragraphs is imported into the INCEpTION platform, which also performs the sentence segmentation and word tokenization by its built-in algorithm.
After tokenization, each sentence is mapped with the corresponding entity mentions and relations, which includes the named entity type, position, token information, and the relations type, as well as left and right position information.

D Inter-annotator Agreement Study
Despite from Fleiss' kappa for measuring agreements in Table 3, we describe more details in this section.

D.1 Sentences annotation
Given a paragraph selected from a scientific publication, we first examine the synthesis-related sentences.In practice, the annotators only label synthesis-related sentences for the entity and relation information.All other sentences without labeling are considered non-synthesis sentences.
To compare the model's performance with human annotation, 32 documents are labeled by two main annotators in the second round individually.Then one annotation is regarded as the ground truth and the other is treated as a prediction.A micro-average F1 score of 90.62% is calculated between the two annotators.Additional details about the precision, recall, and F1 score is shown in Table 10.In general, the main annotator selects 153 of the 256 sentences to label as synthesis-related sentences, while the second annotator chose 163 to be labeled as target sentences.The overall result demonstrates high-quality annotations and can serve as a human agreement score for further baseline.

D.2 Named entity annotation
Following the previous step, all of the entity mention boundaries are first recognized by the annotators and then one entity label is chosen from the predefined entity labels to represent the entity type.

E Document Distribution Among Journals
Table 13 demonstrates that the source of our collected documents is distributed among different journals.Considering that the writing style and publication requirements of different journals vary a lot, we aim to include documents from a range of sources to make the dataset more diverse.

F Annotation Examples and Statistics
Common examples of entity mentions and relation triples are shown in Table 14 and Table 15, respectively.The relation triple has the form of r i : (e i , e j ), where r i is one relation label, while e i and e j denote the entity mention within one sentence.

Figure 4 :
Figure 4: Confusion matrix over relations between the two lead annotators.

Table 2 :
Corpus statistics of our PcMSP and previous datasets for materials science.
Each span of continuous words is labeled as a certain kind of entity type.There are five general categories of labels, namely Material: Material-target, Materialrecipe, Material-intermedium and Material-others, Property: Property-time, Property-temperature, Property-rate and Property-pressure, Operation, Item: Value, Brand, Device and Descriptor.Every general coarse-grained category can further be divided into one or several fine-grained types.The full definitions of these labels can be found in the following.Material-target: final material (or products) of the material synthesis process, usually refers to only one target in a typical procedural paragraph, but can appear as multiple target materials (this occurs less than 1%).Material-recipe: raw material used to synthesize the final product, can be fundamental elements(like Si), compounds(like SrO2), or precursors of other polycrystalline materials.verbs or a particular overall synthesis method, like Solid − state − reaction.Property-time: a time condition associated with an operation, which is usually composed of numerical values and time units.

Table 2
procedures.On the other hand, the SC-CoMIcs and MS-MENTIONS only contain entity mentions, without any sentence or relation labels.In addition, the SOFC-exp corpus focuses on the whole

Table 6 :
Named entity recognition results in terms of F1 score on the PcMSP test set.

Table 7 :
NER per label performance on the PcMSP test set by SciBERT.

Table 8 :
RE per label performance on the PcMSP test set.

Table 9 :
Table 9, the joint model demonstrates a 79.46% F1 score in terms of the entity prediction.As for the relation prediction, a much lower F1 score is observed for both Rel and Joint entity and relation extraction results on test set.

Table 13 :
Document distribution among main journals: ACS: American Chemistry Society, APS: American Physical Society, and others refers to other journals not included here.