MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain

Keeping track of all relevant recent publications and experimental results for a research area is a challenging task. Prior work has demonstrated the efficacy of information extraction models in various scientific areas. Recently, several datasets have been released for the yet understudied materials science domain. However, these datasets focus on sub-problems such as parsing synthesis procedures or on sub-domains, e.g., solid oxide fuel cells. In this resource paper, we present MuLMS, a new dataset of 50 open-access articles, spanning seven sub-domains of materials science. The corpus has been annotated by domain experts with several layers ranging from named entities over relations to frame structures. We present competitive neural models for all tasks and demonstrate that multi-task training with existing related resources leads to benefits.


Introduction
Designing meaningful experiments in empirical sciences requires maintaining a detailed overview of the huge amounts of literature published every year.Applying natural language processing (NLP) in this context has risen to be an active research area (Chandrasekaran et al., 2020;Beltagy et al., 2021;Cohan et al., 2022).Besides the biomedical field, which has been studied extensively in the past decades (e.g., Collier et al., 2004;Cohen et al., 2012;Demner-Fushman et al., 2022), the less-studied materials science domain has recently received more attention (Mysore et al., 2019;Friedrich et al., 2020;O'Gorman et al., 2021).
Materials science research aims to design and discover new materials.Part of the papers is hence often dedicated to the synthesis procedures, the "recipe" for creating a material.Their extraction from papers has been covered by Mysore et al. (2019) and O'Gorman et al. (2021).Much materials science research develops materials in the context of creating a particular device, e.g., batteries or photovoltaic panels.The device is tested in various conditions and the literature needs to be analysed for identifying promising set-ups.Friedrich et al. (2020) address this for solid oxide fuel cells.
In this paper, we introduce MuLMS (the Multi-Layer Materials Science corpus), a new dataset of scientific publications annotated by domain experts with named entity (NE) mentions, relations, and frame structures corresponding to a broad notion of measurements (see Figure 1).In contrast to prior works, we include papers from a variety of materials science sub-domains.To the best of our knowledge, the existing datasets only annotate particular paragraphs or subsets of the sentences with NE mentions.Our dataset is the first to exhaustively annotate a large-scale collection of materials science articles with NEs and facilitates novel semantic search applications, e.g., answering search queries such as "find a passage within a paper reporting a measurement using material X, condition Y, and obtaining a value of at least Z." The design of MuLMS' annotation scheme results from a collaboration of NLP and materials science experts.Our inter-annotator agreement study shows good agreement for most categories and decisions.We propose several machine learning tasks on the annotated data and present strong neural baselines for all tasks, which signals a high level of consistency across the annotations in our dataset.We cast detecting sentences describing measurements as a sentence classification task and provide a robust tagger for recognizing NEs.We propose to treat relation and semantic role extraction on MuLMS in a single step using a dependency parser that predicts relations between the NEs in a sentence.According to our multi-task experiments with related datasets, training jointly with MuLMS is beneficial for performance on those datasets.
Our contributions are as follows.
(1) We publicly release a dataset of 50 open-access scientific publications exhaustively annotated by a domain Transmittance varied between 10% and 78% and the coloration efficiency at a wavelength of 528 nm and the overpotential of 400 mV was 40 cm2 C -1 .expert with NE mentions, relations, and measurement frames. 1 (2) We define a set of NLP tasks on MuLMS and provide strong transformer-based baselines.Our code will be published.
(3) We formulate the relation and frame-argument extraction as a single dependency parsing task, which extracts all relations in a sentence in one processing step.
(4) We perform an extensive set of multi-task learning experiments with related corpora, showing that MuLMS is a useful auxiliary task for two other materials science NLP datasets.

Related Work
Several materials science NLP datasets have recently been released, e.g., targeting NE recognition (Yamaguchi et al., 2020;O'Gorman et al., 2021).The Materials Science Procedural Text (MSPT) corpus (Mysore et al., 2019) consists of paragraphs describing synthesis procedures annotated with graph structures capturing relations and typed arguments.SOFC-Exp (Friedrich et al., 2020) marks similar graph structures describing experiments.
In this paper, we compare two state-of-the-art approaches to Named Entity Recognition (NER).Huang et al. (2015) use a CRF layer (Lafferty et al., 2001) on top of a neural language model (in their case a BiLSTM) for sequence-tagging related tasks.Yu et al. (2020) 2021) compare different pre-trained transformers for token classification.Both studies find SciBERT (Beltagy et al., 2019), a BERT-style model pre-trained on scientific documents, to be very effective.
Relation and Event Extraction.Friedrich et al.
1 https://github.com/boschresearch/mulms-wiesp2023 (2020) treat their slot filling task in the SOFC subdomain as a sequence tagging task, assuming that each sentence represents at most one experiment.
To predict a possible relation between two entities, Swarup et al. (2020) retrieve a set of sentences similar to the input sentence, and learn to copy relations from these sentences.Mysore et al. (2017) experiment with unsupervised methods for extracting action graphs for synthesis procedures.
An exhaustive overview of the literature on biomedical relation extraction is out of the scope of this paper.Recent works have used graphneural networks (Huang et al., 2020), or convolutional neural networks (Ramponi et al., 2020).Sarrouti et al. (2022) compare various pre-trained transformer models.Semantic parsing of frame structures (Fillmore and Baker, 2001) has been addressed using graph-convolutional networks (Marcheggiani and Titov, 2020), BiLSTMs (He et al., 2018), and recently by generating structured output using encoder-decoder models (Hsu et al., 2022;Lu et al., 2021).Tackling semantic dependency parsing with a biaffine classifier architecture was first proposed by Dozat and Manning (2018).

MuLMS Corpus
In this section, we present our new corpus.

Source of Texts and Preprocessing
We select 50 scientific articles licensed under CC BY from seven popular sub-areas of materials science: electrolysis, graphene, polymer electrolyte fuel cell (PEMFC), solid oxide fuel cell (SOFC), polymers, semiconductors, and steel.The four SOFC papers were selected from the SOFC-Exp corpus (Friedrich et al., 2020).11 papers were selected from the OA-STM corpus 2 and classified into the above subject areas by a domain ex-pert.The majority of the papers were found via PubMed3 and DOAJ4 using queries prepared by a domain expert.For the OA-STM data, we use the sentence segmentation provided with the corpus, which has been created using GENIA tools (Tsuruoka and Tsujii, 2005).For the remaining texts, we rely on the sentence segmentation provided by INCEpTION (Klie et al., 2018) with some manual fixes.As shown in Table 1, documents are rather long with a tendency to long sentences (but with high variation due to, i.a., short headings).

Annotation Scheme
We annotate various layers: NEs, relations, and frame structures representing measurements.

Named Entities
We annotate the following materials-science specific NE mentions and assign the following NE types to these mentions: MAT: mentions of materials described by their chemical formula (WO position, its batch name (Aq-825) or by referring to the whole component (MEA-Pt/C) or to part of the material's structure (ionomer patches).In simulation papers, the SAMPLE may also be the computational model under study (RBF-ANN).

Relations and Measurement Frame
We treat measurement annotation in a frame-like (Fillmore and Baker, 2001) fashion, using the span type MEASUREMENT to mark the triggers (e.g., was measured, is plotted) that introduce the Measurement frame to the discourse.About 88% of the triggers are verbs.The remaining 12% occur in figure captions without verb phrases and are annotated either on nouns (Comparison) or, in absence of more suitable phrases, on figure labels such as Figure 17.The trigger annotations of these sentences serve as the root of the tree/graph annotations as illustrated in Figure 1.There are also cases in which the Measurement frame is evoked, but there are no technical details or results that we can extract about the measurement.We mark the triggers of these sentences with the tag QUAL_MEAS (qualitative mention of a measurement).An example of such a sentence is "We compare a critical volume to be detached from the different nanostructures." Measurement-related Relations.We annotate several relations that start at a MEASUREMENT tag and that end at the annotations of the corresponding slot fillers within the sentence.Consider the following sentence: "To characterize the ORR activity of the catalyst, linear scan voltammetry (LSV) was tested from 0 to 1.2 V on an RDE with a scan rate of 50 mV/s in O2-saturated HClO4  In most cases, a conditionProperty or a measuresProperty connects the MEASUREMENT annotation to a PROPERTY node, at which a propertyValue relation starts that ends at the respective VALUE.However, in some cases, the condition or measured property is not mentioned explicitly.In this case, we link the VALUE directly to the MEASUREMENT node via a conditionProp-ertyValue or a measuresPropertyValue link.For consistency reasons, we also add these links in cases that mention the property explicitly, turning the trees into graph structures.Out of the added conditionPropertyValue links, 967 are for such explicit cases, while the other 206 describe implicit cases.In the case of measuresPropertyValue, 722 links are for explicit cases and 36 for implicit cases.
Further Relations.In the following, we explain relations that can appear independently of measure-   ments.Examples are shown in Table 2.
hasForm: connects mention of MATERIAL and the corresponding FORM annotation.usedIn: connects MATERIAL and the DEVICE it is used in.In Table 2, MOSFET stands for Metal Oxide Semiconductor Field-Effect Transistors.usedAs: links a specific MATERIAL mention with a more generic one such as catalyst, a material class defined by its function.dopedBy: indicates dopants (e.g., chlorine), i.e., impurities added to a main material (e.g., SiC).usedTogether: connects two MATERIALs if they are used together in an experiment, i.e., if the materials are part of an assembly or a mixture.

Corpus Statistics
We now analyze our corpus and provide detailed corpus statistics.In total, there are 46,351 NE annotations.Table 3 shows the counts by NE label.There are roughly 1.5 MAT annotations per sentence as these are nested and occurrences of composite materials often result in many combined MAT tags.Table 4 reports the counts of annotated relations (16,794 in total), with hasForm as the most frequent relation with 2910 instances and dopedBy the least frequent with only 65 instances.
Out of all 10186 sentences, 2111 (20.7%) describe a measurement (i.e., they contain at least one MEASUREMENT annotation).On average, each document contains 43.4 MEASUREMENT annotations.In addition, there are 1476 sentences (14.5%) marked as containing a QUAL_MEAS, with 40 sentences of these also containing a MEASUREMENT annotation.

Inter-Annotator Agreement (IAA)
Our entire dataset has been annotated by a graduate student of materials science, who was also involved in the design of the annotation scheme.We perform two agreement studies, comparing to the annotations of a second annotator with a PhD degree in environment engineering and several years of experience in materials science research.
Agreement on identifying Measurement sentences.In this agreement study, we estimate the degree of agreement whether a sentence expresses a MEASUREMENT, a QUAL_MEAS, or whether it does not express a measurement at all.We sample 60 sentences marked with MEASUREMENT, 60 sentences marked with QUAL_MEAS, and 120 sentences not marked as either by the first annotator.Table 5 shows the confusion matrix for the 239 sentences for which both annotators provided a label.One automatically selected sentence was not labeled by one of the annotators due to incomprehensibility.In terms of Cohen's κ (Cohen, 1960), agreement amounts to 59.2, indicating moderate to substantial agreement (Landis and Koch, 1977).When collapsing MEASUREMENT and QUAL_MEAS, κ is 63.4 (substantial).
Agreement on named entities.We next compute agreement for NE and relation annotations.IAA on "easy" types such as MAT, NUM, UNIT, VALUE and RANGE has been shown to be very high in prior work (Friedrich et al., 2020).Hence, as our resources are limited, we provide annotations of these types for correction to the second annotator.We sample 134 sentences such that each entity type occurs at least 25 times in the annotations of the first annotator and have the second annotator correct or add entity annotations.We then compare the annotated sets of NE mentions using precision and recall (for a justification of this choice of agreement metrics, see Appendix C). Results using relaxed matching (containment) are shown in For most types, scores are in the expected range of difficult semantic annotation tasks.Agreement on identifying Measurement sentences is good; the decision of where exactly to place the MEASURE-MENT annotation differs between annotators.Agreement on relations.We sample 178 sentences in which each relation occurs at least 25 times according to the first annotator.We keep NE annotations and ask the second annotator to add relations.Table 7 shows the results in terms of precision, recall, and κ per relation type.The latter has been computed by treating all pairs of NE annotations as potential relations, using N O_REL if no relation has been annotated.Overall κ on relations is 0.61 (substantial).For each relation label, we can map all other relation types to OT HER and compute agreement for the binary decision whether the label is present or not (analysis suggested by Krippendorff (1989)).κ aims to quantify the degree of agreement above chance.Interpreting our κ scores according to the scale of Landis and Koch (1977), we reach at fair agreement for condition-PropertyValue, usedTogether, conditionSampleFeatures, dopedBy, and usedAs.We reach moderate agreement for usedIn, takenFrom, and condition-Instrument.For the practically important relations propertyValue, measuresProperty, and usesTechnique, we even reach almost perfect agreement.
For the non-easily identifiable types, post-hoc discussion with the second annotator (who did not receive an extensive training on the task) concluded it was not always clear to them when using related labels (e.g., conditionProperty and conditionEnvironment).Yet, these labels can be learned with good or acceptable accuracy (see Appendix E), indicating that the primary annotator has used the labels consistently.
In this section, we define several NLP tasks for MuLMS and describe our computational models.

Pre-trained Models
We use BERT (Devlin et al., 2019) as the underlying text encoder for all of our models.We also use variants of BERT, namely SciBERT (Beltagy et al., 2019), which has been pre-trained on articles in the scientific domain, and MatSciBERT (Gupta et al., 2022), a version of SciBERT further pre-trained on materials science articles.We use the uncased, 768-dimensional variant of each model, which we fine-tune.

Detecting Measurements
We model the task of classifying whether a sentence contains a MEASUREMENT or a QUAL_MEAS annotation as a ternary sentence classification task, i.e., it is also possible that a sentence does not refer to any measurement.As we are primarily interested in detecting MEASUREMENT, we map the few multi-label cases carrying both positive labels to MEASUREMENT.We use a linear layer plus softmax with the CLS token embedding as input.For training, we downsample the amount of non-measurement sentences.

Named Entity Recognition (NER)
We compare two state-of-the-art models for NER, (a) a sequence tagger and (b) a dependency parser.For the sequence tagger, we encode the NE labels using the nested BILOU scheme (Alex et al., 2007), which leverages a label set of combined types constructed from the training set for nested NEs.As there are only very few cases (about 0.65% of all NE annotations) where a token receives more than three stacked NE labels, in order to avoid sparsity issues, we consider only the "bottom" three layers of stacked entities.We feed the contextualized embeddings of the last transformer layer of the respective first wordpiece token of each "real" token into a linear layer and then use a CRF (Lafferty et al., 2001) to optimize predictions for the entire sequence.
Modeling NER as a dependency parsing task (Yu et al., 2020) can easily account for nested NEs.The main idea is to predict edges reaching from the end token of an NE to its start token as depicted in Figure 2. We adapt the STEPS parsing pipeline (Grünewald et al., 2021a) to the task.There are three combinations of tags in our dataset that occasionally cover the exact same span and that occur more than 20 times: VALUE+NUM, VALUE+RANGE and MAT+FORM.We hence introduce the above combined labels.For any other infrequent conflicting labels, we do not add extra tags, i.e., the model can never catch these cases.We decide on this slight restriction of the model capabilities in order to avoid sparsity issues.In the evaluation, we do not filter for these cases but of course use all nested NEs as annotated.

Relation Extraction
Given an input sentence along with all named entities within it, as well as their types (either gold or predicted depending on the experimental setting), we predict which (if any) relation is present between them.We treat all relations in a single model and predict all relations of a sentence simultaneously by modeling relation extraction as a graph parsing task.Following Toshniwal et al. (2020), we first create an embedding e i for the i th NE in the sentence by concatenating the token embeddings of its first and last token (e i,START , e i,END ).We also concatenate a learned embedding for the NE's label (e i,LABEL ): e i = e i,START ⊕ e i,END ⊕ e i,LABEL Considering NEs as nodes in a graph, we use a biaffine classifier architecture (Dozat and Manning, 2017)

Experiments
We now detail our experimental results.
Experimental Settings.We split our corpus into train, dev, and test sets on a per-document basis.Within the train set, we provide five distinct tune splits (train1 to train5).For all experiments and for hyperparameter tuning, we always train five models.Similar to cross-validation, we train on four folds and use the fifth "training fold" for model selection (cf.van der Goot (2021) for details).Hy- perparameters are chosen based on the best dev results, and we finally report results for the test set.
The splits are the same across all tasks.Because the training data varies across the five runs for which we report results, standard deviations are usually larger than when using the same training data.For hyperparameter settings, see Appendix A.

Identifying Measurement Sentences
Table 8 reports the results for identifying sentences that contain a MEASUREMENT or a QUAL_MEAS annotation.In each experiment, we tune the downsampling rate for the majority class OTHER and the learning rate (using grid search from 1e-4 to 1e-7).The random baseline assigns labels according to the percentage of instances in the (full) training set carrying a particular label.The average overall accuracy of the MatSciBERT classifier is 78.2%.
SciBERT and MatSciBERT perform similarly, with MatSciBERT having a small edge.Identification of MEASUREMENT is comparable to our estimate of human agreement.For identifying QUAL_MEAS, there is headroom.erable margin.For detailed per-label statistics, see Appendix E. Precision and recall are approximately balanced for all labels.An exception is SAMPLE, which is both infrequent in the dataset and hard to identify for humans.Both models suffer from low recall for this tag.

Relation Extraction Results
Table 10 shows the results for relation extraction on gold entities.A predicted relation is counted as correct if and only if there is a relation with the same start span, end span, and relation label in the set of gold relations for the sentence.The majority baseline assigns to each pair of entities the relation that is most common in the training set for the respective entity types of the governing and dependent spans (see Appendix E).
The results demonstrate that a biaffine dependency parsing approach achieves robust performance overall and outperforms the baseline by a substantial margin.The two models trained on scientific text outperform BERT.Their results are similar, with MatSciBERT having a slight edge.
Analysis of per-label scores (see Appendix E) for MatSciBERT) shows that the highest scores are achieved for conditionInstrument (92.2 F1), uses-Technique (91.0 F1), and takenFrom (84.7).This is somewhat surprising especially for conditionInstrument and takenFrom, as these are among the rarest relation types in the corpus (see Table 4).However, our majority baseline achieves high accuracies on these relation types as well (>90 F1 for conditionInstrument and usesTechnique), i.e., they are easily inferable from entity types.The worst performance is observed on the relation types used-Together (4.0 F1), dopedBy (22.7 F1), and usedIn (37.9 F1).These relations occur relatively rarely and also cannot be inferred from entity types.
Relation extraction on predicted entities.Finally, we also run our relation extraction module on predicted named entities using the respective bestperforming models (both based on MatSciBERT).Models are evaluated as above, with the additional Results for MTL for relations are shown in Table 12.We observe that adding MuLMS to the training data of both SOFC-Exp and MSPT results in improvements.Incorporating SOFC-Exp instances in the training does not meaningfully increase prediction accuracy on MuLMS, whereas incorporating instances from MSPT leads to modest improvements.Intuitively, this makes sense: relations in SOFC-Exp focus on a specific type of experiment, while MuLMS covers a broader range of measurements.Similarly, some MuLMS relations bear resemblance to MSPT relations (e.g., those dealing with instruments or apparatus), which explains why training jointly is beneficial.

Conclusion and Outlook
In this resource paper, we have presented a new large-scale dataset of 50 scientific articles in the domain of materials science exhaustively annotated with named entity mentions, relations, and measurement-related frames.Our inter-annotator agreement study shows good agreement for most decisions.Our experiments with state-of-the-art neural models highlight that most distinctions can be learned with good accuracy, and that synergies can be achieved by training jointly with existing more specific materials-science NLP datasets.
Future work is needed to improve on end-to-end or joint models of NER and relation extraction as our experiments showed that a pipeline-based setting suffers from error propagation.A potential next step is to adapt sequence-to-sequence models to the structure induction tasks of MuLMS, following ideas of (Hsu et al., 2022;Lu et al., 2021).Finally, employing data augmentation techniques in particular for the less frequent relation types is a viable path for future work.

Limitations
As discussed in Sec.3.4, we expect our interannotator agreement scores to underestimate the reproducibility of the task.It is, unfortunately, not trivial to find annotators with the required background knowledge.Hence, scores reflect agreement after only an initial very brief training phase, but nevertheless (in our opinion) give useful insights on the relative difficulty of the labeling decisions.
In our relation extraction experiments, we use label embeddings based on either gold or predicted entity labels (depending on the experimental setup) as an input to our system.Providing gold entity label information in particular constitutes a setting that is considerably easier for a relation classifier than providing no label information.Using predicted entity mention and labels showed to suffer from error propagation.In future work, it may be interesting to evaluate the performance of a relation extraction system that is not given label information, or that predicts entity labels jointly with relations.

Ethical Considerations
The annotators participating in our project were completely aware of the goal of the annotations and even helped designing the annotation scheme.They gave explicit consent to the publication of their annotations.The main annotator was paid considerably above our country's minimum wage.George Hripcsak and Adam S Rothschild.2005.Agreement, the f-measure, and reliability in information retrieval.Journal of the American medical informatics association, 12(3):296-298.Table 13: Hyperparameters: Learning rates for NER models.
to perform best in preliminary experiments.We employ early stopping with a patience of 15 epochs for all experiments.
Our models are trained with Nvidia A100 and V100 GPUs using the PyTorch framework.

B Details on Biaffine Parser Architecture
We here describe the biaffine parser architecture used to predict relations between named entities.Taking as input the NE embeddings described in Sec.4.4, head and dependent representations for the i'th NE are computed via two single-layer feedforward neural networks: These representations are then fed to a biaffine classifier that maps head-dependent pairs onto logit vectors s i,j whose dimensionality corresponds to the inventory of relation labels.Using the softmax operation, these scores are transformed into probability distributions P (y i,j ) over relation labels: The predicted relation for a pair of named entities is the one receiving the highest probability (which may be ∅, i.e., no relation).
Token embeddings.The token embeddings e i,START and e i,END , which form part of the NE embeddings e i , are computed as a learned scalar mixture of BERT layers as described by Kondratyuk and Straka (2019).

C Detailed Corpus Statistics
Table 14 shows the NE counts in MuLMS by datasplit.
Table 15 and Table 16 show the detailed counts for our inter-annotator agreement study.
Choice of agreement metrics for evaluating agreement on named entity annotations.The task of identifying and labeling NE mentions is a sequence labeling task, hence, κ is not applicable.Brandsen et al. (2020) provide a good explanation of why this is the case in their section 5.1.Using unitizing α U is an option, but there is no standard implementation or interpretation for NE annotations in the NLP community, and it does not work for overlapping annotations (which we have in our dataset).We opted for using precision and recall, which are intuitively interpretable (How many of the instances of one type marked by one annotator have also been marked by the respective other annotator?).Hripcsak and Rothschild (2005) convincingly argue (with a very simple proof) that for sequence labeling tasks such as NEs, F1 actually approaches κ.

D SOFC-Exp and MSPT Corpora
In Sec.5.4, we perform several multi-task learning (MTL) experiments with MuLMS and two additional NLP datasets in the materials science domain, SOFC-Exp (Friedrich et al., 2020) and MSTP (Mysore et al., 2019).We here describe them briefly.
There are 4 named entities in the SOFC-Exp corpus: MATERIAL, which refers to mentions of materials or chemical formulas, VALUE, which denotes numerical values and their corresponding physical unit, DEVICE, which marks device types used in an experiment, and EXPERIMENT, which indicates frame evoking words.Furthermore, there are 16 distinct slots that are modeled as relations between experiment frame evoking word and cor-   • AMOUNT-UNIT

E Detailed Experimental Results
This appendix provides further details on our experimental results.

Figure 1 :
Figure 1: Multi-Layer Materials Science Corpus: named entity, relation and semantic role annotations.
treat NER as a graph-based dependency parsing task by representing NEs as spans between the first and last token of an entity.In the materials science domain, Friedrich et al. (2020) test a variety of embedding combinations in a CRFbased tagger.O'Gorman et al. (
using the implementation of Grünewald et al. (2021a,b) to predict the relation between each pair.The non-existence of a relation is encoded as simply another label (∅).For details on the parser architecture, see Appendix B.
3 ) or its chemical name (indium tin oxide).FORM: mentions of the form or morphology of the material, e.g., thin film, gas, liquid, cubic.INSTRUMENT: mentions of devices used to perform a materials-science-related measurement, e.g., Olympus BX52 microscope.

Table 2 :
Measurement-independent relations annotated in MuLMS.MAT is short for MATERIAL.

Table 3 :
Corpus counts for named entity annotations.

Table 4 :
Corpus counts for measurement relations.

Table 5 :
Inter-annotator agreement for identifying measurement sentences: confusion matrix.

Table 6 (
detailed counts in Appendix C).

Table 8 :
Ternary sentence classification results for identifying measurement sentences on test set."Sampling" indicates amount of OTHER sentences used for training.
*estimated on subset of data.

Table 9 :
Named entity recognition results on test set.
Timothy Dozat and Christopher D. Manning.2018.Simpler but more accurate semantic dependency parsing.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 484-490, Melbourne, Australia.Association for Computational Linguistics.

Table 14 :
Label counts for named entities in MuLMS.

Table 15 :
Inter-annotator agreement: named entities.Precision and recall computed from relaxed matches.

Table 17 :
SOFC-Exp relation counts in our setup.

Table 18
lists the counts of the 14 relations of the MSPT dataset that we use in our MTL experiments.

Table 18 :
MSPT relation counts in our setup.

Table 23 :
Table 26 depicts the results for identifying sentences containing MEASUREMENT or QUAL_MEAS annotations.NER.Table 19 and Table 20 report F1 for NER on MuLMS per label.Table 21 gives per-label scores for NER in our MTL experiments.Table 22 and Table 23 provide per-label scores for the SOFC-Exp corpus and MSPT corpus in a single-task setting as well as in a multi-task setting with MuLMS added to the training.Relation extraction.Table 27 lists per-relation scores when using gold NEs or when using predicted NEs for relation extraction, as well as perrelation scores for the majority baseline.Table 25 shows relation extraction scores per label for both dev and test.Table 24 shows overall results for predicted entities on dev and test.Per-Label Named Entity Recognition results for MSPT on test in terms of F1 using single-task (ST) and multi-task MatSciBERT taggers.

Table 24 :
Relation extraction results in terms of F1, predicted named entities (including standard deviation over five folds).

Table 26 :
Ternary sentence classification results for identifying sentences containing MEASUREMENT or QUAL_MEAS annotations vs. NONE.Human agreement is only suitable for a rough comparison because it is estimated on a subset of the data.

Table 27 :
Per-label scores (MuLMS test set) for relation extraction using MatSciBERT.Majority baseline is computed on gold entities.

Table 28 :
Per-label scores (MuLMS test set, gold entities) for multi-task relation extraction.