Extracting Fine-Grained Knowledge Graphs of Scientific Claims: Dataset and Transformer-Based Results

Recent transformer-based approaches demonstrate promising results on relational scientific information extraction. Existing datasets focus on high-level description of how research is carried out. Instead we focus on the subtleties of how experimental associations are presented by building SciClaim, a dataset of scientific claims drawn from Social and Behavior Science (SBS), PubMed, and CORD-19 papers. Our novel graph annotation schema incorporates not only coarse-grained entity spans as nodes and relations as edges between them, but also fine-grained attributes that modify entities and their relations, for a total of 12,738 labels in the corpus. By including more label types and more than twice the label density of previous datasets, SciClaim captures causal, comparative, predictive, statistical, and proportional associations over experimental variables along with their qualifications, subtypes, and evidence. We extend work in transformer-based joint entity and relation extraction to effectively infer our schema, showing the promise of fine-grained knowledge graphs in scientific claims and beyond.


Introduction
Using relations as edges to connect nodes consisting of extracted entity mention spans produces expressive and unambiguous knowledge graphs from unstructured text. This approach has been applied to diverse domains from moral reasoning in social media (Friedman et al., 2021b) to qualitative structure in ethnographic texts (Friedman et al., 2021a), and is particularly useful for reasoning about scientific claims, where several experimental variables in a sentence may have differing relations. Scientific information extraction datasets such as SciERC (Luan et al., 2018) use relations for labeling general scientific language. Utilizing the advances of SciBERT (Beltagy et al., 2019) in scientific language modeling, SpERT (Eberts and Text: Levels of social support for medical staff were significantly associated with self -efficacy and sleep quality and negatively associated with the degree of anxiety and stres Levels of social support Ulges, 2020)-a transformer-based joint entity and relation extraction model-advanced the state of the art on SciERC.
To extend relational scientific information extraction to specifically target scientific claims, we annotate SciClaim, 1 a dataset of 12,738 annotations on 901 sentences from expert identified claims in Social and Behavior Science (SBS) papers (Alipourfard et al., 2021), detected causal language in PubMed papers (Yu et al., 2019), and claims and causal language heuristically identified from CORD-19 abstracts (Wang et al., 2020).
For annotation, we developed a novel graph schema that reifies claimed associations as entity spans with fine-grained attributes and extracts factors as additional entities connected with relations to one or more associations in which they are involved. In Figure 1, two association entities relate two pairs of dependent factors to an independent factor, while attributes and additional relations delimit the scope and qualitative proportionalities of the claim. Inspired by semantic role labeling, attributes modify associations and the roles of their arguments, allowing us to represent claims of causal, comparative, predictive, statistical, and proportional associations along with their qualifi-Text: We predicted that the subliminal prime would , under specifiable conditions , increase the accessibility of the pertinent negative outcome and thereby increase its perceived likelihood of occurrence . Input: "We predicted that the subliminal prime would, under specifiable conditions, increase the accessibility of the pertinent negative outcome and thereby increase its perceived likelihood of occurrence." Figure 2: This SciClaim graph captures the chaining together of associations and uncovers a mediating factor in the qualitative proportionality (q+) between the "subliminal prime" and "perceived likelihood of occurrence." cations, subtypes, and evidence.
We adapt SpERT to model this additional multilabel attribute task and demonstrate that extraction of our highly expressive knowledge graphs is within reach of present methods.

Related Work
Many previous datasets for relational scientific information extraction-such as SemEval 2017 task 10 and 2018 task 7, SciERC, and SciREX (Augenstein et al., 2017;Gábor et al., 2018;Luan et al., 2018;Jain et al., 2020)-have annotated corpora from NLP, computer science, or similar engineering-oriented fields. As such their annotation schemas have emphasized the description of how research was carried out, by extracting categories of entities such as methods, tasks, metrics, and datasets as well as relations mostly describing their intrinsic properties such as uses, composition, and hyponymy. Two of these datasets (Luan et al., 2018;Gábor et al., 2018) contain associative relations that directly link entities being compared or producing a result. Our work extends further in this direction by examining not only which entities are associated, but also how the presentation of the associations is nuanced by the assertion of fine-grained attributes such as causality or proportionality.
SciClaim provides the largest number of finegrained label types among comparable datasets. Table 1 shows SciClaim's remarkable label densities per word. SciClaim also contains 81.88% as many total labels as SciERC and more total labels than SemEval 2017 task 10 and 2018 task 7. On the other hand, SciREX utilizes distant supervision from an existing knowledge base and noisy automatic labeling trained on SciERC to pro-vide an order of magnitude more labels and annotate complete documents. This is one example of how smaller yet more densely and directly labeled datasets like SciERC and SciClaim can enable and compliment larger, higher-level corpora.
Meanwhile, our dataset also focuses on scientific claims. Some previous work identifies claims within scientific texts (Wadden et al., 2020;Gelman et al., 2021), but does not extract the relations and factors within the claims themselves. Other recent symbolic semantic NLP systems do model relational representations of scientific claims (e.g., Friedman et al., 2017), but these approaches rely on rule-based engines with hand tuning, which require NLP experts to maintain and adapt to new domains. Instead we modify SpERT (Eberts and Ulges, 2020), a transformer-based method that has been shown to effectively extract relational scientific information on SciERC (Luan et al., 2018). We extend this model to accommodate our additional multi-label attributes and apply it to our claim graph extraction task.

Problem Definitions
SciClaim defines the multi-attribute knowledge graph extraction task as follows: for a sentence S of n tokens s 1 , ..., s n , and sets of entity types T e , attribute types T a , and relation types T r , predict the set of entities s j , s k , t ∈ T e ∈ E ranging from tokens s j to s k , where 1 ≤ j ≤ k ≤ n, the set of relations over entities e head ∈ E, e tail ∈ E, t ∈ T r ∈ R where e head = e tail , and the set of attributes over entities e ∈ E, t ∈ T a ∈ A. This defines a directed multi-graph without self-cycles, where each unique span can be represented by at  Table 1: Our SciClaim dataset contains the highest label densities per word and comparable label counts to other scientific information extraction datasets except SciREX, which uses distant supervision and noisy automatic labeling. Our dataset contains fine-grained attributes as additional labels, while SciERC contains coreference links.
most one entity node with zero to |T a | attributes.

Dataset Construction
To create SciClaim, two NLP researchers annotated 901 sentences from several sources: 336 from papers in Social and Behavior Science (SBS) with expert identified claims (Alipourfard et al., 2021), 411 filtered for causal language in PubMed papers (Yu et al., 2019), 135 containing claims and causal language identified from CORD-19 abstracts (Wang et al., 2020) with heuristic keywords, and 19 manual perturbations included only in training data.
To aid in the labeling of these densely annotated sentences, we iteratively trained on already collected data and utilized the predictions of the partially trained model on new training examples as suggestions in our labeling interface. We disabled these model suggestions on our 100 example test set to ensure that this did not bias our evaluation.
Due to the dense and potentially overlapping span annotations, small decisions about what tokens to include in a span frequently influence the span boundaries of several other entities in a sentence. However, most of these decisions have negligible impact on the meaningfulness of the annotation (e.g. the decision to include a determiner in span), rendering exact match agreement ineffective. Instead to promote consistency and domain relevance we employed iterative schema design sessions in consultation with a subject matter expert in reproducibility of SBS experiments and a process of consensus, schema re-development, and re-annotation on 250 examples where annotators overlapped. Table 1 contrasts SciClaim's label counts and density with the other relational scientific information extraction datasets discussed in Section 2, and precise counts for each label type are provided in Table 3. Further details are in Appendix A.

Graph Schema
The SciClaim graph schema is designed to capture associations between factors (e.g., causation, comparison, prediction, statistics, proportionality), monotonicity constraints across factors, epistemic status, subtypes, and high-level qualifiers.
Text: Compared to control group , the isolated species from T2DM group had higher proteinase activity . Entities are labeled text spans. The same exact span cannot correspond to more than one entity type, but two entity spans can overlap. Entities comprise the nodes of SciClaim graphs upon which attributes and relations are asserted. Our schema includes six entity types: Factors are variables that are tested or asserted within a claim (e.g., "sleep quality" in Figure 1). Associations are explicit phrases associating one or more factors (e.g., "higher" Figure 3). Magnitudes are modifiers of an association indicating its likelihood, strength, or direction (e.g., "significantly" and "negatively" in Figure 1). Evidence is an explicit mention of a study, theory, or methodology supporting an association (e.g., "our SEIR model"). Epistemics express the belief status of an association, often indicating whether something is hypothesized, assumed, or observed (e.g., "predicted" in Figure 2). Qualifiers constrain the applicability or scope of an assertion (e.g., "for medical staff " in Figure 1).
Attributes are multi-label fine-grained annotations (visualized in parentheses), where zero or more may apply to any given entity. Our schema includes the following attributes, all of which apply solely to Association entities: Causation ex-presses cause-and-effect over its constituent factors (e.g., both "increase" spans in Figure 2). Correlation expresses interdependence over its constituent factors (e.g., both "associated with" spans in Figure 1). Comparison expresses an association with a frame of reference (as in the "higher" statement of Figure 3). Sign+ and Signexpresses high/low or increased/decreased factor value (e.g., "correlates more closely with" or "shortened" respectively). Test expresses statistical measurements (e.g., "ANOVA"). Indicates expresses a predictive relationship (e.g., "prognostic factors for").
Relations are directed edges between labeled entities in SciClaim graphs. They are critical for expressing what-goes-with-what over the set of entities. Note that in Figures 1 and 2 the unlabeled arrows are all modifier relations, left blank to avoid clutter. We encode six relations: arg0 relates an association to its cause, antecedent, subject, or independent variable (e.g., "levels of social support" in Figure 1). arg1 relates an association to its result or dependent variable (e.g., "self-efficacy" and "stress" in Figure 1). comp_to is an explicit frame of reference in a comparative association (e.g., "control group" in Figure 3). subtype relates a head entity to a subtype tail (e.g., "stillbirth" as a subtype of "pregnancy outcome"). modifier relates associations to qualifiers, magnitudes, epistemics, and evidence (e.g., all unlabeled arrows in Figure 1 and Figure 2). q+ and qindicate positive and negative qualitative proportionality, respectively, where increasing the head factor increases or decreases the tail factor, respectively (e.g., "levels of social support" is q+ to "sleep quality" and qto "stress" in Figure 1).

Model Architecture
In order to model the additional multi-label task in SciClaim, we extend SpERT (Eberts and Ulges, 2020) with an attribute classifier. SpERT provides components (Figure 4 a-c) for joint entity and relation extraction and permits the overlapping spans in our data. These classifiers utilize a span representation that combines the SciBERT These attention weights are used to make a span representationĥ i with the following weighted sum: We use the same cascaded inference strategy and input the span representations of identified entities x a to an attribute classifier (Figure 4 d) with weights W a and bias b a . A pointwise sigmoid σ yields seperate confidence scoresŷ a for each attribute:ŷ We train the attribute classifier with a binary cross entropy loss L a summed with the SpERT entity and relation losses, L e and L r , for a joint loss:

Evaluation
In Table 2 we report micro performance metrics on the SciClaim test set averaged over 5 runs.
In addition to the modified SpERT (detailed in Section 3.4), we also test a variant attrs-as-ents   where all attribute labels on an entity span are collapsed into a single combined annotation, allowing unmodified SpERT to process attributes. Precisely, we collapse T e entity types with all combinations of T a attribute types into {T e × Ta k : 0 ≤ k ≤ |T a |} multi-class entity labels. We hypothesized that the combinatorially larger number of labels required by attrs-as-ents would lower performance on rarely occurring combinations. Surprisingly the variants get almost identical results, suggesting that-at least for our data-a single layer classifier can infer the attributes of a span simultaneously just as well as doing so independently. We tested other model variants that also produced changes ∼1% F1 and thus are relegated to Appendix B.
To our knowledge no previous models exists that can run directly on all three tasks in our dataset due to the presence of both overlapped entity spans and multi-label attributes. For comparison we include SpERT's state-of-the-art performance on SciERC, the dataset closest to ours in terms of label density. The high performance of our adapted SpERT on SciClaim demonstrates the practicality of extracting our novel graph schema with present methods despite its fine-grained approach.
The per-class evaluations for our main model are reported in Table 3. With few exceptions performance is good, and generally follows support for the label in the dataset. The Causation attribute metrics may be influenced by noise from anomalously low representation in the test set (only 5 instances compared to 59 instances of Correlation). Likewise the Test attribute unfortunately does not appear in the test set at all, but receives validation F1 of 95.95% despite only appearing 25 times in the corpus. Another outlier, the subtype relation, is particularly challenging, especially with its low rate of occurrence, due to it being one of the few relation types occurring directly between factors rather than mediated through a reified association span. The q+/qrelations are likewise expressed as direct links between factors. Although these require complex reasoning about the qualitative proportionalities of factors (e.g., Figure 2), they nonetheless receive promising results. The attributes Sign+/Signserve a similar role and provide partial redundancy for q+/qlabels, allowing downstream reasoning to back off to these less precise, more robust attributes when the qualitative proportionalities are not extracted.

Conclusion
Previous scientific information extraction crafts useful high-level representation of papers, going as far as document level relations spanning thousands of words in Jain et al. (2020). Complimentary to these efforts, we propose fine-grained and densely annotated scientific information extraction that captures not just what is said but how it is presented and argued. SciClaim applies this approach to associative claims and demonstrates that existing models such as SpERT (Eberts and Ulges, 2020) can be modified to accurately extract fine-grained knowledge graphs ripe for downstream reasoning.   (Wang et al., 2020) is sampled with the following keywords as a heuristic identification of claims and causal language similar to our expert identified data from PubMed and Social and Behavioral Science (SBS) papers: associated with, reduce, increase, leads to, led to, our result, greater, less, more, cause, demonstrate, show, improve.
200 of our sentences (100 from PubMed and 100 from SBS) were selected to intentionally minimize the likelihood of claims and causal language. This includes sentences that discuss factors and other entities present in our schema but either do not contain associations or frame associations in unusual ways such as rhetorical questions. We intend for this data to encourage robustness that maintains correct labels for partial graph extractions rather than simply hallucinating associations in all sentences. We employ the following heuristics to identify this data: We sample 100 PubMed sentences from Yu et al. (2019) that are identified as having low causal content. We sample 50 titles from SBS paper present in Alipourfard et al. (2021), as titles contain factors but rarely contain explicit associations and may be present in input data from automatically extracted text from PDFs. Finally we sample 50 first lines of SBS papers from Alipourfard et al. (2021), as these lines frequently introduce topics or rhetorical questions which either lack associations or present highly hypothetical associations unlike those in our main corpus.
Each filtered data source was sampled chrono-logically. We utilized the following procedure for labeling: The annotators undertook extensive, iterative schema design sessions in consultation with a subject matter expert in reproducibility of SBS experiments. After the schema was settled on pilot examples, a lead annotator established the annotation standards on several hundred examples through a process of relabeling and retraining the suggestion model. Once the suggestion model became effective, the lead annotator and model suggestions guided the other annotator in adopting the annotation standards. The lead annotator reviewed and corrected the 250 overlapping examples in a consensus process with the other annotator.

B.1 Variants
We experiment with several variants none of which substantially outperformed the others. SpERTmodified-maxpool contains our modifications but simply uses SpERT's original maxpooling span representation instead of the attention-based representations inspired by Lee et al. (2017). SpERTmodified-unfiltered forgoes cascading inferences and instead classifies all possible spans for attributes. Full test and averaged validation results are presented in Table 4.

B.2 Hyperparameters
The following hyperparameters and settings were selected using manual tuning of 10-fold cross validation on the training set and optimizing for average micro-f1 performance over all 3 tasks: language model SciBERT uncased, mini batch size 8, epochs 40, optimizer AdamW, linear scheduling, warm up 0.05, learning rate 5e-5, learning rate warm up 0.1, weight decay 0.01, max grad norm 1.0, size embedding dimension 25, dropout probability 0.1, maximum span size 20, attribute filter threshold 0.55, relation filter threshold 0.4.