Counts@IITK at SemEval-2021 Task 8: SciBERT Based Entity And Semantic Relation Extraction For Scientific Data

This paper presents the system for SemEval 2021 Task 8 (MeasEval). MeasEval is a novel span extraction, classification, and relation extraction task focused on finding quantities, attributes of these quantities, and additional information, including the related measured entities, properties, and measurement contexts. Our submitted system, which placed fifth (team rank) on the leaderboard, consisted of SciBERT with [CLS] token embedding and CRF layer on top. We were also placed first in Quantity (tied) and Unit subtasks, second in MeasuredEntity, Modifier and Qualifies subtasks, and third in Qualifier subtask.


Introduction
SemEval 2021 Task 8 (Harper et al. 2021) is a task for extracting entities and semantic relations between them from a corpus of scientific articles coming from different domains. Instead of just identifying quantities, the task gives more weightage to parsing and extracting important semantic relations among the extracted entities. This is challenging because texts are ambiguous, and inconsistent, and extraction relies heavily on implicit knowledge. The results of this task can also be used for extractive scientific data summarization.
Given a scientific text, the task is to identify the span of quantities, units, and other attributes of those quantities and related measured entities, properties, and qualifiers, if any. The organizers have divided the task into five subtasks and submissions will be evaluated against all five sub-tasks 1 . We consider subtask 1 as an entity extraction task, and subtask 3, 4, and 5 are viewed as relation extraction tasks. After extracting the quantities, other attributes (MeasuredEntity, Property, and Qualifier) related to those quantities need to be predicted. The directed graph in Figure 1 gives an overview of our proposed approach. The set of incoming edges to each node represents the input to the trained model (represented by node), and the label at each node represents the prediction made by the model. The task data is extracted from CC-BY ScienceDirect Articles and made available by the Elsevier Labs via the OA-STM-Corpus 2 . This motivated the use of SciBERT ) model for various subtasks. "SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks". Our final submitted system consisted of SciB-ERT with [CLS] token embedding and CRF layer on top, and it achieved an overall F1-overlap score of 0.432. We were ranked fifth on the global leaderboard. The top performance on the leaderboard achieved an overall F1-overlap of 0.519. The implementation of our system is made available via Github 3 .
The rest of this paper is arranged as follows. Section 2 introduces the previous work in this field and describes the organizers' dataset. Section 3 explains our overall approach. Section 4 contains the experimental setup for training the model. We conclude with the analysis of our model performance in section 5 and concluding remarks in section 6.
2 Background 2.1 Related work Magge et al. 2018 attempted to recognize the Clinical Entities using a LSTM CRF based architecture. The authors used the word and character level embedding obtained from word2vec (Mikolov et al. 2013). For relation extraction between these entities, authors build a binary classifier using random forest classifier. This approach has higher time complexity as it checks for all possible relationships that could exist and classifies them. The more recent work in entity extraction is by Lee et al. 2019, where they fine-tuned the BERT model using the Bio-Medical data, and have shown SOTA performance. Some other works in entity extraction includes Taher et al. 2020, where they fine-tuned BERT followed by a fully connected layer and a CRF layer.
The work by Wu and He 2019 on Relation Extraction uses BERT to identify the different types of relations between pair of entities in the given text. The system does not automatically recognize the entities between which relation exists, rather entities of interest need to be manually specified.

Task setup
The scientific articles in the training and test corpus are from the following sub-domains: Astronomy, Engineering, Medicine, Materials Science, Biology, Chemistry, Agriculture, Earth Science, and Computer Science. These articles were manually annotated. The inter-annotator agreements was calculated using Krippendorff's Alpha IAA score (Table 1). The training dataset comprised of 298 paragraphs containing 1164 quantities, 1148 measured entities, 742 measured properties, and 309 qualifiers. The evaluation set included 135 paragraphs.

System overview 3.1 Pre Processing
Since we are using the SciBERT model, a maximum of 512 tokens can be passed as input to the model. Therefore, we used SciSpaCy (Neumann et al. 2019) to split the paragraph into sentences, and these sentences were passed as input to the SciBERT model.

Subtask 1 (Quantity Extraction)
Input sentences were tokenized using a SciBERT tokenizer from HuggingFace (Wolf et al. 2020) implementation. The Quantity span were transformed into BIO / IOB format (Ramshaw and Marcus 1995) and used as the true-labels for training the model.
The tokenized sentence is passed through SciB-ERT. Tanh activation function is applied over the final hidden state of SciBERT i.e.

len
Here H i is the hidden units corresponding to token i and len is the maximum length of the tokenized sentence. Similarly, [CLS] token is processed.
Finally, we get the final representation for the sentence by concatenating H cls and H i and this is used for prediction via the softmax.
where d is the hidden state size from BERT and t represent the number of tags, i.e., t = 3 in our case as we are using BIO encoding.. CRF (Conditional Random Field) (Lafferty et al. 2001) is a probabilistic model that makes it possible to extract structural dependencies among the BIO tags. The tag probability vector for all the tokens, i.e., p, is passed through the CRF layer to generate the most probable output sequence. We trained the model using CRF loss and Adam optimizer. The overall architecture of the model is shown in Figure 2. The tuned hyper-parameters are reported in appendix A.1.

Subtask 2 (Unit Detection)
The Quantity phrases are tokenized using Spacy (Honnibal et al. 2020) character-based tokenizer. The true-label for training is formatted as a binary vector marking one at the indices for characters in the unit's span in the Quantity phrases.
We trained a Character-based Bi-LSTM (Hochreiter and Schmidhuber 1997) model with trainable word embeddings using BCE (Binary Cross Entropy) loss and Adam optimizer. The model architecture and tuned hyper-parameters are reported in appendix A.2

Subtask 2 (Modifier Classification)
We formulated this subtask as a multi-label classification problem with 12 labels (HasTolerance, IsApproximate, IsCount, IsList, IsMean, IsMean-HasSD, IsMeanHasTolerance, IsMeanIsRange, Is-Median, IsRange, IsRangeHasTolerance, None). To enable the BERT module to capture the location of a quantity, we insert the special symbol "$" at the beginning and end of the Quantity span. If there are multiple Quantities in a sentence, multiple copies of the same sentence are generated with "$" at different positions. Suppose H i to H j are the final hidden state vector for the Quantity span. Then, the average operation is applied to get the vector representation of the Quantity. The averaged output is passed through a fully connected layer followed by softmax activation.
Matrix W has dimension R l×d , where l represnts the number of classification label, i.e., l = 12 in our case and d is the hidden state size from BERT.
The above model was trained using BCE (Binary Cross Entropy) and Adam optimizer. The threshold value for prediction was determined using crossvalidation. The model architecture and tuned hyperparameters are reported in appendix A.3.

Subtask 3 and 5 (MeasuredEntity and HasQuantity Extraction)
As done in the previous subtask to capture the location, we insert the special symbol "$" at the beginning and end of the quantity span. The modified sentences are tokenized using a SciBERT tokenizer. The span of the MeasuredEntity related to Quantity enclosed in the "$" symbol is transformed into BIO / IOB format and used as the true-label for training the model. The formatted data is used to train a model similar to the Quantity Extraction (SciBERT + CRF Model). The above model extracts the Measure-dEntity associated with the Quantity enclosed in "$". Thus, it predicts the MeasuredEntity as well as the HasQuantity relationship of the predicted MeasuredEntity.

Subtask 3 and 5 (MeasuredProperty and HasProperty Extraction)
To extract MeasuredProperty and HasProperty relationship, we used a similar approach as used for MeasuredEntity and HasQuantity. We enclosed the Quantity span in "$" symbol and the MeasuredEntity span in "#" symbol. The modified sentences are passed through the SciBERT tokenizer. Measured-Property's span, then the HasQuantity relation is updated to MeasuredProperty, and the HasProperty relation is added to MeasuredEntity.

Subtask 4 and 5 (Qualifier and Qualifies Extraction)
To extract Qualifier and Qualifies's span, two separate models similar to Quantity Extraction (SciB-ERT + CRF Model) were trained. While training the first model, we insert "$" at the beginning and end of the Quantity span because we assumed that Qualifier Qualifies Quantity. During the second model training, we enclosed the MeasuredProperty span in "$" because of the assumption that Qualifier Qualifies MeasuredProperty.

Post Processing
Once the predictions from all the models are available, we need to transform the predicted BIO/ IOB format into entity span format. We initially map each token's span in the tokenized sentence and use it to determine the predicted entity's span. While finding the span of the MeasuredEntity, Measured-Property, or Qualifier, if our model predicts multiple entities, then we predict the one which is closest to the Quantity span. After that, we convert the sentence span of each entity extracted to the paragraph span.

Experimental Setup
The dataset is split into two parts -train set and dev set in a ratio of 90:10. The models were trained on the train set and were validated on the dev set. The environment and packages used for training and pre-processing are listed in appendix B.

Evaluation Metrics
The official metrics used by the SemEval organizer are F1-measure, F1-overlap, and Exact Match. Exact Match is a binary value of 0 or 1, while F1measure is a token level overlap ratio of submission to true spans, where tokenization is done using simple white space delimiters. F1-overlap is a SQuAD (Rajpurkar et al., 2016) style Overlap score based on F1-measure, which penalizes the negative submissions more strictly. The final evaluation is based on a global F1-overlap score averaged across all subtasks.

Model Variants Used
We tried various models like BERT-Base, BERT-Medium (Devlin et al., 2018), SciBERT, and BioBERT (Lee et al. 2019). We could not try BERT-Large due to computational limitations. The results for the top two models are shown in Table 2.
We also experimented with Bi-LSTM layers on top of BERT, but the model was overfitting due   to its high complexity. Consequently, it was not included in the final model.

Results on evaluation set
The results achieved on the evaluation set for each subtask are shown in Table 2, and the overall results are shown in Table 3. Figure 3 represents the results achieved in the various subdomains. The difference between the Exact Match score and F1-overlap score shows that the spans predicted by our model were precisely the same as gold data whenever they matched.
We achieved an overall fifth rank (among 19 participating teams) in the competition. We were also placed first in Quantity (tied) and Unit subtasks, second in MeasuredEntity, Modifier and Qualifies subtasks, and third in Qualifier subtask.

Error Analysis
The relation extraction subtask was challenging because associating entities with the quantities they are related to is context-dependent and based on one's understanding. This is also evident from the IAA scores reported for the train data that even humans can achieve deficient performance.
Some of the aspects where our model did not work well are: 1. Our model looks for relations only within a sentence, which may cause problems when a relation exists outside the same sentence. 2. There is loss in reconstructing the TSV files from entities because the neighboring data may/maynot be part of the same entity group 3. Our model didn't work well on MeasuredProperty and Qualifiers as it did on other subtasks, which is evident as we achieved only 0.53 and 0.35 F1-overlap on training data for these two subtasks.

Conclusion
This paper proposed SciBERT + CRF Model (SciB-ERT with [CLS] token embedding and CRF layer on top) for span extraction, classification, and semantic relation extraction. Our model shows significant improvement in performance over the baseline model and works equally well across all the scientific sub-domains. In the future, we plan to explore various other pre-trained contextual models for our approach. In this section we provide model architecture (Figure 4) and hyper-parameter values (Table 5) we used for training our final unit extraction model to facilitate reproduciblity of our results.  In this section we provide model architecture (Figure 5) and hyper-parameter values (