KGP at SemEval-2021 Task 8: Leveraging Multi-Staged Language Models for Extracting Measurements, their Attributes and Relations

SemEval-2021 Task 8: MeasEval aims at improving the machine understanding of measurements in scientific texts through a set of entity and semantic relation extraction sub-tasks on identifying quantity spans along with various attributes and relationships. This paper describes our system, consisting of a three-stage pipeline, that leverages pre-trained language models to extract the quantity spans in the text, followed by intelligent templates to identify units and modifiers. Finally, it identifies the quantity attributes and their relations using language models boosted with a feature re-using hierarchical architecture and multi-task learning. Our submission significantly outperforms the baseline, with the best model from the post-evaluation phase delivering more than 100% increase on F1 (Overall) from the baseline.


Introduction
Most scientific experiments are accompanied by relevant measurements, which help researchers to quantify their observations and qualitative arguments. Measurements also play a pivotal role in summarizing large experiments, and provide a brief idea of the results obtained. It is customary for scientists to present their research in the form of scientific papers. Nowadays, with thousands of papers being published digitally every year, it is extremely difficult to go through every single paper in order to get the desired data. The most popular electronic open-access repository of e-prints, arXiv, currently has 1,867,929 articles 1 . The sheer vastness of this number suggests just how important it is for us to automate the task of extracting measurementrelated information from research papers (Singh et al., 2016).
A thorough understanding of the measurements not only requires the numerals, but also the context in which the quantities occur. Moreover, the entities and the properties measured along with the qualifiers that condition the measurements are crucial for understanding the measurement. MeasEval (Harper et al., 2021) is a semantic relation extraction task focused on obtaining 9 different entities pertaining to counts, measurements and qualifying attributes of these quantities in a collection of excerpts from research papers in English. Figure  1 shows an example of a quantity along with its attributes and relations from this dataset.
We propose a three-stage pipeline to address this task. The first stage uses a pre-trained BERT model (Devlin et al., 2019) to detect quantity spans from sentences. Receiving the detected spans as inputs, the second stage obtains the units and modifiers using extracted units and modifier keywords. Finally, the third stage receives the quantity spans from the first stage and uses another pre-trained language model over each quantity-span-conditioned sentence to obtain quantity-span-aware contextualized representations for each sub-token in the sentence. These representations are then used to detect the measured entity corresponding to each quantity (if any). The predictions from the measured entity task are then fused with the individual representations for each sub-token. These representations are used to detect the measured property and the qualifiers in a multi-task learning setting (Ruder, 2017).
Our submission surpassed the baseline by a significant margin and ranked 3 rd for the Unit task. Our current best model delivers 516.7%, 436.8%, and 296.4% F1 (Overlap) (Mei and Radev, 1979) gains for Measured Entity, Measured Property and Qualifier tasks respectively, over the baseline.
We showed that co-deposition of blended mixtures leads to 60% higher photocurrents than in thickness-optimized Pth/C60 heterojunction counterparts [37].

Related Works
Understanding and extracting information from scientific documents has been receiving increasing interest (Tsai et al., 2006;Nadeau and Sekine, 2007). Extracting units of measurement from scientific documents was previously studied via regular expressions and supervised classifiers (Berrahou et al., 2013;Sevenster et al., 2015).
In the orthogonal directional, there has been rapid progress in understanding natural language using deep pre-trained language models (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2020;, which has lead to a general improvement across multiple tasks. The sequence labelling (Lample et al., 2016;Panchendrarajan and Amaresan, 2018) and span prediction (Luo et al., 2020;Pang et al., 2019) tasks for natural language have also received great interest recently. We build upon these systems.

Problem Statement
We are given a set of can have a Unit of measurement (e.g. cm, ml) associated with it. Also every q j i ∈ Q i is associated to some (or no) Modifiers (Mod) which provide information about the type of Q (e.g. whether it is a range of values, whether it denotes the Median of a set of values, etc.) 2 . For every q j i ∈ Q i , there can exist a corresponding Measured Entity (ME) e j i . Some Qs do not have any ME, e.g. in '3413 women', the measurement is 3413 and 'women' is the 'unit' of 3413 and not its ME (according to " S0006322312001096-1177.tsv"). Similarly in 'three occasions', the measurement is 'three' and 'occasions' is its 'unit' and not its ME (according to "S0165587612003680-1078.tsv"). If a q j i has a corresponding ME e j i , it can also have an associated Measured Property (MP) p j i . Finally, the 2 https://github.com/harperco/MeasEval/ tree/main/annotationGuidelines Qs, MEs and MPs can have a number of Qualifiers (Qual) qual j i providing additional information about them.
• HQ(y, q) = 1, ⇐⇒ the Q, q, is related to element y, where y is an ME or MP.
The problem statement consists of 5 sub-tasks. We deal with identifying all Q spans in the documents in sub-task 1, followed by detecting the Units and Mods for each identified Q in sub-task 2. In sub-tasks 3 and 4, we identify the ME, MP, and Qual spans, corresponding to the extracted Qs. Finally in sub-task 5, we identify the relationships HQ, HP, and QS between the detected Q, ME, MP, and Qual spans. Figure 1 shows the annotation procedure to be followed (Stenetorp et al., 2012).

System Overview
We model all the previously described sub-tasks as supervised learning problems. Firstly, we perform a minimal pre-processing of sentence segmentation and number normalization on the documents. Then, Stage 1 handles sub-task 1 and the Stage 2 handles sub-tasks 2 respectively, and the remaining ones are handled by Stage 3 of our pipeline.
Before proceeding to describe our approach, we describe the baseline model, provided by the task organizers. The baseline treats the detection of Q, ME, MP and Qual spans all as sequence labeling problems. It uses the spaCy Entity Tagger model (Honnibal et al., 2020) to extract all these four spans. The Units for these Qs are obtained by  Figure 2: Overview of our Pipeline matching the largest Units in these predicted spans with those from the train dataset.

Stage 1
Similar to the baseline, we treat the Q spans learning problem as a sequence labelling problem. This is an intuitive step as it can detect multiple spans within the same text segment while being significantly cheaper in terms of the computation cost. Specifically, for a given sentence s, the input to our model is [CLS] s [SEP ]. It is sub-word tokenized (Wu et al., 2016) to get the one-hot sub-token sequence w 0 , w 1 . . . w n . These sub-tokens are then fed to BERT to obtain the contextualized representations x 0 , x 1 . . . x n . as follows.
First the word vectors are obtained using the Embedding E and Positional-Embedding E pos : Then these vectors are passed through L layers of transformer encoder (Vaswani et al., 2017) to obtain the contextualized representation. Each transformer encoder layer l receives the output vector se- from the previous layer l − 1 and computes the Here M SA is Multi-headed Self Attention and LN denotes Layer Norm. W l 2 , W l 1 , b l 1 , b l 2 are trainable parameters and f is the activation function.
The final contextualized representations } are the outputs of the L th transformer layer. Finally, these representations x j (excluding j = 0, j = n for [CLS], [SEP ] tokens) are each classified to a binary label: Here W c and b c are learnable parameters and (y N Q j , y Q j ) are the logits. This formulation of our problem can also be treated as the popular BIO tagging scheme excluding the 'B' beginning tag. This is then used to greedily match the largest contiguous span of sub-tokens with positive labels.

Stage 2
This stage receives the Q span predictions from Stage 1 as input and uses a method similar to the baseline, to obtain the Units. We extracted the set of Units occurring in the annotated Qs, from the documents in the dataset. However, in scientific documents, often combinations of units are present (e.g. Kgms −2 is a combination of 'Kg', 'm' and 's'). Our future work includes extending our approach to be exhaustive to handle such complex combinations of units.
To obtain the keywords for modifiers, given a Q span, we extracted the set of tokens occurring inside the span as well as in the neighboring window of 10 characters, on either side of the actual span. We discarded stopwords, punctuation marks and numbers. Then, we calculated the rate of cooccurrence between the remaining set of tokens and the Mods in the train dataset. This helped us to obtain keywords acting as significant cues for the respective Mod classes. Examples include "approximately" for IsApproximate, "greater than" for IsRange, etc. Another challenge with the subtask is the presence of similar sets of keywords corresponding to multiple Mod types. For example, the Mods 'IsMean' and 'IsMeanHasTolerance' are very similar with the slight difference that keywords corresponding to the Mod 'IsMeanHasTolerance' contain the additional symbol, '±'. We adopted a hierarchical approach in order to detect such minute differences and correctly identify the type of Mod for every Q span, e.g. IsMeanHas-Tolerance is True when IsMean and HasTolerance are both true. We started by detecting a general Mod class, and gradually used extra cues to classify the span into more specific Mod classes such as {IsMeanHasSD, IsMeanHasTolerance, IsRange-HasTolerance, IsList}.

Stage 3
The input to this stage is the sentence-quantity tuple 〈s, q〉 and our objective is to detect the spans for ME, MP and Qual. There could be multiple Qs in a single sentence. We treat detecting ME, MP, and Qual as three sequence labeling sub-tasks in a multi-task learning setting.
We create a modified sentence s where the Q span q inside the sentence is enclosed within a special start marker 〈E〉, and a special end marker 〈/E〉 (Baldini Soares et al., 2019;Kaushal and Vaidhya, 2020;Zong et al., 2020). We additionally have a special segment embedding for the Quantity (q) portion of the quantity-context encoded sentence s , different from the remainder of the sentence. We input s and corresponding segment embeddings to BERT and obtain quantity-aware contextualized vectors {v 1 , v 2 ...v n } for each of the n sub-tokens in s . We then obtain the ME task logits e i , for each sub-token vector v i : Here W T e and b e are learnable parameters. Now, as per the annotation rules of the task, a Q will have an associated MP only if an ME related to the given Q exists. Hence, for predicting the MP, we extract features from the ME task logits and concatenate them with each sub-token vector v i as follows: Here W T p and b p are learnable parameters, max and mean are element-wise operations and p i is the logit of the i th sub-token for the MP sub-task. Here ; denotes concatenation. Similarly, we obtain the logits qu i corresponding to the Qual task, for every sub-token vector v i of the sentence s, as follows: Here W T qu and b qu are learnable parameters. The model is trained with the following combined multitask learning objective: Here are ground truths for each sub-token for the ME, MP and Qual subtasks respectively; L is the softmax cross-entropy loss (Dunne and Campbell, 1997).
Similar to Stage 1, we greedily match the longest contiguous positive labeled spans for each of the three sub-tasks and obtain the ME span e, MP span p and Qual span qu corresponding to the input Q span q for the sentence s. Here (q, e, p, qu) forms an annotation set which is then post processed to generate the relations HP, HQ and QS on this annotation set as per their definitions in §3.

Experiments and Discussion
All experiments were performed using PyTorch (Paszke et al., 2019) and HuggingFace's transformers (Wolf et al., 2019). Optimization was done using Adam (Kingma and Ba, 2014). We include the complete set of experimental parameters in §D.

Development Phase
After dividing the 5 sub-tasks into 3 stages, we worked on each stage individually. We trained the models exclusively on the train dataset and used the trial dataset for validation and hyperparameter tuning. We used the F1, Precision and Recall metrics for each token in the sequence labeling sub-  tasks, for evaluating individual components over the validation set during the development phase. Table 1 shows the performances of various BERT models in Stage 1. We observe that BioBERT delivers the best F1 score, followed by BERT-base and RoBERTa-BioMed. Much to our surprise, BERT-Large and SciBERT performed worse than BERT-base despite their large size (Li et al., 2020) and domain specificity.
In order to understand the role of each component of our model in Stage 3, we perform various ablation studies as shown in Table 2. First, we experiment with various combinations of multi-task learning with the three tasks -ME, MP and Qual. We observe that multi-task learning can lead to significant gains on all three tasks. Only the multi-task combination of ME and Qual led to performance reduction. Multi-task training all three tasks together nearly gives the best performance on all three metrics. We attribute this gain in performance to the inter-related natures of the three sub-tasks.
Secondly, we study the importance of segmentation and concatenation of features. We create BERT X, which doesn't add separate segment embeddings for the Q span, and BERT Y which does not concatenate the ME logit features for predicting MP and Qual spans. From Table 2, we observe that BERT X has a significant reduction in performance for all the three sub-tasks upon excluding the segment embeddings, as the model input doesn't have a clear demarcation between the Q span portion and non-Q span portion of the sentences. We also observe a reduction in performance for MP and Qual for BERT Y, showing the importance of fusing the logits of ME for the former two sub-tasks.
Similar to Stage 1, we experiment with various BERT models as shown in Table 3. Here we observe that RoBERTa-BioMed, BioBERT and BERT-large perform the best for ME, MP and Qual respectively. BERT-Base performs the worst for all of them. All the models except BioBERT have significantly lower F 1 Qual than BERT-Large. Each model produces an F 1 M E score greater than 0.5.

Post-Evaluation Phase
The evaluation was done using the official script 3 . The classification and relation extraction sub-tasks were both evaluated by a binary match score and the span identification tasks by a SQuAD style (Ra-Model Q ME  For our official submission, we selected BioBERT as it achieved the best F1 score in Stage 1 and near-best performance for the tasks in Stage 3. Minor discrepancies in the submission format involving the annot-id reference, quotes, whitespacesensitivity and utf-8 encoding, not detected by the evaluation script were fixed in the post-evaluation phase. Table 4 shows the final performance of our models. After proper conversion to the desired format during the post-evaluation phase, we also evaluated various other BERT models along with our best model, BioBERT. BioBERT delivers the best performance of 0.456 F1 (Overall) followed by RoBERTa-BioMed and SciBERT. BioBERT also performs best on 7 of the 9 individual tasks.

Future Work
Stage 3 of our pipeline operates at a sentence-level, so for a given Q span, it does not capture the ME, MP, and Qual spans occurring across sentences. However, our approach can be easily extended to consider the nearby sentences or even the entire document (at the cost of computation speed).
The identification of exact word boundaries for the span identification tasks is crucial. Treating these tasks as sequence labeling problems and greedily matching for spans can lead to a few problems. For example, if a sub-token occurring within a long span is mislabeled, then the span is split into two components. In the future, we can explore leveraging contrastive learning (Chen et al., 2020) to improve the predictions for exact word boundary match. We can have transition based labeling layers such as Conditional Random Fields (CRFs) (Wallach, 2004) over the more popular BIO/BIOES sequence tagging schemes (Yang et al., 2018).
Lastly, while the multi-staged approach is fairly interpretable at the intermediate outputs of Q spans, it also leads to a few issues. The predictions for MP, ME and Qual spans in Stage 3 are heavily dependent on the Q spans from Stage 1, and there does not exist any mechanism to rectify errors in Stage 1 later, in our approach. There is also an exposure bias (Schmidt, 2019;Galloway et al., 2019) as the model is trained on the ground truth, while tested on the predicted Q spans. Moreover, we believe that having common weights between the BERT models of Stage 1 and Stage 3 will not only make our approach faster and lighter, but also more performant through multi-task learning.

Conclusion
In this paper, we present our system details for the SemEval 2021 Task-8: MeasEval which is aimed at extracting entity and semantic relations pertaining to counts and measurements. We use a multi-staged approach where we first identify the quantity spans using BERT, then the units and modifiers for these predicted quantity spans by intelligent templates that leverage extracted units and modifier keywords. Finally we input the quantity-aware sentences to another BERT model to predict ME, MP, and Qual in a multi-task learning settings with feature re-use. Our submission achieved the second runner up position on the leaderboard for the Unit-identification sub-task and it showed the highest improvement in the post-evaluation phase, with an F1 (Overall) score only 0.063 lower than the highest score across both the phases.

A Appendices
Following is the overview of the appendix.
• §B -We provide implementation details: codebases, trained models and detail dependencies.
• §C -We provide details of the dataset in shared task, its statistics and annotation set for the task.
• §D -We detail the experimental settings and hyperparameters.

B Code and Dependencies
We will make our code public 4 with instructions to replicate our systems. We also release our pre-4 https://github.com/Ayushk4/SE-T8 trained model for our submissions 5 . All experiments were performed using PyTorch (Paszke et al., 2019) and HuggingFace's transformers (Wolf et al., 2019) libraries. The optimization was done using Adam optimize (Kingma and Ba, 2014). We used git for reproducibility setup. In Table 7 we list all the dependencies used in our codebase. We include a step-by-step guide to setup and run the codebase in our README file present within the code also with details to set up our environment.

C Dataset Details
We experiment on the dataset provided by the task organizers, consisting of gold annotations (Harper et al., 2021) for the set of scientific documents in English which are released here 6 . These scientific documents are a subset of the Elsevier Labs OA-STM-Corpus available publicly 7 .
Basic Annotation Set: The basic annotation set consists of 4 types of spans and 3 types of relations between them. The span types are Quantity (counts and measurements), Measured Entity (the item whose measurement/count is provided by the Quantity spans), Measured Property (the property of the Measured Entity, whose measurement is provided by the Quantity spans) and Qualifier (special circumstances which affect a particular measurement). These spans are related using three types of Relations -HasQuantity (relates a Measured Entity or a Measured Property to a Quantity), HasProperty (relates a Measured Entity to a Measured Property) and Qualifies (relates a Qualifier to any Measured Entity, Measured Property, or Quantity).

Model
Huggingface's Model API BERT-base bert-base-cased BERT-large bert-large-cased RoBERTa-BioMed allenai/biomed roberta base SciBERT allenai/scibert scivocab cased BioBERT dmis-lab/biobert-v1.1  Statistics: The complete dataset is divided into three parts: train, trial and eval. We train on the train set. Trial is used for validating and Eval is the held-out test dataset on which the final performance of the models are evaluated. In Table 5, we list the dataset statistics for the spans of each type. In Table  6, we list the dataset statistics related to the various relations -(HP, HQ, QS).

D Experimental Settings
Preprocessing: We sentence tokenize every document using the NLTK sentence tokenizer. we observed that phrases such as " Fig. 1", " Table. 2" and "et al. ", along with a few others, caused sentences to be tokenized at wrong intervals (due to the presence of "."). We detected and re-joined the instances for such phrases.
Normalization: We normalized the dataset by replacing all numerals by the same digit -0. The helped our model identify the Q spans better. We observed that without normalization, the F1 (Overlap) Score for Q spans decreased considerably (from 0.844 to 0.790).
Training and Hyperparameters: The model take ≈ 20 seconds per epoch on Tesla P100. The number of parameters are same as BERT. Table 9 lists the HuggingFace model names corresponding to the BERT models we used. We validated our models using F1 metrics for Stage 1 and Stage 3 over the trial dataset. In Table 10 we share the sets of hyperparameters that we explored whereas in Table 8 we mention the best set of hyperparameters that we obtained.