UPB at SemEval-2020 Task 6: Pretrained Language Models for Definition Extraction

This work presents our contribution in the context of the 6th task of SemEval-2020: Extracting Definitions from Free Text in Textbooks (DeftEval). This competition consists of three subtasks with different levels of granularity: (1) classification of sentences as definitional or non-definitional, (2) labeling of definitional sentences, and (3) relation classification. We use various pretrained language models (i.e., BERT, XLNet, RoBERTa, SciBERT, and ALBERT) to solve each of the three subtasks of the competition. Specifically, for each language model variant, we experiment by both freezing its weights and fine-tuning them. We also explore a multi-task architecture that was trained to jointly predict the outputs for the second and the third subtasks. Our best performing model evaluated on the DeftEval dataset obtains the 32nd place for the first subtask and the 37th place for the second subtask. The code is available for further research at: https://github.com/avramandrei/DeftEval


Introduction
Definition extraction from text is a challenging research task, addressed by numerous researchers in the area of natural language processing (NLP). Factual question-answering systems are possible applications that can benefit from the results of this task (Zhang and Jiang, 2009). As a response to this challenge, Spala et al. (2019) introduced the Definition Extraction from Texts (DEFT) corpus, a human-annotated English dataset that contains multi-domain (e.g., biology, sociology, physics) term-definition pairs from two types of documents, free (i.e., Textbook) and semi-structured text (i.e., Contracts), as opposed to the domain-specific WCL dataset (Navigli and Velardi, 2010). In addition, a shared task was proposed by SemEval-2020 which was aimed at evaluating the performance of each participant system on three subtasks defined for the DEFT corpus. The three subtasks are the following: Subtask 1: Given a labeled dataset of sentences, this subtask is to build a classifier capable of distinguishing between a sentence containing both a definition and the defined term or not. We provide two examples of sentences that contain a definition, from Biology and Physics domains, respectively: • The metabolome is the complete set of metabolites that are related to the genetic makeup of an organism.
• Polarization is the separation of charges in an object that remains neutral.
Subtask 2: Given a dataset of tokenized sentences, we aim to label each token with one of the following classes: Term, Alias-Term, Referential-Term, Definition, Referential-Definition, or Qualifier. The meaning of each token can be found in the corpus description paper (Spala et al., 2019).
Subtask 3: Given a dataset of tokenized sentences and the tag id of each token, our goal is to predict, for each token, the type of relationship it had with another token, and also the tag id of the token it had a relation with. This is a classical relation extraction task (Zeng et al., 2018) and the relations that must be extracted are: Direct-defines, Indirect-defines, Refers-to, AKA, and Supplements.
Previous approaches (Klavans and Muresan, 2001;Fahmi and Bouma, 2006;Zhang and Jiang, 2009) of recognizing definition sentences mainly focused on the use of linguistic clues (e.g., "is", "means", "are", "a", or "()"). However, these studies fail to classify sentences containing definitions, where the linguistic clues are non-existent. In recent years, neural network-oriented solutions are another line of research in order to capture definitions from the text. Anke and Schockaert (2018) have proposed an architecture that relies on two models, convolutional neural network (Fukushima and Miyake, 1982) and bidirectional long short-term memory network (Hochreiter and Schmidhuber, 1997). Recently, Veyseh et al. (2019) combines more advanced deep learning techniques by leveraging graph convolutional neural networks with both syntactic and semantic information. However, the existing methods for definition extraction cannot benefit from language pretrained models (Lan et al., 2019). Motivated by their recent performances in more NLP tasks, we employ these models for solving the previously mentioned subtasks.
For the third subtask, we did not take into consideration the fact that we were given the set of correct tags of the second subtask (which tremendously help in the relation classification) and, thus, our approaches obtained poor results on both development and test datasets, so this subtask will not be discussed in the rest of the paper.
Our main contributions can be summarized as follows: • We explore various pretrained language models and we depict their results for the first and the second subtasks, respectively.
• As has been shown on other corpora (Peters et al., 2019), we find that fine-tuning the weights of the pretrained language models on the DEFT corpus gives a boost in performance, as opposed to freezing them.
• Additionally, we investigate a RoBERTa model (Liu et al., 2019) within a multi-task architecture that jointly learns the outputs of the second and the third subtask. However, we report only its performance for the second subtask for the reason presented above.

Pretrained Language Models
All pretrained language models presented below use the Transformer encoder (Vaswani et al., 2017) to produce contextualized embeddings. The Transformer is a self-attention mechanism that can capture long-distance dependencies between its inputs. For each language model, we use the implementation that is publicly available in the HuggingFace repository 1 .

BERT
Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) was bidirectionally trained using two strategies: (1) Masked Language Modeling (MLM), and (2) Next Sentence Prediction (NSP), on both the BooksCorpus (Zhu et al., 2015) with a dimensionality of 800M words and the English Wikipedia with a dimensionality of 500M words. It has improved the existing results on the General Language Understanding Evaluation (GLUE) dataset (Wang et al., 2018) by 7%.

XLNet
XLNet  trains with a new objective, called Permutation Language Modeling (PLM), that instead of predicting the tokens in a sequential order in the same way as a traditional autoregressive solution, it predicts the tokens in a random order. Moreover, aside from using PLM, XLNet relies on Transformer XL , a variation of the transformer structure that can capture a longer context through recurrence, as its base architecture. XLNet surpass BERT on a series of NLP tasks, and set the new state-of-the-art result on the GLUE dataset with an average F1-score of 88.4%.

RoBERTa
Later on, the Robustly optimized BERT pretraining approach (RoBERTa) (Liu et al., 2019) uses the MLM strategy of BERT, but removes the NSP objective. Moreover, the model was trained with a much larger batch size and learning rate, on a much larger dataset and showed that the training procedure can significantly improve the performance of BERT on a variety of NLP tasks. In particular, with a F1-score of 88.5%, RoBERTa reached the top position on the GLUE leaderboard, outperforming the previous leader, XLNet. When RoBERTa was released, it obtained new state-of-the-art results on the GLUE dataset, improving the performance of BERT by 6.4% and over the previous leader by 0.1%.

SciBERT
Science BERT (SciBERT) (Beltagy et al., 2019) is a pretrained language model based on BERT. As opposed to BERT, the novelty here is that SciBERT was trained on large multi-domain corpora of scientific publications to improve performance on domain-aware NLP tasks. Experimental results show that SciBERT significantly surpassed BERT on biomedical, computer science, and other scientific domains.

ALBERT
A Lite BERT (ALBERT) (Lan et al., 2019) is a system that has fewer parameters than the classical BERT, but still maintains a high performance. The contributions of ALBERT consist in two key approaches of parameter reduction: factorized embedding parameterization and cross parameter sharing. In addition, ALBERT uses for training the sentence-order prediction instead of NSP. ALBERT obtained an average F1-score of 89.4% on the GLUE dataset, pushing the state-of-the-art by 0.6%.

Conditional Random Fields
The most common method for treating a sequence labeling task is the Conditional Random Field (CRF) model (Lafferty et al., 2001). As was mentioned by Alzaidy et al. (2019), for a sequence x of input words and another sequence y of output tags, CRF works by constructing a conditional probability distribution in the following manner: where the parameters W y i−1 ,y i and b y i−1 ,y i are called the weight matrix and the bias, respectively. To estimate the parameters W and b, we perform a maximization of the log-likelihood function: Once the CRF is trained, we use the Viterbi algorithm (Forney, 1973) to find the most probable sequence among all possible tag sequences.

Approaches based on Pretrained Language Models
For the first subtask, we add a two-layered feed-forward neural network with 512 neurons in each layer on top of the [CLS] contextualized embedding (as proposed in the original paper of BERT for the single sentence classification tasks (Devlin et al., 2019)) that maps this embedding to a scalar. By applying a sigmoid function to this mapping, we obtain a trainable scalar that represents the probability of a sentence to contain a definition.
For the second subtask, we also map the contextualized embeddings generated by each pretrained language model in a lower-dimensional space using a two-layered feed-forward neural network. Then, we  use these mappings to train a CRF model that learns to predict the most probable sequence of labels for a given input. The main problem that we encountered in this subtask was that we needed to apply a special tokenization, namely Byte Pair Encoding (BPE), to train the language models, that was different than the one used to create the corpus. To mitigate this issue, we reconstructed the sentence from its tokens and splited it again in BPE subtokens. Then, to map the subtokens back to the original tokens, we employed a character matching algorithm that is similar to the one used by spacy-transformers 2 .
The BPE tokenization also introduced a problem at inference because the predicted labels for the subtokens of a word might not match, so a label for the whole word could not be inferred directly. To solve this problem, we took the label of the majority or, if the labels were equally distributed, we selected the label of the first subtoken. Figure 1 depicts the proposed solution for the problem of label mismatch for the word "extrapolate".

Multi-task Learning Approach
In this work, we also experimented with a language model that jointly learned to predict the tags, the tag ids, and the relations. Figure 2 depicts the architecture of the multi-task learning setting. To create the framework, we projected the contextualized embeddings generated by RoBERTa in three vectors, representing the outputs for the subtasks 2 and 3, respectively.
The approach on tag prediction subtask in the multi-task context is identical to its single subtask counterpart. To predict the tags ids, we consider that the maximum number of possible tags in a paragraph is 10 and that the id of the tag is given by its position in this context. Once we identify all tag ids, we predict the corresponding relations.
L(y tag , y id , y rel ,ŷ tag ,ŷ id ,ŷ rel ) = λ 1 L 1 (y tag ,ŷ tag ) + λ 2 L 2 (y id ,ŷ id ) + λ 3 L 3 (y rel ,ŷ rel ) The learning objective of the multi-task method is to predict the outputs for the three subtasks by minimizing the following multi-task loss function: where y tag , y id , y rel are the true labels for tags, tag ids and relations, respectively, whileŷ tag ,ŷ id ,ŷ rel are the corresponding predictions. Also, λ 1 , λ 2 , λ 3 and L 1 , L 2 , L 3 are the weights and the individual loss functions, respectively.
3 Performance Evaluation

Dataset and Preprocessing
The DeftEval dataset contains imbalanced classes for all the subtasks considered. For example, the first subtask has 11,090 sentences that do not contain a definition and 5,569 that contain. In order to handle this issue, we balance the classes by doubling the number of sentences that contain a definition, obtaining a new total of 11,138 positive samples. Moreover, the second task has highly imbalanced classes, ranging from 93,204 for the Definition tag to 256 for the Referential-Term tag. To balance the classes in this case, we oversampled each sentence that contains an under-represented tag by a factor inversely proportional to their number 3 . The number of the initial and final entries for each tag (i.e., before and after applying the oversampling technique), along with the corresponding multiplication factor are depicted in Table  1. As one can note, the final number of tags is not equal to the initial number of tags multiplied by its factor. This is because other tags were affected when we oversampled a sentence for a certain tag, with a particular multiplication factor 4 . The preprocessing step consists of removing the artifacts that could interfere with the language model representation of the sentences. For the first task, we replace the URLs and equations with two special tokens, <url> and <equation>, respectively, during the fine-tuning process, and remove them while freezing the language model weights. We also eliminate the artifacts that came from the text formatting 5 and the spaces before punctuation. For the second subtask, we discard the characters that were not recognized by the language models and that could break down the tag-subtoken matching process, like the "Â o " character or Greek letters. We also replace the accented characters with their corresponding unaccented version.

Experimental Settings
For training purposes, we employ the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2e − 5 for both the frozen and fine-tuned versions. We train each language model for 100 epochs with a batch size of 16 and we save only the language model that obtained the highest performance on the development dataset. We decided not to use the early stopping setting because we observed that the models can significantly improve the efficiency of our results even after a long period of stagnation.
The feed-forward layer, which is placed on top of each language model to project the contextualized embeddings in the output space, has a hidden layer of size 512. Thus, a more complex family of functions can be learned by the system. We regularize the hidden layer with a high 80% dropout for the fine-tuned versions in order to make their learning slower and more robust, and with a 20% dropout for the frozen versions to avoid underfitting.  Table 2: Comparison of performance on both the development and test datasets for the first subtask (up) and the second subtask (down), respectively. The frozen versions denote that the weights of the respective language model were not updated during the training phase, while the fine-tuned versions denote that their weights were updated during the training process.
Due to the computational constraints, we adopt only the base version of all the language models tested in the current work. Furthermore, we use only the cased version of each language model when there is the case. Finally, we select the cross-entropy loss function for the multi-task architecture and we set all the weights λ 1 , λ 2 , λ 3 equal to 0.33.

Results and Analysis
As mentioned above, we conducted experiments with a total of five pretrained language models, including BERT, XLNet, RoBERTa, SciBERT, and ALBERT. More specifically, we use each language model by freezing and fine-tuning its weights. We also experiment with a multi-task architecture that is a joint learning technique to predict the outputs for both the second and the third subtask. Table 2 reports the evaluation metrics, Macro-Precision, Recall, and F1-scores, respectively, on both development and test datasets.
It can be observed from the two tables that fine-tuning the weights of the language models offers a high boost in performance. That is, the results show an improvement of up to 11.4% on the development dataset for the first subtask (in case of XLNet) and up to 23.4% on the development dataset for the second subtask (in case of RoBERTa). Moreover, the results on the development dataset also show that a fine-tuned RoBERTa is the best performing language model among all others, for both subtasks. Thereby, this was the only model submitted for evaluation, and it obtained a F1-score of 0.777 for the first subtask, and a macro F1-score of 0.439 for the second subtask, ranking 32nd and 37th on the leaderboard, respectively. Figure 3 depicts the detailed confusion matrices of the submitted models for the first and the second subtasks on the evaluation dataset. In the case of the first subtask, we can observe that the model is slightly biased towards predicting positive labels, resulting in more false positives for the input sentences, which is somewhat expected, given the balancing method we have applied. To improve the visualization of the results, the confusion matrix for the second subtask was normalized along the true labels for the test set. It can be seen that the dominant tags, Definition and Term, and also the Ref-Term, the tag that was highly oversampled, were the least confused by the model, obtaining an accuracy of 64%, which is almost double than the accuracy obtained by the rest of the tags. Moreover, it should be noted that most of the tags are misclassified by the model with the O tag, exception making the Ref-Term tag which is misclassified with the Definition tag, 36% of the time.

Conclusions and Future Developments
In this paper, we have presented our solution for the SemEval-2020 Task 6: Extracting Definitions from Free Text in Textbooks (DeftEval). We evaluated different state-of-the-art language models both by freezing and fine-tuning their weights. We observed that the performance of all the selected models can be significantly improved by fine-tuning their weights. Through a series of experiments conducted on the development dataset, we showed that RoBERTa significantly outperforms other language models for the two subtasks. According to the official leaderboard, we obtained the 32nd place out of 56 submissions for the first subtask and the 37th place out of 51 submissions for the second subtask. One possible direction for future work is to evaluate the multi-task scenario using other language models. We also consider that a larger annotated dataset together with the large variants of the pretrained language models could drastically improve definition extraction performance. Moreover, we believe that by using class weights instead of oversampling for the first subtask, one can mitigate the problems observed in its confusion matrix.