Gorynych Transformer at SemEval-2020 Task 6: Multi-task Learning for Definition Extraction

This paper describes our approach to “DeftEval: Extracting Definitions from Free Text in Textbooks” competition held as a part of Semeval 2020. The task was devoted to finding and labeling definitions in texts. DeftEval was split into three subtasks: sentence classification, sequence labeling and relation classification. Our solution ranked 5th in the first subtask and 23rd and 21st in the second and the third subtasks respectively. We applied simultaneous multi-task learning with Transformer-based models for subtasks 1 and 3 and a single BERT-based model for named entity recognition.


Introduction
This work is devoted to DeftEval challenge (Spala et al., 2020) held as part of SemEval 2020. It was concerned with the problem of definition extraction. It has recently been a popular topic. However, there were few annotated datasets and they were often small in size (Jin et al., 2013) or were limited to the cases when a term and its definition are in the same sentence (Navigli et al., 2010).
DeftEval is one of the first attempts to provide a structured multi-task dataset that can be used for various tasks connected with definition extraction and labeling (Spala et al., 2019) at text level (in contrast to sentence level). All the provided data was in English.
Our system employs a Transformer-based model trained jointly for all three tasks. For each task, we add a linear layer with dropout on top of the Transformer output. It allows the system to use information about sentence classes, entities it contains, and relations between them at the same time while training. This helps to improve the results for Subtasks 1,3. However, a single-task model performs better for Subtask 2.
Our system achieved F1 score of 0.844 for the first task with the difference from the first place equal to approximately 0.03 points. For the second task, our final score was equal to 0.52 while the difference amounted to 0.32. For the third task, our F1-score was equal to 0.61 while the winning system achieved the perfect score of 1.0. Although named entity information was provided for the third subtask, we did not use it and included only span information into the model. The main contribution of our work is a detailed analysis of multi-task systems performance for definition extraction, classification and named entity recognition. The results show that our approach to multi-task training might be beneficial for the sequence classification task, but it requires reconsidering for sequence labeling. Our code is publicly available 1 .
• Subtask 2: Sequence Labeling Label each token with BIO tags according to the corpus' tag specification.
• Subtask 3: Relation Classification Given the tag sequence labels, label the relations between each tag according to the corpus' relation specification.
The dataset contained 215 files, 80 out of them were for training, 68 for validation, and 67 for the test. These files contained 7001 text extracts with 26552 sentences and 513219 word uses. The dataset contains 29011 unique tokens. The data was provided in CoNLL 2003-like format (see Fig. 1). The data was from several distinct domains: biology, economics, government, history, physics, psychology and sociology.
DeftEval was held in two phases: first, there was given the data for the first two subtasks which did not contain named entity information. Then the third subtask data with named entity spans and types were revealed. There are many ways to extract information from text. This task is often solved by extracting named entities and classifying relations between them. Currently, the best results are achieved with Transformerbased models (Vaswani et al., 2017). The most advanced models (according to paperswithcode 3 ) use extra training data or additional knowledge bases. For example, in the state-of-the-art system the authors use Wikipedia data (Baldini Soares et al., 2019). However, such data is impossible to get for domain-specific relations.
Among the systems that do not use encyclopedias or other labeled data, the best results were achieved by Joshi et al. . They pre-trained a BERT-like system, but instead of predicting individual masked tokens, they trained the model to infer contiguous random spans. The model was also trained to predict each token in the masked span using output representations of only span boundary tokens. This significantly improved the results of their model in comparison with the vanilla BERT.
Our system applies a sequence labeling approach to both named entity recognition and relation extraction. A similar work was proposed by Veyseh et al. (Veyseh et al., 2020) where they built a joint system for definition extraction where they combined both sentence classification and sequence labelling in a single BiLSTM model with a graph convolutional layer on top of it. This approach looked promising and we decided to transform it and to use a single BERT-based model. Thus, we adopt a multi-task approach and predict sentence classes, named entities and the relation between these entities in one go. We compare the multi-task model results with its single-task counterparts.

Multi-task learning
To solve all three subtasks of the competition, we decided to use the joint training method. We propose a model that simultaneously predicts an input example class, a tag sequence of entity labels and semantic relations between entities. To do so, we consider relation extraction as a sequence labeling problem (similar to how named entity recognition is usually solved). In each example, we have one marked main entity (which may contain several tokens) and we predict all named entity tags and all relations between the main entity and all other tokens in the sentence (see Fig. 2). The architecture of our main model is depicted in Figure 3.
The dataset contained texts separated into small windows of 3-5 sentences each. Windows were split with respect to their description ids. According to the organizers, there were no relations that span across windows. Thus, all our training and inference was done with respect to these windows.
In each training example, we highlight the boundaries of the analyzed sentence with special tokens.
We also mark the boundaries of the entity for which we are going to predict all relations in the text extract. So for each named entity from the dataset we generate a training example containing the boundaries for the considered entity and the sentence.
During training, the weighted sum of cross-entropies of each subtask was optimized. Thus, the proposed model relies solely on the input text and the knowledge of the boundaries of entities and sentences, without using information about entity types. The learning rate was set to 1e-5, the weight decay and the dropout were set to 0.1. To obtain information about entity boundaries, we trained an independent entity extraction model based on BERT (Devlin et al., 2018). We use it to extract entities and generate examples for our main model. In the competition for the third subtask, we used annotated named entity information provided by the organizers instead of BERT-model predictions.
BERT and XLNet  tokenizers split tokens into several subtokens so we had to create an aggregation scheme to merge subtoken outputs back together for entity and relation inference. The output was taken from the first sub-token.
For each training example from the training dataset we generated several samples. Since each example from the training dataset was turned into several examples, in the prediction we had to choose the answer from one. For this, in the first task, we selected the answer with the maximum score, for the second task for each word we took the answer from the example in which the score was maximum for this word. In the third task, we group examples into non-overlapping sets according to the predicted relation type, and in each set, we choose an example with the maximum average score. For the final prediction, we merge the answers of these sets. Similar to the approach described before, we tried to train models jointly on first and seconds subtasks only. We also tried single-task models for the first and second tasks. Subtask sections 1 and 2 below describe the response construction process for each subtask respectively.

Subtask 1. Sentence classification
In this experiment, sentences were classified into two classes whether they contain a definition or not. As a single task model we fine-tuned a Roberta.large model (Liu et al., 2019) 4 . According to the Roberta instruction, the training and validation samples were binarized to the desired format. We fine-tuned only weight-decay and dropout coefficients due to heavy performance costs. The learning rate was set equal to 1e-05. All models were trained for 20 epochs. Validation occurred at the end of each epoch. Roberta models were trained at the sentence level without using all sentences from the window.

Subtask 2. Named entity recognition
The second subtask was named entity recognition in the definition domain. Entity labels were selected among various definitions and term types. Entities could span across several words. In the experiment, the model was trained at the window level.
We relied on the code by Kamal Raj 5 . The BERT-large-uncased model was used. For each token in the example, we took BERT embedding from the first subtoken and passed it through a dropout layer followed by a linear layer. Cross entropy was used as an error function. All non-entity tokens were ignored for loss function calculation. The labels '[CLS]' and ' [SEP]' were used to mark the beginning and the end of each example. We also optimized the learning rate, dropout rate, and weight decay coefficients using the validation dataset.   We ranked 5th in the task of sentence classification and 23rd and 21st in named entity recognition and relation classification. Our system achieved 0.844 in the F1 metric for the first task with the difference from the first place equal to approximately 0.03 points. For the second task our final result score was equal to 0.52 while the difference amounted to 0.32. For the third task our F1-score was equal to 0.61.

Model
In Table 1 we provide the post-evaluation results of our models for all three subtasks. Entity spans for subtask 1 models (denoted by ♠) were inferred from predictions of our best single-task model for subtask 2 on the development dataset.
For the first task we tried a Roberta based model and BERT and XLNet-based multi-task learning models. Multi-task approach outperforms the Roberta model for this task. It is true not only for the best model but for all multi-task models trained on all three subtasks where the sentence weight is not set to 0. However, we could not improve single-task results for named entity recognition. It might be due to insufficient training time because the task itself is more difficult than binary classification. XLNet and BERT results turned out to be close to each other. Their exact results may depend on a lot of factors such as seed number which are not covered in the article. Multi-task learning results with different weighting schemes can be seen in the Appendix. In Figure 4 we show our classification of the main error types of our best-performing relation extraction models for the first subtask. We manually labelled misclassified examples according to their error type. It turns out that most errors come from our model being too sensitive to words typical for definitions, e.g. various conjunctions (which, that). Another major downside is mishandling of named entities. It proves that named entity information might be helpful for telling whether a message contains a definition.

Error analysis (Subtask 1)
After the shared task we have also studied the influence of context on model results for the first subtask (see Figure 5). The texts in the dataset were split at a sentence level by the organizers. So we decided to see how full text inputs influenced the results. Three context types were studied: no context, left, right and full. The context shows which information is left with respect to the analyzed sentence. Left context means that we make predictions for the sentence and all words to the left. Full context means that we preserve the sentence and all words to the left and right. We filter examples which have main entity outside of preserved words of the window. We did early stopping by best f1-score for positive class on first subtask. As can be seen from Figure 5, leaving only left context improves model results. The fact that full context is performing poorly relative to other variants maybe attributed to the filtering process for examples with main entity out of context.

Conclusion
This work is about our results in DeftEval challenge which was devoted to finding and classifying definitions in texts. A single Transformer-based model was adopted for both tasks simultaneously. We ranked 5th in the task of sentence classification and 23rd and 21st in named entity recognition and relation classification. In this paper we describe our system and analyze the errors.
A Multi-task learning on subtasks 1, 2 Figure 6 shows the results of the models trained jointly on the first and the second subtasks. We set the weight of the target subtask to 1.0, while changing the weight of another subtask. For subtask 1 multi-task learning seems beneficial on the dev set, which was used for early stopping. From the test set performance we see that this improvement is comparable with the variance of the scores. For subtask 2 multi-task learning evidently hurts. B Multi-task learning on all three subtasks In Figure 7 you can see the results of the models trained jointly on all three subtasks. For the first subtask we used entity spans predicted by our best single-task sequence labeling model, while for the third subtask we used gold entity spans. Also, for subtask 1 we did early stopping by the F1 score of the positive on the subtask 1 development set.   Figure 8 shows the results for subtask 1 when the gold entity spans are used instead of the predicted ones. It seems, that our results for subtask 1 could be way better if our model for subtask 2 was better at predicting entities spans.