Defx at SemEval-2020 Task 6: Joint Extraction of Concepts and Relations for Definition Extraction

Definition Extraction systems are a valuable knowledge source for both humans and algorithms. In this paper we describe our submissions to the DeftEval shared task (SemEval-2020 Task 6), which is evaluated on an English textbook corpus. We provide a detailed explanation of our system for the joint extraction of definition concepts and the relations among them. Furthermore we provide an ablation study of our model variations and describe the results of an error analysis.


Introduction
Definition extraction (DE) is a subfield of information extraction that deals with the automated extraction of terms and their descriptions from natural language text. We belief that for humans definitions are one of the most important sources of knowledge to clarify unknown words in a new language or domain. Thus, an important application is the automated construction of a glossary, which is a laborious task if done manually. Another important application is the use of definitions as background knowledge for machine learning algorithms. Recent results have disclosed difficulties of state of the art approaches when confronted with factual knowledge , and have lead to a surge of research in this area (Logeswaran et al., 2019;Zhang et al., 2019;. The advantages of definitions as a knowledge source are two-fold: Firstly, definitions are easier to extract and annotate than other forms of knowledge sources, e.g. Knowledge Graphs (KG). And secondly, definitions can be processed using the same text-based algorithms whereas the use of a KG requires other forms of algorithms, e.g., Graph Embeddings. While research using definitions exists, it is limited by the scale of available annotations and restricts large scale applications such as in self-supervised learning setups. This paper describes our submission to the DeftEval challenge Subtask 2. Our approach is based on a multi-task learning strategy, that jointly extracts concepts and relations between them, and achieves a score of 49.68 macro F1, which ranked our best system in place 27 of 51 participants. The major challenge turned out to be the strong label imbalance that made an evaluation using the official macro-averaged F1 score difficult. The source code of our approach is publicly available. 1

DeftEval Challenge
The DeftEval challenge 2 (Spala et al., 2020) includes three subtasks: (1) classification of definition sentences, (2) sequence labeling of definition concepts and (3) relation extraction between concepts. The DEFT corpus (Spala et al., 2019) is used for evaluation, which consists of English textbooks scraped from an online-learning website. 3 Compared to the previous manually annotated datasets WCL (Navigli et al., 2010) and W00 (Jin et al., 2013), the DEFT corpus is significantly larger and the examples are more diverse than simple is-a patterns. In contrast to WCL and W00 the DEFT corpus examples are not limited to definition sentences. Instead, they consist of windows of up to three sentences around a highlighted   word in the source texts. Examples do not necessarily include a term-definition pair, but may include other concepts and relations, such as aliases or supplementary information. Additionally, definition relations do not have to be stated directly within one sentence, but can also be stated indirectly through a coreference relation to a preceding sentence. An example is depicted in Figure 1. The official metrics for Subtasks 2 and 3 are macro-averaged F1 scores evaluated on the labels in Table 1.

Model
We regard the DE task as the joint task of Named Entity Recognition (NER) and Relation Extraction (RE), and propose a system that closely follows the joint learning approach of Bekoulis et al. (2018). The NER part of their model is based on a BiLSTM CRF Tagger architecture (Ma and Hovy, 2016;Lample et al., 2016). The RE part projects each encoded token into a head and a tail space and classifies relations between all combinations of heads and tails, allowing for each token to have multiple relation heads.
CRF Tagger In our model each input example is first passed through a BERT model. The BERT layers are averaged for each token using a vector of learned weights, resulting in a token embedding. After passing through BERT, this token embedding is concatenated with the auxiliary input features and passed through the n BiLSTM layers to learn a task-specific contextualized encoding h. This is projected into logits using a fully-connected layer and a CRF layer is trained to extract Subtask 2 labels. This subcomponent forms the CRF Tagger baseline and the Simple Tagger baseline is the same model without the CRF layer.
Relation Extraction RE is used as an auxiliary objective. As the challenge provided files in the DEFT format, which is a tab-separated format with a single relation tail per token, we modeled the constraint that every token could point to at most one relation tail. For each token t i , the goal is to predict an output index c i . The special case c i = 0 corresponds to the negative class label r 0 , if no relation exists between t i and any of the potential relation tails t j . In all other cases the index c i is decoded to the k-th relation r k between token t i and token t j . The RE subcomponent uses only the hidden representation h as input. 4 The hidden representation h is passed through two projection layers U and W to extract representations for head and tail, respectively. For each token combination, the sum of the representations is passed through a relu activation function in order to create a tensor of feature activations: The class logits are computed as follows: where the index z is mapped to j = f loor((z − 1)/K + ) and k = (z − 1) mod K + + 1 with K + being the number of non-negative relation labels. Predictions are obtained via softmax: The joint model is then trained on the sum of the losses for the two subtasks. We evaluated differently weighted losses, but found an equally-weighted sum to work best with respect to the Subtask 2 performance.
Auxiliary Input Features Additional input information augments the raw token input. String labels such as part of speech (POS) tags and NER labels are converted into vectors using a randomly-initialized learned embedding matrix. For tokens that participate in a coreference cluster a binary indicator variable is added. Another set of binary variables indicates matches of rule-based patterns where each pattern match results in an activated indicator variable. Every binary variable is represented as one dimension of the additional input.

Experiments
The challenge provided labeled files split into training and development, while keeping the final test set hidden from the participants. We treated the official development set as our test set and created a new development set by randomly selecting 8 files from the training set. The resulting dataset sizes are listed in Table 2. There were several corpus changes shortly before and after the evaluation window of Subtask 2. We used the latest git commit before the evaluation period. 5 In a second step we converted the tab-separated DEFT files into a jsonl format and extended the data with additional information. Spacy v2.1.8 was used with the large English model 6 for POS tagging and rule-based patterns. In the case of coreference relations, these patterns targeted sentences starting with demonstrative determiners, e.g. "this" or "these", which are commonly used to refer to concepts in a previous sentence. We predicted coreference clusters using the implementation in AllenNLP v0.9.0 (Gardner et al., 2018) proposed by Lee et al. (2017).
Hyperparameter tuning was performed on the dev set using the Allentune implementation (Dodge et al., 2019). The source code includes config files for the conducted experiments. Table 3 lists the results of the submitted runs. Using a majority-vote ensemble improved the score of the submitted joint model by ∼1 point F1 to 49.68 F1, ranking our submission in the 27-th place of 51 participants. A larger mix of models in an ensemble did not improve the performance over the standalone joint model. Precision and recall are balanced for the single-run model. Precision improved whereas recall degraded when multiple runs were evaluated in an ensemble. Table 4 shows the results of an ablation study over a subset of the tested model combinations. Each of the models was evaluated on both datasets using macro-averaged F1 score with 10 repetitions. Due to the 5 https://github.com/adobe-research/deft_corpus/tree/ab1fb8951d0950a177e96 6 Spacy is available at https://spacy.io/ and the used model is en core web large   Table 4: Model ablations on the two splits with macro-averaged metrics on 10 repeated runs.

Results
strong label imbalance and small sample sizes all results show a high standard deviation. On the dev set the best F1 score was achieved by our submitted model. Best precision was achieved using a joint model and best recall using the CRF tagger. All of the joint models provide a higher precision than the baseline methods, while in recall they are worse than the CRF tagger but better than the Simple tagger. The F1 score is slightly higher for all models that employ a CRF layer. On the test set the CRF tagger performs best in terms of all metrics. The performance of joint models is slightly worse than the CRF tagger but better than the Simple tagger. Overall no single system is significantly better than the rest. An analysis of the confusion matrices revealed that the most common misclassification is a confusion of Term and Alias-Term. A manual inspection of the training examples showed that many of these examples are very complex, and some examples were annotated incorrectly. A strong data bias towards pairs of Term and Definition resulted in a high rate of false positives for these types. Simple over-or undersampling did not help to combat these errors in our experiments.
We also observed that some possible annotations were missing. This was most notable for sentences starting with a demonstrative determiner, where the previous sentence included a definition concept. Furthermore, the annotation schema does not allow nested annotations although we have observed several examples where this would have been required. 7 In retrospect, Subtask 3 was likely too trivial to provide enough valuable information to improve upon Subtask 2. Alternative improvements could have been data augmentation for the minority class examples, and an improved use of the coreference and rule-based preprocessing.

Conclusion
In this paper we described our system for the joint extraction of definition concepts and relations among them, and reported results on the DEFT corpus, a definition extraction dataset of English textbooks.
Despite the marginal improvements of our joint approach we also provide a robust setup of baseline methods, that we believe to be helpful to the community for experimentation and subsequent application in downstream tasks.