Cardiff University at SemEval-2020 Task 6: Fine-tuning BERT for Domain-Specific Definition Classification

We describe the system submitted to SemEval-2020 Task 6, Subtask 1. The aim of this subtask is to predict whether a given sentence contains a definition or not. Unsurprisingly, we found that strong results can be achieved by fine-tuning a pre-trained BERT language model. In this paper, we analyze the performance of this strategy. Among others, we show that results can be improved by using a two-step fine-tuning process, in which the BERT model is first fine-tuned on the full training set, and then further specialized towards a target domain.


Introduction
Definitions are central to the way in which humans convey knowledge about the meaning of concepts. Accordingly, a large number of general and domain-specific dictionaries have been created. As new concepts emerge, and the meaning of existing concepts shifts, these dictionaries need to be updated. This continual process is traditionally carried out by linguists or domain experts, meaning that dictionaries are never fully up-to-date. In rapidly evolving scientific domains, among others, this is a clear limitation. An appealing alternative is to automatically identify and extract definitions expressed in free text. This task of extracting term-definition pairs from text corpora is known as Definition Extraction (DE).
Early attempts to solve this task relied on rule-based methods (Klavans and Muresan, 2001;Cui et al., 2005). However, such methods are typically only able to detect explicit, direct and structured definitions, which usually contain definitor verb phrases such as means, is, is defined as. Later, a large number of supervised and semi supervised machine learning models for DE have been proposed (Westerhout, 2009;Reiplinger et al., 2012;Jin et al., 2013). While being able to identify a wider range of definitions, these approaches cannot be adapted to new domains efficiently, as they crucially rely on carefully designed features, which might not be available, or be less effective, in the new domain. More recently, the focus has shifted to neural network based models (Espinosa-Anke and Schockaert, 2018;Veyseh et al., 2019).
The method we analyze in this paper is based on fine-tuning a pre-trained BERT language model (Devlin et al., 2018). This strategy has recently proven successful across a wide range of Natural Language Processing (NLP) tasks. In particular, we focus on SemEval-2020 Task 6: DeftEval: Extracting term-definition pairs in free text (Spala et al., 2020). We participated in Subtask 1: Sentence Classification. This task required participants to predict whether a given sentence contains a definition. The associated dataset contains documents from seven different domains, including biology, history and economics. In our analysis, we focus on comparing two different strategies for fine-tuning the pre-trained BERT model: 1. fine-tuning a single BERT model based on all the available training data; 2. fine-tuning a separate BERT model for each of the 7 domains, each time only relying on the training data that is available for that domain.
The first strategy has the advantage that all training data can be exploited. However, our hypothesis is that this strategy may struggle to optimally capture the different definition styles that are used in different This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details:http://creativecommons.org/licenses/by/4.0/. domains. The second strategy avoids confusing the classifier with different definition styles, but it implies that only a limited amount of training data is available for each domain. We also experiment with a third approach, which is aimed at combining the best of both worlds: 3. fine-tuning a domain-specific BERT model in two steps, by first fine-tuning the model on all training data, and subsequently specializing it to the target domain in an additional fine-tuning step.

Data
We used the DEFT corpus (Spala et al., 2019), which was made available as part of the SemEval-2020 Task Figure 1: Impact of padding/truncating sentences to different lengths, for 'BERT -Fine-tune all domains'. The reported F1 score is for the positive class.

Methods
Given the success of BERT (Devlin et al., 2018) across a wide range of NLP tasks, we decided to focus on analyzing its performance in the context of definition extraction. We considered the following variants.
Fine-tuning strategies. We experimented with the BERT-base model, using the pytorch huggingface implementation named BertForSequenceClassification 3 . Essentially, this method corresponds to adding a classification layer on top of the pre-trained BERT model, and fine-tuning the BERT model while training the classification layer. We will specifically compare the following fine-tuning strategies: • BERT-all: We fine-tune the model based on all the training data (i.e. from all domains). This is our official submission to the competition, which ranked 16 out of 56 submissions.
• BERT-target: We fine-tune the model only on training data for a given target domain. In other words, for each of the 7 domains, we train a separate model.
• BERT-double: We fine-tune the pre-trained BERT model twice. For the first fine-tuning step, we used the training data from all domains. Subsequently, we fine-tune the resulting model, based on the training data from the considered target domain.
As a baseline strategy, we also explored the following variant: • BERT-name: In this case, we used all the available training data, but we add the domain name at the start of the input sentence as an additional token, to condition the model.  LSTM based strategies. Apart from the standard strategy of fine-tuning a BERT model, we also experimented with using LSTMs (Hochreiter and Schmidhuber, 1997), using contextualised word vectors from BERT as input. We again compare several strategies: • LSTM-base: We used BertTokenizer 4 to tokenize the sentence. For this baseline model, we then trained the LSTM, including the corresponding token embeddings, from scratch. We used 300dimensional word embeddings and two hidden layers of 256 dimensions.
• LSTM-BERT-pre: We used the same LSTM architecture as before, but instead of learning the token embeddings, we used the last hidden of the pre-trained 'bert-base-uncased' model.
• LSTM-BERT-ft: In this case, we used the final layer of the fine-tuned BERT-all model as the word embedding layer to the LSTM model.

Results and Discussion
For all experiments, we used Google Colab free GPU 5 to train 'bert-base-uncased' models for 4 epochs. For the BERT-double method, we used 4 epochs for each of the two fine-tuning steps. We set the batch size to 16 and we pad the sentences to the 256 sequence length, which gave the best performance for the BERT-all model based on the development set, as shown in Figure 1. We used the Adam optimizer and a learning rate of 2 · 10 −5 and 10 −3 for fine-tuning BERT models and LSTM based models respectively. The results of the considered methods are summarized in Table 2 in term of precision, recall and F1 score. We show the performance of each model for predicting the positive (1) and negative (0) classes as well as their macro and weighted average. The official score in the competition was the F1 score for the positive class. The results show that fine-tuning BERT outperforms the LSTM based strategies. When comparing the different fine-tuning strategies, we found that specifying the domain name as an additional token in BERT-name failed to outperform the standard fine-tuning strategy. On average, the standard strategy also performed better than domain-specific fine-tuning. However, the double fine-tuning strategy led to the best results overall. A more detailed analysis of the main fine-tuning strategies is presented in Table 3, which shows the results for each of the 7 domains separately. One surprising finding is that the relative performance of the domain-specific fine-tuning strategy does not seem directly related to the amount of training data. In particular, this strategy outperforms the 'all domains' strategy on the History  domain, despite the fact that far less training data is available for this domain than for most of the others. Conversely, despite the fact that Government is one of the largest domain, in terms of available training data, the domain specific fine-tuning strategy performs comparatively very poorly. Table 4 lists randomly selected examples of the incorrectly classified sentences from Government domain. Looking at these sentences, we can see that some gold definitions are either incorrectly labeled or very difficult to classify even for a human. For instance, the first sentence contains a definition (of "ideology"), but the sentence as a whole is not a definition. Surprisingly, we found that some of these sentences are also present in the training set but with opposite labels as in the test set.

Conclusions
We have described our participation in SemEval-2020 Task 6 on Extracting Definitions from Free Text in Textbooks. In particular, we participated in Subtask 1, where the aim was to classify a given sentence as definitional or not. We evaluated to use of LSTMs and we compared different strategies for fine-tuning a pre-trained BERT language model. We found the latter to be more effective, especially when we fine-tuned the model twice. In particular, the BERT model is then first fine-tuned on the full training set and then fine-tuned further towards the target domain.

Sentence Label
5847 . While some Americans disapprove of partisanship in general , others are put off by the ideology -established beliefs and ideals that help shape political policy -of one of the major parties .
1 6246 . The current relationship between the U.S. government and Native American tribes was established by the Indian Self -Determination and Education Assistance Act of 1975 .