MULTISEM at SemEval-2020 Task 3: Fine-tuning BERT for Lexical Meaning

We present the MULTISEM systems submitted to SemEval 2020 Task 3: Graded Word Similarity in Context (GWSC). We experiment with injecting semantic knowledge into pre-trained BERT models through fine-tuning on lexical semantic tasks related to GWSC. We use existing semantically annotated datasets and propose to approximate similarity through automatically generated lexical substitutes in context. We participate in both GWSC subtasks and address two languages, English and Finnish. Our best English models occupy the third and fourth positions in the ranking for the two subtasks. Performance is lower for the Finnish models which are mid-ranked in the respective subtasks, highlighting the important role of data availability for fine-tuning.


Introduction
The meaning of words is strongly tied to the context in which they occur: Different contexts might point to different senses or indicate subtler meaning nuances. SemEval 2020 Task 3 "Graded Word Similarity in Context" (GWSC) (Armendariz and Purver, 2020) explores the effect of context on meaning, and proposes to predict the similarity of word instances in a continuous, or graded, fashion. GWSC is based on the CoSimLex dataset  and consists of two subtasks where models have to predict (1) the shift in meaning similarity for a pair of words (w a , w b ) occurring in different contexts, and (2) the similarity of two word instances in the same context. This is illustrated by sentences c 1 and c 2 , two contexts where dinner and breakfast co-occur. c 1 (...) After Mickey rings the dinner bell, Goofy foolishly leaves the driver's seat for breakfast. c 2 Residence Inns typically feature a complimentary small hot breakfast in the morning and a free light dinner or snack reception on weekday evenings (...) A change in meaning similarity occurs between the highlighted words in the two sentences: They are less similar in context c 1 where dinner is part of a noun compound (dinner bell), than in context c 2 where they describe different kinds of meals. The shift in meaning is reflected in the gold similarity scores assigned to these instance pairs in the GWSC dataset (4.39 vs. 5.35). We build models for these two subtasks by fine-tuning BERT on existing lexical similarity datasets. Additionally, we propose to approximate similarity of words in context through automatically generated lexical substitutes. We build and evaluate models in two languages, English and Finnish. In Subtask 1, our English and Finnish models ranked third and sixth out of nine participants. In Subtask 2, they are both found at the fourth position among ten participants. 1

Background
Our methodology draws inspiration from recent work on injecting semantic information into pre-trained language models (LMs). This can be done at two stages: during model pre-training or during fine-tuning. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/ 1 Our code will be made available at https://github.com/ainagari/semeval2020-task3-multisem Lauscher et al. (2019) opt for the first, adding an additional lexical task to BERT's two training objectives (language modelling and next sentence prediction) (Devlin et al., 2019). The semantic knowledge used in this additional task comes from pre-defined lexicographic resources (like WordNet (Miller, 1995)), and is shown to be beneficial on almost all tasks in the GLUE benchmark (Wang et al., 2018). Arase and Tsujii (2019) inject semantic knowledge into BERT by fine-tuning the pre-trained model on paraphrase data. They subsequently fine-tune the model again for the related tasks of paraphrase identification and semantic equivalence assessment, and report results that demonstrate improved performance over a model that has not been exposed to paraphrase data. We follow their approach and fine-tune BERT models for English and Finnish on a set of semantic tasks that are closely related to the GWSC task, since no training data is available for GWSC.
One of our tasks is inspired by the retrofitting approach of Shi et al. (2019). This consists in gathering sentence pairs from the Microsoft Research Paraphrase Corpus (MRPC) (Dolan et al., 2004) that share a word and which are paraphrases of each other (T) or not (F). Shi et al. propose an orthogonal transformation for ELMo (Peters et al., 2018) that is trained to bring representations of word instances closer when they appear in meaning-equivalent contexts. They show that this retrofitting approach improves ELMo's performance in a wide range of semantic tasks at the sentence level (sentiment analysis, inference and sentence relatedness). We follow their data collection method to obtain word instances for fine-tuning BERT. We replace the MRPC with the Opusparcus resource (Creutz, 2018) since it covers two of the languages addressed in GWSC, English and Finnish.

Datasets
We fine-tune pre-trained BERT models on semantic tasks that are related to GWSC. We specifically select tasks that address the similarity of word meaning in context, and use the corresponding datasets to make BERT more sensitive to this specific aspect of meaning. Table 1 contains annotated instances from each dataset used in our experiments.
Usim The Usim dataset contains 10 sentences for each of 56 words of different parts of speech, manually annotated with pairwise usage similarity scores (Erk et al., 2009;Erk et al., 2013). 2 As in GWSC, similarity scores are graded and range from 1 (completely different) to 5 (same meaning). The Usim sentences come from the SemEval 2007 Lexical Substitution task dataset (McCarthy and Navigli, 2007). 3 To binarize the usage similarity scores and use them for fine-tuning, we consider sentence pairs annotated with low similarity scores (score < 2) as instances denoting a different meaning (F), and highly similar sentence pairs (score > 4) as instances of the same sense (T). In total, we use 1,399 Usim sentence pairs for fine-tuning.

Concepts in Context (CoInCo)
The CoInCo corpus (Kremer et al., 2014) contains manually selected substitutes for all content words in a sentence. Substitute overlap between different word instances reflects their semantic similarity: instance pairs with similar meaning share a higher number of substitutes. We binarize the data as in Garí Soler et al. (2019) by assigning instance pairs to a class describing the same (T) or different (F) meaning depending on their shared substitutes. The data sample used by Garí Soler et al. contains instances with at least four substitutes: T pairs involve instances that have at least 75% of substitutes in common, and F pairs correspond to instances with no substitute overlap. We gather additional data from CoInCo by relaxing the class inclusion constraints. Specifically, we retain all instances regardless of the number of available substitutes. We consider as T examples instance pairs that have at least 50% of substitutes in common, and as F examples pairs that share at most one substitute.
We retain up to 500 instance pairs per CoInCo lemma, when available. We balance the two classes (T and F) and merge the obtained instances with Garí Soler et al. (2019)'s dataset (5,023 pairs) removing the duplicates. In total, we have 22,226 CoInCo instance pairs for fine-tuning. We use these instances in combination with the Usim data.   (Schuler, 2006) and Wiktionary examples, and were automatically annotated based on information provided in these resources. We use the training set (5,428 sentence pairs) with its labels (T or F) as data for fine-tuning.
ukWaC-subs The GWSC task addresses pairs of different words that can have similar meanings in some contexts and not in others (e.g., room and cell). Given that no training data is available, we automatically create one more dataset for fine-tuning called ukWaC-subs, which approximates this task. ukWaC-subs contains pairs of sentences (p 1 , p 2 ) that differ in one word only. We create data by substituting a word w in p 1 by either (a) a correct substitute; (b) a word that is a good synonym of w and could have been a correct substitute in another context but not in this one; or (c) a random word of the same part of speech as w. This is illustrated by the three ukWaC-subs sentences in Table 1. With (a), we expect BERT to learn that clear is being used in its unambiguous sense in this context. In (b), we tell BERT that despite the (out-of-context) similarity between present and moment, the latter is not adequate in this context. With (c), we want BERT to learn to distinguish date from a completely unrelated word (heritage). We use this data for a 3-way classification task.
We create this dataset by gathering sentences from the ukWaC corpus (Baroni et al., 2009) and automatically annotating them with lexical substitutes. We identify the content words in a sentence and use as their candidate substitutes their paraphrases in the Paraphrase Database (PPDB) lexical XXL package (Ganitkevitch et al., 2013;Pavlick et al., 2015). 4 The PPDB resource was automatically constructed by a bilingual pivoting method. Every paraphrase pair has a PPDB 2.0 score indicating its quality. We only consider as candidates for substitution pairs with a score above 2. We then use the context2vec lexical substitution model (Melamud et al., 2016) to rank the candidates according to how well they fit in a context. context2vec is a biLSTM model that jointly learns static representations of words and dynamic context representations. We rank candidate substitutes using the following formula: where s is the static representation of the candidate substitute, C is the context embedding of the sentence and t is the static embedding of a word instance i we want to replace. Using this formula, we obtain an ordered ranking R of substitutes for an instance i in context C. The highest-ranked substitute is viewed as correct and serves to create instances of type (a). A random word of the same part of speech found in the corpus makes an instance of class (c). To obtain instances of class (b) we could in principle take the last substitute in the ranking. However, due to the noise that exists in PPDB, these often are not correct paraphrases of the target word. We therefore apply a filtering strategy proposed by Garí Soler et al. (2019) which checks whether substitutes in adjacent positions (s i , s i+1 ) in the ranking R form a paraphrase pair in PPDB. If this is not the case for a specific pair, we stop checking and retain s i+1 as a substitute that represents a different meaning of the target word. Once the substitutes have been collected, 40% of the instances are assigned to class (a), 30% to class (b) and 30% to (c). One sentence may contain more than one training instance if a substitute ranking is available for different words in it. A training instance is created by replacing the word with the substitute required by the class it has been assigned to. We create 100,000 instances that we use to fine-tune BERT.
OpusParcus Shi et al. (2019) show that retrofitting ELMo with paraphrases improves its performance on lexical semantic tasks. We follow a similar approach and use paraphrases to fine-tune BERT before applying it to GWSC. We use paraphrases from the Open Subtitles Paraphrase Corpus (Opusparcus) (Creutz, 2018). We use this corpus instead of the Microsoft Research Paraphrase Corpus (Dolan et al., 2004) used by Shi et al. (2019), because it contains paraphrase pairs for six European languages, including English and Finnish which are addressed in GWSC.
Paraphrase pairs in Opusparcus were extracted from movies and TV shows subtitles, and are ranked by quality. We use paraphrases from the Opusparcus training set with a quality score higher than 15, 5 and create our own training instances following the procedure of Shi et al. (2019). Every pair of paraphrases that share a content word constitutes a positive example (T). For every T, we create a negative example (F) by selecting a pair of sentences from the resource that share the same word but are not paraphrases of each other. To avoid creating examples for target words that are highly frequent and have fuzzy semantics, we omit instances of the 200 most frequent words in the Google Books NGram corpus (Michel et al., 2011) (e.g., make, get, good). In total, we use 100,000 sentence pairs for fine-tuning the English model and 60,520 for Finnish.

Models
We use these five datasets to fine-tune pre-trained BERT models for English and Finnish. All tasks require comparing the meaning of word instances in two different sentences. We form an input sequence (sentence pair) for BERT by joining the two sentences together with the separator token ([SEP]) in between. Since the task is at the word level, we do not build our classifier on top of the [CLS] token which is an aggregation of the whole input sequence. Instead, our classifier receives as input the BERT representations of the target word instances at the last layer. BERT uses wordpiece tokenization (Wu et al., 2016) which means that a target word may be split into several tokens. For words that have been split, we average the representations of each wordpiece. We use two kinds of heads for fine-tuning.
• Classification head: The representations of the two target tokens are concatenated and fed to a linear classifier which outputs probabilities for each class. We use a cross entropy loss for training. We call this head CLASSIF.
• Cosine Distance head: We apply the Cosine Embedding Loss (PyTorch (Paszke et al., 2019)) to the representations of the two target tokens at the last layer. This loss increases the cosine distance of two tokens if they do not have the same meaning, and decreases it otherwise. We refer to this head as COSDIST.
Note that the ukWaC-subs dataset is compatible with the CLASSIF head only because it has three classes.
To predict the similarity of two target tokens in the GWSC task, we extract their representations from the different layers of a fine-tuned model. We use cosine similarity (cossim) as our similarity metric. In Subtask 2, which consists in predicting the similarity scores for a pair of words (w a , w b ) in the same context, we simply calculate the cosine similarity of their representations in a specific layer. In Subtask 1, we need to predict a change in similarity between two words w a and w b in two different contexts (c 1 , c 2 ). We estimate the change in similarity (∆Sim) with a simple subtraction: where w a c2 is the representation of word w a in context c 2 .

Experimental Setup
We participate in Subtasks 1 and 2 for English and Finnish. 6 For English, we fine-tune the bert-base-uncased model. For Finnish, we use the uncased Finnish model (finnish) (Virtanen et al., 2019) 7 and the uncased Multilingual BERT-base model (multilingual). 8 For faster fine-tuning, we set the maximum length to 128 wordpiece and omit examples where a target word occurs after this position. As a development set for English, we use the officially released GWSC trial data (10 sentence pairs) and an earlier release of trial data (8 sentence pairs), both distinct from the test set. We use these data to select the best models and hyperparameters for our official submissions to GWSC. The English test set consists of 340 context pairs for Subtask 1 and 680 unique contexts for Subtask 2. We finetune bert-base-uncased separately on each of our English datasets experimenting with the two classification heads {CLASSIF, COSDIST} and with different learning rates {5e-5, 1e-6, 1e-7} for up to 15 epochs. These hyperparameters, along with the layer the word representations are extracted from, are set on the GWSC trial data. Our submitted models were fine-tuned on WiC, Opusparcus and CoInCo-Usim with a learning rate of 5e-5 and 0.1 dropout for 4, 3 and 2 epochs, respectively. The ukWaC-subs model was fine-tuned for 11 epochs with a learning rate of 1e-6 and 0.2 dropout. Dropout was determined based on results on 2,000 held-out ukWaC-subs instances. Since no trial dataset was released for Finnish, we fixed the hyperparameters for our models to those that worked best for the English Opusparcus data. Our submitted predictions are from the higher layers of the models fine-tuned with the CLASSIF head. The test set for Finnish consists of 24 context pairs for Subtask 1 (48 unique contexts for Subtask 2). 9 The metrics used to evaluate model predictions are the uncentered Pearson correlation (ρ) in Subtask 1, and the harmonic mean of Pearson and Spearman correlations (ρ) in Subtask 2.

Results
Results for the two English and Finnish subtasks are presented in Table 2. We report results of the two best systems submitted to each subtask (marked with †) along with results calculated during the post-evaluation phase for comparison. These include baseline predictions made by BERT models without fine-tuning.
Although the two subtasks are highly related, different models perform best in each one. For English, the best result in Subtask 1 (among our official submissions) is obtained by the model fine-tuned on WiC data with the COSDIST head (ρ = 0.760) which occupies the third position in the final ranking. It is closely followed by the model fine-tuned on paraphrase data with the CLASSIF head. The best performing model in Subtask 2 is the one fine-tuned on the ukWaC-subs data (ρ = 0.718) which ranked fourth. The second best model uses the COSDIST head and is trained on the CoInCo and Usim data together. All  English models outperform the BERT-based baseline without fine-tuning (ρ = 0.715 andρ = 0.661). This demonstrates the higher quality of lexical semantic knowledge in our fine-tuned models.
Best results for the Finnish Subtasks 1 and 2 are also produced by different models. The multilingual model performs better on Subtask 1 and the finnish model on Subtask 2. We observe that the multilingual model tends to assign very high similarities to all word instance pairs, which explains its low performance in Subtask 2. At the same time, however, it does well on Subtask 1 because it captures the magnitude of the difference in similarity between two pairs. Given that no trial data (development set) are available for Finnish and that the maximum number of submissions was nine, we could only try at most five layers per model at submission time. The models were ranked sixth and fourth in Subtasks 1 and 2.
During the post-evaluation phase, we had the possibility to test all layers of the models. The sixth layer of the multilingual model fine-tuned on Finnish Opusparcus data outperforms the multilingual baseline on Subtask 1 (ρ = 0.718 vs ρ = 0.677), but the other fine-tuned models do not improve over their respective baselines. Surprisingly, the finnish baseline model in Subtask 2 (ρ = 0.671) outperforms the top-ranked model for Finnish among all teams that participated in the task (ρ = 0.645).

Conclusion
We have participated in the SemEval task Graded Word Similarity in Context for English and Finnish, with models integrating different notions of word similarity. We have specifically investigated the effect of fine-tuning pre-trained BERT models on existing datasets that address word meaning similarity in context. Furthermore, we have proposed a new fine-tuning task where in-context lexical similarity is approximated through automatic substitute annotations.
Our English models are ranked at the third and fourth position in the two GWSC subtasks, and outperform a BERT-based baseline without fine-tuning. This demonstrates the benefit of fine-tuning BERT on a task that is closely related to the end task, even when the data used for fine-tuning are automatically obtained. Due to the scarcity of resources for Finnish, we could only fine-tune models with paraphrases. The Finnish models are mid-ranked among all participating systems.