Distilling Word Meaning in Context from Pre-trained Language Models

In this study, we propose a self-supervised learning method that distils representations of word meaning in context from a pre-trained masked language model. Word representations are the basis for context-aware lexical semantics and unsupervised semantic textual similarity (STS) estimation. A previous study transforms contextualised representations em-ploying static word embeddings to weaken excessive effects of contextual information. In contrast, the proposed method derives representations of word meaning in context while preserving useful context information intact. Speciﬁcally, our method learns to combine outputs of different hidden layers using self-attention through self-supervised learning with an automatically generated training corpus. To evaluate the performance of the proposed approach, we performed comparative experiments using a range of benchmark tasks. The results conﬁrm that our representations exhibited a competitive performance compared to that of the state-of-the-art method transforming contextualised representations for the context-aware lexical semantic tasks and outperformed it for STS estimation.


Introduction
Word representations are the basis for various natural language processing tasks. Particularly, they are crucial as a component in context-aware lexical semantics and in the estimation of unsupervised semantic textual similarity (STS) (Arora et al., 2017;Ethayarajh, 2018;Yokoi et al., 2020). Word representations are desired to represent word meaning in context to improve these downstream tasks. Large-scale masked language models pre-trained on massive corpora, e.g., bi-directional encoder representations from transformers (BERT) (Devlin et al., 2019), embed both the context and meaning of a word; thus, word-level representations generated by such masked language models are called contextualised word representations. Previous studies (Ethayarajh, 2019;Vulić et al., 2020) have revealed that lexical information and contextspecific information are captured in different layers of masked language models. They argued that a sophisticated mechanism is required to derive representations of word meaning in context from them. Although contextualised word representations have shown considerable promise, how best to compose the outputs of different layers of masked language models to effectively represent word meaning in context remains an open question.  improved contextualised word representations by transforming their space towards static word embeddings, e.g., fastText (Bojanowski et al., 2017). Although this transformation is computationally efficient, the process is monotonic, weakening the effect of context in representations. As an orthogonal approach, pre-trained masked language models should fit themselves to generate representations of word meaning in context with supervised fine-tuning. However, annotating word meanings in context is non-trivial, and no such resource is abundantly available.
To address these challenges, we propose a method that distils representations of word meaning in context from masked language models via self-supervised learning. 1 Specifically, our model combines the outputs of different hidden layers using a self-attention mechanism (Vaswani et al., 2017). The distillation model is self-supervised using an autoencoder to reconstruct original representations with an automatically generated training corpus. In contrast to the transformation-based approach, our representations preserve useful context information intact.
Experimental results on a range of benchmark tasks show that our representations exhibited a performance competitive with that of the state-of-the-art method that transforms contextualised representations for context-aware lexical semantics. Furthermore, the results confirm that our representations are more effective for composing sentence representations, which contributes to unsupervised STS estimation.  (Mikolov et al., 2013) to generate static word representations for context-free lexical semantic tasks such as word similarity and analogy prediction.
Transformation has also been used to adjust excessive effects of context that dominate representations. Shi et al. (2019) add a transformation matrix on top of the embedding layer of ELMo (Peters et al., 2018). Their approach derives the matrix such that final representations of the same words in paraphrased sentences become similar, whereas those of non-paraphrases become distant. The study most relevant to the present work was conducted by . They transform the space of word representations towards the rotated space of static word embeddings using a crosslingual alignment technique (Doval et al., 2018) for context-aware lexical semantic tasks. In principle, these previous studies aim to make contextualised representations less sensitive to contexts through transformation and prevent them from dominating the representations. We adopt an orthogonal approach to derive word in context representations by combining different layers of a pre-trained model while preserving useful context information intact.

Representation Disentanglement
Disentanglement techniques are relevant to our approach, which generate specialised representations dedicated to a specific aspect. Previous studies typically employed autoencoders, with the encoder learning to disentangle representations and the decoder learning to reconstruct original representa-  Wieting et al. (2020) disentangled language-dependent styles and sentence meanings for STS estimations. The removal of specific attributes from representations is also relevant. Previous studies have proposed methods for removing predetermined attributes instead of disentangling for multi-linguality (Chen et al., 2018;Lample et al., 2018) and debiasing (Zemel et al., 2013;Barrett et al., 2019).
These previous studies assume that disentangled attributes are distinctive, e.g., language-dependent styles and meanings are supposed to be independent of one another. Similarly, studies on attribute removal assume that the removed attributes are independent of the information remaining in the output representations. In contrast, the distillation of word meaning in context requires a subtle balance to the extent that context information is present in the meaning representations. In this study, we design a self-supervision framework to achieve this challenging goal.

Distilling Word Meaning in Context
Inspired by the representation disentanglement approach (Section 2.2), we model the distillation of representations of word meaning in context using an autoencoder framework, as shown in Figure 1. Vulić et al. (2020) probed pretrained language models for lexical semantic tasks, revealing that lexical information is scattered across lower layers, whereas context-specific information is embedded in higher layers. Hence, we aim to distil the outputs of different hidden layers using a transformer layer. In this study, although we adopted BERT as the masked language model, the proposed method is directly applicable to other pre-trained models. Figure 1 shows the model architecture. First, we obtain the outputs of all hidden layers of a masked language model, MLM(·), with frozen parameters H = MLM(S) ∈ R |S|×( +1)×d , where S is an input sentence of length |S| containing the target word, w t ∈ S, is the number of hidden layers in the masked language model (0 corresponding to its embedding layer), and d is the hidden dimension of the masked language model. We then extract the outputs of the hidden layers corresponding to the target word, w t , from H, not- When w t is segmented into a set of m sub-words ω 1 , ω 2 , · · · , ω m , by a tokeniser of the masked language model, we compute the layer-wise averages of the hidden outputs of all sub-words (Bommasani et al., 2020). That is, h i ∈ H wt becomes where h ω j i is the ith hidden output of a sub-word ω j and the Pool(·) function conducts mean-pooling.
We then input these hidden outputs into a meaning distillation model to derive a representation for word meaning in context. We also input the hidden outputs to another distillation model that derives information other than word meaning in context. For convenience, hereinafter we refer to this information as the context and the distillation model as the context distillation model. 2 Each distillation model consists of a transformer layer followed by a mean-pooling function to obtain meaning and context representations, expressed as h m ∈ R d and h c ∈ R d , respectively.
where k ∈ [0, ] determines the bottom layer to consider and TransF(·) represents a transformer layer. We distil the context representation in the same manner.
Finally, we reconstruct the original representation from h m and h c . Although there are different approaches for reconstructions, such as using a neural-network-based decoder, a sophisticated decoder may learn to fit itself to mimic the masked language model outputs. Hence, we adopt meanpooling as the simplest reconstruction mechanism for reconstruction.
The reconstruction target y ∈ R d is the meanpooled hidden layers of the original masked language model. (1) We minimise the reconstruction loss as For inference, we use h m as a representation of word meaning in context. Averaging the outputs of the layers in the tophalf of masked language models consistently performs well for context-aware lexical semantic tasks (Vulić et al., 2020;. Thus, we set k = /2 + 1 to use the top-half layers for distillation. 3 John et al. (2019) reported that a variational autoencoder (Kingma and Welling, 2014) outperformed the simpler autoencoder on representation disentanglement. However, this was not the case in this study, wherein the autoencoder consistently outperformed the variational version. We intend to further investigate auto-encoding architectures in future work.

Self-supervised Learning
The meaning and context distillation models described in Section 3 require constraints to ensure that the desired attributes are distilled; otherwise, these distillation models obtain a degenerate solution that simply copies the original representations. We design a self-supervision framework ensuring that word meaning in context is distilled using an automatically generated training corpus.

Cross Reconstruction
Suppose we have two sentences, S p and S n . S p is a sentence that contains a word with the same mean- original They promised him a nice amount of coins, if the work would be successful. positive They assured him a good amount of coins if the work was successful. negative They left him a nice amount of coins, if the work would be successful. ing with w t in S, while S n contains a word with a different meaning with w t while the context is the same with S. More concretely, S p is a sentence containing w p , which is equivalent to w t or a lexical paraphrase of w t , that allows w p to have the same meaning with w t in S. In contrast, S n replaces w t with a non-paraphrasal word that is suitable for the context, w n , i.e., S n = {w n , w i |w i ∈ S \ w t }.
We refer to S p and S n as the positive and negative samples, respectively. Table 1 shows examples of such positive and negative samples. From the hidden outputs of w p and w n , we distil the meaning and context representations, p m and p c , and those of n m and n c , respectively. The meaning representation of w t , h m , should satisfy the following two conditions.
• h m can be combined with p c to reconstruct the original representation derived for w p , and • h m can be combined with n c to reconstruct the original representation, y.
Similarly, the context representation, h c , should satisfy the following two conditions.
• h c can be combined with p m to reconstruct the original representation, y, and • h c can be combined with n m to reconstruct the original representation derived for w n .
We use these properties of meaning and context representations as constraints.
Specifically, we train the model to achieve cross reconstruction of meaning and context representations, as depicted in Figure 2.
Our self-supervised learning minimises the following cross reconstruction loss, as given below.
where p and n are computed by the same manner with Equation (1). The overall loss function is the summation of the reconstruction and crossreconstruction losses in Equations (2) and (3) where L r is expanded to sum the reconstruction losses of the positive and negative samples.

Training Corpus Creation
In this section, we describe the generation of a training corpus for self-supervision using techniques of round-trip translation and masked token prediction.

Round-trip Translation
The positive samples in this study require that w p has the same meaning with w t in another context of S p . We assume that common words in a paraphrased sentence pair meet this requirement (Shi et al., 2019). To expand the applicability of our method to various languages, we automatically generate paraphrases using round-trip translation, which translates a source sentence into a target language and then back into the source language.  have shown that pairs of source and back-translated sentences are useful paraphrases for style transfer research. Hence, we obtain S p by round-trip translation of S. We need to align w t and w p in S and S p . The two-round translation makes tracing which word 11: return w p in S p corresponds to w t non-trivial. Following the trends on monolingual alignment (Yoshinaka et al., 2020) that use static word embeddings, we designed an alignment method based on a simple heuristic using cosine similarities between the embeddings of words in S and S p , as depicted in Algorithm 4.1. Specifically, we first identify an alignment between word w i ∈ S \w t and w j ∈ S p if and only if they have highest cosine similarities to each other (line 5). We then determine w p as a word that has the highest cosine similarity to w t satisfying that it is higher or equal to a pre-determined threshold λ and has not been aligned to others (line 9).

Masked Token Prediction
In contrast, negative samples replace w t with an arbitrary word w n that fits in the context of S. We generate candidates for replacement words using masked token prediction, which is the primary task used to train the masked language model. Specifically, we input an original sentence whose target is masked by the [MASK] label to the masked language model, and we obtain predictions T = {t 1 , · · · , t |V | } with probabilities, Q = {q 1 , · · · , q |V | }, where |V | is the size of the vocabulary of the masked language model. To avoid selecting a possible paraphrase of w t as w n , we again use the static word-embedding model following Qiang et al. (2020). We sort T in a descending order of Q and identify w n the word embedding of which has a lower cosine similarity than λ and a prediction probability q n higher than a pre-determined threshold δ.
We apply the same technique to enhance w p when it is identical or similar to w t based on a character-level edit distance. Where possible, we replace w p with w p ∈ T the word embedding of which has a higher or equal cosine similarity than λ and a prediction probability higher than δ in masked token prediction.
We also investigated a word substitution approach for self-training corpus creation (Garí Soler and Apidianaki, 2020), i.e., replacing only w t to w p using masked token prediction. This method is computationally faster than round-trip translation, but showed inferior performance compared to the proposed approach. We presume this is because round-trip translation provides more diverse lexical paraphrases compared to those already learned by the masked language model, and paraphrasing the context also enhances the robustness of the meaning and context distillers.

Experimental Setup
We empirically evaluated whether our method distils representations of word meaning in context from a masked language model using contextaware lexical semantic tasks and STS estimation tasks. 4 All the experiments were conducted on an NVIDIA Tesla V100 GPU.
We compared our method to  as the state-of-the-art in the family of methods that transform contextualised representations. Recall that  adopt an approach orthogonal to that proposed herein, which transforms word representations from the masked language model using static word embeddings. Specifically, we used fastText as the static embeddings that performed most robustly across models and tasks. As a baseline, we also show the performance of BERT. Based on the previous studies (Vulić et al., 2020;, we used the average of the outputs of the top-half layers, i.e., Equation (1), which consistently performed well in lexical semantic tasks.

Context-aware Lexical Semantic Tasks
We followed experimental settings used by    parison. They categorised context-aware lexical semantic tasks into Within-word and Inter-word tasks. The former evaluates the diversity of word representations for different meanings of the same word associated with different contexts. In contrast, the latter evaluates the similarity of word representations for different words when they have the same meaning. The left-side columns of Table 2 show the number of word pairs in the evaluation corpora.
Within-word Tasks The within-word evaluation was divided into three tasks. The first is based on the Usage Similarity (Usim) corpus (Erk et al., 2013), which provides graded similarity between the meanings of the same word in a pair of different contexts. The second task uses the Word in Context (WiC) corpus (Pilehvar and Camacho-Collados, 2019), which provides binary judgements as to whether the meaning of a given word varies in different contexts. Following the standard setting recommended in the original work, we tuned the threshold for cosine similarity between word representations to make binary judgments. Specifically, we searched the threshold in the range of [0, 1.0] with 0.01 intervals to maximise the accuracy of the development set. The performance of the test set was measured on the CodaLab server. 5 The third task is the subtask-1 of CoSimlex (Armendariz et al., 2020) (denoted as CoSimlex-I). The CoSimlex provides a pair of contexts consisting of a few sentences for each word pair extracted from SimLex-999 (Hill et al., 2015). It annotates the graded similarity in each context. CoSimlex-I requires the estimation of the change in similarities between the same word pair in different contexts. Hence, it evaluates whether representations can change for different word meanings according to context.

Inter-word Tasks
The inter-word evaluation consisted of two tasks. The first was the subtask-2 of CoSimlex (denoted as CoSimlex-II), which required estimating the similarity between different word pairs in the same context. The second task used the Stanford Contextual Word Similarity (SCWS) corpus (Huang et al., 2012), which provides graded similarity between word pairs in a pair of different contexts. The contexts of CoSimlex and SCWS consist of several sentences. We input all the sentences as a single context.

Evaluation Metrics
We estimated the similarity between words using cosine similarity between their representations. We used evaluation metrics determined by each corpus. Namely, we evaluated WiC using accuracy, CoSimlex-I using Pearson's r, and others using Spearman's ρ.

STS Tasks
We also evaluated the proposed method on STS tasks. Cosine similarity is commonly used to estimate the similarity between two text representations. In this experiment, we also used cosine similarity because such a primitive measure is sensible to characteristics of different representations. We generated a sentence representation by simply averaging representations of sub-words in a sentence excluding representations for special tokens preserved in BERT, i.e., [CLS] and [SEP]. We then computed cosine similarities between them. We evaluated the 2012-to-2016 SemEval STS shared tasks (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, where the goal is to predict human scores that indicate the degree of semantic similarity between two sentences. The Pearson's r between model predictions and human scores was used as an evaluation metric. Each STS corpus is divided by data sources. Hence, the corpus level score is the average of the Pearson's r for each sub-corpus. We downloaded and pre-processed STS 2012 to 2016 corpora using the SentEval toolkit (Conneau and Kiela, 2018). The right-side columns of Table 2 show the number of sentence pairs in these corpora.

Training Corpus Preparation
To prepare a training corpus for self-supervised learning as described in Section 4.2, we used English Wikipedia dumps distributed for the WMT20 competition, the texts of which were extracted using WikiExtractor. As a pre-processing step, we first identified the language of each text using the langdetect toolkit and discarded all non-English texts. We then conducted sentence segmentation and tokenization using Stanza (Qi et al., 2020) and extracted sentences of 15 to 50 words.
As candidate target words, we extracted the top-50k frequent words 6 following . We then sampled 1M sentences containing these words from the pre-processed Wikipedia corpus. Using these 1M sentences, we generated positive and negative samples via round-trip translation and masked token prediction. For round-trip translation, we trained translators using exactly the same settings as . For convenience, we used fastText as a static word embedding model in Algorithm 4.1. However, other word embeddings or paraphrase lexicons, e.g., PPDB (Ganitkevitch et al., 2013), can also be used. We set λ as 0.6 based on the distribution of cosine similarities of fastText embeddings on a large text corpus. 7 We set δ as 0.003 based on observations of masked token predictions on several samples randomly extracted from the training corpus, such that we could obtain more than 10 predictions of reasonable quality.
Round-trip translation does not always produce an alignable w p , and our simple word alignment heuristic may fail to identify w p . Hence, the final number of sentences in our training corpus was reduced to 929, 265, where 44, 614 unique words remained as targets. Among them, 242, 643 sentences had w p whose surfaces were larger than the 3 character-level edit distance, which were expected as lexical paraphrases. We used these 929k triples of the original, positive, and negative samples for self-supervised learning. We randomly sampled and excluded 10k sentences as a validation set and used the remainder for training.

Implementation
We implemented our method using PyTorch and Lightning. As a masked language model, we used BERT-Large, cased model for which we used the Transformers library (Wolf et al., 2020). BERT-Large has 24 layers of 1, 024 hidden dimensions with 16 attention heads. Recall that the parameters of BERT were frozen and never fine-tuned.
The meaning and context distillers of the implementation of the proposed model included a transformer layer consisting of 1, 024 hidden di-mensions with eight attention heads. 8 We applied 10% dropouts to the transformer layer. The batch size was 128. We used AdamW (Loshchilov and Hutter, 2019) as an optimizer for which the learning rate was tuned as 4.0e − 5 following Smith (2017). For stable training, we applied a warm-up, where the initial learning rate was linearly increased for the first 1k steps to reach the predetermined value. The training was stopped early with a patience of 15 and a minimum delta of 1.0e − 4 based on the validation loss measured for every 0.1 epoch.
For the method of , we replicated their model using the implementation and training corpus published by the authors. Note that their training corpus was also drawn from English Wikipedia. Then, the performance was measured on the same evaluation corpora and computational environments with our method.

Results and Discussions
Below, we discuss experimental results and the results of in-depth analyses conducted to identify characteristics of meaning representations generated by our method. Table 3 shows the results on context-aware lexical semantic tasks. The superior performance of our meaning representation to context representations confirm that distillation performed as designed. Our meaning representations achieved performance competitive with the transformation method by . 9 While the transformation method was stronger in Within-Word tasks, our method outperformed it for Inter-Word tasks. This is because the transformation method makes representations of the same words in different contexts closer to the same static embedding but do not explicitly model relations across words. In contrast, our negative samples provide supervision, which makes representations of words with different meanings distinctive. While the performances of these two methods are competitive, these different properties   are reflected in the representations. This difference is more pronounced in the results of unsupervised STS tasks shown in Table 4. In unsupervised STS tasks, our meaning representations outperformed the transformed representations in four out of five tasks. The transformation has an effect of making contextualised representations less sensitive to contexts to prevent contexts from dominating the representations. This effect is preferred in tasks of context-aware lexical semantics that severely require representations of word meaning, but at the same time, sacrifices context information valuable for other tasks. In contrast, our method does not waste the context information useful for composing sentence representations.

Analysis
For a deeper understanding of the context information preserved in representations by the transformation method and our method, we conducted an experiment using the corpus of paraphrase adversaries from word scrambling (PAWS) . PAWS is a paraphrase corpus dedicated to evaluating the sensitivity of recognition models for syntax in paraphrases. It provides paraphrase and non-paraphrase pairs that were generated by controlled word swapping and back translation with manual screening. Because pairs in PAWS have relatively high word overlap rates, models insensitive to contexts cannot exceed the chance rate for paraphrase recognition.
We generated representations of sentences in the PAWS-Wiki Labeled (Final) section in the same manner as with the STS tasks and computed cosine similarities between them. We then determined a threshold to regard a pair as paraphrase using the development set. Table 5 shows the results. BERT-Large and the transformation method had equal to or lower accuracy than the chance rate of 55.80% (always outputting the majority label of non-paraphrases). In contrast, our method improved the accuracy even on this challenging task. This is achieved by our property that distils word meaning in context without sacrificing useful context information.   Table 3, where meaning representations were no longer useful while the context representations show only a comparable performance to BERT-Large.

Ablation Study
Interestingly, these context representations still outperformed representations of BERT-Large on the unsupervised STS tasks. We conducted an intrinsic evaluation again using the PAWS-Wiki Labeled (Final) section to investigate characteristics of the meaning and context representations and reveal possible mechanisms behind this gain. Table 6 shows average cosine similarities between meaning and context representations separately for common and different words in paraphrases and nonparaphrases. Representations for word in context are expected to have (a) higher similarity for words with the same surfaces than for different words, and (b) higher similarity for words appearing in paraphrases than for words in non-paraphrases by reflecting the context. Particularly, appropriate representations should have higher similarity for common words in paraphrases than for those in nonparaphrases because the former more likely has the same meaning.
The meaning and context representations trained with negative samples as well as the context representations without negative samples preserve these characteristics; in other words, they have noticeable distinction between common and different words and words in paraphrases and non-paraphrases. In contrast, the meaning representations generated without negative samples have high cosine similarities among all words, regardless of word and paraphrase relations. This result implies that these meaning representations without negative samples  Table 6: Average cosine similarities between words in PAWS-Wiki where "w/o NS" denotes our method without negative samples ("P" stands for paraphrases and "N" stands for non-paraphrases) performed as a noise filter to remove non-useful information from the context representations, and only the corresponding context representations benefited from the self-supervision.

Summary and Future Work
We have proposed a method that improves contextualised word representations. The proposed approach distils a representation of word meaning in context, retaining useful context information encoded by a masked language model. Experimental results confirmed that our method exhibited performance competitive with the state-of-the-art method for transforming contextualised representations to alleviate excessive effects of contexts on representations, demonstrated on context-aware lexical semantic tasks. Our method further outperformed it on STS tasks. In a future work, we plan to investigate correspondences of the context representations. We had assumed that these representations preserve the sentence-level meaning; however, the STS results confirmed that this assumption was incorrect. Another possibility is that context representations may retain syntactic information. We intend to conduct in-depth investigations using syntactic tasks. Moreover, we will expand our method to support multilingual masked language models to contribute to cross-lingual processing, e.g., cross-lingual word in context disambiguation (Camacho-Collados et al., 2017), word alignment (Nagata et al., 2020), and quality estimation and post-editing for machine translation (Fomicheva et al., 2020).