Hitachi at SemEval-2020 Task 3: Exploring the Representation Spaces of Transformers for Human Sense Word Similarity

In this paper, we present our system for SemEval-2020 task 3, Predicting the (Graded) Effect of Context in Word Similarity. Due to the unsupervised nature of the task, we concentrated on inquiring about the similarity measures induced by different layers of different pre-trained Transformer-based language models, which can be good approximations of the human sense of word similarity. Interestingly, our experiments reveal a language-independent characteristic: the middle to upper layers of Transformer-based language models can induce good approximate similarity measures. Finally, our system was ranked 1st on the Slovenian part of Subtask1 and 2nd on the Croatian part of both Subtask1 and Subtask2.


Introduction
In this paper, we describe our participation in SemEval-2020 task 3: Predicting the (Graded) Effect of Context in Word Similarity (Armendariz et al., 2020). The goal of the task is to understand the effect of contexts on word similarity. The task is composed of two subtasks sharing inputs: we are given a word pair (i.e., two words) and two text snippets (hereafter "contexts") both including the word pair. For convenience of explanation, we describe Subtask2 first. In Subtask2, we predict two similarity scores between the word pair in the two given contexts. In Subtask1, we predict the difference in the above two similarity scores. More detailed descriptions of the subtasks are give in Section 3.
In both subtasks, small labeled data is available for model development. Thus, participants are required to build models in an unsupervised manner.
We formulated the tasks as the exploration of similarity measures induced in the hidden layer word representation space of pre-trained Transformer (Vaswani et al., 2017) based language models. Our expectation is that contextualized representation of Transformers could induce context dependent similarity measures that approximate the human perception.
Experimental results show that better approximations of word sense similarity can be induced in the middle to upper layers of Transformers for most languages. As a result, our system was ranked 1st on the Slovenian part of Subtask 1 and 2nd on the Croatian part of both Subtask 1 and Subtask 2.

Background
Capturing the similarity between words has been considered to be one of the fundamental tasks in natural language processing research because it is strongly related to many research fields such as text search, entailment recognition, and information extraction. Recent work has been aimed at predicting the similarity of a given word pair (Camacho-Collados et al., 2017;Mikolov et al., 2013) and considers the similarity of word meanings that does not consider the effects of their contexts. Word-Sense Disambiguation (WSD) (Miller et al., 2012;Raganato et al., 2017), which is a lexicographical approach to the representation of word senses, aims at selecting an appropriate sense for a given word from wordspecific sense candidates. Alongside with these two lines of the work, Armendariz et al. (2020) extended context text sim score c1 Her prison cell was almost an improvement over her room at the last hostel. s1 = 7 c2 His job as a biologist didn't leave much room for a personal life. He knew much more about human cells than about human feelings. s2 = 2 Table 1: Example of data used in task. In this case, w 1 = "cell," and w 2 = "room." In c 1 , both w 1 and w 2 refer to different kinds of rooms, so similarity score s 1 is rather high. In c 2 , w 1 is used as biological term, while w 2 means abstract concept, so similarity score s 2 is low. Note that this sample is taken from official competition examples, and scores are virtual since we do not know gold ones.
similarity-based word-sense detection in SemEval-2020 task 3. This task focuses on the effect of contexts on word similarity. More concretely, the task aims at predicting the similarity of a given word pair in different contexts.
Recently proposed contextual word vectors, especially those of Transformer-based language models, are considered to be able to capture context-dependent word meanings (Ethayarajh, 2019). We utilize these vectors for the task.

Task Formalization
The task includes two subtasks, both of which aim at capturing the effect of context on similarities of word pairs. Each subtask has four "sub-subtasks", each of those corresponding to each of four languages, namely, English, Finnish, Hungarian, and Slovenian.
As shown in Table 1, let w = (w 1 , w 2 ) denote the given word pair, c = (c 1 , c 2 ) the two different contexts, and s = (s 1 , s 2 ) the human-annotated similarity scores of w for each context (c 1 , c 2 ). s i is annotated as an integer number in the range of [0, 10]. The higher the value is, the more similar w 1 and w 2 are.
Given w and c, Subtask 1 aims at predicting d = s 2 − s 1 , which expresses the change in similarity scores caused by contexts. The metric of Subtask 1 is the Pearson correlation coefficient between gold labels and predictions. Due to the translational and scale invariance of Pearson correlation, we can use the [-1, +1] range instead of [0, 10]. Using this same input for Subtask 1, Subtask 2 aims at predicting s 1 and s 2 directly. The evaluation metric is the uncentered Pearson correlation coefficient; thus, we can use any range in R as well.
We take a two-stage approach; (i) we first solve Subtask 2 by predicting s 1 and s 2 directly, and (ii) we second solve Subtask 1 by calculating d from the predicted s 1 and s 2 .

Model
As we mentioned in the above, we explore similarity measures in Transformer's representation space exhaustively, which will represent the human sense of word similarity well. We introduce a cosine similarity-based measure and then take a "layer-wise" exploration strategy.

Transformer Similarity
Because the input contexts are plain text, we apply two-level tokenization (i.e., word-level tokenization and subword-level tokenization) for each context c 1 and c 2 and then feed the subword-level tokens to a Transformer-based language model to get contextual word vectors: where e (τ,λ) (w, c) represents the representation vector of word w in context c, taken from the λ-th layer of the given Transformer-based language model τ . To get a word-level token representation w, we take an average of all the representation vectors of the corresponding subword-level tokens.
To calculate the similarity between two words, we take cosine-similarity (written as "sim") between the corresponding word vectors. Cosine-similarity scores between the contextualized vectors are repre-sented as the following operations.

Exploration on Representation Space
As described in the above, each combination of (τ, λ) induces a similarity measure. Therefore, we define Exploration Space Θ = {(τ, λ)|τ ∈ Transformers, λ ∈ Layers(τ )}, where Transformers represents a set of Transformer-based language model types, and Layers(τ ) represents the set of layer indices that Transformer-based language model τ contains. We investigate Θ to find the one that approximates gold similarity the best. Details on Transformers and Layers are given in Section 5.

Rank-Weighted Voting for English
Relatively larger number of pieces of annotated data are available in the English task. This enables us to tune a slightly more complicated system for better predicting gold labels. Therefore, we decided to build a special system for the English task that utilizes the multiple predictions made from different similarity measures in Θ for more robust predictions.
First, we sort the predictions of different similarity measures in Θ in the order of the overall performance on the development data, that is, in descending order of the Pearson coefficients between the predictions and gold labels. Let y r denote the prediction on a given sample made by the r-ranked similarity measure. Concretely, y r corresponds tod (τr,λr) in Subtask 1, s (τr,λr) 1 and s (τr,λr) 2 in Subtask 2. We calculate the rank-decayed weighted average of the predictions: y = r ω(r)F (y r ) , ω(r) = exp(−r/R) where R is a tunable hyperparameter representing the rank decay rate, and F is a non-linear transformation function. Note that the non-linearity of F is a significant property. Using a linear transformation function is equivalent to only taking weighted n-best predictions, which have a smaller Pearson coefficient. 1

Experiments Setup
We employed six types of Transformer-based language models as shown in Table 2. For the non-English languages, we used multilingual models, namely, multilingual BERT ( 1 Let y denote a linear combination of uncorrelated stochastic variables: y[ω] = r ω(r)yr. Let l denote another stochastic variable. In our case, yr is the prediction of the r-th ranked similarity measure and l the gold label. By simple calculation, we can show that the Pearson coefficient Pearson(y[ω], l) takes the max when the ω(r) is taken as follows.
Pearson(yr 0 , l) ≥ Pearson(yr, l) Although, in our case, yr does have correlations if taken from a different layer of the same Transformer, the correlations may originate from rather trivial degeneracy, which we do not want the system to rely on.
We employed the log function as the scaling function F for English, which performed best on the development data.
All of the experimental code was implemented with PyTorch (Paszke et al., 2019) and jiant (Pruksachatkun et al., 2020). jiant is a recently developed transfer learning framework, which in turn utilizes Hugging Face's library (Wolf et al., 2019) for Transformer-based language models and their tokenizers. Table 5 present the official ranking of the subtasks. We submitted the similarity measures described in Table 3, which performed the best on the test data among the fixed number of trials. 2 The similarity measures that performed the best on the development data were selected for the trials.

Table 4 and
Interestingly, for all the non-English language tasks that employed multilingual-BERT and XLM-RoBERTa, multilingual-BERT outperformed XLM-RoBERTa. Furthermore, for Subtask 1, the 8th-layer outperformed the other layers submitted for the trials. Which Layer Approximates the Human Sense of Similarity Better?: Figure 1 shows heatmaps of Pearson coefficients between the gold labels and the predictions made by two of the Transformer-based language models (i.e., multilingual-BERT and XLM-RoBERTa), calculated on the development data. We also show the layer-wise averages over the languages. Note that the Finnish results are not shown   because no development data was distributed.
We can see from Figure 1 that better approximations of word sense similarity can be induced in the middle to upper layers of Transformers for most of the languages. This is also consistent with the intended design of the multi-layered self-attention mechanism, which aims to obtain more contextualized word representations on the upper layers.
Looking more into detail, there are different characteristics between the multilingual-BERT and XLM-RoBERTa. For multilingual-BERT, it seems that the deeper the layer is, the higher the performance is. For XLM-RoBERTa, the middle layers tend to perform better than the other layers. This implies that different Transformer language models capture word similarity differently.

Conclusion
In this paper, we proposed a model for the task of capturing the effects of context on word similarity. We employed similarity measures induced by the hidden layer representation vectors of pre-trained Transformer-based language models. We explored all the layers of the models to find the one that matches human perception the best.
Our experimental results show that the multi-layered self-attention mechanism of Transformer-based language models successfully captures the human sense of context-dependent word similarity. The results also revealed a universal language characteristic, that is, for all the Transformer-based language models, the middle to upper layers perform better on the task than the others.