A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space

The representation degeneration problem in Contextual Word Representations (CWRs) hurts the expressiveness of the embedding space by forming an anisotropic cone where even unrelated words have excessively positive correlations. Existing techniques for tackling this issue require a learning process to re-train models with additional objectives and mostly employ a global assessment to study isotropy. Our quantitative analysis over isotropy shows that a local assessment could be more accurate due to the clustered structure of CWRs. Based on this observation, we propose a local cluster-based method to address the degeneration issue in contextual embedding spaces. We show that in clusters including punctuations and stop words, local dominant directions encode structural information, removing which can improve CWRs performance on semantic tasks. Moreover, we find that tense information in verb representations dominates sense semantics. We show that removing dominant directions of verb representations can transform the space to better suit semantic applications. Our experiments demonstrate that the proposed cluster-based method can mitigate the degeneration problem on multiple tasks.


Introduction
Despite their outstanding performance, CWRs are known to suffer from the so-called representation degeneration problem that makes the embedding space anisotropic (Gao et al., 2019). In an anisotropic embedding space, word vectors are distributed in a narrow cone, in which even unrelated words are deemed to have high cosine similarities. This undesirable property hampers the representativeness of the embedding space and limits the diversity of encoded knowledge (Ethayarajh, 2019).
To better understand the representation degeneration problem in pre-trained models, we analyzed the embedding space of GPT-2 (Radford et al., 2019), BERT (Devlin et al., 2019), and RoBERTa . We found that, despite being extremely anisotropic in all non-input layers from a global sight, the embedding space is significantly more isotropic from a local point of view (when embeddings are clustered and each cluster is made zero-mean). Motivated by this observation and based on previous studies that highlight the clustered structure of CWRs (Reif et al., 2019;Michael et al., 2020), we extend the technique of Mu and Viswanath (2018) with a further clustering step. In our proposal, we cluster embeddings and apply PCA on individual clusters to find the corresponding principal components (PCs) which indicate the dominant directions for each specific cluster. Nulling out these PCs for each cluster renders a more isotropic space. We evaluated our cluster-based method on several tasks, including Semantic Textual Similarity (STS) and Word-in-Context (WiC). Experimental results indicate that our cluster-based method is effective in enhancing the isotropy of different CWRs, reflected by the significant performance improvements in multiple evaluation benchmarks.
In addition, we provide an analysis on the reasons behind the effectiveness of our cluster-based technique. The empirical results show that most clusters contain punctuation tokens, such as periods and commas. The PCs of these clusters encode structural information about context, such as sentence style; hence, removing them can improve CWRs performance on semantic tasks. A similar structure exists in other clusters containing stop words. The other important observation is about verb distribution in the contextual embedding space. Our experiments reveal that verb representations are separated across the tense dimension in distinct sub-spaces. This brings about an unwanted peculiarity in the semantic space: representations for different senses of a verb tend to be closer to each other in the space than the representations for the same sense that are associated with different tenses of the same verb. Indeed, removing such PCs improves model's ability in downstream tasks with dominant semantic flavor.

Isotropy in CWRs
Isotropy is a desirable property of word embedding spaces and arguably any other vector representation of data in general (Huang et al., 2018;Cogswell et al., 2016). From the geometric point of view, a space is called isotropic if the vectors within that space are uniformly distributed in all directions. Lacking isotropy in the embedding space affects not only the optimization procedure (e.g., model's accuracy and convergence time) but also the expressiveness of the embedding space; hence, improving the isotropy of the embedding space can lead to performance improvements Ioffe and Szegedy, 2015).
We measure the isotropy of embedding space using the partition function of Arora et al. (2016): where is a unit vector, is the corresponding embedding for the ℎ word in the embedding matrix W ∈ IR N×D , N is the number of words in the vocabulary, and D is the embedding size. Arora et al. (2016) showed that ( ) can be approximated using a constant for isotropic embedding spaces. Therefore, for the set , which is the set of eigenvectors of W W, in the following equation, I(W) would be close to one for a perfectly isotropic space (Mu and Viswanath, 2018).
2.1 Analyzing Isotropy in pre-trained CWRs Using the above metric, we analyzed the representation degeneration problem globally and locally.
Global assessment. We quantified isotropy in all layers for GPT-2, BERT, and RoBERTa on the development set of STS-Benchmark (Cer et al., 2017). Figure 1 shows the trend of isotropy in all layers based on I(W). Clearly, all CWRs are extremely anisotropic in all non-input layers. While the isotropy of GPT-2 decreases consistently in upper layers, that for RoBERTa has a semi-convex form in which the last layer (except for the input layer) has the highest isotropy. Also, interestingly, the input layer in GPT-2 is more isotropic than those for the other two models. This observation contradicts with what has been previously reported by Ethayarajh (2019).
Local assessment. In the light of the clustered structure of the embedding space in CWRs (Reif et al., 2019), we carried out a local investigation of isotropy. To this end, we clustered the space using -means and measured isotropy after making each cluster zero-mean (Mu and Viswanath, 2018). Table 1 shows the results for different number of clusters (each being the average of five runs). When the embedding space is viewed closely, the distribution of CWRs is notably more isotropic. Clustering significantly enhances isotropy for BERT and RoBERTa, making their embedding spaces almost isotropic. However, GPT-2 is still far from being isotropic. This contradicts with the observation of Cai et al. (2021). A possible explanation for these contradictions is the different metric used by Ethayarajh (2019) and Cai et al. (2021) for measuring isotropy: cosine similarity. Randomly sampled words in an anisotropic embedding space should have high cosine similarities (a near-zero similarity denotes isotropy). However, there are exceptional cases where this might not hold (an anisotropic embedding space where sampled words have near-zero cosine similarities). In Figure 2, we illustrate GPT-  2 embedding space as an example for such an exceptional cases. Making individual clusters zeromean (bottom) improves isotropy over the baseline (top). However, the embeddings are still far from being uniformly distributed in all directions. Instead, they are distributed around a horizontal line. This leads to a near-zero cosine similarity for randomly sampled words while the embedding space is anisotropic. Hence, cosine similarity might not be a proper metric for computing isotropy.

Cluster-based Isotropy Enhancement
The degeneration problem in the embedding space can be attributed to the training procedure of the underlying models, which are often language models trained through likelihood maximization with the weight tying trick (Gao et al., 2019). Maximizing the likelihood of a specific word embedding (minimizing that for others) requires pushing it towards the direction of the corresponding hidden state, which results in the accumulation of the learnt word embeddings into a narrow cone. Previous work has shown that nulling out dominant directions of an anisotropic embedding space can make the space isotropic and improve its expressiveness (Mu and Viswanath, 2018). We refer to this as the global approach. This method was proposed for static embeddings. Hence, it might not be optimal for contextual embeddings, especially in the light that the latter tends to have a clustered structure. For instance, recent work suggests that word types (e.g., verbs, nouns, punctuations), entities (e.g., personhood, nationalities, and dates), and even word senses (Michael et al., 2020;Loureiro et al., 2021;Reif et al., 2019) create local distinct clustered areas in the contextual embedding space. Moreover, our local assessment shows that it is not necessarily the case that all clusters share the same dominant directions. Hence, discarding dominant directions that are computed globally is not efficient for removing local degenerated directions. Consequently, it is more logical to have a cluster-specific dropping of dominant directions.
Based on these observations, we propose a cluster-based approach for isotropy enhancement. Specifically, instead of determining dominant directions globally, we obtain them separately for different sub-spaces and discard for each cluster only the corresponding cluster-specific dominant directions. To this end, we employ Principal Component Analysis (PCA) to compute local dominant directions in clusters. Geometrically, principal components (PCs) represent those directions in which embeddings have the most variance (maximum elongation). In our proposed method, we first cluster word embeddings using a simple -means algorithm. After making each cluster zero-mean, the top PCs of every cluster are removed separately. Adding a clustering step helps us to eliminate the local dominant directions of each cluster. We will show in Section 5 that different linguistic knowledge is encoded in the dominant directions of various clusters. Moreover, numerical results show that in comparison with the global approach, our method can make the embedding space more isotropic, even when the fewer number of PCs are nulled out.    (Clark et al., 2019). For the classification tasks, we limit our experiments to BERT and extract features to train an MLP. Further details on the datasets and system configuration can be found in Appendix B. We benchmark our cluster-based approach with the pre-trained CWRs (baseline) and the global method. As it was mentioned before, this method is similar to ours in its elimination of a few top dominant directions but with the difference that these directions are computed globally (in contrast to our local cluster-based computation). The best setting for each model is selected based on performance on the STS-B dev set. The reported results are the average of five runs.

Results
Tables 2 and 3 report experimental results. As can be seen, globally increasing isotropy can make a significant improvement for all the three pre-trained models. However, our cluster-based approach can achieve notably higher performance compared to the global approach. We attribute this improvement to our cluster-specific discarding of dominant directions. Both global and cluster-based methods null out the optimal number of top dominant directions (tuned separately, cf. Appendix B), but the latter identifies them based on the specific structure of a sub-region in the embedding space (which might not be similar to other sub-regions).

Discussion
In this section, we provide a brief explanation for reasons behind the effectiveness of the clusterbased approach through investigating the linguistic knowledge encoded in the dominant local directions. We also show that enhancing isotropy reduces convergence time.

Linguistic knowledge
Punctuations and stop words. We observed that local dominant directions for the clusters of punctuations and stop words carry structural and syntactic information about the sentences in which they appear. For example, the two sentences "A man is crying." and "A woman is dancing." from STS-B do not have much in common in terms of semantics but are highly similar with respect to their style. To quantitatively analyze the distribution of this type of tokens in CWRs, we designed an experiment based on the dataset created by Ravfogel et al. (2020). The dataset consists of groups in which sentences are structurally and syntactically similar but have no semantic similarity. We picked 200 different structural groups in which each group has six semantically different sentences. Then, using the -NN algorithm, we calculated the percentage of    nearest neighbours which are in the same group before and after removing local dominant directions. We evaluated this for period and comma, which are the most frequent punctuations, and "the" and "of" as the most contextualized stop words (Ethayarajh, 2019). The reported results in Figure 3 show that the representations for punctuations and stop words are biased toward structural and syntactic information of sentences; hence, removing their dominant directions reduces the number of same-group nearest neighbours. The improvement from our local isotropy enhancement can be partially attributed to attenuating this type of bias.
Verb Tense. Our experiments show that tense is more dominant in verb representations than senselevel semantic information.
To have a precise examination of this hypothesis, we used SemCor (Miller et al., 1993), a dataset comprising around 37K sense-annotated sentences. We collected representations for polysemous verbs that had at least two senses occurring a minimum of 10 times. Then, for each individual verb, we calculated Euclidean distance to the contextual representation of the same verb: (1) with the same tense and the same meaning, (2) with the same tense but a different meaning, and (3) with a different tense and the same mean- ing. The experimental results reported in Table 4 confirm the hypothesis and show the effectiveness of the cluster-based approach in bringing together verb representations that correspond to the same sense, even if they have different tense.

Convergence time
In the previous experiments, we showed that the contextual embeddings are extremely anisotropic and highly correlated. Such embeddings can slow down the learning process of deep neural networks. Figure 4 shows the trend of convergence for the BoolQ and RTE tasks (dev sets). By decreasing the correlation between embeddings, our method can reduce convergence time.

Conclusions
In this paper, we proposed a cluster-based method to address the representation degeneration problem in CWRs. We empirically analyzed the effect of clustering and showed that, from a local sight, most clusters are biased toward structural information. Moreover, we found that verb representations are distributed based on their tense in distinct sub-spaces. We evaluated our method on different semantic tasks, demonstrating its effectiveness in removing local dominant directions and improving performance. As future work, we plan to study the effect of fine-tuning on isotropy and on the encoded linguistic knowledge in local regions.
A Isotropy statistics Table 5 shows isotropy statistics for GPT-2, BERT, and RoBERTa. GPT-2's embedding space is extremely anisotropic in upper layers. Hence, more PCs are required to be eliminated to make this embedding space isotropic in comparison to BERT and RoBERTa, both in the cluster-based approach and the global one (Mu and Viswanath, 2018). Also, in almost all layers, BERT has higher a isotropy than RoBERTa.  In the Semantic Textual Similarity task, the provided labels are between 0 and 5 for each paired sentence. We first calculate sentence embeddings by averaging all word representations in each sentence and then compute the cosine similarity between two sentence representations as a score of semantic relatedness of the pair.
RTE. The Recognizing Textual Entailment dataset is a classification task from the GLUE benchmark . Paired sentences are collected from different textual entailment challenges and labeled as entailment and notentailment.
CoLA. The Corpus of Linguistic Acceptability (Warstadt et al., 2019) is a binary classification task in which sentences are labeled whether they are grammatically acceptable.

B.2 Configurations
For the classification tasks, we trained a simple MLP on the features extracted from BERT. The proposed cluster-based approach has two hyperparameters: the number of clusters and the number of PCs to be removed. We selected both of them from range [5,30] and tuned them on the STS-B dev set.
In the cluster-based approach,The optimal number of clusters for GPT-2, BERT, and RoBERTa are respectively 10, 27, and 27. For BERT and RoBERTa, 12 top dominant directions have been removed, while the number is 30 for GPT-2 regarding its extremely anisotropic embedding space. The tuning of the number of PCs to be eliminated in the global method has been done similarly to the cluster-based approach (on the STS-B dev set): 30, 15, and 25 for GPT-2, BERT, and RoBERTa, respectively.

C Isotropy on STS datasets
In Table 6, we present the isotropy of the contextual embedding spaces calculated using I(W) on the STS benchmark. The results reveal the effectiveness of the proposed method in enhancing the isotropy of the embedding space.

D Word frequency bias in CWRs
CWRs are biased towards their frequency information, and words with similar frequency create local regions in the embedding space (Gong et al., 2018;Li et al., 2020). From the semantic point of view, this is certainly undesirable given that words with similar meanings but different frequencies could be  located far from each other in the embedding space. This phenomenon can be seen in Figure 5. The encoded knowledge in the local dominant directions partly correspond to frequency information.
The embedding space visualization reveals that our approach performs a decent job in removing frequency bias in pre-trained models.