Changing the Basis of Contextual Representations with Explicit Semantics

The application of transformer-based contextual representations has became a de facto solution for solving complex NLP tasks. Despite their successes, such representations are arguably opaque as their latent dimensions are not directly interpretable. To alleviate this limitation of contextual representations, we devise such an algorithm where the output representation expresses human-interpretable information of each dimension. We achieve this by constructing a transformation matrix based on the semantic content of the embedding space and predefined semantic categories using Hellinger distance. We evaluate our inferred representations on supersense prediction task. Our experiments reveal that the interpretable nature of transformed contextual representations makes it possible to accurately predict the supersense category of a word by simply looking for its transformed coordinate with the largest coefficient. We quantify the effects of our proposed transformation when applied over traditional dense contextual embeddings. We additionally investigate and report consistent improvements for the integration of sparse contextual word representations into our proposed algorithm.


Introduction
In recent years, contextual word representations -such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020) -have dominated the NLP landscape on leaderboards such as SuperGLUE (Wang et al., 2019) as well as on real word applications Alloatti et al., 2019). These models gain their semantics-related capabilities during the pre-training process, which can be then fine-tuned towards downstream tasks, including question answering (Raffel et al., 2019;Garg et al., 2019) or text summarization (Savelieva et al., 2020;Yan et al., 2020).
Representations obtained by transformer-based language models carry context-sensitive semantic information. Although the semantic information is present in the embedding space, the interpretation and exact information it carries is convoluted. Hence understanding and drawing conclusions from them are a cumbersome process for humans. Here we devise such a transformation where we explicitly express the semantic information in the basis of the embedding space. In particular, we express the captured semantic information as finite sets of linguistic properties, which are called semantic categories. A semantic category can represent any arbitrary concept. In this paper, we define them according to WordNet (Miller, 1995) Lex-Names (sometimes also referred as supersenses).
Even though we present our work on supersense prediction task, our proposed methodology can also be naturally extended to settings that exploit a different inventory of semantic categories. Our results also provide insights into the inner workings of the original embedding space, since we infer the semantic information from embedding spaces in a transparent manner. Therefore, amplified information can be assigned to the basis of the original embedding space.
Sparse representations convey the encoded semantic information in a more explicit manner, which facilitates the interpretability of such representations (Murphy et al., 2012;Balogh et al., 2020). Feature norming studies also illustrated the sparse nature of human feature descriptions, i.e. humans tend to describe objects and concepts with only a handful of properties (Garrard et al., 2001;McRae et al., 2005). Hence, we also conduct experiments utilizing sparse representations obtained from dense contextualized embeddings.
The transformation that we propose in this paper was inspired by Ş enel et al. (2018), but it has been extended in various important aspects, as we • also utilize sparse representations to amplify semantic information, • analyze several contextual embedding spaces • apply whitening transformation on the embedding space to decorrelate semantic features, which also servers as the standardization step, • evaluate the strength of the transformation in a different manner on supersense prediction task.

Related Work
Contextual word representations provide a solution for context-aware word vector generation. These deep neural language models -such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020) -are pre-trained on unsupervised language modelling tasks, and later finetuned for downstream NLP tasks. Several variants were proposed to address one or more issue corresponding to the BERT model. Some of which we exploited in this paper.  proposed a better pre-training process, Sanh et al. (2019) reduced the number of parameters, Conneau et al. (2020) presented a multilingual model. These models form the base of our approach, since we produce interpretable representations by measuring the semantic content of existing representations. One way to measure the morphological and semantic contents of contextual word embeddings is via the application of probing approaches. The premise of this approach is that, if the probed information can be identified by a linear classifier, then the information is encoded in the embedding space (Adi et al., 2016;Ettinger et al., 2016;Klafka and Ettinger, 2020). Others explored the capacity of language models, where they examined the output probabilities of the model in given contexts (Linzen et al., 2016;Wilcox et al., 2018;Marvin and Linzen, 2018;Goldberg, 2019). We slightly reflect the premise of these methodologies by introducing a logistic regression baseline model.
Another approach is to incorporate external knowledge into Language Models. Levine et al. (2020) devised SenseBERT by integrating supersense information into the training of BERT. K M et al. (2018) showed a method where an arbitrary knowledge graph can be incorporated into their LSTM based model. External knowledge incorporation is getting a popular approach to improve already existing state-of-the-art solutions in a domain or task specific environment (Munkhdalai et al., 2015;Weber et al., 2019;Baral et al., 2020;Mondal, 2020;Wise et al., 2020;Murayama et al., 2020). Since we deemed to investigate the effect of incorporated knowledge towards the semantic content of embedding space, SenseBERT serves a good basis for that. Ethayarajh (2019) investigated the importance of anisotropic property of the contextual embeddings, which is a different kind of investigation than we aim to do. It still gives a good insight into the inner workings of the layers. Ş enel et al. (2018) showed a method where they measured the interpretability of Glove embeddings, and later showed a method to manipulate and improve the interpretability of a given static word representation (Ş enel et al., 2020). Our approach resembles Ş enel et al. (2018), however, we apply different pre-and post-processing steps and more importantly, we replaced the usage of the Bhattacharyya distance with the Hellinger distance, which is closely related to it but operates in a bounded and continuous manner. Our approach also differs from Ş enel et al. (2018) in that we deal with contextualited language models instead of static word embeddings and we also rely on sparse contextualized word vectors.
The intuition behind sparse vectors is related to the way humans describe concepts, which has been extensively studied in various feature norming studies (Garrard et al., 2001;McRae et al., 2005). Additionally, generating sparse features (Kazama and Tsujii, 2003;Friedman et al., 2008;Mairal et al., 2009) has proved to be useful in several areas, including POS tagging (Ganchev et al., 2010), text classification (Yogatama and Smith, 2014) and dependency parsing (Martins et al., 2011). Therefore, several sparse static representations were presented, such as Murphy et al. (2012) proposed Non-Negative Sparse Embeddings to represent interpretable sparse word vectors. Park et al. (2017) showed a rotation-based method and Subramanian et al. (2017) suggested an approach using a denoising k-sparse auto-encoder to generate sparse word vectors. Berend (2017) showed that sparse representations can outperform their dense counterparts in certain NLP tasks, such as NER, or POS tagging. Additionally, Berend (2020) illustrated how applying sparse representations can boost the performance of contextual embeddings for Word Sense Disambiguation, which we also desire to exploit.

Our Approach
We first define necessary notations. We denote the embedding space with E ∈ R v×d with the superscript indicating whether it is obtained from the training set t or evaluation set e. We denote the number of input words and their dimensionality by v and d, respectively. Furthermore, we denote the transformation matrix with W ∈ R d×s -where s represents the number of semantic categories -and the final interpretable representation with I ∈ R v×s , which always denotes the interpretable representation of E (e) . Additionally, we denote the semantic categories with S.

Interpretable Representation
Our goal is to produce such embedding spaces where we can identify semantic features by their basis. In order to obtain such an embedding space, we are constructing a transformation matrix W (t) , which amplifies the semantic information of an input representation and can be formulated as: w × W (t) . E w represents the whitened embedding space, which is the output of a preprocessing step (Section 3.2), and W being our transformation matrix (Section 3.3).

Pre-processing
Pre-processing consists of two steps: first we generate sparse representations of dense embedding spaces (this step is omitted when we report about dense embedding spaces), then we whiten the embedding space.

Sparse Representation
For obtaining sparse contextualized representations, we follow the methodology proposed in (Berend, 2020). That is, we solve the following sparse coding (Mairal et al., 2009) optimization problem: where D ∈ R k×d is the dictionary matrix, and α ∈ R v×k contains the sparse contextualized representations. The two hyperparameters of the dictionary learning approach are the number of basis vectors to employ (k) and the strength of the regularization (λ). We obtained the sparse contextual representations for the words in the evaluation set by fixing the dictionary matrix D that we learned on the train set and optimized solely for the sparse coefficients α (e) . We also report experimental results obtained for different values of basis vectors k and regularization coefficients λ.
The output of this step is also represented with E instead of α since this step is optional. Among our results we mark whether we applied (Sparse) or skipped (Dense) this step.

Whitening
Since we handle dimensions independently, we first apply whitening transformation on the embedding space. Several whitening transformations are known -like Cholesky or PCA (e.g. Friedman (1987)) -but we decided to rely on ZCA whitening (or Mahalanobis whitening) (Bell and Sejnowski, 1997). One benefit of employing ZCA whitening is that it ensures higher correlation between the original and whitened features (Kessy et al., 2018). As a consequence, it is a widely utilized approach for obtaining whitened data in NLP (Heyman et al., 2019;Glavaš et al., 2019).
We determine the whitening transformation matrix from the training set (E (t) ), which is then applied on the representation of our training (E (t) ) and evaluation sets (E (e) ). We denote the whitened representations for the training and evaluation sets by E (t) w and E (e) w , respectively.

Transformation
In this section, we discuss the way we measure the semantic information of the embedding space and express the linear transformation matrix (W).

Semantic Distribution
The coefficients of the contextual embeddings of words that belong to the same (super)sense category are expected to originate from the same distribution. Hence, it is reasonable to quantify the extent to which some semantic category is encoded along some dimension by investigating the distribution of the coefficients of the word vectors along that dimension. For every semantic category, we can partition the words whether they pertain to that category. When a dimension encodes a semantic category to a large extent, the distribution of the coefficients of those words belonging to that category is expected to differ substantially from that of those words not pertaining to the same category.
We can formulate the distributions of our interest by function L : x → S, which maps each token (x) to its context-sensitive semantic category (Lex-Name) and a function f : x → E, which returns the context-sensitive representation of x. Thus the devised distributions can be defined as: and where i represents a dimension and j denotes a semantic category. In other words, P ij represents the distribution along the ith dimension of those words that belong to the jth semantic category, whereas Q ij represents the distribution of the coefficients along the same dimension (i) of those words that do not belong to the jth semantic category.

Semantic Information and Transformation Matrix
For every dimension (i) and semantic category (j) pair, we can express the presence of the semantic information by defining a distance between the distributions P ij and Q ij . Following from the construction of the distributions P ij and Q ij , the larger the distance between a pair of distributions (P ij , Q ij ), the more likely that dimension i encodes semantic information j. Based on that observation, we define a transformation matrix W D as where D is the distance function. We specify the distance function as the Hellinger distance, which can be formulated as where we assume that P ij ∼ N (µ p ij , σ p ij ) and Q ij ∼ N (, µ q ij , σ q ij ), i.e. they are samples from normal distributions with expected value µ and standard deviation σ. We decided to rely on Hellinger distance due to its continuous, symmetric and bounded nature. In contrast to out approach, Ş enel et al. (2018) proposed the usage of Bhattacharyya distancewhich is closely related to Hellinger distance -but it would overestimate the certainty of the semantic information of a dimension in the case of distant distributions. Another concern is that the Bhattacharyya distance is discontinuous. We discussed this topic in a earlier work (Ficsor and Berend, 2020) in relation to static word embeddings.
Bias Reduction. So far, our transformation matrix is biased due to the imbalanced semantic categories. It can be reduced by 1 normalizing W D in such a manner that vectors representing semantic categories sum up to 1, which we denote as W N D (Normalized Distance Matrix).
Directional Encoding. As semantic information can be encoded in both positive and negative directions, we modify the entries of W N D as where sign(·) is the signum function. This modification ensures that each semantic category is represented with the highest coefficients in their corresponding base of the interpretable representation.

Post-processing
The representations transformed in the above manner are still skewed in the sense that they do not reflect the likelihood of each semantic category. In order to alleviate that problem, we measure and normalize the frequency (f N = f / f 2 , f ∈ N s ) of each occurrence of a supersense category in the training set and accumulate that information into the embedding space in the following manner: I f = I + I 1f N , where represents the element-wise multiplication, and 1 represents a vector consisting of all ones. Finally, I f represents our final interpretable representations adjusted with supersense frequencies.

Accuracy Calculation
Representations generated by our approach let us determine the presumed semantic category by the highest coefficient in the word vector. In other words, a word vector should have its highest coefficient in the base, which represents the same semantic category as the annotation represents in the evaluation set. Our overall accuracy is the fraction of the correct predictions and the total number of annotated data in the evaluation set.

Experimental setting.
During our experiments, we relied on the SemCor dataset for training and the unified word sense disambiguation framework introduced in (Raganato et al., 2017a) for evaluation, which consists of 5 sense annotated corpora: SensEval2 (Edmonds and Cotton, 2001), SensEval3 (Mihalcea et al., 2004), SemEval 2007Task 17 (Pradhan et al., 2007, Se-mEval 2013 Task 12 (Navigli et al., 2013), SemEval 2015 Task 13 (Moro and Navigli, 2015) and their concatenation. We refer to the combined dataset as ALL througout the paper. The individual datasets contain 2282, 1850, 455, 1644 and 1022 sense annotations, respectively. These datasets contain fine-grained sense annotation for a subset of the words from which the supersense information can be conveniently inferred. We reduced the scope of fine-grained sense annotations to lexname level, in order to maintain well-defined semantic categories with high sample sizes. We used the SemEval 2007 data as our development set in accordance with prior work (Raganato et al., 2017b;Kumar et al., 2019;Blevins and Zettlemoyer, 2020;Pasini et al., 2021).
We conducted our experiments on several contextual embedding spaces, where each model represent a different purpose. We can consider BERT (Devlin et al., 2019) as the baseline of the following contextual models. Sense-BERT (Levine et al., 2020) incorporated word sense information into its latent representation. DistilBERT (Sanh et al., 2019) obtained through knowledge distillation and operates with less parameters. RoBERTa  introduced a better pre-training procedure. Finally, XLM-RoBERTa (Conneau et al., 2020) is a multilingual model with the RoBERTa's pre-training procedure. When available, we also conducted experiments using both cased and uncased vocabularies.
Following (Loureiro and Jorge, 2019), we also averaged the representations from the last 4 layers of the transformer models to obtain our final contextual embeddings. Furthermore, to determine the hyperparameters for sparse vector generation, we used the accuracy of BERT Base model with different regularizations (λ) and number of employed basis (k) on the SemEval2007 dataset, the results of which can be seen in Table 1

Baselines
We next introduce those baselines we compared our approach with. Most of these approaches rely on the intact contextual representations E, for which the dimensions are not intended to directly encode human interpretable supersense information about the words they describe.
Logistic Regression Classifier We conducted the experiments by setting the random state to 0, maximum iterations to 25,000 and turned off the utilization of a bias term. In this case the vectors that were used for making the predictions about the supersenses of words were of much higher dimensions and not directly interpretable at all, unlike our representations.
Dimension Reduction (PCA+LogReg) We also experimented with representations, which inherit the same number of dimensions as many we utilize (45). So we applied principal component analysis (PCA) based dimension reduction on the original E embedding space. Additionally, we applied Logistic Regression Classifier on the reduced representations with the same parametirazition to the previously described baseline.
Sparsity Makes Sense (SMS) An approach proposed by Berend (2020) yields human-interpretable embeddings like ours, since human-interpretable features are bound to the basis of the output representation. Berend (2020) originally presented the devised algorithm on fine-grained word sense disambiguation, which we altered to work similarly to our approach and predict supersense information instead. We utilized normalized positive pointwise mutual information to construct the transformation matrix because it showed the most prominent scores in the paper.  Table 2: Accuracy of each model on the supersense prediction task using dense and sparse embedding spaces. ALLdev denotes the evaluation on the ALL dataset excluding the development set. All of the sparse representations were generated using λ = 0.1 for the regularization coefficient and k = 3000 basis based on the experiments reported in Table 1. Our approach and SMS are interpretable representations, PCA+LogReg just represents the information in the same number of basis but there are no connection, which can be drawn to the previous two, and Logistic Regression operates on the original embedding spaces. We also include a more detailed table in the Appendix, which breaks down performances for each sub-corpora.

Results
We list the results of our experiments using different contextual encoders on the task of supersense prediction in Table 2. We calculated the accuracy as the fraction of correct predictions and the total number of annotated samples. We selected λ = 0.1 regularization and k = 3000 basis for sparse vector generation in accordance with the results that we obtained over the development set for different choices of the hyperparameters (see Table 1).

Model Performances
We consider a model's semantic capacity as the Logistic Regression model's performance, and its interpretability as the best performing interpretable representation. We do not expect to exceed the original model, since we limited its capabilities drastically by reducing the number of utilized dimensions to 45. By looking at the performance, as expected the original latent representation expresses the most semantic information measure by Logistic Regression. Among all of them, SenseBERT dominates which is due to the additional supersense information signal it relies on during its pretraining. The incorporated supersense information helps Sense-BERT to represent that information more explicitly, which becomes more obvious when we amplify it by sparse representations. So including further objectives during training just further separates the information in the basis.

Dense and Sparse Representations
We can see from Table 2 that relying on sparse representations further amplifies the semantic content of the latent representations. Based on the results of our approach, we can conclude that the semantic information can be more easily identified in the case of sparse representations (as indicated by the higher scores in the majority of the cases). SMS follows a similar trend to ours. Also the relatively small decrease in performance suggests that the majority of the removed signals correspond to noise.

Impact of Base and Large Models
In several cases, the Large models underperformed their Base counterparts (except RoBERTa). It can indicate that the Large version might be under-trained, which was also hypothesised in . Overall, choosing the Base pre-trained models seems to be a sufficient and often better option for performing supersense prediction.

Case-sensitivity of the Vocabulary
As the choice whether using a cased or an uncased model is more beneficial can vary from task to task, we made experiments in that respect. To this end, we compared the performance of BERT and DistilBERT, which are available in both case sensitive and case insensitive versions. Usually, the choice highly depends on the task (cased versions being recommended for POS, NER, WSD) and the language (cased can be beneficial for certain languages such as German). Overall, we can observe some advantage of using the cased vocabularies. Interestingly, the behavior of DistilBERT and BERT differs radically in that respect for all but the Lo-gReg approach.

Considering Dimensionality
Other than the Logistic Regression model, every approach relies on some kind of condensed representation for supersense prediction. Even though all of the representations were condensed -into 45 dimensions from 768, 1024 dimensions for dense and 3000 dimensions for sparse representationsthe performance did not decreased by a large margin. PCA-based dimension reduction approach performed the worst among the 3 approaches, whereas ours performed the best. Note that these interpretable approaches (ours and SMS) not only perform better over a standard dimension reduction, but they also associate human-understandable knowledge to the basis of the embedding space. So it can be utilized as an explicit semantic compression technique.

Comparing Interpretable Representations
Both our and SMS approach are similar in the sense that we can assign human-interpretable features to the basis of output embeddings. We hence analysed the similarity of the semantic information of the two embedding spaces. We measured the Spearman rank correlation of the coefficients in each pair of basis generated by our approach and the SMS approach. We included these values in Table 3, which showcases the mean of absolute (ignoring the direction of correlation) correlation coefficients. Except for SenseBERT, we can see weak correlation scores. Higher correlation between the coefficients of these interpretable models, along the same dimension would suggest that they can represent the same semantic information to a different level and/or manner. According to the Spearman correlation between our and the SMS approach captures a different aspect of the encoded semantic content, but we futher experimented with SenseBERT. Since the two embeddings expressed from Sense-BERT -with our and SMS approach -seem to share the most semantic content, we investigated them further. During our evaluation, we rely on the maximum value of each word token, so each dimension represents the semantic information among its highest coefficients. Hence, higher value ranks a word more likely to carry the corresponding semantic information. Therefore, we calculated Rank-  Figure 2: Representation of the coefficients of several semantic categories where the color represents the assigned label according to the corpus, whether the prediction according to the maximum is correct (True) or not (False), and both axis represent its value in their corresponding basis in our representation (SenseBERT, k = 3000, λ = 0.1).
biased Overlap (RBO) scores (Webber et al., 2010) between the sorted basis, which can be seen in Figure 1. RBO quantifies a weighted, non-conjoint similarity measure, which does not rely on correlation. RBO utilizes a p parameter, which controls the emphasis we have on top ranked items (lower p indicates more emphasis on the top ranked items). The p = 1 case differs from the p < 1 case, in that it returns the un-bounded set-intersection overlap calculated according to the proposition from Fagin et al. (2003). On the other hand, p < 1 prioritizes the head of the lists. Higher score indicates higher similarity between two ranked lists, which in our case means that the two models behave more similarly.
Both models perform comparable in general with slightly better scores on sparse models for our approach. We measured the statistical significance of the improvements by Berg-Kirkpatrick et al. (2012), which states the following H 0 hypothesis: if p(δ(X) > δ(x)|H 0 ) < 0.05 then we accept the improvement of the first model and unlikely to be cause of random factors, where δ(·) represents the improvement of the first model. Furthermore, we used b = 10 6 bootstraps, which was sufficient according to the original paper. Between sparse models we obtained p = 0.0016 value, which suggests that the significance of improvement is unlikely to be caused by random factors.

Qualitative Assessments
Clustering We demonstrate the semantic decomposition of 3 pairs of semantic categories in Figure 2. Each marker corresponds to a concrete word occurrence with their color reflecting their expected supersense. The markers also indicate whether the prediction made according to the highest coordinate is correct (True) or not (False). Furthermore, both axis represents its actual value in its corresponding base. We can notice in these figures how well data points are separated with respect to their semantic properties.

Shared Space of Multilingual Domain
The availability of multilingual encoders allows us to use our supersense classifier on languages other than English as well. In order to test the applicaility of XLM-RoBERTa in such a scenario, we tested it on some sentences in multiple languages, the outcome of which is included in Table 4.
To this experiment, we constructed W D in the usual manner from Sparse XLM-RoBERTa transformer on the SemCor dataset (which is in English). After that, we generated the context aware word vectors for the sentences. We then obtained the sparse representations from them by employing the already optimized dictionary matrix from Sem-Cor. We finally utilized the previously constructed distance matrix to obtain the interpretable representation. In Table 4, we marked the expected label above the text with blue, and the top 3 predictions with red below the text.
We included 3 typologically diverse languages German (DE), Hungarian (HU) and Japanese (JP). Overall, the expected label was within the top 3 predictions irrespective of the language, which suggests that the overlap in semantic distribution is high between languages, but further quantitative experiments are also needed to support that statement.

Conclusion
In this paper, we demonstrated our approach to obtain interpretable representations from contextual representations, which represents semantic information in the basis with high coefficients. We demonstrated its capabilities by applying it on supersense prediction task. However, it can be utilized on other problems as well such as term expansion and knowledge base completion. We additionally explored the application of sparse representations, which successfully amplified the examined semantic information. We also considered the effect of incorporated prior knowledge in the form of applying SenseBERT embeddings, which showed that its additional objective during pre-training can amplify those features. Furthermore, explored the space of condensed (Distil-BERT) and multilingual (XLM-RoBERTa) spaces. We examined the improvements come by RoBERTa from a semantic standpoint. Note that our classification decision is currently made by simply finding the coordinate with the largest magnitude.
In conclusion, our experiments showed that it is possible to extract and succinctly represent humaninterpretable information about words in transformed spaces with much lower dimensions than their original representations. Additionally, it allows us to make decisions about word vectors in a more transparent manner, where some kind of explanation is already assigned to the basis of a representation, which can lead us to more transparent machine learning models.