Exploring the Role of BERT Token Representations to Explain Sentence Probing Results

Several studies have been carried out on revealing linguistic features captured by BERT. This is usually achieved by training a diagnostic classifier on the representations obtained from different layers of BERT. The subsequent classification accuracy is then interpreted as the ability of the model in encoding the corresponding linguistic property. Despite providing insights, these studies have left out the potential role of token representations. In this paper, we provide a more in-depth analysis on the representation space of BERT in search for distinct and meaningful subspaces that can explain the reasons behind these probing results. Based on a set of probing tasks and with the help of attribution methods we show that BERT tends to encode meaningful knowledge in specific token representations (which are often ignored in standard classification setups), allowing the model to detect syntactic and semantic abnormalities, and to distinctively separate grammatical number and tense subspaces.


Introduction
Recent years have seen a surge of interest in pretrained language models, highlighted by extensive research around BERT (Devlin et al., 2019) and its derivatives. One strand of research has focused on enhancing existing models with the primary objective of improving downstream performance on various NLP tasks (Liu et al., 2019b;Lan et al., 2019;Yang et al., 2019). Another strand analyzes the behaviour of these models with the hope of getting better insights for further developments (Clark et al., 2019;Kovaleva et al., 2019;Jawahar et al., 2019;Tenney et al., 2019;Lin et al., 2019).
Probing is one of the popular analysis methods, often used for investigating the encoded knowledge Authors marked with a star ( ) contributed equally. 1 Code is available at https://github.com/ hmohebbi/explain-probing-results in language models (Conneau et al., 2018;Tenney et al., 2018). This is typically carried out by training a set of diagnostic classifiers that predict a specific linguistic property based on the representations obtained from different layers. Recent works in probing language models demonstrate that initial layers are responsible for encoding low-level linguistic information, such as part of speech and positional information, whereas intermediate layers are better at syntactic phenomena, such as syntactic tree depth or subject-verb agreement, while in general semantic information is spread across the entire model (Lin et al., 2019;Peters et al., 2018;Liu et al., 2019a;Hewitt and Manning, 2019;Tenney et al., 2019). Despite elucidating the type of knowledge encoded in various layers, these studies do not go further to investigate the reasons behind the layer-wise behavior and the role played by token representations. Analyzing the shortcomings of pre-trained language models requires a scrutiny beyond the mere performance (e.g., accuracy or F-score) in a given probing task. This is particularly important as recent studies point out that the diagnostic classifier (applied to the model's outputs) might itself play a significant role in learning nuances of the task and hence suggest evaluating probes with alternative criteria (Hewitt and Liang, 2019; Voita and Titov, 2020;Pimentel et al., 2020;Zhu and Rudzicz, 2020).
We extend the layer-wise analysis to the token level in search for distinct and meaningful subspaces in BERT's representation space that can explain the performance trends in various probing tasks. To this end, we leverage the attribution method (Simonyan et al., 2013;Sundararajan et al., 2017;Smilkov et al., 2017) which has recently proven effective for analytical studies in NLP (Li et al., 2016;Bastings and Filippova, 2020;Atanasova et al., 2020;Wu and Ong, 2021;Voita et al., 2021). Our analysis on a set of surface, syntax, and semantic probing tasks (Con-neau et al., 2018) shows that BERT usually encodes the knowledge required for addressing these tasks within specific token representations, particularly at higher layers. For instance, we found that sentence-ending tokens (e.g., "[SEP]" and ".") are mostly responsible for carrying positional information through layers, or when the input sequence undergoes a re-ordering the alteration is captured by specific token representations, e.g., by the swapped tokens or the coordinator between swapped clauses. Also, we observed that the ##s token is mainly responsible for encoding noun number and verb tense information, and that BERT clearly distinguishes the two usages of the token in higher layer representations.

Related Work
Probing. Several analytical studies have been conducted to examine the capacities and weaknesses of BERT, often by means of probing layerwise representations (Lin et al., 2019;Goldberg, 2019;Liu et al., 2019a;Jawahar et al., 2019;Tenney et al., 2019). Particularly, Jawahar et al. (2019) leveraged the probing framework of Conneau et al. (2018) to show that BERT carries a hierarchy of linguistic information, with surface, syntactic, and semantic features respectively occupying initial, middle and higher layers. In a similar study, Tenney et al. (2019) employed the edge probing tasks defined by Tenney et al. (2018) to show the hierarchy of encoded knowledge through layers. Moreover, they observed that while most of the syntactic information can be localized in a few layers, semantic knowledge tends to spread across the entire network. Both studies were aimed at discovering the extent of linguistic information encoded across different layers. In contrast, in this paper we explore the role of token representations in the final performance. More recently, Klafka and Ettinger (2020) investigated the extent of information that can be recovered from each word representation in a sentence about the other words. Apart from using different probing tasks and methodologies, most notably they relied solely on classifier's performance score, whereas we make conclusion based on the most contributed token representations.
Representation subspaces. In addition to layerwise representations, subspaces that encode specific linguistic knowledge, such as syntax, have been a popular area of study. By designing a structural probe, Hewitt and Manning (2019) showed that there exists a linear subspace that approximately encodes all syntactic tree distances. In a follow-up study, Chi et al. (2020) showed that similar syntactic subspaces exist for languages other than English in the multilingual BERT and that these subspaces are shared among languages to some extent. This corroborated the finding of Pires et al. (2019) that multilingual BERT has common subspaces across different languages that capture various linguistic knowledge.
As for semantic subspaces, Wiedemann et al. (2019) showed that BERT places the contextualized representations of polysemous words into different regions of the embedding space, thereby capturing sense distinctions. Similarly, Reif et al. (2019) studied BERT's ability to distinguish different word senses in different contexts. Using the probing approach of Hewitt and Manning (2019), they also found that there exists a linear transformation under which distances between word embeddings correspond to their sense-level relationships. Our work extends these studies by revealing other types of surface, syntactic, and high-level semantic subspaces and linguistic features using a pattern-finding approach on different types of probing tasks.
Attribution methods. Recently, there has been a surge of interest in using attribution methods to open up the blackbox and explain the decision makings of pre-trained language models, from developing methods and libraries to visualize inputs' contributions (Ribeiro et al., 2016;Han et al., 2020;Wallace et al., 2019;Tenney et al., 2020) to applying them into fine-tuned models on downstream tasks (Atanasova et al., 2020;Wu and Ong, 2021;Voita et al., 2021). In particular, Voita et al. (2021) adopted a variant of Layer-wise Relevance Propagation (Bach et al., 2015) to evaluate the relative contributions of source and target tokens to the generation process in Neural Machine Translation predictions. To our knowledge, this is the first time that attribution methods are employed for layerwise probing of pre-trained language models.

Methodology
Our analytical study was mainly carried out on a set of sentence-level probing tasks from SentEval (Conneau and Kiela, 2018). The benchmark consists of several single-sentence evaluation tasks. Each task provides 100k instances for training and 10k for test, all balanced across target classes. We used the test set examples for our evaluation and indepth analysis. Following the standard procedure for this benchmark, we trained a diagnostic classifier for each task. The classifier takes sentence representations as its input and predicts the specific property intended for the corresponding task.
In what follows in this section, we first describe how sentence representations were computed in our experiments. Then, we discuss our approach for measuring the attribution of individual token representations to classifier's decision.

Sentence Representation
For computing sentence representations for layer l, we opted for a simple unweighted averaging (h l Avg ) of all input tokens (except for padding and [CLS] token). This choice was due to our observation that the mean pooling strategy retains or improves [CLS] performance in most layers in our probing tasks (cf. Appendix A.1 for more details). This corroborates the findings of Reimers and Gurevych (2019) who observed a similar trend on sentence similarity and inference tasks. Moreover, the mean pooling strategy simplifies our measuring of each token's attribution, discussed next.
Our evaluations are based on the pre-trained BERT (base-uncased, 12-layer, 768-hidden size, 12-attention head, 110M parameters) obtained from the HuggingFace's Transformers library (Wolf et al., 2020). We followed the recommended hyperparameters by Jawahar et al. (2019) to train the diagnostic classifiers for each layer. In addition to BERT, we carried out our evaluations on RoBERTa (Liu et al., 2019b, base, 125M parameters). However, we observed highly similar patterns for the two models. Hence, we only report results for the BERT model.

Gradient-based Attribution Method
We leveraged a gradient-based attribution method in order to enable an in-depth analysis of layer-wise representations with the objective of explaining probing performances. Specifically, we are interested in computing the attribution of each input token to the output labels. This is usually referred to as the saliency score of an input token to classifier's decision. Note that using attention weights for this purpose can be misleading given that raw attention weights do not necessarily correspond to the importance of individual token representations (Serrano and Smith, 2019;Jain and Wallace, 2019;Abnar and Zuidema, 2020;Kobayashi et al., 2020).
Using gradients for attribution methods has been a popular option in neural networks, especially for vision (Simonyan et al., 2013;Sundararajan et al., 2017;Smilkov et al., 2017). Images are constructed from pixels; hence, computing their individual attributions to a given class can be interpreted as the spatial support for that class (Simonyan et al., 2013). However, in the context of text processing, input tokens are usually represented by vectors; hence, raw feature values do not necessarily carry any specific information. Li et al. (2016)'s solution to this problem relies on the gradients over the inputs. Let w c be the derivative of class c's output logit (y c ) with respect to the k-th dimension of the input embedding (h[k]): Although the absolute value of gradients could be employed for understanding and visualizing the contributions of individual words, these values can only express the sensitivity of the class score to small changes without information about the direction of contribution (Yuan et al., 2019). We adopt the method of Yuan et al. (2019) for our setting and compute the saliency score for the i th representation in layer l, i.e., h l i , as: where y l c denotes the probability that the classifier assigns to class c based on the l th -layer representations. Given that our aim is to explain the representations (rather than evaluating the classifier), we set c in Equation 3 as the correct label. This way, the scores reflect the contributions of individual input tokens in a sentence to the classification decision.
In what follows in the paper, we use the analysis method discussed in this section to find those tokens that play the central role in different surface (Section 4), syntactic (Sections 5 and 6.1) and semantic (Section 6.3) probing tasks. Based on these tokens we then investigate the reasons behind performance variations across layers.

Sentence Length
In this surface-level task we probe the representation of a given sentence in order to estimate its size, i.e., the number of words (not tokens) in it. To this end, we used SentEval's SentLen dataset, but changed the formulation from the original classification objective to a regression one which allows a better generalization due to its fine-grained setting. The diagnostic classifier receives average-pooled representation of a sentence (cf. Section 3.1) as input and outputs a continuous number as an estimate for the input length.
Given that the ability to encode the exact length of input sentences is not necessarily a critical feature, we do not focus on layer-wise performance and instead discuss the reason behind the performance variations across layers. To this end, we calculated the absolute saliency scores for each input token in order to find those tokens that played pivotal role while estimating sentence length.
Rounding the regressed estimates and comparing them with the gold labels in the test set, we can observe a significant performance drop from 0.91 accuracy in the first layer to 0.44 in the last layer (cf. Appendix A.1 for details). This decay is not surprising given that the positional encodings, which are added to the input embeddings in BERT and are deemed to be the main players for such a position-based task, get faded through layers (Voita et al., 2019).
Sentence ending tokens retain positional information. Figure 1 shows tokens that most contributed to the probing results across different layers according to the attribution analysis. Finalizing tokens (e.g. "[SEP]" and ".") are the main contributors in the higher layers. We further illustrate this in of a finalizing token with those of another frequent non-finalizing token. Clearly, positioning information is lost throughout layers in BERT; however, finalizing tokens partially retain this information, as visible from distinct pattern in higher layers.

Verb Tense and Noun Number
This analysis inspects BERT representations for grammatical number and tense information. For this experiment we used the Tense and ObjNum tasks 3 : the former checks whether the main-clause verb is labeled as present or past 4 , whereas the latter classifies the object according to its number, i.e., singular or plural (Conneau et al., 2018). On both tasks, BERT preserves a consistently high performance (> 0.82 accuracy) across all layers (cf. Appendix A.1 for more details).  Colors indicate whether the token occurred in present-or past-labeled sentence in the Tense task (see Section 5). For the sake of comparison, we also include two present verbs without the ##s token 5 (i.e., does and works) and two irregular plural nouns (i.e., men and children), in rounded boxes. The distinction between the two different usages of the token (noun number as well as the tense information) is clearly encoded in higher layer contextualized representations. As plural nouns can appear in both past-and present-labeled examples, the cluster belongs to the plural form of ##s token in higher layers may contain both types of examples.
Articles and ending tokens (e.g., ##s and ##ed) are key playmakers. Attribution analysis, illustrated in Figure 3(a), reveals that article words (e.g., "a" and "an") and the ending ##s token, which makes out-of-vocab plural words (or third person present verbs), are among the most attributed tokens in the ObjNum task. This shows that these tokens are mainly responsible for encoding object's number information across layers. As for the Tense task, Figure 3(b) shows a consistently high influence from verb ending tokens (e.g., ##ed and ##s) across layers which is in line with performance trends for this task and highlights the role of these tokens in preserving verb tense information.
##s -Plural or Present? The ##s token proved influential in both tense and number tasks. The token can make a verb into its simple present tense (e.g., read → reads) or transform a singular noun into its plural form (e.g., book → books). We further investigated the representation space to 5 Tokens that were not split by the tokenizer. check if BERT can distinguish this nuance. Results are shown in Figure 4: after the initial layers, BERT recognizes and separates these two forms into two distinct clusters (while BERT's tokenizer made no distinction among different usages). Interestingly, we also observed that other present/plural tokens that did not have the ##s token aligned well with these subspaces.

Inversion Abnormalities
For this set of experiments, we opted for SentEval's Bi-gram Shift and Coordination Inversion tasks which respectively probe model's ability in detecting syntactic and semantic abnormalities. The goal of this analysis was to to investigate if BERT encodes inversion abnormality in a given sentence into specific token representations.

Word-level inversion
Bi-gram Shift (BShift) checks the ability of a model to identify whether two adjacent words within a given sentence have been inverted (Con- Figure 5: Normalized layer-wise attribution scores for a randomly sampled sentence from the test set (left). The right figure shows how the attribution scores changed when two words ("at" and "the") from the original sentence were inverted. neau et al., 2018). Probing results shows that the higher half layers of BERT can properly distinguish this peculiarity (Figure 7). Similarly to the previous experiments, we leveraged the gradient attribution method to figure out those tokens that were most effective in detecting the inverted sentences. Given that the dataset does not specify the inverted words, we reconstructed the inverted examples by randomly swapping two consecutive words in the original sentences of the test set, excluding the beginning of the sentences and punctuation marks as stated in (Conneau et al., 2018).

Results
Our attribution analysis shows that swapping two consecutive words in a sentence results in a significant boost in the attribution scores of the inverted tokens. As an example, Figure 5 depicts attribution scores of each token in a randomly sampled sentence from the test set across different layers. The classifier distinctively focuses on the token representations for the shifted words ( Figure 5 right), while no such patterns exists for the original sentence ( Figure 5 left).
To verify if this observation holds true for other instances in the test set, we carried out the following experiment. For each given sequence X of n tokens, we defined a boolean mask M = [m 1 , m 2 , ...m n ] which denotes the position of the inversion according to the following condition: where V is the set of all tokens in the shifted bigram (|V | ≥ 2, given BERT's sub-word tokeniza-   Figure 6 reports mean layer-wise correlation scores. We observe that in altered sentences the correlation significantly grows over the first few layers which indicates model's increased sensitivity to the shifted tokens. We hypothesize that BERT implicitly encodes abnormalities in the representation of shifted tokens. To investigate this, we computed the cosine distance of each token to itself in the original and shifted sentences. Figure 7 shows layer-wise statistics for both shifted and non-shifted tokens. Distances between the shifted token representations aligns well with the performance trend for this probing task (also shown in the figure). Figure 8: Evaluating the β map 6 for a single example in a specific layer (layer = 3). After computing the map for the original (a) and inverted (b) forms of the sentence, to compute the ∆ β map we need to reorder the inverted map. The corresponding columns and rows for the inverted words (orange boxes) are swapped to re-construct the original order (c). The ∆ β map (d) is the magnitude of the point-wise difference between the re-ordered and the original maps. The ∆ β map for this example clearly shows that most of the changes have occurred within the bi-gram inversion area. All values are min-max normalized.

Attention-norm behavior on bi-gram inversion
Our observation implies that BERT somehow encodes oddities in word order in the representations of the involved tokens. To investigate the root cause of this, we took a step further and analyzed the building blocks of these representations, i.e., the self-attention mechanism. To this end, we made use of the norm-based analysis method of Kobayashi et al. (2020) which incorporates both attention weights and transformed input vectors (the value vectors in the self-attention layer). The latter component enables a better interpretation at the token level. This norm-based metric || αf (x)||-for the sake of convenience we call it attention-norm-is computed as the vector-norm of the i th token to the j th token over all attention heads (H = 12) in each layer l: where α i,j is the attention weight between the two tokens and f head,l (x) is a combination of the value transformation in layer l of the head and the matrix which combines all heads together (see Kobayashi et al. (2020)'s paper for more details).
We computed the attention-norm map in all layers, for both the original and shifted sentence. To be able to compare these two maps, we re-ordered Figure 9: A cumulative view of the attention-norm changes (∆ β l ) centered around the bi-gram position (the approximate bi-gram position is marked on each figure). Each plot indicates the cumulative layer-wise changes until a specific layer. Each row indicates the corresponding token's attention-norms to every token in the sentence (including itself). Although the changes slightly spread out to the other tokens as we move up to higher layers, they mostly occur in the bi-gram area. Given BERT's contextualization mechanism, variations in attention-norms in each row directly result in a change in the corresponding token's representation. Therefore, the tokens in the bi-gram undergo most changes in their representations. the shifted sentence norms to match the original order. The magnitude of the difference between the original and the re-ordered map ∆ β l shows the amount of change in each token's attention-norm to each token. Figure 8 illustrates this procedure for a sample instance. Given that bi-gram locations are different across each instance, to compute an overall ∆ β l we centered each map based on the position of the inversion. As a result of this procedure, we obtained a ∆ β l map for each layer and for all examples. Centering and averaging all these maps across layers produced Figure 9. Figure 9 indicates that after inverting a bi-gram, both words' attention-norms to their neighboring tokens change and this mostly affects their own representations rather than others. This observation suggests that the distinction formed between the representations of the original and shifted tokens, as was seen in Figure 7, can be rooted back to the changes in attention heads' patterns.

Phrasal-level inversion
The Coordination Inversion (CoordInv) task is a binary classification that contains sentences with two coordinated clausal conjoints (and only one coordinating conjunction). In half of the sentences the clauses' order is inverted and the goal is to detect malformed sentences at phrasal level (Conneau et al., 2018). Since the phrasal-level inversion does not alter the syntax structure of the sentence, the task could be considered as a semantic one (Conneau et al., 2018). For an example: the glass broke and i cut myself . → Original i cut myself and the glass broke . → Inverted While both sentences are syntactically correct, we should rely on the meaning of the sequence of the events in order to detect the abnormality in the second sentence.
BERT's performance on this task increases through layers and then slightly decreases in the last three layers. We observed that the attribution scores for "but" and "and" coordinators to be among the highest (see Appendix A.2) and that these scores notably increase through layers. We hypothesize that BERT might implicitly encodes phrasal level abnormalities in specific token representations.
Odd Coordinator Representation. To verify our hypothesis, we filtered the test set to ensure all sentences contain either a "but" or an "and" coordinator 7 . We reconstructed the original examples by inverting the order of the two clauses in the inverted instances since no sentence appears with both labels in the dataset. Feeding this to BERT, we extracted token representations and computed the cosine distance between the representations of each token in the original and inverted sentences. Figure  10 shows these distances, as well as the normalized saliency score for coordinators (averaged on all examples in each layer), and layer-wise performance for the CoordInv probing task. Surprisingly, all Figure 10: Averaged cosine distances between coordinators in the original and inverted sentences. We also show the normalized saliency scores for the coordinators across layers which correlate with the performance scores of the task. The distance curve for other tokens is a baseline to highlight that the representation of coordinators significantly change after inversion. these curves exhibit a similar trend. As we can see, when the order of the clauses are inverted, the representations of the coordinators "but" or "and" play a pivotal role in making sentence representations distinct from one another while there is nearly no change in the representation of other words. This observation implies that BERT somehow encodes oddity in the coordinator representations (corroborating part of the findings of our previous analysis of BShift task in Section 6.1).

Control Experiments
The main motivation behind designing a control task in probing studies is to check whether it is the representations that encode linguistic knowledge or the diagnostic classifier itself which plays a significant role in learning nuances of the task (Hewitt and Liang, 2019). In this regard, most of our experiments throughout the paper (similarity curves, tSNE plots, or attention-norm analysis) all rely on fixed representations and do not need any classifier or training; hence, they all serve as control experiments or sanity checks. For example, in our attention-norm analysis (which requires no training and comes from a different perspective) we arrive at the same results as our attribution analysis. Computation of attribution scores based on trained diagnostic classifiers is the only part of our experiments which involves a training procedure. Hence, we carried out a control study inspired by Talmor et al. (2020) to check the consistency of attribution patterns. The intuition behind this is in line with Voita and Titov (2020) who stated that if there is a strong regularity in the representations with respect to the labels, this can be revealed even with fewer training data points.
To this end, we used only 10% of the training data to train the diagnostic classifiers and computed the attribution scores for each task. Then, we computed the correlation between attribution scores for each sentence obtained by these classifiers and those obtained from the original classifiers (trained on full training data). After averaging the correlations over all examples, we report the mean and maximum statistics among all layers in Table 1. The strong correlations imply that a similar pattern exist in the attribution scores even when fewer training instances are used. This highlights the fact that task-specific knowledge is well encoded and regularized in the representations, nullifying the possibility of the classifier playing a major role.

Conclusions
In this paper we carried out an extensive gradientbased attribution analysis to investigate the nature of BERT token representations. To our knowledge, this is the first effort to explain probing performance results from the viewpoint of token representations. We found that, while most of the positional information is diminished through layers, sentence-ending tokens are partially responsible for carrying this knowledge to higher layers in the 8 Results are averaged over three runs. model. Furthermore, we analyzed the grammatical number and tense information throughout the model. Specifically, we observed that BERT tends to encode verb tense and noun number information in the ##s token and that it can clearly distinguish the two usages of the token by separating them into distinct subspaces in the higher layers. Also, we found that abnormalities can be captured by specific token representations, e.g., in two consecutive swapped tokens or a coordinator between two swapped clauses.
Our approach in using a simple diagnostic classifier and incorporating attribution methods provides a novel way of extracting qualitative results in probing studies. This can be seamlessly applied to various deep pre-trained models, providing a wide range of options in sentence-level tasks and from the fine-grained viewpoint of tokens. We hope this will spur future probing studies in other evaluation scenarios. Future work might investigate how subspaces are evolved or transformed during fine-tuning and whether they are beneficial at inference time to various downstream tasks (e.g. syntactic abnormalities, grammatical number and tense subspaces in grammar-based tasks like CoLA Warstadt et al., 2019) or to check whether these behaviors are affected by different training objectives. Furthermore, our token-level analysis can provide insights for enhancing model efficiency based on token importance, something we plan to pursue in future work.   (2019). Specifically, we show layerwise performance differences of the two representations, with the green color indicating improvements of our strategy. The results clearly highlight that average representations are more suited to the task, providing improvements across many layers in most tasks.

A.2 Full 12-layer Figures
In this section we provide the full 12-layer version of the previous summarized layer-wise figures.

A.3 SubjNum Mislabelling
The SubjNum probing data suffers from numerous incorrect labels which are more obvious within samples which starts with a name that ends with an "s" and labelled as plural. We show five examples with this issue in Table A  Lois had stopped in briefly to visit , but didn 't stay very long . NNS Tomas sank back on the seat , wonder on his face .

NNS
Justus was an unusual man .