Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers’ progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.


Introduction
Transformer architecture (Vaswani et al., 2017) has advanced the state of the art in a wide range of natural language processing (NLP) tasks (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020). Along with this, Transformers have become a major subject of research from the viewpoints of engineering (Rogers et al., 2020) and scientific studies (Merkx and Frank, 2021;Manning et al., 2020).
(b) Proposed method incorporating the whole attention block (i.e., multi-head attention, residual connection, and layer normalization) into the analysis. Figure 1: Visualizations of the token-by-token interactions in each layer when a sentence pair is fed into the pre-trained BERT-base. The diagonal elements correspond to the effect of preserving the original input information. The contrast between Figures 1a and 1b demonstrates that the contextual information contributed less to the computation of the output representations than previously expected.
to capturing linguistic information such as semantic and syntactic relations, some reports question the importance of attention. For example, several studies in fields ranging from NLP (Michel et al., 2019;Kovaleva et al., 2019) to neuroscience (Toneva and Wehbe, 2019) empirically found that discarding learned attention patterns from Transformers retains or even improves their performance in downstream tasks and the ability to simulate human brain activity. These observations imply that Transformers do not heavily rely on the multi-head attention alone, and the other components contribute to their progressive performance.
In this study, we broaden the scope of the analysis from the multi-head attention to the whole attention block, i.e., the multi-head attention, residual connection, and layer normalization. Our analysis of the Transformer-based masked language models (Devlin et al., 2019;Liu et al., 2019) revealed that the newly incorporated components have a larger impact than expected in previous studies (Abnar and Zuidema, 2020;Kobayashi et al., 2020) ( Figure 1).
More concretely, we introduce an exact decomposition of the operations in the whole attention block exploiting the norm-based analysis (Kobayashi et al., 2020). Our analysis quantifies the impact of the two contrasting effects of the attention block : (i) "mixing" the input representations via attention and (ii) "preserving" the original input mainly via residual connection (Section 3). Our analysis reveals that the preserving effect is more dominant in each attention block than previously estimated (Abnar and Zuidema, 2020;Kobayashi et al., 2020). The results also reveal the detailed mechanism of each component in the attention block. The residual connections pass through much larger vectors than the vectors produced by the multi-head attention. The layer normalization also reduces the effect of the operation via attention.
Our finding of the relatively small impact of the multi-head attention provides new intuitive interpretations for some existing reports, for example, discarding the learned attention patterns did not adversely affect their performance. Our analysis also provides a new intuitive perspective on the behaviors of Transformer-based masked language models. For example, BERT (Devlin et al., 2019) highlights low-frequency (informative) words in encoding texts, which is consistent with the existing methods for effectively computing text representations (Luhn, 1958;Arora et al., 2017).
The contributions of this study are as follows: • We expanded the scope of Transformers analysis from the multi-head attention to the attention block (i.e., multi-head attention, residual connection, and layer normalization).
• Our analysis revealed that the operation via residual connection and layer normalization contributes more to the internal representations than expected in previous studies (Abnar and Zuidema, 2020;Kobayashi et al., 2020).
• We detailed the functioning of BERT: (i) BERT tends to mix a relatively large amount of contextual information into [MASK] in the middle and later layers; and (ii) the contribution of contextual information in the attention block is related to word frequency. 2 Background: Transformer architecture The Transformer architecture consists of a stack of layers. Each layer has an attention block, which is responsible for capturing the interactions between input tokens. The attention block can be further divided into the three components: multi-head attention (ATTN), residual connection (RES), and layer normalization (LN) (Figure 2). This block can be written as the following composite function: where x i ∈ R d is the i-th input representation, X := [x 1 , . . . , x n ] ∈ R n×d is the sequence of input representations, and x i ∈ R d is the output representation corresponding to x i . Boldface letters such as x denote row vectors. In the following, we review the computations in the ATTN, RES, and LN components.

Multi-head attention (ATTN):
The ATTN takes the role of mixing contextual information into output representations. Formally, given input representations X, the H head ATTN computes the output ATTN(x i , X) ∈ R d for each input x i : where ATTN h (x i , X) ∈ R d denotes the output vector from each attention head h. ATTN h (x i , X) is computed by each attention head h as follows: where has been assumed to represent the contribution of the input x j to computing x i .

Residual connection (RES):
In RES, the original input vector for the multi-head attention (x i ) is added to its output vector: Layer normalization (LN): LN first normalizes the input vector and then applies a transformation with learnable parameters γ ∈ R d and β ∈ R d : where m(y) ∈ R and s(y) ∈ R denote the element-wise mean and standard deviation 2 , respectively. Here, subtraction and division are also performed on an element-wise basis. The normalized vector, (y − m(y))/s(y), is then transformed with γ and β; here, denotes the element-wise multiplication.
Note that analyzing the feed-forward networks in each layer is beyond the scope of this study and will be carried out as future work.
1 d k y (k) − m(y) + 2 , where y (k) denotes the k-th element of the vector y and ∈ R is a small constant to stabilize the value.

Proposal: Analyzing attention blocks
For analyzing Transformers, solely observing the attention weights has been a common method (Clark et al., 2019;Kovaleva et al., 2019, etc.). We extend the scope of analysis to the whole attention block (ATTN, RES, and LN).

Strategy: Norm-based analysis
Kobayashi et al. (2020) introduced the norm-based analysis to extend the scope of analysis from the attention weights to the whole multi-head attention.
We follow this norm-based analysis and extend its scope to the whole attention block. The norm-based analysis first attempts to decompose the output vector x i into the sum of the transformed input vectors {x j }: where F i is an appropriate vector-valued function. Then, the contribution of x j to x i can be expressed by the norm of F i (x j ). In the next subsection, we indicate that this norm-based method can be applied to analyzing the whole attention block. In other words, the output of the attention block is also be decomposed into the sum of the transformed input vectors without any approximation.

Decomposing output into a sum of inputs
The output x i is decomposed into a sum of terms associated with each input x j . First, ATTN (Equation 2) can be decomposed into a sum of vectors (Kobayashi et al., 2020): Second, in RES, no interaction between subscripts i and j occurs, and the form is already additively decomposed. Third, by exploiting the linearity of m(·), we can derive the "distributive law" of LN and decompose it. Let y = j y j be the input to LN. Then, See Appendix A for the derivation.
With these decompositions of ATTN and LN, the output of the whole attention block can be written as the sum of vector-valued functions with each input vector in X as an argument:

Measuring the contribution of context
Regarding the success of the contextualized representations in NLP, an interesting issue is the location and strength of the context mixing performed in the model. Based on this issue, we investigate the attention block by categorizing the terms in Equation 16 into the two effects: 3 1. Mixing contextual information into the output representation by the ATTN: We measure the magnitude of this contextmixing effect by the norm x i←context . This strength refers to the amount of information from the surrounding contexts {x 1 , . . . , x n }\ {x i } in calculating x i .
2. Preserving the original information via ATTN and RES: We measure the magnitude of the preserving effect by the norm x i←i . This strength refers to the amount of information from the original vector x i used in calculating x i . At the attention block, information from the input vector x i can flow through two ways: (i) attention to the original input (the first term) and (ii) residual connection (the second term).
To summarize the relative strength of the contextmixing effect, the context-mixing ratio is defined as follows: A higher mixing ratio indicates that the mixing effect is more dominant than the preserving effect in the computation of x i . Note that Abnar and Zuidema (2020) assumed that the multi-head attention and residual connection always equally impact the output, i.e., r ≈ 0.5 in the analysis of Transformers. However, our experiments revealed that, in practical masked language models, the mixing ratio is considerably below 0.5.

Experiments: Analysis of mixing ratio
The context-mixing ratio of the attention blocks in pre-trained masked language models was analyzed using the proposed norm-based analysis. The obtained results were different from those of the existing methods that analyze only some of the components in the attention block.

General setup
Model: We investigated the 32 variants of the masked language models (BERT with five different sizes, BERT-base trained with 25 different seeds, and RoBERTa with two different sizes). In Section 4, the results for BERT-base and RoBERTabase are demonstrated. The results for the other models are provided in Appendix B and C. Note that most of our findings reported in this section generalize across these model variants. Exceptions are discussed in the relevant section (Section 4.4).  consists of paired consecutive paragraphs. Each sequence is fed into the models with masking applied to 15% of tokens 80% of the time. 4 Analysis methods: We compared the contextmixing ratio computed with the following five analyzing methods: • ATTN-W: Analyzing ATTN via attention weights, which has been applied in many existing studies (Clark et al., 2019;Kovaleva et al., 2019;Mareček and Rosa, 2019, etc.). The ratio, where attention weight assigned to the original input vector α i,i corresponds to the preserving effect, and the others correspond to the mixing effect, is calculated as follows: • ATTN-N: Analyzing ATTN via the vector norm (Kobayashi et al., 2020). The mixing ratio is calculated as • ATTNRES-W: Analyzing ATTN and RES via attention weights, as Abnar and Zuidema (2020) did. They assumed that the residualaware attention matrix is constructed as 0.5A + 0.5I. Here, A is the actual attention matrix and I is the identity matrix considered as the effect of residual connection. The mixing ratio is calculated as • ATTNRESLN-N (proposed): Analyzing ATTN, RES, and LN via the vector normthe method proposed in Section 3. This corresponds to the r i in Equation 18.

Results
We computed the mixing ratio of each token in each layer (each attention block) of the models with the five analysis methods (Section 4.1). The average, maximum, and minimum mixing ratios are shown in Table 2. Each row corresponds to a different analysis method.
Lower mixing ratio than in existing methods: Table 2 shows that the mixing ratios obtained from the proposed ATTNRES-N and ATTNRESLN-N largely differ from those obtained from the existing methods. Whereas the attention analyses (ATTN-W and ATTN-N) yield mixing ratios of 84-97% and ATTNRES-W yields 48%-49%, our proposed method (ATTNRESLN-N) yields about 16 and 19% on average. The visualizations of the token-bytoken interactions in the common attention map style become almost diagonal patterns ( Figure 1). These demonstrate that each layer's context mixing is lower than previously expected, and RES and LN largely cancel the mixing by ATTN. Observing the only ATTN and making an inference about the Transformer layer may lead to misleading. Note that Srivastava et al. (2015) reported a similar trend that stacked feed-forward networks tend to prioritize the "preserving" effect in skip connections.

Consistent trends across model sizes:
Our method consistently shows the lowest mixing ratio among the compared methods for BERT and RoBERTa models of various sizes (BERT-large, medium, small, tiny, and RoBERTa-large) (Appendix B). Interestingly, the context-mixing ratio is higher in the models with fewer layers (37% in BERT-tiny, but 15% in BERT-large).

Connections to previous studies
Our finding of a lower mixing ratio than previously thought provides explanations for previous results and is consistent with the pre-training strategy.
Token identifiability: The low context-mixing ratio is consistent with Brunner et al. (2020)'s reports on what they called "token identifiability." They showed that input tokens can be well predicted only from the corresponding internal representations within BERT, especially in shallower layers, suggesting that context mixing is performed little by little. Our analysis results of the whole attention block were consistent with this finding.
Masked language modeling objective: Regarding the masked token prediction task 5 during the pre-training, BERT and RoBERTa learn to conduct the following operations for a given input sequence: (i) infilling the [MASK] with plausible words, (ii) replacing the normal (non-special) tokens that might not fit their context (i.e., randomly replaced tokens) with plausible one, and (iii) reconstructing the original input tokens that might fit their context. In our experiments and in common practical use, most tokens in the input sequence are not masked and fit their context. Thus, BERT is assumed to reconstruct the inputs for these tokens (i.e., behave as an auto-encoder). From this point of view, the superiority of the preserving effect is the intuitive behaviors of the masked language models.
Low impact of discarding learned attention patterns: Several studies have reported the low impact of discarding the learned attention patterns in Transformers. Michel et al. (2019) and Kovaleva et al. (2019) reported that the attention patterns of many attention heads in Transformers can be removed or overwritten into the uniform patterns with almost no change in their performance, and this even brought about improvements in some cases. Voita et al. (2019) also reported the same phenomenon using a pruning method with additional training. In addition, Toneva and Wehbe (2019) reported that using uniform attention in early layers of BERT instead of the learned attention patterns leads to a better ability to simulate human brain activity.
Our analysis shows that most of the attention signal is reduced by the immediately following modules, RES and LN. This fact may explain the above observations that discarding the learned attention patterns of many attention heads does not cause a severe difference.

Mechanism
How is the mixing effect conducted in multi-head attention largely suppressed in the whole attention block? We discuss the mechanism role of ATTN and LN in suppressing the mixing ratio.
ATTN reduces context-mixing ratio: RES is a mechanism that equally adds together the output of ATTN and the input in a one-to-one fashion (Equation 7). Considering this, the mixing ratio in the scope of ATTN and RES is expected to be about 50%, while the mixing ratio was actually substantially below 50% (19-22% in ATTNRES-N) (Section 4.2). This suggests that the output of ATTN is much smaller than the input; in other words, ATTN seems to have the effect of largely shrinking inputs to compute the output. How is this achieved?
Recall that the output of ATTN is a weighted sum of the affine-transformed vector f h (x) using with attention weight α h i,j (Equation 10). We describe and empirically show that (i) the affine transformation in ATTN has the effect of shrinking the inputs, and (ii) the attention weights and affine-transformed vectors cancel each other out on specific vectors. We describe a brief idea here and provide the detailed derivation of each equation in the Appendix C.
First, under a coarse assumption, multiple affine transformations performed in the multi-head attention can be integrated into a single one: Assume that the input vector x is a sample from the standard normal distribution: x ∼ N (0, I d ). Then we can estimate its magnitude by E x ≈ √ d and the magnitude after affine transformation by ATTNRES-W (Abnar and Zuidema, 2020).  singular values of f . Thus, the expansion rate of f is approximately estimated by If the ratio is lower than one, f has a tendency of shrinking the input. For commonly used large models, results stably demonstrated the shrinking tendency (layer mean of the expansion rate was 0.88 < 1.0 for BERT-base and 0.80 < 1.0 for BERT-large). Note that, for smaller models, results demonstrated the expanding tendency (layer mean 1.24 for BERT-mini and 1.86 for BERT-tiny). This is consistent with the result that the latter models tended to have a higher mixing ratio than the former models (Section 4.2). Detailed results are shown in Appendix C.3. To summarize, ATTN's shrinking effect is probably achieved by (i) the shrinking in f alone and (ii) further shrinking through the cancellation of α and f (x) . By these mechanisms, ATTN can contribute to decreasing the mixing ratio.
LN reduces the context-mixing ratio: LN contains not only the vector normalization but also the affine transformation with learnable parameters (Equation 8). Although the validity or usage of LN has been investigated in terms of stability and speed of training (Parisotto et al., 2020;Liu et al., 2020), the effects of affine transformation have rarely been explored. By comparing the mixing ratios obtained from ATTBRES-N and ATTNRESLN-N (Table 2), we discovered that LN reduced the context-mixing ratio. This suggests that the scaling (by γ) of the affine transformation shrinks the vector from ATTN and emphasizes RES over ATTN.

Detailed analysis
We further analyzed the mixing ratio of the masked language models in detail from the perspectives of both the layer and word attributes. In this section, we inherit the experimental setup (Section 4.1) from the previous section and demonstrate results for BERT-base with the Wikipedia dataset. The results for the other experimental settings are shown in Appendix B and D. Note that only the finding reported in Section 5.2 did not generalize across model variants, and we exceptionally discuss this point in the body. Figure 3 shows the mixing ratio in each layer of the BERT model (results for other models are shown in Appendix B). Each subfigure corresponds to a different analysis method, each row represents a layer, and each column represents a token type. The averaged results of the following token categories and their overall average ("overall") are reported: (i) non-special tokens ("normal"), (ii) [MASK], (iii) [CLS], and (iv) [SEP].

Methods
Spearman's ρ all tokens w/o special tokens  Results and discussion: Our proposed method showed that the mixing ratio is higher in the earlier layers than in the later ones (see the "overall" trend in Figure 3e). 6 This trend mirrors the tendency that a deep neural network with "gates" similar to residual connections passes through the input more in the later layers (Srivastava et al., 2015). Furthermore, our method showed a distinctive trend for the [MASK] tokens. In the middle and deep layers, the mixing ratio for [MASK] becomes higher (19-30%) than the overall trends (15-20%). Note that this trend becomes clearer when considering the RES and LN. This trend implies that in the middle and deep layers, BERT refers to contextual information for predicting the masked words. The trends of the other masked language models are shown in Appendix B.

Word frequency and mixing ratio
In this section, we will discuss the property of BERT related to the word frequency. 7 6 The Spearman's ρ between the "overall" mixing ratio and the layer depth are −0.67 and −0.98 in "overall" of BERTbase and RoBERTa-base, respectively. 7 Following Kobayashi et al. (2020), we counted the frequency for each word type by reproducing the training data of Results: Table 3 lists the Spearman's rank correlation ρ between the frequency rank (e.g., rank("the") = 1, rank("and") = 6, etc.) and the mixing ratio across tokens in the text data.
The results obtained from ATTNRES-N and ATTNRESLN-N indicate a surprisingly stronger negative correlation than the results obtained by the existing methods (Figure 4). This indicates that BERT discounts the information of high-frequency words compared with low-frequency ones. 8 Discussion: Discounting high-frequency words is a common practice for making the semantic representation of a sentence or a text from word representations; examples are Luhn's heuristic in classical text summarization (Luhn, 1958) and the smooth inverse frequency (SIF) weighting in sentence vector generation (Arora et al., 2017). Our frequency-based results reveal that attention blocks in BERT achieve this desirable property.
Our observation may also explain the phenomenon that adding up BERT's internal or output representations does not produce a good sentence vector (Reimers and Gurevych, 2019). In contrast, in static word embeddings (e.g., word2vec (Mikolov et al., 2013)), the norm encodes the word importance derived from its frequency (Schakel and Wilson, 2015); we can generate a good sentence vector by simply adding these static word vectors (Yokoi et al., 2020). Our finding suggests that BERT encodes the token's importance through the context-mixing ratio rather than the norm. 9 In this sense, it is plausible that additive composition using BERT's internal or output representations does not perform well.
Generalizability: Contrary to the other experimental results, only the relationship between word frequency and mixing ratio (Figure 4) was not consistent across different model sizes. For the larger variant (BERT-large), a stronger negative correlation between them was indicated than for BERTbase, while for the smaller variants (BERT-medium, BERT-small, BERT-mini, and BERT-tiny), even a positive correlation or no correlation was indicated (see Appendix D). Generally, larger BERT models BERT. 8 Kobayashi et al. (2020) reported that ATTN in BERT tends to discounts frequent words when mixing contexts. We found even stronger trends after broadening the scope of analysis. 9 In BERT, it may be difficult for the norm to encode the token importance, because the norm is fixed at each layer normalization.
(BERT-base and BERT-large) achieve better performance on downstream tasks. The different results across model sizes suggest that this desirable property can be learned when the representational power is sufficient.
6 Related work 6.1 Probing Transformers As current neural-based models have an end-to-end, black-box nature, existing studies have adopted several strategies to interpret their inner workings (Carvalho et al., 2019;Rogers et al., 2020;Braşoveanu and Andonie, 2020). In analyzing Transformers, previous studies have mainly employed the following approaches: (i) observing the vanilla attention weights ( . We adopted the norm-based analysis because this method can be naturally extended to the analysis of the whole attention block and it has some advantages (Kobayashi et al., 2020) that will also be discussed in the following paragraph.
As for broadening the scope of the analysis, Abnar and Zuidema (2020) modified the attention matrix to incorporate the residual connections into the analysis. However, they assumed that the multihead attention and residual connection equally contributed to the computation of the output representations without any justification (Section 4.1). Brunner et al. (2020) employed a gradient-based approach for analyzing the interaction of input representations; however, the gradient ignores the impact of the input vector (i.e., only observing ∂ x i /∂x j neglects the impact of x j itself) as described in Section 6.2 of Kobayashi et al. (2020). Note that our norm-based analysis can include the magnitude of the impact of inputs in the analysis. They demonstrated that the output of self-attention networks without residual connections converges to a rank-1 matrix quickly with increasing its layer depth. In addition, as a similar component to RES, Srivastava et al. (2015) proposed "gates" that adjust the amount of routing of the input information. Their experiments using stacked feed-forward networks for image classification also show consistent trends with ours -the effect of preserving the original input is dominant especially in the later layers. Inspired by this observation, Liu et al. (2020) modified the Transformer architecture to enhance the original input in the residual connections and demonstrated that this extension leads to better performance and convergence. Note that several variants of the Transformer-based architecture with different arrangements of RES and LN have also been proposed (Klein et al., 2018;Xiong et al., 2020;Parisotto et al., 2020), and analyzing these models is one of our future works.

Conclusions
In this paper, we have extended a norm-based analysis to broaden the scope of analyzing Transformers from the multi-head attention alone to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of the masked language models revealed that the context-mixing ratio in each block is much lower than expected in previous studies, demonstrating that RES and LN largely cancel the mixing by ATTN. This observation can provide new explanations for some unexpected results were reported on Transformers in fields ranging from NLP to neuroscience (e.g., discarding the learned attention patterns did not adversely affect the performance). Our detailed analysis further suggested that BERT discounts highly frequent, low-informative tokens.
Although our method is applicable to analyzing other variants of Transformers, our experiments were limited to the Transformer-based masked language models. In addition, the Transformer is not composed of only the attention block; feed-forward and embedding layers also exist. We plan to extend this work in both directions.

Acknowledgements
We would like to thank the members of the Tohoku NLP Lab for their insightful comments, particularly Benjamin Heinzerling for his valuable suggestions on content and wording. This work was supported by JST CREST Grant Number JP-MJCR20D2, Japan; JST ACT-X Grant Number JPMJAX200S, Japan; and JSPS KAKENHI Grant Number JP20J22697.

Ethical considerations
One recent issue in the whole NLP community is that neural-network-based models have nonintended biases (e.g., gender bias) induced during the training process. This paper gives a method for interpreting the inner workings of real-world machine learning models, which may help us understand such biased behaviors of the models in the future.

A Derivation of "distributive law" of LN
In Section 3.2, we introduced the "distributive law" for LN (layer normalization) in Equations 12 and 13. Here, we show its derivation. Let z = j z j be the input to LN. Then, Equations 12 and 13 are derived as follows:  Table 4 shows the architecture hyperparameters of each model. Table 5 shows the statistics of the mixing ratio for each model. Figures 5 to 11 show the mixing ratio at each layer (each attention block) of each model. We also conducted it with the other three datasets. Table 6 shows the statistics of the mixing ratio for BERT-base on each dataset. Figures 12  to 14 show the mixing ratio at each layer of BERTbase on each dataset. Furthermore, we conducted it with 25 BERTbase models trained with different seeds by Sellam et al. (2021). Table 7 shows the statistics of the mixing ratio for the models on the Wikipedia dataset. Figures 15 to 17 show the mixing ratio at each layer of three models (trained with 0th, 5th, 20th seeds) from them.
In Section 5.1, we showed the distinctive trend for the [MASK] tokens in BERT-base with the Wikipedia dataset. Even in the other models and with the other datasets, the mixing ratio for the masked tokens was relatively high in the middle and deep layers (Figures 5 to 14e).
Contrary to the results for the masked tokens, the trend for the beginning of sentence token ([CLS] or <s>) was different across these models (Figures 5 to 11). For BERT-large, RoBERTa-large, and RoBERTa-base, the layer with the highest mixing ratio for CLS was the first layer, while for the other models, it was the final or penultimate layer. Different trends between BERT and RoBERTa can be naturally explained by the fact that RoBERTa is pre-trained without the next sentence prediction. Although we cannot interpret the difference of trends across BERT models with various sizes, it was consistent among them in that the later layers mix contextual information into [CLS] with a relatively high mixing ratio. This implies that, in the later layers, BERT conducts some operations specialized to the next sentence prediction task. Solving such a discourse-level task in the later layers is consistent with the previous report that BERT makes lower-level decisions (e.g., part-of-speech tagging) in the earlier layers and that the later layers have high-level information (e.g., knowledge on co-reference) (Tenney et al., 2019).

C Details on the investigation of the mechanism of ATTN's shrinking
We describe the details of Section 4.4.

Integration of each head's affine transformation
To consider the scaling effect of the affine transformations in ATTN, we integrate each head's affine transformation f h into one affine transformation f : R d → R d , under a coarse assumption. First, for simplicity, we assume that all heads in an ATTN assign the same weights Then, the computation of ATTN (Equation 10) can be rewritten as follows: where Concrete computation of f From Equation 11, the affine transformation f is Following the Transformer implementation, it can be further simplified as follows: On the difference in arguments of ATTN and f In Section 4.4, we considered the scaling effect of ATTN, using the affine transformation f . One may wonder about the difference between arguments of ATTN (i.e., x i ) and arguments of f (i.e., x j ) in Equation 29. We can give two kinds of justification to this question. In the estimation of the expansion rate, we consider the expected value. From the symmetry of x i and x j , when the expected value for x j is obtained, the expected value for x i is obtained. In the actual BERT model, it has been empirically confirmed that two token vectors x i , x j ∈ X contained in the same context X exist in a fairly close position (x i ≈ x j ). First, Ethayarajh (2019) found that the cosine similarity between the intra-sentence representations in BERT is much larger than 0. Second, the norm of input vectors has just been unified by the layer normalization in the previous layer. Thus, for our target models, x i ≈ x j is not a strong assumption.

Affine transformation as linear transformation
The affine transformation f : R d → R d in ATTN can be viewed as a linear transformation f : R d+1 → R d+1 . Given x := x 1 ∈ R d+1 , where 1 is concatenated to the end of each input vector x ∈ R d , the affine transformation f can be viewed as: The "affine transformation" mentioned in Section 4.4 represent this linear transformation f , and we measured the singular values of f .

C.2 Expected expansion rate for a random vector
In the following, we introduce the derivation of the expansion rate of the affine transformation f , that is, We assume that the input vector x is a sample from the standard normal distribution: First, the expectation value of x 2 is as follows (Vershynin, 2018): Then, we have x ≈ √ d. Next, let the singular value decomposition of the linear transformation f is f = U ΣV , where Σ = diag(σ 1 , . . . , σ d ) ∈ R d×d is the diagonal matrix of singlar values of f . As the matrix V is orthogonal, the following random vecotr f also follows the standard normal distribution, as does x: y = (y 1 , . . . , y d ) := V x ∼ N (0, I d ). (42) By the orthogonal transformation by U does not change the norm, we need to estimate Σy 2 in order to estimate f (x) 2 = U ΣV x 2 . Then C.3 Results for other models Table 8 shows the expected expansion rate of f for each model.

D Relationship between word frequency and mixing ratio in other settings
We also conducted the experiment shown in Section 5.2 with the pre-trained BERT-large, BERTmedium, BERT-small, BERT-mini, and BERT-tiny. However, we didn't do for RoBERTa-large and RoBERTa-base due to the difficulty of reproducing the pre-training dataset to count the word frequency. Table 9 lists the Spearman's rank correlation ρ between the frequency rank and the mixing ratio for each model. We discussed the inconsistency of the results across different model sizes in Section 5.2. We also conducted it with the other three datasets. Table 10 lists the Spearman's rank correlation ρ between the frequency rank and the mixing ratio for each dataset.
Furthermore, we conducted with 25 BERT-base models trained with different seeds. Table 11 lists the Spearman's rank correlation ρ between the frequency rank and the mixing ratio for the models on the Wikipedia dataset.        Table 7: Mean, maximum, and minimum values of the mixing ratio in each method for 25 BERT-base models trained with different random seeds by Sellam et al. (2021). Mean value is the average of the values from 25 models, and the standard deviation (SD) is also listed. Maximum and minimum values are the maximum and minimum of these values from 25 models, respectively.                        Table 11: Spearman's ρ between the frequency rank and the mixing ratio calculated by each method for for 25 BERT-base models trained with different random seeds. In the "w/o special tokens" setting, it was calculated without [CLS] and [SEP]. Both of the values are the mean of the values from 25 models, and the standard deviation (SD) is also listed.