How does Attention Affect the Model?

The attention layer has become a prevalent component in improving the effectiveness of neural network models for NLP tasks. Figur-ing out why attention is effective and its inter-pretability has attracted a widespread deliberation. Current studies mostly investigate the effect of attention mechanism based on the attention distribution it generates with one single neural network structure. However they do not consider the changes in semantic capability of different components in the model due to the attention mechanism, which can vary across different network structures. In this paper, we propose a comprehensive analytical framework that exploits a convex hull representation of sequence semantics in an n -dimensional Semantic Euclidean Space and deﬁnes a series of indicators to capture the impact of attention on sequence semantics. Through a series of experiments on various NLP tasks and three representative recurrent units, we analyze why and how attention beneﬁts the semantic capacity of different types of recurrent neural networks based on the indicators deﬁned in the proposed framework.


Introduction and Motivation
The first appearance of the attention mechanism in natural language processing (NLP) can be traced back to its successful application in Neural Machine Translation (NMT). Bahdanau et al. (2014a) proposed an attention mechanism in an Encoder-Decoder model, which achieved great success, and showed that attention weight produced in this mechanism improved the interpretability of the model by providing a way of aligning the source and target languages through a simple quantitative analysis. Subsequently, the assumption that attention could improve the interpretability and transparency of a model was acquiesced by many later works, such * Corresponding Author: Dawei Song as AEN (Song et al., 2019) (applied to targeted sentiment classification), ATAE-LSTM (Wang et al., 2016) (applied to aspect-level sentiment classification), and CMLA (Wang et al., 2017) (applied to semantic sentiment analysis).
More recently, this hypothetical premise has aroused controversies. For example, Serrano and Smith (2019) and  used an erasure-based approach and advocated the attention weight does not necessarily correspond to importance. Wiegreffe and Pinter (2019) and Vashishth et al. (2019) considered attention to be interpretable, using a more model-driven approach and manual verification. These investigations focus on whether the attention distribution is unique and the correlation between attention weight and model prediction results, based on a similar neural network setting that consists of an embedding layer, a specific Recurrent Neural Network (RNN) and an attention component. Complex components such as the encoder-decoder structure were removed from the network as they may bias the analysis on the effect of attention weights. However, these works fail to explain two critical issues as follows: (1) Neglecting the changes in the rest of the model before and after introducing attention, especially the word embedding layer and the RNNs' hidden layer. Figure 1 shows the transition before and after the introduction of attention. When a model does not introduce attention, the model generates an embedding sequence E = {e 1 , e 2 , ..., e n } from the original one-hot word representation. Subsequently, the sequence E is processed by a specific RNN and converted into a hidden layer sequence H = {h 1 , h 2 , ..., h n }, which is used to produce the output. The introduction of attention will cause the model to change the gradient during the back-propagation in the training phase, which will lead to the embedding and hidden sequences to move away from the pre- The existing works have focused on whether the attention distribution is unique or reasonable (if it is not unique). However, they ignore the extent of semantic changes in the sequences caused by the attention mechanism, including the changes in the embedding sequence (E → E attn ), in the hidden layer sequence (H → H attn ), and even in the emerging attention sequence (A attn ). We posit that such ignorance would lead to an unfair and biased analysis of the attention.
(2) Lacking a systematic study of the attention effect on different types of RNNs. The attention layer is compatible with various types of RNNs, regardless of which recurrent unit out of Vanilla-RNN (Mikolov et al., 2010), LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014), is used, or whether it has a uni-directional or bi-directional structure. Although the existing works have experimented on many datasets, they solely focus on a single type of RNN preceding the attention layer at a time. We argue that a comprehensive comparison of the changes before and after introducing attention to different types of RNNs mentioned above will better reveal the intrinsic interpretability of attention.
To address these two issues, we propose to explore the effect of attention from a new perspective by conducting a systematic investigation on the semantic changes across different sequences of a RNN model with or without attention, and comparing the differences in the changes across mainstream recurrent units. Based on the analysis results, we expect to better understand what happens before and after the introduction of attention into the model.
The proposed analysis requires a comprehensive framework with reasonable metrics to evaluate the quality of sequence semantics based on their vector representations. For this purpose, we adopt the concept of Convex Hull in n-dimensional Semantic Euclidean Space (SR n ) (Zhang et al., 2020) to represent the semantics of a sequence. Since an attention mechanism always produces a point in the convex hull of its preceding hidden units, we can establish suitable metrics based on the convex hull formed by a sequence of vectors in SR n , to facilitate the analysis of attention effect. Section 2 will briefly introduce the Semantic Euclidean Space and the convex hull representation of the sequence semantics. In Section 3, we analyze the attention mechanism and establish the relationship between the attention weight and the sequence meaning (as convex hull in SR n ). Section 4 formulates a set of indicators to analyze the semantic changes before and after attention. With the proposed framework, we conduct comparative experiments on various datasets concerning text classification and sentiment analysis tasks in Section 5. Based on the experimental results, we conduct in-depth analysis from the perspective of why and how the attention mechanism benefits the semantic capacity of different recurrent units of RNNs. Zhang et al. (2020) proposed an n-dimensional Semantic Euclidean Space (SR n ), which defines the mapping relationship between points in an Euclidean Space (R n ) and their semantics. As a se-mantic extension of R n , SR n is defined as:

Background
In SR n , the points are divided into specific semantic points and abstract semantic points. A specific semantic point has a word corresponding to it, which can also be regarded as a word embedding. An abstract semantic point does not have a specific word corresponding to it, such as a point generated by the hidden layer of RNNs. Zhang et al. (2020) then proposed to use the convex hull and centroid of points in SR n to measure the meaning and central idea of a sequence of words. It provides a theoretical basis for exploring the semantic changes that occur before and after introducing attention into a model.

Meaning of a Sequence
Definition. The meaning of a sequence composed of semantic points is represented by the convex hull composed of these points.
Given a sequence X composed of semantic points, its meaning, denote as ME(X ), is formulated as: Conv(X ) denotes the convex hull of a finite point set X , as the set of all convex combinations of the points (Faux and Pratt, 1979). In a convex combination, each point x i in X is assigned with a weight or coefficient α i in such a way that the coefficients are all non-negative and sum to one. These weights are used to produce a weighted average of points. It is formulated as: The mapping between the definition of the convex hull and the meaning of a sequence is intuitive. A sentence (sequence) consists of words (semantic points). In addition to the semantics expressed by the individual words, a sentence should also include the implicit semantics (abstract semantic points) produced by all possible combinations of words.

Central Idea of a Sequence
Definition. The central idea of a sequence composed of semantic points is represented by the centroid of the sequence's meaning.
The central idea of a sequence X of semantic points is denoted as Centroid(X ), formulated as: The centroid a subset X of R n is the mean position of all the points in all coordinate directions. It is computed as: where the integrals are taken over the whole space R n , and g is the characteristic function, which is 1 if a point is inside X and 0 otherwise (Protter and Morrey, 1977). However, the central idea of a sequence needs to be calculated as Centroid(Conv(X )), instead of Centroid(X ) directly, to guarantee that the central idea of the sequence lies within the convex hull (meaning) of the sequence. In contrast, even though the geometric centroid of a convex object always lies within the area representing its meaning, a non-convex object might also have a centroid that is outside the area, which is undesirable. As introduced above, ME scopes the meaning of a sequence as an area in SR n , while the central idea of the sequence should be at the centre of the ME area.
The central idea of a sequence can be considered as a "summary" of the sentence's meaning. Operationally, it is the centroid of the convex hull representation of the sentence meaning, within a SR n . Take the phrase "The Association for Computational Linguistics" as an example, the central idea of this phrase can be summarized as a semantic point, which corresponds to the abbreviation "ACL". Considering another phrase, "good enough but not excellent", the central idea of this phrase also can be summarized as a semantic point, but for the time being, there is no word that corresponds to this semantic point. Perhaps with the development of natural language, people will soon create a word to describe this semantic point. This is actually the specific semantic point and abstract semantic point defined in SR n . More explanations about SR n can be found in Zhang et al..

Attention With Convex Hull
Motivated by the ability of SR n to measure the meaning and central idea of sequences, we will theoretically analyze the role of attention from the perspective of semantic change.
Take the well-known Scaled Dot-Product attention (Vaswani et al., 2017) as an example (this will be abbreviated as dot-attn later). When a sequence X = {x 1 , x 2 , ..., x n } passes through a dot-attn layer, the specific calculation process is shown as follows: X ∈ R n×m denotes the matrix of word vectors corresponding to the input sequence X , where m is the dimensionality of word vectors. Essentially this process can be described as the following two steps: 1. Construct an attention distribution α = (α 1 , α 2 , ..., α n ) through the input sequence X and the softmax function, 2. Use the attention weight and X to generate a new sequence Y = {y 1 , y 2 , ..., y n }, which is called an attention sequence.
α i is a probability distribution generated from the softmax function, which ensures that each component in it will be in the interval (0, 1) and the components will add up to 1. Focusing on a specific vector y i in Y, it can be expressed as follows: Comparing Formula 8 with Formula 3, we can find that under the action of dot-attn, the process of converting X to Y is indeed a process of continuously selecting new semantic points from the convex hull of X (i.e., the meaning of X , ME(X )) to form a new semantic sequence Y. Therefore, the new sequence Y is a semantic transformation of the original sequence X to some extent. An example of X dot-attn −→ Y is shown in Figure 2. Furthermore, although the convex hull of X can be used to express the meaning of a sentence, the model usually uses X to construct a vector c = 1 n n i=1 x i as representation of a sentence, followed by a dense layer and activation function to ME(X attn ) Figure 2: Use attention to convert sequence X to sequence Y. The meaning of X is represented by the yellow shaded part, and the meaning of Y is represented by the red part.
generate prediction result. The form of sentence representation c is consistent with the definition of the central idea in SR n . Therefore, from the sentence representation's perspective, attention is to transform the central idea expressed by the original sequence (CI(X )) to a new semantic point CI(Y) (the central idea of Y). The offset from the yellow diamond to the red triangle in Figure 2 represents the conversion from CI(X ) to CI(Y).
In summary, the model has undergone the following changes after the introduction of attention: 1. Attention adjusts the meaning expressed by the original sequence by adjusting each semantic point in the sequence.
2. Attention changes the central idea (an instance representation) of the original sequence.
Through the above analysis, we have a deeper understanding of how attention transforms the original input from the perspective of the convex hull at the theoretical level. It is important to note that our proposed attention analysis framework above is applicable to other forms of attention, such as tanh attention (Zhou et al., 2016). No matter a popular dot-attn or a traditional tanh attention is used, they can be regarded as firstly adjusting the sequence X to sequence Y, and then further averaging them to make predictions. The subtle difference between them lies in the dimensions of attention distributions. For an input sequence, the dot-attn generates a 2-dimensional distribution, while the tanh attention generates a 1-dimensional distribution.
However, only the above analysis framework is not enough. During the training process, due to the introduction of attention, the gradient change in the back-propagation process will cause the original sequence X to be converted into a new sequence X attn , and this has also been ignored in previous studies. To this end, we will first construct relevant evaluation indicators, and then give our complete analysis framework based on the theoretical analysis of attention in this section.

Assessing the Effect of Attention
Following the typical settings in this area, we use an RNN model as the basis to systematically analyze the changes between different sequences in the process of introducing attention to the model. We specify multiple indicators to measure these changes and accordingly present our analysis framework.

Various Sequences Before and After the Introduction of Attention
As shown in Figure 3, the model takes the onehot representation of the word sequence as the initial input. When attention is not used by the model, the model contains an embedding sequence E = {e 1 , e 2 , ..., e n } inferred from a dense layer and a hidden sequence H = {h 1 , h 2 , ..., h n } produced by an RNN. When an attention mechanism is introduced, the model weight changes due to the gradient changes during the training process.
Hence, E and H will be converted to new sequences , ..., a attn n }. For the above five sequences, there is the following progressive relationship: The differences between the above two links of sequences should be carefully examined to explore the impact of attention on the model. In this work, we first define a series of indicators to measure the semantic expression ability of an independent sequence and the semantic relationship between two sequences that belong to the same link. A framework is proposed to systematically compare the differences between the two links to assess the effect of attention.

Degree of Semantic Unsaturation
In SR n , the meaning of a sequence X of length |X | is calculated by ME. We define the degree of semantic unsaturation of a sequence as follow: DSU(X ) reflects the degree of semantic unsaturation regarding X . Normally, the smaller the semantic space contained in the meaning of a sequence, the more precise the semantics expressed. Specifically, for sequences X and Y have same sequence length, if the meaning expressed by X is more precise than the meaning expressed by Y, i.e. ME(X ) < ME(Y), then DSU(X ) is less than DSU(Y), which means that the degree of unsaturation of X is lower than Y. For this reason, the smaller DSU(X ), the better.

Semantic Coverage
For two sequences X , Y, the sequence Y is a semantic transformation of the previous sequence X (this transformation may be synonymous transformation or even semantic extraction), we use semantic coverage (SC) (Zhang et al., 2020) to indicate the overlap between two sequences: Since X is the original sequence and Y is the converted sequence, then three indicators Semantic Coverage Precision (SCP), Semantic Coverage Recall (SCR), and Semantic Coverage F-Measure (SCF) can be naturally defined to observe the changes between the two sequences:

Central Idea Offset
In addition to the difference in meaning between two sequences, it is crucial to check the deviation of the central idea between the two sequences X , Y. The offset distance between the central idea of Y and that of the original sequence X is called Central Idea Offset (CIO), formulated as follows:

Analysis Framework
Base on the definition of the above five indicators, we propose a framework to analyze the impact of  Figure 3: Splice the two models before and after the attention is introduced in a symmetrical manner. The word vector sequence and hidden layer sequence linked by the red arrow in the figure represent the observation from the corresponding perspective. The sequence linked at both ends of the blue dashed arrow represents observation from a shift perspective.
attention on a certain model. Recalling the two links in Eq. 9, since the number of sequences contained in each link is different, we propose two perspectives for comparison: the corresponding perspective and the shift perspective.

The Corresponding Perspective
The introduction of attention to a model has caused changes in its embedding sequence (E → E attn ) and its hidden layer sequence (H → H attn ). This observation on the changes in the corresponding layers of the model is called the corresponding perspective. Using ∆ to represent the difference, ∆(ρ(E), ρ(E attn )) reflects the influence of the introduction of attention on the embedding layer from the corresponding perspective, and similarly for H and H attn . In addition to the changes on a single sequence, the difference between the links between adjacent sequences (E → H, E attn → H attn ) can also be used to observe the impact of attention. For example, ∆(SCP(E, H), SCP(E attn , H attn )) is employed to compare the changes of semantic coverage precision between embedding layer and hidden layer before and after introduction of attention. This difference can also be computed for SCP, SCF and CIO.

The Shift Perspective
According to the analysis in Section 3, before attention is added, the generated attention sequence H is actually a conversion of the embedding layer. In the presence of attention, the embedding se- quence E attn is converted to the sequence H attn , which is further transformed to A attn by the attention mechanism. Therefore, the difference between CIO(E, H) and CIO(E attn , A attn ) can be used to reflect the influence of attention on the overall semantic shift along the link. In the mean time, an input sentence is finally converted into H or A attn to express the meaning of the sentence, so we can alternatively express this change by ∆(DSU(H), DSU(A attn )).

Exploring the Attention
We first introduce the dataset and models used in the experiments and then explore the impact of attention using the analysis framework above.

Experimental Setup
In order to make our analysis concise, our experiments focused on both text classification task (Stanford Sentiment Treebank (SST) (Socher et al., 2013)) and sentiment analysis task (AGNews 1 ). In the future, we will extend our work to more data  sets, especially the machine translation dataset in the encoder-decoder model.
Since the AGNews dataset does not have a predefined validation set, the training set is split into a training set and validation set at a ratio of 8:2. The statistics of datasets are shown in Table 1. For each dataset, the base model we used for training is shown in Figure 1. It has an embedding layer for convert one-hot representation to distribution representation, a specific RNN-layer (recurrent units can be Vanilla-RNN, LSTM or GRU. The overall structure can be uni-directional or bi-directional, resulting in 6 different combinations.), without or with a dot-attn layer, followed by an additive layer and softmax prediction. The accuracy results of these models on the validation set are shown in Table 2, we calculate and analyze our indicators on the test set (The distribution of sentence length in the test set is shown in Appendix A). It should be pointed out that for the problem that the convex hull of high-dimensional vectors cannot be calculated temporarily, we use t-SNE to reduce the collected vectors to 2-dimensional at first (like the work of Zhang et al.), and further continue calculate the convex hull of the sequence, and use the area to represent the semantic size covered by the convex hull (The reproducibility is shown in Appendix B).
Both in the datasets, regardless of the unidirectional or bi-directional RNNs structure is used, the experimental results' trends are similar. Therefore, we only show the results generate by the use of bi-directional RNNs to compare different recurrent units before and after the introduction of attention on the dataset SST. The more experimental results can found in Appendix C, such as unidirectional RNNs in SST dataset, uni/bi-directional RNNs on AGNews.

Analysis from the Corresponding Perspective
As shown in Figure 4, by observing the difference in semantic density from the corresponding perspective. We can find that the embedding sequences (E, E attn ) learned by the model is basically the same for different recurrent types with or without attention. However, from the hidden layer sequence, H, H attn , we can observe the change of this difference, no matter what type of recurrent unit, H attn are less than H, this shows that the hidden layer sequence of RNNs using the attention escapes or abstracts the original text with a smaller semantic range, and the semantics expressed are more accurate. After using the attention, the hidden layer sequence improves the accuracy of semantic expression (decrease the degree of semantic unsaturation) also brings about the improvement of the model effect, as shown in Table 2. The attention mechanism's introduction also led to the shortening of the central idea offset between embedding sequences and hidden sequences.
From the semantic coverage perspective, the introduction of attention makes the semantic conversion between the embedding layer and the hidden layer generally improve in all of the three indicators, semantic coverage recall, semantic coverage accuracy, and semantic coverage F-Measure. This improvement shows that the semantic closeness between embedding and hidden layer and the degree of unsaturation of the hidden layer greatly influence the model results. The introduction of attention improve the accuracy of hidden layer sequence expression semantics and makes the semantic conversion between the embedding sequence and the hidden layer sequence more natural. If we compare different types of recurrent units, it is not difficult to find that RNNs is significantly worse than LSTM and GRU on most indicators, which shows that the prediction accuracy of RNNs is lower than LSTM and GRU is truthfully reflected in our indicators.

Analysis from the Shift Perspective
The result of the shift perspective is shown in Figure 5, the picture on the top reflects ∆(CIO(E, H), CIO(E attn , A attn )), the bottom picture reflects ∆(DSU(H), DSU(A attn )). After the introduction of attention, the sequence A attn used for original semantic expression has a smaller degree of semantic unsaturation than H used for original semantic expression before the introduction of attention. At the same time, the distance between the central idea of the embedding sequence (CI(E), CI(E attn )) and the final vector used as an instance representation of the embedding sequence (CI(H), CI(A attn )) is also shortened. The improvement of these indicators is also reflected in the accuracy of the model.
It is worth mentioning that if we observe the changes from the perspective of different recurrent types, it is not difficult to find that the number of gate structures in the recurrent type is positively correlated with the degree of semantic unsaturation of embedding sequences and attention sequences. (There are three gate structures in LSTM, 2 in GRU and 0 in Vanilla-RNN.) Table 1 and Figure A in Appendix show that in terms of dataset size, vocabulary size and sentence length, the AGNews dataset is larger than SST. By observing the experimental results, it can be found that the changes in the many indicators of RNNs after adding attention on SST are the most obvious. On the other hand, the experimental results for LSTM and GRU without using attention are significantly better than RNN in term of semantic expression ability. However, this superiority is largely compromised by the introduction of attention, which can well recognize the central idea and effectively condense the semantics of a sequence. Therefore, for LSTM and GRU, the changes in performance caused by adding attention are relatively less than that for RNN. This also explains why the accuracy of any RNN variant can be greatly improved after the introduction of attention.

Analysis from the Holistic Perspective
Through Figure 4 and Figure 7 in Appendix, we can see that the introduction of attention on the SST dataset has led to substantial improvements for all indicators, and the improvements on the AGNews dataset are significantly lower. Furthermore, for CIO indicators, RNN results are similar to LSTM and GRU after the introduction of attention. The CIO indicator measures the offset between central ideas and is directly related to the vector that the model finally uses to make predictions (see Formula 8 and Formula 15). Therefore, the high performance of RNN with dot-attn on SST validation set is explainable, especially a single-directional RNN model with dot-attn on the SST validation set reached 0.948.
All of these results verify that the analysis framework in our paper can objectively reflect the attention mechanism's effectiveness on the semantic expression ability.
6 Related Work Guidotti et al. (2018) divided the problem of blackbox models in detail, and the interpretability problem of attention discussed in this paper belongs to the model explanation problem.
RNNs can be said to be the basic ancestor model that introduced the attention mechanism in NLP tasks. Karpathy et al. (2015) established a mapping between the neurons of hidden layers and the content represented to explore the RNNs, Du et al.

Conclusions
In this paper, we have proposed a novel framework, based on a convex hull representation of sequence semantics over a Semantic Euclidean Space, to analyze the effect of attention on the semantic capacity of a RNN model and how the effect differs on different network structures. Extensive experiments on two NLP tasks provide in-depth insights on how and why attention impacts the model. From the corresponding perspective, the introduction of attention directly leads to (1) a reduction of semantic unsaturation in the hidden layer of RNNs, that is, an increase in accuracy of the original semantic expression, (2) narrowing the central idea distance between the hidden layer sequence and the embedding layer sequence, (3) an improved performance of semantic coverage between embedding layer sequence and hidden layer sequence. These are critical impacts of attention on the model and improve the capabilities of different types of RNNs. From the shift perspective, the attention layer sequence further reduces the degree of semantic unsaturation, and gets a closer proximity to the embedding layer sequence in the central idea. This is a critical factor in improving the model's accuracy. Our method illustrates how attention affects the model from the perspective of semantic transformation and makes up the limitations of the previous studies in which they only uses a single model to analyze attention.
We believe that the method proposed in this paper will help carry out more in-depth analysis of the role of attention and provide a brand-new perspective for semantic visualization in NLP tasks.

A Datasets Detail
The statistics of sentence length in the test dataset of SST and AGNews are shown in Figure 6. The distribution of sentence length in the figure shown the sentence length in the AGNews' test set is mainly concentrated in (0, 80), in the SST's test dataset is mainly concentrated in (0, 40).

$*1HZV
&RXQW 667 &RXQW Figure 6: Sentence length statistics. The abscissa indicates the length of the sentence, and the ordinate indicates the count number.

B Reproducibility
Our experiment uses public dataset SST and AG-News. At the same time, in order to reproduce the experimental results more conveniently, we store the scores of each set of sequences in the dataset on defined indicators in a pickle binary file 2 , which is convenient for you load it in and use Pandas 3 to view it. We uploaded all the pickle files saved under different models and different datasets to the code part and provided our drawing part of the code to view the experimental results disclosed in our paper, and the model code and training code will be released after some sorting. Figure 7 from the corresponding perspective observe the impact of the introduction of attention. Each set of pictures shows the experimental results under different experimental settings. Contains the dataset used in the experiment (SST or AGNews) and the directionality of RNNs (uni-directional or bi-directional), different types of recurrent units are compared on the ordinate of each picture. Figure 8 from the shift perspective observe the impact of the introduction of attention. Each set of pictures shows the experimental results under different experimental settings. Contains the dataset used in the experiment (SST or AGNews) and the directionality of RNNs (uni-directional or bidirectional), different types of recurrent units are compared on the ordinate of each picture.