More Identifiable yet Equally Performant Transformers for Text Classification

Interpretability is an important aspect of the trustworthiness of a model’s predictions. Transformer’s predictions are widely explained by the attention weights, i.e., a probability distribution generated at its self-attention unit (head). Current empirical studies provide shreds of evidence that attention weights are not explanations by proving that they are not unique. A recent study showed theoretical justifications to this observation by proving the non-identifiability of attention weights. For a given input to a head and its output, if the attention weights generated in it are unique, we call the weights identifiable. In this work, we provide deeper theoretical analysis and empirical observations on the identifiability of attention weights. Ignored in the previous works, we find the attention weights are more identifiable than we currently perceive by uncovering the hidden role of the key vector. However, the weights are still prone to be non-unique attentions that make them unfit for interpretation. To tackle this issue, we provide a variant of the encoder layer that decouples the relationship between key and value vector and provides identifiable weights up to the desired length of the input. We prove the applicability of such variations by providing empirical justifications on varied text classification tasks. The implementations are available at https://github.com/declare-lab/identifiable-transformers.


Introduction
Widely adopted Transformer architecture (Vaswani et al., 2017) has obviated the need for sequential processing of the input that is enforced in traditional Recurrent Neural Networks (RNN). As a result, compared to a single-layered LSTM or RNN model, a single-layered Transformer model is computationally more efficient, reflecting in a relatively shorter training time (Vaswani et al., 2017). This advantage encourages the training of deep Transformer-based language models on largescale datasets. Their learning on large corpora has already attained state-of-the-art (SOTA) performances in many downstream Natural Language Processing (NLP) tasks. A large number of SOTA machine learning systems even beyond NLP (Lu et al., 2019) are inspired by the building blocks of Transformer that is multi-head self-attention (Radford et al., 2018;Devlin et al., 2018).
A model employing an attention-based mechanism generates a probability distribution a = {a 1 , . . . , a n } over the n input units z = {z 1 , . . . , z n }. The idea is to perform a weighted sum of inputs, denoted by n i=1 a i z i , to produce a more context-involved output. The attention vector, a, are commonly interpreted as scores signifying the relative importance of input units. However, counter-intuitively, it is recently observed that the weights generated in the model do not provide meaningful explanations (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019).
Attention weights are (structurally) identifiable if we can uniquely determine them from the output of the attention unit (Brunner et al., 2019). Identifiability of the attention weights is critical to the model's prediction to be interpretable and replicable. If the weights are not unique, explanatory insights from them might be misleading.
The self -attention transforms an input sequence of vectors z = {z 1 , . . . , z n } to a contextualized output sequence y = {y 1 , . . . , y n }, where y k = n i=1 a (k,i) z i . The scalar a (k,i) captures how much of the i th token contributes to the contextualization of k th token. A Transformer layer consists of multiple heads, where each head performs selfattention computations, we break the head computations in two phases: • Phase 1: Calculation of attention weights a (k,i) . It involves mapping input tokens to key and query vectors. The dot product of k th query vector and i th key vector gives a (k,i) .
• Phase 2: Calculation of a contextualized representation for each token. It involves mapping input tokens to the value vectors. The contextualized representation for k th token can be computed by the weighted average of the value vectors, where the weight of i th token is a (k,i) computed in first phase.
The identifiability in Transformer has been recently studied by Brunner et al. (2019) which provides theoretical claims that under mild conditions of input length, attention weights are not unique to the head's output. Essentially their proof was dedicated to the analysis of the computations in the second phase, i.e., token contextualization. However, the theoretical analysis ignored the crucial first phase where the attention weights are generated. Intrinsic to their analysis, the attention identifiability can be studied by studying only the second phase of head computations. However, even if we find another set of weights from the second phase, it depends on the first phase if those weights can be generated as the part of key-query multiplication.
In this work, we probe the identifiability of attention weights in Transformer from a perspective that was ignored in Brunner et al. (2019). We explore the previously overlooked first phase of selfattention for its contribution to the identifiability in Transformer. During our analysis of the first phase, we uncover the critical constraint imposed by the size of the key vector 1 d k . The flow of analysis can be described as • We first show that the attention weights are identifiable for the input sequence length d s no longer than the size of value vector d v ( §3.1) (Brunner et al., 2019) 2 .
• For the case when d s > d v , we analyse the attention weights as raw dot-product (logits) and the softmaxed dot-product (probability simplex), independently. An important theoretical finding is that both versions are prone to be unidentifiable.
• In the case of attention weights as logits ( §3.2.1), we analytically construct another set of attention weights to claim the unidentifiability. In the case of attention weights as 1 The size of key and query vector is expected to be the same due to the subsequent dot product operation 2 The sequence length denotes number of tokens at input.
softmaxed logits ( §3.2.2), we find the attention identifiability to be highly dependent on d k . Thus, the size of key vector plays an important role in the identifiability of the self-attention head. The pieces of evidence suggest that the current analysis in Brunner et al. (2019) ignored the crucial constraints from the first phase in their analysis.
To resolve the unidentifiability problem, we propose two simple solutions ( §4). For the regular setting of the Transformer encoder where d v depends on the number of attention heads and token embedding dimension, we propose to reduce d k . This may lead to more identifiable attention weights. Alternatively, as a more concrete solution, we propose to set d v equal to token embedding dimension while adding head outputs as opposed to the regular approach of concatenation (Vaswani et al., 2017). Embedding dimension can be tuned according to the sequence length up to which identifiability is desired. We evaluate the performance of the proposed variants on varied text classification tasks comprising of ten datasets ( §5).
In this paper, our goal is to provide concrete theoretical analysis, experimental observations, and possible simple solutions to identifiability of attention weights in Transformer. The idea behind identifiable variants of the Transformer is-the harder it is to obtain alternative attention weights, the likelier is they are identifiable, which is a desirable property of the architecture. Thus, our contribution are as follows: • We provide a concrete theoretical analysis of identifiability of attention weights which was missing in the previous work by Brunner et al. (2019).
• We provide Transformer variants that are identifiable and validate them empirically by analysing the numerical rank of the attention matrix generated in the self-attention head of the Transformer encoder. The variants have strong mathematical support and simple to adopt in the standard Transformer settings.
• We provide empirical evaluations on varied text classification tasks that show higher identifiability does not compromise with the task's performance.

Identifiability
A general trend in machine learning research is to mathematically model the input-output relationship from a dataset. This is carried out by quantitatively estimating the set of model parameters that best fit the data. The approach warrants prior (to fitting) examination of the following aspects: • The sufficiency of the informative data to the estimate model parameters, i.e., practical identifiability. Thus, the limitation comes from the dataset quality or quantity and may lead to ambiguous data interpretations (Raue et al., 2009).
• The possibility that the structure of the model allows its parameters to be uniquely estimated, irrespective of the quality or quantity of the available data. This aspect is called structural identifiability. A model is said to be structurally unidentifiable if a different set of parameters yield the same outcome.
In this work, we focus on the structural identifiability (Bellman andÅström, 1970). It is noteworthy that the goodness of the fit of a model on the data does not dictate its structural identifiability. Similar to Brunner et al. (2019), we focus our analysis on the identifiability of attention weights, which are not model parameters, yet demands meaningful interpretations and are crucial to the stability of representations learned by the model.

Transformer Encoder Layer
We base our analysis on the building block of Transformer, i.e., the encoder layer (Vaswani et al., 2017). The layer has two sub-layers. First sublayer performs the multi-head self-attention, and second is feed-forward network. Given a sequence of tokens {x 1 , . . . , x ds }, an embedding layer transforms it to a set of vector {z 1 , . . . , z ds } ∈ R de , where d e denotes token embedding dimension. To this set, we add vectors encoding positional information of tokens {p 1 , . . . , p ds } ∈ R de .
Multi-head Attention. Input to a head of multihead self-attention module is W ∈ R ds×de , i.e., a sequence of d s tokens lying in a d e -dimensional embedding space. Tokens are projected to d q -size query, d k -size key, and d v -size value vectors using linear layers, resulting in the respective matrices -Query Q ∈ R ds×dq , Key K ∈ R ds×d k , and Value V ∈ R ds×dv . The attention weights A ∈ R ds×ds can be computed by The (i, j) th element of A shows how much of i th token is influenced by j th token. The output of a head H ∈ R ds×de is given by where D ∈ R dv×de is a linear layer and the matrix T ∈ R ds×de denotes the operation V D. The R ds×de output of multi-head attention can be expressed as a summation over H obtained for each head 3 . The i th row of multi-head output matrix corresponds to the d e dimensional contextualized representation of i th input token. In the original work, Vaswani et al. (2017), the multi-head operation is described as the concatenation of A V obtained from each head followed by a linear transformation D ∈ R de×de . Both the explanations are associated with the same sequence of matrix operations as shown in fig. 1. In regular Transformer setting, a token vector Feed-Forward Network. This sub-layer performs the following transformations on each token representation at the output of a head: y 1 = Linear 1 (Norm(t i + head output for t i )) y 2 = Norm(t i + ReLU(Linear 2 (y 1 ))) Linear 1 and Linear 2 are linear layers with 2048 and 512 nodes, respectively. Norm denotes minibatch layer normalization.

Identifiability of Attention
The output of an attention head H is the product of A and T (eq. (2)). Formally, we define identifiability of attention in a head: Definition 3.1. For an attention head's output H, attention weights A are identifiable if there exists a unique solution of A T = H.
The above definition can be reformulated as Definition 3.2. A is unidentifiable if there exist añ A, (Ã = 0), such that (A +Ã) is obtainable from phase-1 of head computations and satisfy (constraint-R1) Under this constraint, we getã i T = 0 whereã i is the i th row ofÃ. The set of vectors which when multiplied to T gets mapped to zero describes the left null space of T denoted by LN(T). The dimension of the left null space of T can be obtained by taking the difference of the total number of rows (d s ) and the number of linearly independent rows, i.e, rank of the matrix T denoted by rank(T). Let dim(·) denotes the dimension of a vector space, then (4)

"A" is Identifiable for d s ≤ d v
If dim(LN(T)) = 0 then LN(T) = {0}, it leads to the only solution of constraint-R1 that isÃ = 0. Therefore, the unidentifiabilty condition does not hold. Now we will prove such a situation exists when the number of tokens is not more than the size of value vector.
We utilize the fact that the rank of product of two matrices P and Q is upper bounded by the minimum of rank(P) and rank(Q), i.e., rank(P Q) ≤ min rank(P), rank(Q) . Thus, the upper bound on rank(T) in eq. (4) can be determined by where the last inequality is obtained for a head in the regular Transformer for which d v =64. Numerical rank. To substantiate the bounds on rank(T) as derived above, we set up a model with a single encoder layer ( §6). The model is trained to predict the sentiment of IMDB reviews ( §5). We feed the review tokens to the model and store the values generated in T of the first head. A standard technique for calculating the rank of a matrix with floating-point values and computations is to use singular value decomposition. The rank of the matrix will be computed as the number of singular values larger than the predefined threshold 4 . The fig. 2 illustrates how the rank changes with the sequence length d s . The numerical rank provides experimental support to the theoretical analysis. Thus, For the identifiability study, since we focus on a model's capability of learning unique attention weights, we will assume T has the maximum obtainable rank set by its upper bound.

Idenitifiability when
In this case, from eq. (7), we obtain a non zero value of dim LN(T) . It allows us to find infi-niteÃ's satisfying (A +Ã) T = A T. However, constraint-R1 demandsÃ to be obtainable from the first phase of self-attention. As a first step, we focus our analysis on the attention matrix without apply- analysis is crucial to identify constraints coming from the first phase of self-attention in Transformer that impact identifiability. Insights from this will help us analyse softmax version of A.

Attention Weights as Logits
Since the logits matrix A is obtained from the product of Q and K T , we can assert that Therefore, the rank of attention matrix producible by the head in the first phase of self-attention can at most be equal to the size of key vectors d k . On this basis, the head can produce only those A +Ã satisfying Proposition 3.3. There exists a non-trivialÃ that satisfy (A +Ã) T = A T and constraint-R2. Hence, A is unidentifiable.
Proof. Let a 1 , . . . , a ds andã 1 , . . . ,ã ds denote rows of A andÃ, respectively. Without the loss of generality, let a 1 , . . . , a d k be linearly independent rows. For all j > d k , a j can be represented as a linear combination d k i=1 λ j i a i , where λ j i is a scalar. Next, we independently choose first k rows ofÃ that are {ã 1 , . . . ,ã d k } from LN(T). From the same set of coefficients of linear combination λ j i for i ∈ {1, . . . , d k } and j ∈ {d k+1 , . . . , d s }, we can construct j th row ofÃ asã j = d k i=1 λ j iã i . Now, since we can construct the j th row of (A +Ã) from the linear combination of its first d k rows as d k i=1 λ j i (a i +ã i ), the rank of (A +Ã) is not more than d k . For a set of vectors lying in a linear space, a vector formed by their linear combination should also lie in the same space. Thus, the artificially constructed rows of A belongs to LN(T). Therefore, there exist anÃ that establishes the proposition which claims the unidentifiability of A.

Attention Weights as Softmaxed Logits
The softmax over attention logits generates attention weights with each row of A (i.e., a i 's) is constrained to be a probability distribution. Hence, we can define constraint overÃ as P1 is non-negativity constraint on (A +Ã) as it is supposed to be the output of softmax; P2 de-notesÃ ∈ LN(T); P3 can be derived from the fact (A +Ã)1 = 1 =⇒ (A 1 +Ã 1) = 1 =⇒Ã 1 = 0 as (A 1 = 1). Where 1 ∈ R ds is the vector of ones. The constraint in P2 and P3 can be combined and reformulated asÃ[T, 1] = 0. Following the similar analysis as in eq. (7) The constraint P4 confirms if there exists a logit matrix A l that can generate (A +Ã), given constraints P1, P2, and P3 are satisfied. The possibility of such an A l will provide sufficient evidence that A is unidentifiable. Next, we investigate how the existence ofÃ is impacted by the size of key vector d k (query and key vector sizes are the same, i.e., d q =d k ). Let (A +Ã)(i, k) denotes (i, k) th element of the matrix. We can retrieve the set of matrices A l such that softmax(A l ) = A +Ã, where for some arbitrary c i ∈ R; log denotes natural logarithm. As shown in fig. 3, the column vectors of A l can be written as c +â 1 , . . . , c +â ds . For an arbitrarily pickedÃ satisfying constraint P1, P2, and P3, the dimensions of affine span S of {â 1 , . . . ,â ds } could be as high as d s − 1 (fig. 4). In such cases, the best one could do is to choose a c a ∈ S such that the dimension of the linear span of {â 1 − c a , . . . ,â ds − c a }, i.e., rank(A l ) is d s − 1. Hence, to satisfy P4, d s − 1 ≤ d k =⇒ d s ≤ d k + 1. Thus, the set of (A +Ã) satisfying constraint P1, P2 and P3 are not always obtainable from attention head for d s > d k . We postulate Although it is easier to constructÃ satisfying constraints P1, P2 and P3, it is hard to constructÃ satisfying constraint P4 over the rank of logit matrix A l . Therefore, A becomes more identifiable as the size of key vector decreases. Figure 4: This is a simplified illustration for the case d s = 3. Affine space (translated linear subspace) spanned by vectorsâ 1 ,â 2 andâ 3 . c a can be any arbitrary vector in affine space. By putting c = −c a , we can obtain a linear subspace whose rank is equal to rank of the affine subspace.
Experimental evidence. We conduct an experiment to validate the minimum possible numerical rank of A l by constructingÃ. ForÃ to be obtainable from the phase 1, the minimum possible rank of A l should not be higher than d k . From IMDB dataset ( §5), we randomly sample a set of reviews with token sequence length d s ranging from 66 to 128 7 . For each review, we construct 1000Ã's satisfying constraints P1, P2, and P3 -First, we train a Transformer encoder-based IMDB review sentiment classifier ( §6). We obtain an orthonormal basis for the left null space of [T, 1] using singular value decomposition. To form anÃ, we generate d s random linear combinations of the basis vectors (one for each of its row). Each set of linear combination coefficients is sampled uniformly from [−10, 10]. All the rows are then scaled to satisfy the constraint P1 as mentioned in Brunner et al. (2019). Using eq. (9), we obtain a minimum rank matrix A l 's by putting c = −â 1 . Figure 5 depicts the obtained numerical rank of A l . We observed all the obtained A l from (A +Ã) (using eq. (9)) are full-row rank matrices. However, from the first phase of self-attention, the maximum obtainable rank of A l is d k = 64. Thus, the experimentally constructed A l 's do not claim unidentifiability of A as it fails to satisfy the constraint P4, while for Brunner et al. (2019), it falls under the solution set to prove unidentifiability as it meets constraints P1, P2 and P3.

Solutions to Identifiability
Based on the Identifiability analysis in §3, we propose basic solutions to make Transformer's attention weights identifiable.
Decoupling d k . Contrary to the regular Transformer setting where d k = d v , a simple approach is to decrease the value of d k that is the size of the key and query vector. It will reduce the possible solutions ofÃ by putting harder constraints on the rank of attention logits, i.e., A l in eq. (9). However, theoretically, d k decides the upper bound on dimensions of the space to which token embeddings are projected before the dot product. Higher the upper bound, more degree of freedom to choose the subspace dimensions as compared to the lower d k variants. Thus, there is a plausible trade-off when choosing between d k induced identifiability and the upper bound on the dimension of projected space.
Head Addition. To resolve the unidentifiability issue when sequence length exceeds the size of value vector, we propose to keep the value vector size and token embedding dimension to be more than (or equal to) the maximum allowed input tokens, i.e., d v ≥ d s-max . In Vaswani et al. (2017), d v was bound to be equal to d e /h, where d e is token embedding dimension and h is number of heads. This constraint on d v is because of the concatenation of h self-attention heads to produce d e -sized output at the first sub-layer of the encoder. Thus, to decouple d v from this constraint, we keep d v = d e and add each head's output. 8

Classification Tasks
For the empirical analysis of our proposed solutions as mentioned in §4, we conduct our experiments on the following varied text classification tasks:

Small Scale Datasets
IMDB (Maas et al., 2011). The dataset for the task of sentiment classification consist of IMDB movie reviews with their sentiment as positive or negative. Each of the train and test sets contain 25,000 data samples equally distributed in both the sentiment polarities.
TREC (Voorhees and Tice, 2000). We use the 6-class version of the dataset for the task of question classification consisting of open-domain, facet-based questions. There are 5,452 and 500 samples for training and testing, respectively.
SST (Socher et al., 2013). Stanford sentiment analysis dataset consist of 11,855 sentences obtained from movie reviews. We use the 3-class version of the dataset for the task of sentiment classification. Each review is labeled as positive, neutral, or negative. The provided train/test/valid split is 8,544/2,210/1,101. 8 ds-max < de as in the regular Transformer setting.

Large Scale Datasets
SNLI (Bowman et al., 2015). The dataset contain 549,367 samples in the training set, 9,842 samples in the validation set, and 9,824 samples in the test set. For the task of recognizing textual entailment, each sample consists of a premisehypothesis sentence pair and a label indicating whether the hypothesis entails the premise, contradicts it, or neutral.
Please refer to Zhang et al. (2015) for more details about the following datasets: Yelp. We use the large-scale Yelp review dataset for the task of binary sentiment classification. There are 560,000 samples for training and 38,000 samples for testing, equally split into positive and negative polarities.
DBPedia. The Ontology dataset for topic classification consist of 14 non-overlapping classes each with 40,000 samples for training and 5,000 samples for testing.
Sogou News. The dataset for news article classification consist of 450,000 samples for training and 60,000 for testing. Each article is labeled in one of the 5 news categories. The dataset is perfectly balanced.
AG News. The dataset for the news articles classification partitioned into four categories. The balanced train and test set consist of 120,000 and 7,600 samples, respectively.
Yahoo! Answers. The balanced dataset for 10class topic classification contain 1,400,000 samples for training and 50,000 samples for testing.
Amazon Reviews. For the task of sentiment classification, the dataset contain 3,600,000 samples for training and 400,000 samples for testing. The samples are equally divided into positive and negative sentiment labels.
Except for the SST and SNLI, where the validation split is already provided, we flag 30% of the train set as part of the validation set and the rest 70% were used for model parameter learning.

Experimental Setup
Setting up the encoder. We normalize the text by lower casing, removing special characters, etc. 9 For each task, we construct separate 1-Gram vocabulary (U ) and initialize a trainable randomly sampled token embedding (U × d e ) from N (0, 1). Similarly, we randomly initialize a (d s-max × d e ) positional embedding.
The encoder ( §2.2) takes input a sequence of token vectors (d s × d e ) with added positional vectors. The input is then projected to key and query vector of size d k ∈ {1, 2, 4, 8, 16, 32, 64, 128, 256}. For the regular Transformer setting, we fix the number of heads h to 8 and the size of value vector d v = d e /h that is 64. For each token at the input, the outputs of attention heads are concatenated to generate a d e -sized vector. For the identifiable variant of the Transformer encoder, d v = d e = 512, this is equal to d s-max to keep it identifiable up to the maximum permissible number of tokens. The outputs of all the heads are then added. Each token's contextualized representations (added head outputs) are then passed through the feed-forward network ( §2.2). For classification, we use the encoder layer's output for the first token and pass it through a linear classification layer. In datasets with more than two classes, the classifier output is softmaxed. In the case of SNLI, we use the shared encoder for both premise and hypothesis; the output of their first tokens is then concatenated just before the final classification layer. We use Adam optimizer, with learning rate =0.001, to minimize the cross-entropy loss between the target and predicted label. For all the experiments, we keep the batch size as 256 and train for 20 epochs. We report the test accuracy obtained at the epoch with the best validation accuracy.
Numerical rank. To generate the numerical rank plot on IMDB dataset as shown in fig. 2, we train a separate Transformer encoder-based classifier. For a particular d s value, we sample 100 reviews from the dataset with token length ≥ d s and clip each review to the maximum length d s . The clipping will ensure the number of tokens is d s before feeding it to the encoder. The numerical rank is calculated for T's obtained from the first head of the encoder.

Results and Discussion
For the identifiable variant, similar to §3.1, we plot the numerical rank of T with input sequence length as shown in fig. 6. Unlike fig. 2, where dim LN(T) linearly increases after d s = 64, we find the dimension is zero for a larger d s (∼ 380). The zero dimensional (left) null space of T con-firms there exist no nontrivial solution to the constraint constraint-R2, i.e.,Ã = {0}. Thus, the attention weights A are identifiable for a larger range of length of the input sequence. It is important that the identifiability of attention weights should not come at the cost of reduced performance of the model. To investigate this issue, we compare the performance of the identifiable Transformer encoder against its regular settings ( §6) on varied text classification tasks.
For the regular setting, as discussed in §4 as one of the solutions, the Transformer can be made identifiable by decreasing the size of the key vector d k . The rows of the Table 1 corresponding to Con denotes regular Transformer setting with varying size of key vector. We observe the classification accuracy at the lower d k is comparable or higher than large d k values, thus, the enhanced identifiability does not compromise with the model's classification accuracy. However, we notice a general performance decline with an increase in the size of the key vector. We speculate that for simple classification tasks, the lower-dimensional projection for key and query vector works well. However, as the task becomes more involved, a higher dimension for the projected subspace could be essential. Nonetheless, as we do not have strong theoretical findings, we leave this observation for future work.
Another solution to identifiability is to increase d v to d e and add the heads' outputs. This setting corresponds to the Add rows in the to the regular settings. For d k ≥ 8, as a general observation, we find the performance of Add does not drop as drastically as Con with an increase in d k . This could be due to the larger size of value vector leading to the more number of parameters in Add that compensate for the significant reduction in the model's accuracy.
On the large-scale datasets, we observe that Add performs slightly better than Con. Intuitively, as shown in fig. 1, we can increase the size of value vector to increase the dimension of the space on which each token is projected. A higher dimensional subspace can contain more semantic information to perform the specific task.
Even though the theoretical analysis shows the possibility of a full row rank of T and identifiable attention weights, the T obtained from a trained model might not contain all the rows linearly independent as d s increases. We can explain this from the semantic similarities between words cooccurring together (Harris, 1954). The similarity is captured as the semantic relationship, such as dot product, between vectors in a linear space. As the number of tokens in a sentence, i.e., d s increases, it becomes more likely to obtain a token vector from the linear combination of other tokens.

Conclusion
This work probed Transformer for identifiability of self-attention, i.e., the attention weights can be uniquely identified from the head's output. With theoretical analysis and supporting empirical evidence, we were able to identify the limitations of the existing study by Brunner et al. (2019). We found the study largely ignored the constraint coming from the first phase of self-attention in the encoder, i.e., the size of the key vector. Later, we proved how we can utilize d k to make the attention weights more identifiable. To give a more concrete solution, we propose encoder variants that are more identifiable, theoretically as well as experimentally, for a large range of input sequence lengths. The identifiable variants do not show any performance drop when experiments are done on varied text classification tasks. Future works may analyse the critical impact of identifiability on the explainability and interpretability of the Transformer.

A Background on Matrices
A.1 Span, Column space and Row space Given a set of vectors V := {v 1 , v 2 , . . . , v n }, the span of V, span(V), is defined as the set obtained from all the possible linear combination of vectors in V, i.e., The span(V) can also be seen as the smallest vector space that contains the set V.
Given a matrix A ∈ R m×n , the column space of A, Cs(A), is defined as space spanned by its column vectors. Similarly, the row space of A, Rs(A), is the space spanned by the row vectors of A. Cs(A) and Rs(A) are the subspaces of the real spaces R m and R n , respectively. If the row vectors of A are linearly independent, the Rs(A) will span R m . A similar argument holds between Cs(A) and R n .

A.2 Matrix Rank
The rank of a matrix P (denoted as rank(P)) tells about the dimensions of the space spanned by the row vectors or column vectors. It can also be seen as the number of linearly independent rows or columns. The following properties hold rank P ≤ min m p , n p rank P Q ≤ min rank(P), rank(Q) .
Where, P and Q are m p × n p and m q × n q dimensional matrices, respectively.

A.3 Null Space
The left null space of a m p × n p matrix P can be defined as the set of vectors v - (10) If the rows of P are linearly independent (P is full-row rank) the left null space of P is zero dimensional. The only solution to the system of equations v P = 0 is trivial, i.e., v=0. The dimensions of the null space, known as nullity, of P can be calculated as dim LN(P) = m p − rank(P).
The nullity of P sets the dimensions of the space v lies in. In §3, we utilize our knowledge of appendix A.2 and appendix A.3 to analyse identifiability in a Transformer.