Overcoming a Theoretical Limitation of Self-Attention

Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer’s classification decisions get closer and closer to random guessing (that is, a cross-entropy of 1) as input strings get longer and longer. We examine this limitation using two languages: PARITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation implied by Hahn’s lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when transformers need to focus on a single position, as for FIRST, we find that they can fail to generalize to longer strings; we offer a simple remedy to this problem that also improves length generalization in machine translation.


111
The second layer adds up the odd elements and

137
The word and position embeddings are:

140
Since we are numbering positions starting from 0, 141 dimension 4 ranges from 0 to −1 , and dimension 5 142 is +1 for even positions and −1 for odd positions.

143
We argue that dimension 5, being a cosine wave, 144 is a fairly standard choice, although its period (2) 145 is shorter than the shortest period in standard sinu-146 soidal encodings (2 ). Dimension 4 is admittedly 147 not standard; however, we argue that it is a reason-148 able encoding, and extremely easy to compute.

149
Thus, the encoding of word is:

151
The first self-attention layer has a single head, because attention always averages, it must compute 154 the "average" number of 1s, that is, , and stores 155 it in dimension 6. It also stores 1 in dimension 7, 156 which we will need later.
The second head doesn't do anything (W 1,2,V = 0; 161 the queries and keys can be anything). After the 162 residual connection, we have:

175
The first FFNN has two layers. The first is:

177
This gives: The second layer linearly combines these three val-181 ues to get I[ = ] as desired.

183
After the residual connection, we have: The second self-attention layer tests whether po-187 sition is even or odd. It does this using two heads, 188 one which attends more strongly to the odd posi-189 tions, and one which attends more strongly to the 190 even positions; both average dimension 8: where > 0 can be any constant. Then If is even, which is positive if is odd and negative if is 207 even. As predicted by Hahn, it is in (1/ ). If 208 is odd, the expression for is somewhat more 209 complicated, but it is still positive iff is odd, and 210 it is still in (1/ ).

211
Finally, the output layer is a sigmoid layer that 212 just looks at dimension 9:

215
So the output is greater than 1 2 iff is odd.

232
(We have chosen W 2,1 in a slightly unusual way in 233 order to avoid using the bias term, in anticipation 234 of §3 when we will add layer normalization.)

235
The second self-attention layer has a single head, 236 which makes CLS focus on position 1.
The final output layer just selects component 6: So the output probability is greater than 1 2 iff 1 = 247 1. However, it will get closer to 1 2 as increases.

249
We implemented both of the above constructions

284
Likewise for the self-attention parameters: Likewise for the position-wise FFNN parameters:

291
Then each layer of activations is (before layer nor-  we add a new layer whose self-attention doesn't do 317 anything (W ,ℎ,V = 0) and whose FFNN is: This causes the residual connection to zero out all 322 dimensions except two, so that if was the original 323 output logit, the output of this new layer (before 324 layer normalization) is

326
Now, if = 0, layer normalization scales this vector 327 to have unit variance exactly, so it becomes The new output layer simply selects the first di-

341
We tested our exact solutions, modified as de-342 scribed above to use layer normalization. Figure 2   343 shows that layer normalization with > 0 im-344 proves the cross-entropy, but it still grows with 345 and approaches 1. With = 0, the cross-entropy 346 is independent of and, as argued above (Proposi-347 tion 2), can be made as low as desired.

349
In this section, we turn to the question of learnabil-350 ity, which will lead to a third way of overcoming  Figure 3: The cross-entropy and accuracy of our solution to PARITY are both extremely sensitive to the parameter [W 1,1,V ] 6,2 , which is responsible for computing . 353 We tried training transformers on both PARITY and 354 FIRST. We used transformers with the same num-   With layer normalization, we can apply the mod-

446
For all training lengths , the model is able to learn 447 with perfect test cross-entropy and accuracy.

448
We see a similar effect on low-resource English-449 to-Vietnamese machine translation (Table 1) and test length distributions are the same, scaling