On Orthogonality Constraints for Transformers

Orthogonality constraints encourage matrices to be orthogonal for numerical stability. These plug-and-play constraints, which can be conveniently incorporated into model training, have been studied for popular architectures in natural language processing, such as convolutional neural networks and recurrent neural networks. However, a dedicated study on such constraints for transformers has been absent. To fill this gap, this paper studies orthogonality constraints for transformers, showing the effectiveness with empirical evidence from ten machine translation tasks and two dialogue generation tasks. For example, on the large-scale WMT’16 En→De benchmark, simply plugging-and-playing orthogonality constraints on the original transformer model (Vaswani et al., 2017) increases the BLEU from 28.4 to 29.6, coming close to the 29.7 BLEU achieved by the very competitive dynamic convolution (Wu et al., 2019).


Introduction
Transformers (Vaswani et al., 2017) are a class of neural architectures that have made a tremendous transformative impact on modern natural language processing research and applications. Transformers have not only served as a powerful inductive bias for general-purpose sequence transduction (Ott et al., 2018) but also lived as the core of large pretrained language models (Devlin et al., 2018;Radford et al., 2018;Dai et al., 2019). That said, the study of more effective training for this class of models is still an open research question, bearing great potential to impact a myriad of applications and domains.
To improve numerical stability during training, the trick of enforcing orthogonality constraints has * Equal contribution.
† Work was done at NTU. surfaced recently. In the analysis of numerical stability, enforcing orthogonality constraints can upper-bound the Lipschitz constant of linear transformations. The Lipschitz constant is a measure that approximates the rate of change (variation) of representations. Theoretically, controlling the Lipschitz constant, which may be achieved via orthogonality constraints, yields representations that are robust and less sensitive to perturbations. In view of this, orthogonality constraints have been studied for convolutional neural networks (CNNs) (Bansal et al., 2018;Huang et al., 2018) and recurrent neural networks (RNNs) (Arjovsky et al., 2016;Vorontsov et al., 2017;Rodríguez et al., 2016). Such plug-and-play constraints can be incorporated into model training without additional hassle. For example, CNN-based models incorporating orthogonality constraints have demonstrated empirical effectiveness for tasks such as person reidentification (Han et al., 2019) and keyword spotting (Lee et al., 2019), while RNN-based models that enforce such constraints have shown promising empirical results for response generation (Tao et al., 2018) and text classification (Wei et al., 2020;Krishnan et al., 2020). However, a dedicated study on orthogonality constraints for transformers has been absent so far.
To fill this research gap, we study orthogonality constraints for transformers, which are imposed on (i) linear transformations in self-attention and position-wise feed-forward networks and (ii) the affinity matrix in self-attention. Mathematically, orthogonality constraints on the weights of these linear transformations can be motivated by bounded Lipschitz constants. We also formally analyze the self-attention mechanism by bounding perturbations to the affinity matrix in the face of input changes.
Furthermore, we conduct extensive experiments on ten neural machine translation (both subword-level and character-level) tasks and two dialogue generation tasks. Our experimental results are promising, demonstrating that the performance of transformers can be consistently boosted with orthogonality constraints. For example, on the large-scale WMT'16 En→De benchmark, simply plugging-and-playing orthogonality constraints on the original transformer model (Vaswani et al., 2017) increases the BLEU from 28.4 to 29.6, coming close to the 29.7 BLEU achieved by the very competitive dynamic convolution (Wu et al., 2019).
Notation For any vector x and any matrix X, x and X denote their L 2 -norm and spectral norm, respectively.

Orthogonality Constraints for Transformers
Recall that in the transformer architecture, keys, queries, and values all come from the same place in the self-attention module. They are linearly transformed for computing multiple attention heads, where all the heads are aggregated by another linear transformation. The position-wise feed-forward network is also built on two linear transformations with activations. In the following, we will describe orthogonality constraints for (i) linear transformations in self-attention and position-wise feedforward networks and (ii) the affinity matrix in self-attention.

For Linear Transformations in Self-Attention and Position-wise Feed-Forward Networks
Note that linear transformations in self-attention and position-wise feed-forward networks are in the form: where y is the output, x is an input, W is a linear transformation weight matrix, and b is an optional bias term. This form provides us with convenient tools for motivating the application of orthogonality constraints to the weights of such linear transformations. Specifically, as described in Section 1, robustness of linear transformations to small perturbations can be measured by Lipschitz constants. Thus, we begin with motivating orthogonality constraints from the perspective of bounding Lipschitz constants of linear transformations.
Formally, the linear transformation (layer) of the aforementioned form y = Wx + b has a Lipschitz constant equal to the largest singular value of the weight matrix W. The linear layer is Lipschitz continuous with the constant L if for all x and x , it holds that which can be re-written as For numerical stability, our goal is to force the Lipschitz constant to be no greater than one at every linear transformation so that their multiplication throughout compositions of transformations is also upper bounded by one. Mathematically, we need to constrain the Lipschitz constant (the largest singular value) of W to be no greater than one, which requires the following orthogonality constraint: Back to the context of multi-head self-attention of transformers, denote by P the concatenation of the linear transformation weights for the query, key, value, and the multi-head aggregation. To impose the orthogonality constraint for these linear transformations, we add the following loss to the transformer model for every layer: Likewise, for position-wise feed-forward network with two linear transformation weight matrices M 1 and M 2 , the orthogonality constraint can be imposed with another additional loss:

For the Affinity Matrix in Self-Attention
In transformers, given the query matrix Q and the key matrix K in the self-attention module, the affinity matrix where α is typically 1 √ d (d is the dimension of the key and the query). Given the value matrix V, the self-attention computes representations via the matrix multiplication AV.
Within the context of sequence transduction, when an input word token is aligned with another semantically similar token, we would expect a small change in the behavior of the self-attention mechanism, rather than a huge change in the output. In the affinity matrix A as defined in (1), let A i, * be the row vector indexed by i. Essentially, each A i, * is a probability distribution over the tokens in the sequence that directs the alignment-based pooling operation. Intuitively, for a robust selfattention mechanism, noisy perturbations should have a limited effect on the affinity scores of the tokens.
More formally, let us analyze the self-attention mechanism by bounding perturbations to the affinity matrix in the face of input changes. Mathematically, changes to the affinity scores are bounded We can see this as the result of the following theorem. Theorem 2.1 (Bounded Perturbations to the Affinity Matrix). Expressing A i, * to be the i th row of the affinity matrix A as defined in (1) and Q i, * to be the i th row of the query matrix Q, the perturbation to the affinity matrix is bounded as such: The detailed proof of Theorem 2.1 is provided in the appendix. In standard training, the spectral norm of the key matrix K or the noise from the query matrix may be large, and as a result the changes to affinity scores may become "unbounded". We speculate that this may hurt the generalization of the self-attention mechanism.
We impose orthogonality constraints on the affinity matrix A. More concretely, we obtain an additional loss term for every layer of the transformer model using the Frobenius norm · F : where I is the identity matrix and λ is a scaling factor to control the ratio to the original task loss. With orthogonally constrained affinity scores, each row vector of A is now orthonormal to all the other row vectors. Given that each row vector is a probability distribution over the tokens in the sequence that directs the alignment-based pooling operation, a diverse form of the self-attention mechanism would be more encouraged. This could be viewed as an additional quality of orthogonality constrained transformers.

Experiments
We evaluate the effectiveness of orthogonality constrained transformers (OC-transformers for brevity) on ten neural machine translation tasks and two dialogue generation tasks. Specifically, we assess three variants, largely pertaining to where orthogonality constraints are applied, i.e., (i) AM (for the affinity matrix in self-attention), (ii) LA (for the linear transformations in self-attention), and (iii) LF (for the linear transformations in position-wise feed-forward networks). We evaluate them in an incremental fashion with three main model labels: VAR-I (AM only), VAR-II (AM + LA), and VAR-III (AM + LA + LF). The scaling factor λ is tuned amongst {10 −6 , 10 −8 , 10 −10 }.

Neural Machine Translation
For neural machine translation (NMT), we evaluate on both the subword-level and character-level tasks.
All the models are trained with the transformerbase setting. Owing to the smaller size, we use the transformer-small setting for IWSLT'14 datasets. For the WMT'16 En→De dataset, we train both the transformer-base and transformer-big settings on 4× GPUs with gradient accumulation of 2× to emulate 8× GPU training. By determining improvement on approximate BLEU scores on the validation set, we train models for 2M steps for the transformer-base setting and 800K steps for the transformer-big setting. Note that between the standard transformer and OC-transformer, we maintain all the other hyperparameters to keep the comparisons as fair as possible. For character-level NMT, we evaluate on three language pairs, namely  Model BLEU MoE  26.0 Transformer-base (Vaswani et al., 2017) 27.3 Transformer-big (Vaswani et al., 2017) 28.4 Transformer-ott-big (Ott et al., 2018) 29.3 Dynamic convolution (Wu et al., 2019) 29.7 OC-transformer-base based on (Vaswani et al., 2017) (VAR-III) 28.5 OC-transformer-big based on (Vaswani et al., 2017)  Experimental Results Table 1 reports experimental results on subword-level NMT datasets. Overall, we note that performance of transformers is consistently boosted by orthogonality constraints, ascertaining the effectiveness of adopting such plug-and-play tricks. More specifically, they are able to achieve +1.0% to +7.3% relative gain over the standard transformer. Notably, all the variants (VAR-I, VAR-II, and VAR-III) boost the performance of transformers: it demonstrates that orthogonality constraints are indeed useful. Moreover, orthogonal constraints on the self-attention affinity matrix are beneficial in general even if the rest of the model is not fully enforced with orthogonality constraints. Table 2 reports the results on the large-scale WMT'16 En→De dataset. Orthogonality constraints boost the performance of the transformerbig setting based on (Vaswani et al., 2017), increasing the BLEU from 28.4 to 29.6. This result outperforms the more advanced transformer-ott-big proposed in (Ott et al., 2018) and comes close to 29.7 that was achieved by the very competitive dynamic convolution model (Wu et al., 2019). Likewise, orthogonality constraints also boost the performance of the transformer-base setting with the BLEU in-creased from 27.3 to 28.5. Table 3 reports the results on character-level NMT. We observe that orthogonality constraints consistently boost the performance of standard transformers on all the three language pairs: En→Fr (+3.5%), Ro→En (+2.6%), and De→En (+1.6%).

Sequence-to-Sequence Dialogue Generation
We conduct experiments on the sequence-tosequence dialog generation task whereby the goal is to generate the reply in a two-way conversation.
Experimental Setup We use two datasets: Per-sonaChat (Zhang et al., 2018) and DailyDialog (Li et al., 2017). We implement our task in Ten-sor2Tensor using the transformer-small setting in a sequence-to-sequence fashion (Sutskever et al., 2014). We train all the models for 20K steps, which we find sufficient for model convergence.
Beam search of beam size 4 and length penalty 0.6 is adopted for decoding the output sequence. We evaluate all the models with the language generation evaluation suite in (Sharma et al., 2017). Table 4 reports our results on the PersonaChat and DailyDialog datasets.

Experimental Results
The key observation is that all the variants of enforcing orthogonality constraints boost performance of standard transformers. The best results of OCtransformers make a substantial improvement in all

BLEU Model
En→Fr Ro→En De→En Transformer (Vaswani et al., 2017) 18.74 22.04 27.59 OC-transformer based on (Vaswani et al., 2017)  +3.5% +2.6% +1.6% Table 4: Experimental results on the PersonaChat dataset (Zhang et al., 2018) and the DailyDialog dataset (Li et al., 2017) on nine evaluation measures (Sharma et al., 2017). SkipT stands for SkipThought cosine similarity, EmbA stands for embedding average, VecE stands for vector extrema, and GreedyM stands for greedy matching. the nine evaluation measures. Notably, on both datasets, the best variants are either VAR-II or VAR-III. VAR-I performs decently and boosts performance of standard transformers on both tasks, signifying that the orthogonality constrained affinity matrix in self-attention is sufficiently effective. This mirrors the results on neural machine translation and is consistent across the findings. The relative gain of applying orthogonality constraints is also promising, peaking at +23.5% on BLEU-1 scores and +2.8% to +6.3% on Rouge.

Conclusion
We studied orthogonality constraints for transformers, which are imposed on (i) linear transformations in self-attention and position-wise feedforward networks and (ii) the affinity matrix in self-attention. We showed that such plug-and-play constraints, which can be conveniently incorporated, consistently boost performance of transformers on ten different machine translation tasks and two dialogue generation tasks. For example, on the large-scale WMT'16 En→De benchmark, simply plugging-and-playing orthogonality constraints on the original transformer model (Vaswani et al., 2017) increases the BLEU from 28.4 to 29.6, coming close to the 29.7 BLEU achieved by the very competitive dynamic convolution (Wu et al., 2019).
Broader Impact Given widespread adoptions of transformer models, the proposed plug-and-play orthogonal constraints could also be useful to computer vision, automatic speech recognition, time series analysis, and biological sequence analysis.
A Proof of Theorem 2.1 Proof. Let x = Q i, * , g(x) = xK , and f (y) = softmax(y). Expressing each row in A as A i, * = softmax(αQ i, * K ), we have We first consider bounding g(x) with respect to x − x : Recalling the definition of the spectral norm, A = max x∈R l \{0} xA x : Here, we can observe that the Lipschitz constant for g is K .
Next, we bound f (y) = softmax(y) with respect to y − y . Since f is a differentiable function, it holds that where J is the Jacobian matrix of f (y) with respect to y, i.e., J i,j = ∂f (y) i ∂y j , and J * = max y J . Since f (y) i = e y i e y j , for diagonal entries of J we have where f i = f (y) i for brevity. For non-diagonal entries of J where i = j, we have With this, we can express the Jacobian J as J =    f 1 − f 1 2 · · · −f 1 f n . . . . . . . . .
−f n f 1 · · · f n − f n