ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation

Residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODE). This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, ODE Transformer, which is analogous to the Runge-Kutta method that is well motivated in ODE. As a natural extension to Transformer, ODE Transformer is easy to implement and efficient to use. Experimental results on the large-scale machine translation, abstractive summarization, and grammar error correction tasks demonstrate the high genericity of ODE Transformer. It can gain large improvements in model performance over strong baselines (e.g., 30.77 and 44.11 BLEU scores on the WMT’14 English-German and English-French benchmarks) at a slight cost in inference efficiency.


Introduction
Residual networks have been used with a great success as a standard method of easing information flow in multi-layer neural models (He et al., 2016;Vaswani et al., 2017). Given an input y t , models of this kind define the output of a layer t to be: where F (·, ·) is the function of the layer and θ t is its parameter. Interestingly, recent work in machine learning (Weinan, 2017;Lu et al., 2018;Haber et al., 2018;Chang et al., 2018;Ruthotto and Haber, 2019) points out that Eq. (1) is an Euler discretization of the Ordinary Differential Equation (ODE), like this: dy(t) dt = F (y(t), θ(t)) (2) * Corresponding author.
6 ODE blocks with 1st-order solutions where y(t) and θ(t) are continuous with respect to t. In this way, we can call Eq. (1) an ODE block. This finding offers a new way of explaining residual networks in the view of numerical algorithms. Then, one can think of a multi-layer network as applying the Euler method (i.e., Eq. (1)) to solve Eq.
(2) has a sufficiently low error bound (call it a stable solution) only if θ(t) changes slow along t (Haber and Ruthotto, 2017;Chen et al., 2018). But this assumption does not always hold for state-of-the-art natural language processing (NLP) systems, in which models are non-linear and over-parameterized. For example, language modeling and machine translation systems learn quite different parameters for different layers, especially when the layers are close to the model input (Vaswani et al., 2017;Dai et al., 2019). Also, truncation errors are nonnegligible for the Euler method because it is a first-order approximation to the true solution (He et al., 2019). These problems make the situation worse, when more layers are stacked and errors are propagated through the neural network. It might explain why recent Machine Translation (MT) systems cannot benefit from extremely deep models Liu et al., 2020a;. This paper continues the line of research on the ODE-inspired method. The basic idea is to use a high-order method for more accurate numerical solutions to the ODE. This leads to a larger ODE block that generates a sequence of intermediate approximations to the solution. We find that the larger ODE block is sufficient to take the role of several ODE blocks with first-order solutions. The benefit is obvious: the use of fewer ODE blocks lowers the risk of introducing errors in block switching, and the high-order method reduces the approximation error in each ODE block. See Figure 1 for a comparison of different models.
Our method is parameter-efficient because θ(t) is re-used within the same ODE block. As another "bonus", the model can be improved by learning coefficients of different intermediate approximations in a block. We evaluate our method in strong Transformer systems, covering both the wide (and big) model and the deep model. For machine translation tasks, ODE Transformer achieves 30.77 and 44.11 BLEU scores on the WMT'14 En-De and En-Fr test sets, setting a new state-of-the-art on the WMT'14 En-Fr task. It also significantly outperforms baselines on abstractive summarization and grammar error correction tasks.

Transformer and ODEs
We start with a description of Transformer, followed by its relationship with ODEs. We choose Transformer for our discussion and experiments because it is one of the state-of-the-art models in recent sentence generation tasks.

Transformer
Transformer is an example of the encoder-decoder paradigm (Vaswani et al., 2017). The encoder is a stack of identical layers. Each layer consists of a self-attention block and a feedforward network (FFN) block. Both of them equip with a residual connection and a layer normalization unit. Note that the term "block" is used in many different ways. In this paper, the term refers to any neural network that is enhanced by the residual connection (occasionally call it a residual block). Following the Pre-norm architecture , we define a block as y t+1 = y t + G(LN(y t ), θ t ) where LN(·) is the layer normalization function, 1 and G(·) is either the self-attention or feedforward network. The decoder shares a similar architecture, having an additional encoder-decoder attention block sandwiched between the self-attention and FFN blocks.

Ordinary Differential Equations
An ordinary differential equation is an equation involving a function y(t) of a variable t and its derivatives. A simple form of ODE is an equation that defines the first-order derivative of y(t), like where f (y(t), t) defines a time-dependent vector field if we know its value at all points of y and all instants of time t. Eq. (4) covers a broad range of problems, in that the change of a variable is determined by its current value and a time variable t. This formulation also works with Pre-norm Transformer blocks. For notational simplicity, we redefine G(LN(y t ), θ t ) as a new function F (y t , θ t ): We then relax y t and θ t to continuous functions y(t) and θ(t), and rewrite Eq.
(3) to be: where ∆t is the change of t, and is general called step size. Obviously, we have ∆t = 1 in Transformer. But we can adjust step size ∆t using a limit, and have lim ∆t→0 Given the fact that lim ∆t→0 (7) is an instance of Eq. (4). The only difference lies in that we introduce θ(t) into the righthand side of Eq. (4). Then, we say that a Pre-norm Transformer block describes an ODE. It has been found that Eq. (3) shares the same form as the Euler method of solving the ODE described in Eq. (7) (Haber and Ruthotto, 2017). This establishes a relationship between Transformer and ODEs, in that, given F (·, ·) and learned parameters {θ t }, the forward pass of a multi-block Transformer is a process of running the Euler method for several steps.

The ODE Transformer
In numerical methods of ODEs, we want to ensure the precise solutions to the ODEs in a minimum number of computation steps. But the Euler method is not "precise" because it is a first-order method, and naturally with local truncation errors. The global error might be larger if we run it for a number of times. 2 This is obviously the case for Transformer, especially when the multi-layer neural network arises a higher risk of instability in solving the ODEs (Haber and Ruthotto, 2017).

High-Order ODE Solvers
Here we use the Runge-Kutta methods for a higher order solution to ODEs (Runge, 1895;Kutta, 1901;Butcher, 1996;Ascher and Petzold, 1998). They are a classic family of iterative methods with different orders of precision. 3 More formally, the explicit Runge-Kutta methods of an n-step solution is defined to be: where h is the step size and could be simply 1 in most cases. F i is an intermediate approximation to the solution at step t + α i h. α, β and γ are coefficients which can be determined by the Taylor series of y t+1 (Butcher, 1963). Eq. (10) describes a sequence of solution approximations {F 1 , ..., F n } over n steps {t + α 1 h, ..., t + α n h}. These approximations are then interpolated to form the final solution, as in Eq. (8). The Runge-Kutta methods are straightforwardly applicable to the design of a Transformer block. All we need is to replace the function f (see Eq. (10)) with the function F (see Eq. (5)). The advantage is that the function F is re-used in a block. Also, the model parameter θ t can be shared within the block. 4 In this way, one can omit t + α i h in Eq.
2 The global error is what we would ordinarily call the error: the difference between y(t) and the true solution. The local error is the error introduced in a single step: the difference between y(t) and the solution obtained by assuming that y(t− 1) is the true solution 3 A p-order numerical method means that the global truncation error is proportional to p power of the step size. 4 Although we could distinguish the parameters at different steps in a block, we found that it did not help and made the model difficult to learn.
(10), and compute F i by This makes the system more parameter-efficient. As would be shown in our experiments, the highorder Runge-Kutta methods can learn strong NMT systems with significantly smaller models. The Runge-Kutta methods are general. For example, the Euler method is a first-order instance of them. For a second-order Runge-Kutta (RK2) block, we have This is also known as the improved Euler method. Likewise, we can define a fourth-order Runge-Kutta (RK4) block to be: See Figure 2 for a comparison of different Runge-Kutta blocks. It should be noted that the method presented here can be interpreted from the perspective of representation refinement (Greff et al., 2017). It provides a way for a function to update the function itself. For example, Universal Transformer refines the representation of the input sequence using the same function and the same parameters in a block-wise manner (Dehghani et al., 2019). Here we show that inner block refinements can be modeled with good theoretical support.

Coefficient Learning
In our preliminary experiments, the RK2 and RK4 methods yielded promising BLEU improvements when the model was shallow. But it was found that the improvements did not persist for deeper models. To figure out why this happened, let us review the Runge-Kutta methods from the angle of training. Take the RK2 method as an example. We rewrite Eq. (12) by substituting F 1 and F 2 , as follow Let E be the loss of training, L be the number blocks of the model, and y L be the model output.
The gradient of E at y t is Seen from Eq. (21), ∂E ∂yt is proportional to the factor 1 2 L−t . This leads to a higher risk of gradient vanishing when L is larger.
The problem somehow attributes to the small coefficients of F i , that is, γ 1 = γ 2 = 1 2 . A natural idea is to empirically set γ i = 1 to eliminate the product factor of less than 1 in gradient computation, although this is not theoretically grounded in standard Runge-Kutta methods. We rewrite Eq. (20) with the new coefficients, as follows Then, we have the gradient, like this This model is easy to optimize because ∂E ∂y L can be passed to lower-level blocks with no scales. Note that, the methods here are instances of parameter sharing (Dehghani et al., 2019;Lan et al., 2020). For example, in each ODE block, we use the same function F with the same parameter θ t for all intermediate steps. Setting γ i = 1 is a further step towards this because F i is passed to the following computations with the same scale. Here we call it implicit parameter sharing.
Another way of scaling F i to further improve ODE functions is to learn the coefficients automatically on the training data. The simplest method is to initialize γ i = 1 and independently optimize each scale. It helps the system learn the way of flowing F i in a block. Based on it, scaling F i by a weighted gate mechanism (Srivastava et al., 2015) empirically achieves the best performance (see Section 4). Take RK2-block as an instance, the concatenation of F 1 and F 2 is transformed to a scalar (0, 1) through a sigmoid gate, then the block output y t+1 is where [, ] denotes the concatenation operation and W, b are learnable parameters. We call it RK2block (learnable γ i ), and the architecture is shown in Figure 2 (d). This kind of formulation offers a more flexible way to decide which part contributes more and is also easy to be optimized. Moreover, we also summarize the comparison of various scaling functions in Appendix C.

Efficiency Discussion
ODE Transformer is efficient to use. As we only apply the ODE design schema to the encoder side, it only brings minor impacts on the inference speed due to the autoregressive decoding schema. Another concern here is memory consumption. ODE Transformer consumes more memory than the baseline in the same depth since we need to store the intermediate approximations in the forward pass. But the additional consumption is less than that of the baseline who has the same computation cost, which is acceptable for most scenarios. We give a quantitative analysis in Section 5.

Experiments
We evaluated the ODE Transformer on three sequence generation tasks: machine translation, abstractive summarization and grammar error correction. The datasets we used are elaborated in the following section, and more details of experimental setups could be found in Appendix A and B.

Datasets
Machine Translation We report results on three WMT benchmarks. For the WMT'14 English-German (En-De) task, the training data consisted of approximately 4.5M tokenized sentence pairs, as in (Vaswani et al., 2017). All sentences were segmented into sequences of sub-word units (Sennrich et al., 2016) with 32K merge operations using a shared vocabulary. We selected newstest2013 as the validation data and newstest2014 as the test data. For the WMT'14 English-French (En-Fr) task, we used the dataset provided within Fairseq, i.e., 36M training sentence pairs from WMT'14. newstest2012+newstest2013 was the validation data and newstest2014 was the test data. For the WMT'16 English-Romanian (En-Ro) task, we replicated the setup of (Mehta et al., 2020), which used 600K/2K/2K sentence pairs for training, evaluation and inference, respectively.
Abstractive Summarization We also tested the models' ability to process long sequences on the CNN-DailyMail summarization task (Nallapati et al., 2016;Hermann et al., 2015). The prepro-   (2018) and used the provided preprocessed script. The word-level dropout technique was also applied to prevent the overfitting problem.

Language Modeling
The truncation error analysis is conducted on the Penn Treebank (Mikolov et al., 2011), which is a widely-used language model dataset. It contains 88K, 3, 370 and 3, 761 sentences for training, validation and test. The vocabulary size was 10K. We set the layer depth of the language model to 1 or 2 to make a fair comparison. Assume the layer depth is 1, then the loss between the block output and the ground-truth could be regarded as the truncation error. It alleviates the influence of the error accumulation across different layers. Table 1

Results of Summarization and Correction
We also evaluated the ODE Transformer on another two sequence generation tasks.  baselines by a margin. Similarly, RK4-block is superior to RK2-block when the model is shallow.
More results and case studies could be found in Appendix C.
Quantization of the Truncation Error In fact, we cannot obtain the "true" solution of each block output in NMT, because we mainly experimented on the encoder side. Instead, we tested our system on the language modeling task, where the perplexity between the single-layer model output and the ground truth could be regarded as the truncation error with no error propagations. Table 5 shows the perplexities on the Penn Treebank dataset (Mikolov et al., 2011). All ODE Transformer variants reduce the errors significantly. RK4-order achieves the lowest PPL on both settings. In addition, RK2block can even obtain a lower PPL than a 2-layer residual-block. The observation here again verifies larger ODE blocks behave superior to the standard residual block. Table 6 shows the comparison of inference speed and memory consumption discussed in Section 3.3. Experimental results demonstrate the proposed ODE design schema results in acceptable inference speeds. And it is also memory-friendly through the memory comparison between the baseline and the RK variants in both base and big configurations.

Inference Speed and Memory Consumption
BLEU against Encoder Depth Figure 3 (   mance over all depths, especially when the model becomes deeper. Interestingly, Figure 3 confirms again that ODE Transformer is parameter efficient, e.g., a 6-layer RK2-block is comparable with the 18-layer baseline system. Another finding here is RK4-block performs well on shallow models, but it is inferior to RK2-block when the depth is going deep. This is because original coefficients may cause the optimization problem in the backward propagation in deep models (see Section 3.2). Also, Figure 3 (right) plots BLEU as a function of the model size when the hidden size is 256. The RK2 method significantly surpasses the baseline using much fewer parameters.
Ablation Study on Different F (·, ·) As stated in Section 3, the F (·, ·) function can either be SAN, FFN or both of them (SAN+FFN). As shown in Figure 4, high-order ODE works better with FFN than SAN. An explanation might be that the FFN component has more parameters than the SAN component. 5 The model that treats FFN and SAN as a single ODE block behaves the best.
Training and Validation Perplexity Figure 5 plots the training and validation PPL curves of RK blocks and the baseline enhanced by RPR (Shaw et al., 2018). RK2-block obtains lower training and validation PPLs in both configurations (base and wide models).

Visualization of the Gradient Norm
We also collect the gradient information of several welltrained systems during training. Figure 6 plots the gradient norm of RK2-block-v2, RK4-block and the standard residual-block (baseline). As we can see that Pre-Norm residual block is able to make the training stable . Both RK2block-v2 and RK4-block provide richer signals due to the implicit parameter sharing among intermediate approximations. The two learning curves appear to be nearly the same, which is consistent with the results in Table 1.

Comparison of Different ODE Design Schemas
Then, we take a comprehensive analysis of several ODE design schemas. As stated in Lu et al.  Table 7. We re-implemented these methods using the same codebase for fair comparisons. We conducted experiments following the base configuration on the En-De task. At the time t, Multistep Euler methods require previous states, e.g. y t−1 , to generate the current approximation, instead of iterative refinements based on the current-time state. So these methods are heavier than ODE Transformer. Note that DLCL  can also be regarded as a multistep Euler method, which is more competitive in deep Transformer. But there is just a modest improvement upon the shallow baseline. Theoretically, the Backward Euler method is slightly better than the Forward Euler method in numerical analysis, but the improvement is marginal. Note that our ODE Transformer achieves consistent BLEU improvements over the aforementioned methods. The reason is that such iterative refinements provide more efficient and effective parameter learning.   Table 8 summarizes the details of our datasets. We both present the sentences and tokens of each task. For the En-De and En-Fr tasks, the datasets used in this work could be found in Fairseq. 6 For the En-Ro task, we used the preprocessed dataset provided by DeLight. 7 Note that we only shared the target embedding and the softmax embedding instead of a shared vocabulary between the source side and the target side. The CNN/DailyMail dataset consists of CNN stories 8 and Daily emails. 9 For the grammar error correction task (GEC), we conducted experiments on the CONLL dataset. 10

B Training and Evaluation
Training As suggested in Li et al. (2020)'s work, we adopted relative positional representation (RPR) (Shaw et al., 2018) for stronger baselines. Dense connections among layers  are also applied for stable learning since the model is optimized with FP16 training. All experiments were trained on 8 GPUs with 4, 096 tokens on each GPU. For the En-De and the En-Fr tasks, we employed the gradient accumulation strategy with a step of 2 and 8, respectively. We used the Adam optimizer (Kingma and Ba, 2015) whose hyperparameters were set to (0.9, 0.997). The hyperparameters including the learning rate, the warmup step and the total training steps of three tasks could be found in  seeds, and we averaged the last 5/10 checkpoints for fair comparisons with previous work. The detail of Base/Deep/Wide configurations is as follows: • Base/Deep Model. The hidden size of selfattention was 512, and the dimension of the inner-layer in FFN was 2, 048. We used 8 heads for attention. For training, we set all dropout to 0.1 as default, including residual dropout, attention dropout, ReLU dropout. Label smoothing ls = 0.1 was applied to enhance the generation ability of the model. For deep models, we only enlarged the encoder depth considering the inference speed.
• Wide (or Big) Model. We used the same architecture as Transformer-Base but with a larger hidden layer size 1, 024, more attention heads (16), and a larger feed forward inner-layer (4, 096 dimensions). The residual dropout was set to 0.3 for the En-De task and 0.1 for the En-Fr task.
For the language modeling task, the hidden size was 512, and the filter size of the FFN was 2, 048. We set all the dropout rates as 0.1, including the residual dropout, attention dropout and ReLU dropout. Each model was trained up to 20 epochs, and most models achieved the lowest PPL on the validation set when the epoch is 10. Then the validation PPL began to increase, though the training PPL is still declining. The warmup step was 2, 000 and the batch size was 4, 096. The max learning rate was set to 0.0007.
Evaluation For machine translation, we measured performance in terms of BLEU. Both tok-enized BLEU and SacreBLEU 11 scores were reported on the En-De and En-Fr tasks. Also, we reported tokenized BLEU scores on the En-Ro task. In addition, we measured Rouge-1, Rouge-2, Rouge-L for CNN/DailyMail and precision, recall, F 0.5 for CONLL. The beam size and length penalty of each task are summarized in Table 8.

C Additional Results and Analyses
Comparison on the CNN/DailyMail Dataset We summarize the previous results on the CNN/DailyMail dataset (See Table 9). The performance was evaluated by ROUGE-1, ROUGE-2 and ROUGE-L, respectively. Intuitively, high-order ODE functions can significantly improve on top of the Euler method as well as several strong existing models. 12 Again, RK4-block beats the baseline and RK2-block by up to 1.36 and 0.25 scores in terms of ROUGE-1, respectively.

Comparison of Various Scaling Methods
We have emphasized the importance of automatic coefficient learning in Section 3.2. The forward pass of RK2-block can be described as y t+1 = y t + γ 1 · F 1 + γ 2 · F 2 , where γ 1 and γ 2 are coefficients which can be numerical suggested or learnable. Here we exhibit the comparison of various scaling methods on the WMT'14 En-De dataset, and the results are listed in Table 10. We can see that RK2-block (learnable γ i ) equips with a single sigmoid gate (line 5 in   performance, especially when the model is deep. A possible explanation is that Tanh produces a larger range ([−1, 1]) which is more difficult to optimize than the sigmoid function.
Case Study on the GEC Task Table 11 summarizes several cases from the GEC task. Here, we make a comparison between the baseline and the RK4-block due to its superiority on the GEC task. We can clearly see that the proposed RK4block delivers more accurate corrections compared with the baseline when handling subject-verb agreement (Case2), collocation (Case1, Case3), spelling (Case4) and other issues. More specifically, Figure  7 illustrates the statistics of different error types annotated by ERRANT (Bryant et al., 2017), a grammatical ERRor ANnotation Toolkit designed to automatically annotate parallel error correction data. For more details please refer to Bryant et al.
(2017)'s work. With the help of ERRANT, we can carry out a detailed error type analysis. As shown in Figure 7, RK4-block corrects the input in a more similar way with the reference, though there is still a large gap between them. Limited by the model ability, the baseline sometimes even cannot generate the right corrections, e.g. R:PUNCT and M:OTHER cases.

D Comparison with Related Work
As we aforementioned, the ODE design schema somehow shares a similar merit with the weight sharing, especially when the coefficients are set to 1. This is because we reuse the same function F to compute the intermediate approximation at each timestep, and it is also an effective way to apply the higher-order ODE into the Transformer architecture. Compared with weight sharing (line 1 in Table 10), ODE Transformer variants can deliver better performance within the same computation cost, demonstrating the effectiveness of ODE design schema. Next, we make a detailed comparison between the proposed ODE Transformer and previous studies (Baier-Reinio and De Sterck, 2020;Zhu and Fu, 2018; to avoid the potential misunderstandings. Compared with RKNet RKNet (Zhu and Fu, 2018) is mainly designed to improve the ResNet using implicit Runge-Kutta methods for vision tasks. There are some differences between ours and RKNet. (i) We mainly conduct experiments on sequence generation tasks, e.g. machine translation, abstract summarization, and grammar error correction tasks. They focused on the image clas-

Case1
Source What 's more , various of cultures can be shown to us through social medias . Reference What 's more , various cultures can be shown to us through social media .
Baseline What 's more , various cultures can be shown to us through social medias . RK4 What 's more , various cultures can be shown to us through social media .

Case2
Source Social media sites such as Facebook has allow us to share our pictures or even chat online with our parents while we are overseas . Reference Social media sites such as Facebook have allowed us to share our pictures or even chat online with our parents while we are overseas .
Baseline Social media sites such as Facebook allow us to share our pictures or even chat online with our parents while we are overseas .

RK4
Social media sites such as Facebook have allowed us to share our pictures or even chat online with our parents while we are overseas .

Case3
Source On one side , it is obvioualy that many advantages have been brought to our lives . Reference On the one hand , it is obvious that many advantages have been brought to our lives .
Baseline On one hand , it is obvious that many advantages have been brought to our lives .

RK4
On the one hand , it is obvious that many advantages have been brought to our lives .

Case4
Source Other than that , I believe that the stong bond we have with our family is the biggest pillar of support to the carrier . Reference Other than that , I believe that the strong bond we have with our family is the biggest pillar of support to the carrier .
Baseline Other than that , I believe that the stong bond we have with our family is the biggest pillar of support to the carrier .

RK4
Other than that , I believe that the strong bond we have with our family is the biggest pillar of support to the carrier .  sification task. (ii) Except for the integration of ODE into the Transformer design schema, we also make an analysis on how to choose appropriate coefficients of intermediate approximations. And we bridge the relationship between the ODE design schema with the explicit weight sharing. (iii) We also offer an automatic coefficient learning method for RK2-block which delivers the best performance in different configurations.
Compared with N-ODE As we discussed in the related work, our work is complementary to Baier-Reinio and De Sterck (2020)'s work. We empirically demonstrate the effectiveness of integrating ODE design schema into Transformer on several sequence generation tasks. This work may shed light on the design of effective Transformer architectures from the numerical perspective and provides stronger baselines to the literature.
Compared with CSAODE The differences between these two works are summarized below: (i) As we emphasized above, the benchmarks we experimented on are quite different. They mainly validated the proposed CSAODE on text classification and QA tasks. (ii) The proposed CSAODE  is an extension of neural ODE (cheng et al., 2018), where the motivation is quite different. They aim to effectively calculate the contiguous states of hidden features only via one-layer parameters and proposed a self-attention solver to fix the issue. While our motivation is to employ higher-order ODE solutions to reduce the truncation errors produced by each layer. On the other hand, CSAODE is still a single-layer model, and ours is a multi-layer sequence-to-sequence model. We also show the comparison of different components based on higher-order ODE solutions (See Figure 4). (iii) The single-layer model is not strong enough to solve complicated tasks, e.g. machine translation. However, when stacking several layers, we need to re-consider the error accumulation among layers, that each layer is an individual ODE solver. How to mitigate the error accumulation is the main goal in this work, which is not discussed in their work.

E Derivations of the Equation
Let E be the loss of training, L be the number blocks of the model, and y L be the model output.