EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation

We introduce EdgeFormer – a parameter-efficient Transformer for on-device seq2seq generation under the strict computation and memory constraints. Compared with the previous parameter-efficient Transformers, EdgeFormer applies two novel principles for cost-effective parameterization, allowing it to perform better given the same parameter budget; moreover, EdgeFormer is further enhanced by layer adaptation innovation that is proposed for improving the network with shared layers.Extensive experiments show EdgeFormer can effectively outperform previous parameter-efficient Transformer baselines and achieve competitive results under both the computation and memory constraints. Given the promising results, we release EdgeLM – the pretrained version of EdgeFormer, which is the first publicly available pretrained on-device seq2seq model that can be easily fine-tuned for seq2seq tasks with strong results, facilitating on-device seq2seq generation in practice.


Introduction
On-device modeling draws increasing attention for its unique advantages (Dhar et al., 2019).On the other hand, strict resource constraints prevent many neural networks performing well in the on-device setting.In Natural Language Processing (NLP), on-device sequence-to-sequence (seq2seq) generation remains challenging, especially for the Transformer (Vaswani et al., 2017) under strict resource constraints in both computation and memory.
To customize the Transformer for seq2seq tasks in the on-device setting, we propose EDGE-FORMER -a novel parameter-efficient Transformer of the encoder-decoder architecture.EDGE-FORMER is structurally similar to the standard Transformer with a deep encoder and shallow decoder, but with a modification that it uses an in- terleaved decoder with shared lightweight feedforward networks, as shown in Figure 1.The modified decoder architecture allows EDGEFORMER to apply two novel principles that we propose for cost-effective parameterization: 1) encoder-favored parameterization that suggests we parameterize the encoder using as many parameters as possible; 2) load-balanced parameterization that suggests we balance the load of model parameters to avoid them being either underused or overused in a neural network with shared parameterization.
In addition to cost-effective parameterization, EDGEFORMER proposes and applies layer adaptation to further improve the model with tied layers, as Figure 2 shows.Inspired by parameterefficient task transfer, we investigate 3 efficient layer adaptation approaches for improving the performance with negligible cost.We show EDGE-FORMER (with fewer than 10 million model parameters) largely outperforms the strong UNIVERSAL TRANSFORMER baselines in the on-device setting with competitive results, and the int8-quantized EDGEFORMER can perform high-quality on-device seq2seq generation within around 100ms latency (20-30 sequence length on average) using two midto-high-end CPU cores and less than 50MB RAM.
The contributions of this work are three-fold: • This paper is one of the earliest work that formally studies on-device seq2seq generation by discussing its challenges and defining a practical setting with appropriate resource constraints for the evaluation.
• We propose EDGEFORMER, a parameterefficient Transformer with novel cost-effective parameterization and layer adaptation, achieving the state-of-the-art result in the on-device seq2seq generation setting under strict computing and memory resource constraints.
• We introduce and release EDGELM (the pretrained EDGEFORMER) -the first publicly available pretrained on-device seq2seq model that can be easily fine-tuned for seq2seq tasks with strong results, which can largely reduce the effort for delivering a powerful on-device seq2seq model in practice.

Architecture
The Transformer follows the encoder-decoder architecture.The Transformer encoder consists of a stack of encoder layers, each of which has a selfattention module parameterized by projection matrices for the query, key, value and output: The Transformer decoder consists of a stack of decoder layers whose architecture is similar to an encoder layer except for an additional cross attention module between self-attention and FFN.
2 For simplicity, we omit discussing the layer normalization and residual connection that are not related with this work.
In summary, we understand that the main parameters in an encoder layer i are: For a decoder layer j, its main parameters are: where is the cross-attention module.

Parameterization: Full vs Shared
Full Parameterization Full parameterization is a common parameterization approach for Transformer, meaning that each model parameter (excluding embedding) is independent without being shared by multiple modules in the network.In a forward pass, each parameter is used only once.Full parameterization allows parameters to be flexible to fit their roles well during model training.3 Constraints for On-device Seq2seq

Shared Parameterization
Computation On-device computer vision (CV) models tend to use 1G FLOPS (0.5G MACS) as a constraint, which is directly followed by previous work on on-device translation (Wu et al., 2020).In our work, however, we propose to relax the FLOPS constraint for typical seq2seq tasks to 2G FLOPS (1G MACS) because the latency requirement for on-device seq2seq generation is not so rigid as CV tasks and it is uncommon for an on-device seq2seq model to handle too many concurrent requests in practice.The relaxed constraint allows better prediction quality that strongly correlates with user experience for seq2seq tasks, but still ensure the CPU on edge devices to process tens of sentences per second, which is more than sufficient for an ondevice seq2seq model.In addition to FLOPS that is a theoretical hardware-independent measurement for computational cost, we also require the runtime latency for an input sentence (typically 20 ∼ 30 tokens on average) to be within around 100ms using two mid-to-high-end CPU cores.
Memory In contrast to deploying a model on a cloud server without caring about memory cost much, there is a very strict memory constraint for an on-device model in practice, because a user's edge device (e.g., PC) is not only for model hosting; instead, it usually runs many other (background) apps and programs at the same time besides the model.To ensure moderate memory cost, we limit the number of model parameters (excluding word embedding lookup table) up to 10 million, following previous work (Wu et al., 2020), and require the runtime memory footprint to be less than 50MB.

Architecture
The biggest challenge for an on-device seq2seq model is regarding the model size and memory cost.As shown in Table 1, the number of parameters of a standard Transformer-base model ( d=512) is about 45 million (excluding the embedding parameters), which is far beyond the parameterization budget (10 million) and unavoidably leads to massive memory cost despite acceptable FLOPS.EDGEFORMER is proposed to address the challenge.Instead of disruptive architectural changes3 as previous research (Wu et al., 2020;Mehta et al., 2020;Panahi et al., 2021), EDGEFORMER's architecture basically follows the standard Transformer consisting of a 12-layer encoder and 2-layer4 decoder, which is efficient in decoding.We mainly discuss the model with d=512 in this paper since it can achieve good performance in the on-device setting as long as it can be appropriately parameterized.The minor architectural modification we propose for EDGEFORMER is using an interleaved decoder where attention modules are interleaved with shared lightweight 5 FFNs (d decffn < d; in this work, d decffn = d/4) in each decoder layer (shown in Figure 1).The modification is helpful for costeffective parameterization (Section 4.2): • The interleaved structure makes the architecture of encoder and decoder layers consistent (Ma et al., 2021), facilitating shared parameterization of attention modules throughout the encoder and decoder.
• As shown in Table 1, the lightweight FFNs that interleave attention modules in the decoder reduce FLOPS and save a large number of parameters for decoder FFNs' parameterization that is very uneconomical.

Cost-effective Parameterization
Due to the tight parameterization budget (i.e., 10 million), EDGEFORMER cannot be fully parameterized as in the standard way; instead, it has to adopt shared parameterization.As a strong baseline for shared parameterization, UNIVERSAL TRANSFORMER lets all its M encoder layers share 1 group of encoder layer parameters and all its N decoder layers share 1 group of decoder layer parameters: As observed by Kasai et al. (2020), reducing ddecffn does not hurt the result much, as shown in Table 8 in Appendix A.
Although UNIVERSAL TRANSFORMER is a popular solution to shared parameterization, it is still not cost-effective for two reasons: First, UNIVERSAL TRANSFORMER uses (over) half of total parameters to parameterize the decoder, which is uneconomical.As shown in Figure 3a, given a fixed architecture (6+6 Transformer, d = 512), densely parameterizing the decoder results in much less performance gain than parameterizing the encoder.This suggests we use as many parameters as possible to parameterize the encoder for the performance.
Second, UNIVERSAL TRANSFORMER does not consider load balance of model parameters, which was a rarely discussed problem until the recent emergence of Mixture-of-Expert models (Fedus et al., 2021).For the Transformers with a deep encoder and shallow decoder, UNIVERSAL TRANS-FORMER's parameterization method will overburden parameters in the encoder but underutilize parameters in the decoder.For example, for a 12+2 UNIVERSAL TRANSFORMER, a parameter in the encoder is used 12 times, while a parameter in the decoder is used only twice in a forward pass.As shown in Figure 3b, moderately reusing parameters (e.g., when x ≤ 4) helps better utilize the parameters, resulting in significant performance gain without increasing parameters.However, as the shared parameters are overused (when x > 6), the performance gain will become marginal, which is intuitive because a parameter's capability is limited.This suggests we balance the load of parameters to avoid them being either overused or underused.
Based on the above insights, we parameterize EDGEFORMER in the following two novel principles for cost-effective parameterization: Encoder-favored Parameterization For EDGE-FORMER, we parameterize its encoder using as many parameters as possible: except a small number of parameters (d 2 /2) for all lightweight FFNs in the decoder, we use almost all parameters in our budget to parameterize the encoder.For attention modules in the decoder, we let them reuse (i.e., share) parameters with the attention modules in the encoder since attention modules in both the encoder and decoder work in the same mechanism and can be effectively shared (Dong et al., 2019).Thanks to the interleaved decoder architecture that makes the structure of encoder and decoder layers consistent, we let the self-attention module in a decoder layer share parameters with its corresponding odd layer in the encoder, and let its cross-attention module share with the corresponding even layer in the encoder, inspired by Ma et al. (2021): Load-balanced Parameterization We try parameterizing EDGEFORMER with a balanced load for each model parameter so that each parameter could be as equally exploited as possible in a forward pass.Given the parameterization budget and the load balance principle, we create 2 groups of encoder FFN parameters equally shared by all encoder layers, 1 group of decoder FFN parameters is shared by light FFNs in the decoder, and 4 groups of attention parameters are shared throughout the encoder and decoder.Except for parameters in the encoder FFNs that are used 6 times, other parameters are all used 4 times in a forward pass, resulting in a load balanced parameterization:

Layer Adaptation
Shared parameterization causes layers with tied weights to become less specialized, as discussed in Section 1.To allow tied layers to be better adapted to their corresponding roles, we propose layer adaptation to further enhance EDGEFORMER.Inspired by parameter-efficient task transfer methods, we investigate three efficient layer adaption approaches: Bias-based Layer Adaptation (Bias-LA) Inspired by BitFit (Ben Zaken et al., 2021) finetuning with only bias terms, we untie all bias terms of each layer and use them to specialize the layers with tied weights, as shown in Figure 2(b).As Bit-Fit, bias-based layer adaptation introduces very few additional parameters without inference overhead.
Adapter-based Layer Adaptation (Adapter-LA) Adapter-based approaches (Houlsby et al., 2019) introduce adapter modules for NLP task transfer without full fune-tuning.We borrow this idea for layer adaptation by introducing an independent adapter module for each layer.Specifically, we adopt the recently proposed LoRA (Hu et al., 2021) as our layer adapter, as Figure 2(c) shows.In our experiments, we apply the layer adapter to W Q and W V , as the original paper of LoRA suggests.
Prefix-based Layer Adaptation (Prefix-LA) Inspired by recent work (Li & Liang, 2021;Lester et al., 2021) using a prefix/prompt for task transfer, we introduce L tokens with learnable parameters as a specific prefix for each layer to adapt layers with tied weights, as shown in Figure 2(d).The prefixs are only used for keys and values in attention modules, which will not introduce much inference overhead as long as L is moderately set.
Following the encoder-favored principle in Section 4.2, we only apply LA to encoder layers.

Experimental Setting
We mainly evaluate our approach in Machine Translation (MT).We select the most popular MT benchmark -WMT14 English-German (En-De) translation task, which is also a touchstone for seq2seq evaluation, as our main test bed.To compare with previous work, we also evaluate WMT14 English-French (En-Fr) translation.We follow the standard way to train and evaluate evaluate WMT14 En-De and En-Fr.As Ott et al. (2018), we use a joint source-target dictionary of 32K Byte Pair Encoding (BPE) for En-De, and 40K BPE for En-Fr.We mainly use sacreBLEU (Post, 2018) for evaluation.
We select UNIVERSAL TRANSFORMER which is the most popular and a strong baseline of parameter-   efficient Transformer for fair comparison.By default, we apply Seq-KD (Kim & Rush, 2016) to train models and use the full-parameterized 6+6 Transformer-big (d = 1, 024) model (Vaswani et al., 2017;Ott et al., 2018) as the teacher.By default, for each experiment, we train 5 models with different initializations and report their average evaluation results for Table 2, 3 and 6 with significance test.For inference, we use beam search (beam=5).

Offline Evaluation
We evaluate EDGEFORMER and compare it with UNIVERSAL TRANSFORMER (UT) on WMT14 En-De.According to Table 2, the EDGEFORMER without layer adaptation (LA) largely outperforms UTs.Among the LA approaches, both Adapter-LA and Prefix-LA are clear to benefit the result with marginal computational or parameterization cost, while Bias-LA does not show significant performance gain though it is the cheapest.
As discussed in Section 4.2, the advantage of EDGEFORMER over UT comes from its costeffective parameterization.The encoder-favored principle is again supported by comparing 6+6 Transformers' results in Table 2, which is consistent with the observation on the dev set in Figure 3a.To further understand the effectiveness of loadbalanced parameterization principles, we conduct an ablation study by adjusting encoder FFNs in EDGEFORMER.Table 3 shows the results of EDGE-FORMER with various FFN parameterization.As we reduce d ffn (e.g., to 1536 or 1024), we can increase the group of encoder FFN parameters and reduce their load given a fixed parameterization budget.However, such a strategy leads to a clear degradation of sacreBLEU.One reason is that the FFN parameters of a reduced load (3-4 times) are not so fully utilized as the baseline (6 times) despite other reasons such as the differences of network shape (e.g., d ffn ).To minimize the effects of other factors, we compare the first group with a balanced  parameter load (i.e., 6-6) and the last group with a imbalanced parameter load (1-11 or 11-1), showing load-balanced parameterization is consistently better than the imbalanced counterparts.
After discussing parameterization, we then analyze the effects of layer adaptation on the results by mainly focusing on Adapter-LA and Prefix-LA that both show performance gain.Figure 4 shows the effects of the rank r in Adapter-LA and prefix length L in Prefix-LA.As r increases, the model performance will gradually improve.However, when r becomes large (e.g., r ≥ 64), it will exceed our parameterization budget and thus the gain will become meaningless.As for prefix length L in Prefix-LA, it is different from r that it will not keep improving the results as it increases: the gain can hardly be observed after some length (e.g., L = 8), which is similar to the observation in prefix-tuning (Li & Liang, 2021).Therefore, we use r = 32 and L = 8 as the default setting to report the results of Adapter-LA and Prefix-LA.
Finally, we compare EDGEFORMER with recent work on parameter-efficient Transformer modeling.To keep consistency of the training and evaluation protocols with previous work, we here give up using Seq-KD to train the models, and report BLEU (Papineni et al., 2002) for comparison.Specifically, we compare with DeLighT (Mehta et al., 2020), Shapeshifter (Panahi et al., 2021) and Lite Transformer (Wu et al., 2020), and show the results in Table 4.However, it is notable that the results are not strictly comparable because the previous studies have their own focus and setting, which are different from ours.For example, De-LighT and Lite Transformer focus much more on FLOPS than the model size, thus they do not a desirable tradeoff between the model quality and size; while Shapeshifter's goal is minimizing the model size despite an additional 10% ∼ 20% inference overhead.Regardless of these factors that

Runtime Evaluation
We conduct experiments in WMT14 En-De translation and CoNLL-14 Grammatical Error Correction6 (GEC) benchmark for runtime latency and memory evaluation using onnxruntime7 that supports efficient seq2seq decoding.We apply int8quantization to EDGEFORMER and test latency on 2 devices: a 2-core Intel® Xeon® E-2288G CPU (in PC), and a 2-core Qualcomm SM8150 Snapdragon 855 CPU (in Pixel 4), which are both current mid-to-high end CPUs launched 2-3 years ago.
Table 5 shows runtime evaluation results.With int8-quantization and smaller vocabulary, EDGE-FORMER can not only meet the on-device seq2seq requirements but also maintain its good performance, demonstrating its practical values.
We evaluate EDGELM in the benchmarks of three popular seq2seq tasks: CoNLL-14 for Grammatical Error Correction (GEC), XSum (Narayan et al., 2018) for Abstractive Summarization, and SQuAD-NQG (Du et al., 2017) for Question Generation.According to Table 6, EDGELM achieves significantly better performance than the pretrained UT models as well as the Transformer-base model trained from scratch.We believe that EDGELM, as the first publicly released on-device seq2seq pretrained model, can largely facilitate on-device seq2seq generation in practice.

Related Work
On-device seq2seq generation in NLP is a research area that has been less explored than on-device CV and NLU (Tambe et al., 2021).Besides the general techniques like pruning, compression, quantization and knowledge distillation (Fan et al., 2019;Xu et al., 2020;Li et al., 2022) that are orthogonal to our effort, parameter-efficient Transformerbased seq2seq modeling is the most related research branch to ours.In this branch, UNIVERSAL TRANSFORMER (Dehghani et al., 2018) uses crosslayer sharing method, which is the most popular solution to parameter efficiency.Takase & Kiyono (2021) extends UNIVERSAL TRANSFORMER by studying different ways for layer sharing, and Reid et al. (2021) proposes to free the first and last encoder layer and widen the intermediate layers for better performance.However, both the approaches consider parameter-efficiency only without caring about latency becoming worse.
In addition to work improving parameter efficiency by weight sharing, there is research that studies lightweight model architecture for seq2seq learning where early work (Gehring et al., 2017;Wu et al., 2019) mainly focuses on CNNs, while recent efforts have tended to switch to attentionbased models such as Mehta et al. (2020).Also, low-rank factorization has been studied intensively to make the model tiny (Zhang et al., 2021;Panahi et al., 2021); and hardware-aware network architecture search with elastic modeling (Wang et al., 2020) has been proposed recently for facilitating deployment of seq2seq models on various devices.Among previous studies, the work of Wu et al. (2020) is the most related to ours, which studies seq2seq generation in an on-device setting.However, it sets the computational constraint for ondevice seq2seq to be the same with the CV tasks, which is too strict and unnecessary, as discussed in Section 3. As a result, their models focus on FLOPS optimization much more than memory, leading to an undesirable tradeoff between the quality and model size for the practical on-device seq2seq setting which should care about memory much more than latency.In contrast, our work carefully evaluates bottleneck constraints, and proposes appropriate models with parameterization and layer adaptation innovations, largely improving the results for practical on-device seq2seq generation.

Conclusion and Future Work
We formally study on-device seq2seq generation, including defining its practical resource constraint setting and proposing an appropriate modeling technology EDGEFORMER.The cost-effective parameterization and layer adaptation innovations in EDGEFORMER both prove effective to improve the results with negligible computation and memory cost, achieving state-of-the-art results in the on-device seq2seq generation setting.Our released pretrained EDGEFORMER -EDGELM can be easily fine-tuned for downstream seq2seq tasks, largely facilitating on-device seq2seq generation in practice.
For future work, we plan to further study loadbalanced parameterization for parameter-efficient models, which is an interesting and new but seemingly profound machine learning research problem: instead of naively assuming that all the parameters are equal in this preliminary study, we suspect that parameters in different modules (e.g., parameters in the self-attn and FFN; or parameters in different layers) should be under different amounts of load.We look forward to in-depth research on this problem, which might be helpful to deepen our understanding of neural networks.

Limitations
EDGEFORMER is a preliminary model proposed for on-device seq2seq generation setting, which still has much room for improvement.For example, as mentioned in Section 8, the current load balance mechanism naively assumes that the number of times that a parameter is used in a forward pass is equal to its load, which may not be always true because parameters in different moduels are different: some parameters may be effectively used more times than others, which requires deeper understanding of neural network and the Transformer.

1Figure 1 :
Figure 1: (a) Vanilla Transformer decoder layer in which d ffn > d; (b) Interleaved Transformer decoder layer with shared lightweight FFNs in which d ffn < d.

Figure 2 :
Figure 2: (a) Encoder layers with shared weights (the same color) without layer adaptation: the tied weights undermine the specialities of encoder layers to process their specific inputs; (b) Bias-based Layer Adaptation (Bias-LA) employs free bias terms to adapt layers with tied weights to fit their specific roles well; (c) Adapter-LA uses a layer-specific LoRA adaptation block with rank r < d for layer adaptation; (d) Prefix-LA uses L layer-specific tokens (i.e., learnable parameters) as the prefix (dotted square) to adapt the mth layer.

Figure 3 :
Figure 3: (a) Performance of 6+6 Transformer (d = 512) on the newstest2013 English-German (En-De) translation dataset (dev set): densely parameterizing the decoder is uneconomical and much less beneficial than parameterizing the encoder; (b) Comparison of x+2 Transformer with full-/shared-parameterized x encoder layers on newstest2013 En-De dataset: when x > 6, the performance of the Transformer with shared parameterization only improves marginally even if x continues to increase.

Figure 4 :
Figure 4: The effects of (a) rank r in Adapter-LA, and (b) prefix length L in Prefix-LA on the performance in WMT14 En-De.Note that r = 64 will lead to exceed our parameterization budget despite better performance.

Table 1 :
Top: #params and FLOPS for Transformer layers.For the encoder and vanilla decoder layer, d ffn = 4d; while for the interleaved decoder layer, d ffn = d/4.Bottom: #params and FLOPS of whole models, where #params excludes embedding lookup, and FLOPS is measured on a sample with src/tgt length of 30 and 32K vocabulary.

Table 2 :
WMT14 En-De results.To fairly compare with UNIVERSAL TRANSFORMER (UT) that is originally smaller than EDGEFORMER, we also test UT with larger FFNs to make its model size comparable to EDGEFORMER.†(i) denotes p < 0.05 in significance test compared with the model marked with i .

Table 3 :
Performance of EDGEFORMER with various encoder FFN parameterization on WMT14 En-De.Load 6-6 means the 2 groups of FFN parameters are used 6 times each, while Load 1-11 means 1 group of FFN is used once, and the other is used 11 times.

Table 4 :
Result comparison to previous parameter-efficient Transformers that have fewer parameters than the baseline Transformer (around 45M parameters)."-" means that the metrics are unavailable or not comparable in the original paper.The underlines denote that the metrics cannot meet the on-device requirement.Note that all the models in this table do not apply Seq-KD.

Table 5 :
Runtime results for int8-quantized EDGEFORMER, in which Latency1 and Latency 2 denote the average latency per sentence measured on the Intel® Xeon® E-2288G CPU and Qualcomm SM8150 Snapdragon 855 CPU, respectively.We run through the test set with batch size=1, and use greedy decoding instead of beam search.