Lessons on Parameter Sharing across Layers in Transformers

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.


Introduction
Transformer-based methods have achieved notable performance in various NLP tasks (Vaswani et al., 2017;Devlin et al., 2019;Brown et al., 2020). In particular, Brown et al. (2020) indicated that the larger parameter size we prepare, the better performance the model achieves. However, the model which is composed of many parameters occupies a large part of a GPU memory capacity. Thus, it is important to explore a parameter efficient way, which achieves better performance than a basic model with the same parameter size.
Parameter sharing is a widely used technique as a parameter efficient way (Dehghani et al., 2019;Dabre and Fujita, 2019;. Dehghani et al. (2019) proposed Universal Transformer which consists of parameters for only one layer of a Transformer-based encoder-decoder, and uses these parameters N times for an N -layered encoder-decoder. Dabre and Fujita (2019) and  also used such parameter sharing across layers for their Transformers. Dehghani et al. (2019) reported that Universal Transformer achieved better performance than the vanilla Transformer in machine translation if the parameter sizes of both models are (almost) equal to each other. However, when we prepare the same number of parameters for Universal Transformer and vanilla Transformer, Universal Transformer requires much more computational time because weight matrices for each layer in Universal Transformer are much larger. For example, Universal Transformer requires twice as much training time as the vanilla Transformer in WMT English-to-German, which is a widely used machine translation dataset (see Table 1).
In this paper, we propose a new parameter sharing method that is faster than using the same parameters for all layers such as Universal Transformers. Universal Transformers raise their expressiveness power by increasing the size of weight matrices for each layer. On the other hand, stacking (more) layers is another promising approach to raise expressiveness power of neural methods. Thus, the most straight-forward way to make Universal Transform-ers faster is stacking layers with smaller weight matrices for each layer. However, the approach using the same parameters for all layers limits the improvement of stacking layers (Dabre and Fujita, 2019). Therefore, instead of preparing parameters for only one layer, we prepare parameters for M layers to construct an N -layered encoder-decoder, where 1 ≤ M ≤ N . In other words, the proposed method relaxes the parameter sharing strategy in previous studies (Dehghani et al., 2019;Dabre and Fujita, 2019;. Because this relaxation addresses the above limitation of improvement by stacking layers, the proposed method can be fast by stacking layers with using small weight matrices for each layer. For the parameter assignment to each layer, we provide several strategies ( Figure 1) and compare them empirically.
We mainly conduct experiments on machine translation datasets. Experimental results show that the proposed method achieves comparable scores to the method assigning parameters of one layer to all layers with smaller computational time. In addition, we indicate that the proposed method slightly outperforms the previous parameter sharing method (Dehghani et al., 2019) when we spend almost the same training time. Moreover, we conduct experiments on automatic speech recognition (Section 5) and language modeling (Appendix A) tasks. Experimental results on these tasks also indicate that the proposed method are efficient in terms of the parameter size in other situations.

Proposed Method
As described in Section 1, we use parameters for M layers in the construction of an N -layered Transformer-based encoder-decoder. We provide three strategies for the parameter assignment: SE-QUENCE, CYCLE, and CYCLE (REV). We describe these strategies in this section. Figure 1 shows examples of three parameter assignment strategies for an encoder side when we set M = 3 and N = 6. Let enc i be the i-th layer of an encoder. Figure 2 describes the algorithm to assign each parameter to each layer of the encoder. For the decoder side, we assign each parameter with the same manner.

SEQUENCE
The simplest strategy is to assign the same parameters to sequential N/M layers. We name this strategy SEQUENCE. For example, when we set 19: else 20:

CYCLE
In CYCLE, we stack M layers whose parameters are independent from each other. Then, we repeat stacking the M layers with the identical order to the first M layers until the total number of layers reaches N . When we set M = 3 and N = 6, we stack 3 layers twice as illustrated in Figure 1.

CYCLE (REV)
Liu et al. (2020) reported that higher decoder layers obtain larger gradient norms when we use the post layer normalization setting, which is originally used in Vaswani et al. (2017) and widely used in machine translation. Their report implies that higher layers require more degrees of freedom than lower layers for their expressiveness. In other words, lower layers probably have redundant parameters compared to higher layers. Thus, we propose the CYCLE (REV) strategy reusing parameters of lower layers in higher layers.
In this strategy, we repeat stacking M layers in the same manner as CYCLE until M * ( N/M −1) layers. For the remaining layers, we stack M layers in the reverse order. When we set M = 3 and N = 6, we stack 3 layers and then stack the 3 layers in the reverse order as illustrated in Figure 1. Thus, the lowest layer and highest layer share their parameters.

Stable Training of Deep Transformers
In the proposed method, we stack many layers to raise expressiveness power of Transformers. Recent studies demonstrated that the training of a deep Transformer is often unstable (Nguyen and Salazar, 2019; Xiong et al., 2020;. This section briefly describes layer normalizations (LNs) in Transformers because LNs are the important technique the stable training of deep Transformers.
Most of recent studies used the pre layer normalization setting (Pre-LN) when they stacked many layers (Wang et al., 2019;Brown et al., 2020) because Pre-LN makes the training process more stable than the post layer normalization setting (Post-LN) (Nguyen and Salazar, 2019; Xiong et al., 2020). However, Transformers with Post-LN achieve better performance if we succeed in training (Nguyen and Salazar, 2019; . To stabilize the training process of Transformers with Post-LN, Liu et al. (2020) proposed Admin that smooths the impact of each parameter in the early stage of training. In this study, we also use Admin to ensure the stable training of the proposed strategies in machine translation, and use Pre-LN in other experiments in accordance with baselines.

Experiments on Machine Translation
We investigate the efficiency of the proposed parameter sharing strategies. In this section, we conduct experiments on machine translation datasets. First, we focus on the English-to-German translation task because this task is widely used in the previous studies (Vaswani et al., 2017;Dehghani et al., 2019;Kiyono et al., 2020). We conduct comparisons based on following aspects: (i) comparison with previous methods in terms of efficiency, (ii) comparison among the proposed parameter sharing strategies, and (iii) comparison with models without restrictions on parameters to investigate the difference from the performance of the upper bound. In addition to the widely used training data, we conduct experiments on a large amount of training dataset in the English-to-German translation task. Then, we investigate if our findings can be applied to other language direction (i.e., German-to-English) and other language pair (i.e., English-to-French and French-to-English). We describe details in the following subsections.

Datasets
We used the WMT 2016 training dataset, which is widely used in previous studies (Vaswani et al., 2017;Takase and Kiyono, 2021). This dataset contains 4.5M English-German sentence pairs. Following previous studies, we constructed a vocabulary set with BPE (Sennrich et al., 2016b) in the same manner. We set the number of BPE merge operations at 32K and shared the vocabulary between the source and target languages. We measured case-sensitive detokenized BLEU with SacreBLEU (Post, 2018) 1 .

Methods
For the proposed parameter assignment strategies, we fixed M = 6 and set N = 12, 18 based on the Transformer (base) setting in Vaswani et al. (2017). We compare the proposed strategies with the following baselines. Vanilla: This is the original Transformer (base) setting in Vaswani et al. (2017). Admin: We applied Admin  to the Transformer (base) setting. Universal: As the parameter sharing strategy in previous studies such as Universal Transformers (Dehghani et al., 2019), we set M = 1 2 . In this setting, we increased the dimensions of each layer for a fair comparison in terms of the number of parameters. This configuration corresponds to the Universal Transformer base setting in Dehghani In addition, we prepare two models that consist of a large number of parameters for reference. Vanilla (big): This is the original Transformer (big) setting in Vaswani et al. (2017). Admin (deep): We stacked layers until N = 18 for the Transformer (base) setting, and applied Admin for the stable training. Table 1 shows BLEU scores on newstest2010-2016 for each method. We trained three models with different random seeds, and reported the averaged scores. Table 1 also shows the total number of parameters and computational speeds 3 . The computational speed is based on the speed of Vanilla.

Results
(i) Comparison with Universal in terms of efficiency In the comparison between Universal and Vanilla, Universal achieved better scores although their parameter sizes are almost the same. This result is consistent with the report in Dehghani et al. (2019). However, the training time of Universal is more than twice as much as the one of Vanilla. In addition, Universal (deep) didn't improve the performance from Universal although its negative loglikelihood on validation set slightly outperformed the one of Universal. Thus, stacking many layers have small effect on BLEU scores when the model shares parameters of one layer with all layers. 3 We regard processed tokens per second during the training as the computational speed.  In contrast, the proposed strategies (SEQUENCE, CYCLE, and CYCLE (REV)) were faster and achieved slightly better scores than Universal when we set M = 6 and N = 12. Since Admin did not have a positive influence on BLEU scores as in Table 1 4 , our strategies were responsible for the improvements. Thus, our proposed parameter sharing strategies are more efficient than Universal in terms of the parameter size and computational time.
In fact, when we used the same procedure as   Vaswani et al. (2017), the best model in Table 2 achieved 35.14 in the averaged BLEU score in new-stest2014. Figure 3 illustrates the negative log-likelihood (NLL) values on newstest2013 for each training time. In this figure, we used M = 6 and N = 12 for our proposed strategies. This figure shows that the proposed strategies achieved better NLL values than Universal during the training. This result also indicates that the proposed strategies are more time efficient than Universal. Moreover, Figure 3 shows that the proposed strategies outperformed Vanilla at the early phase of their training. Since Vanilla has converged, it would be hard for Vanilla to outperform the proposed strategies on NLL even if we spent the twice training time for Vanilla. Therefore, this figure indicates that our proposed parameter sharing strategies are efficient.
(ii) Comparison among the proposed parameter sharing strategies Table 1 shows that all proposed strategies achieved almost the same scores for M = 6 and N = 12. In contrast, the scores of SEQUENCE were lower than those of the other two strategies for M = 6 and N = 18. This result indicates that CYCLE and CYCLE (REV) are better strategies when we construct a deep Transformer with a small M , namely, saving parameter size. For M = 6 and N = 18, CYCLE (REV) improved by 0.41 from Universal in the averaged BLEU score even though their computational speeds were almost the same. Therefore, CYCLE and CYCLE (REV) are superior parameter efficient strategy.
(iii) Comparison with models without restrictions on parameters The lowest part of Table  1 indicates results when we prepared more parameters. We trained these models to investigate the performance of models without any restriction on parameters. In other words, the purpose of these settings are to investigate upper bounds of the performance. However, their scores were lower than scores of our proposed strategies. This result implies that our parameter sharing strategies are also better than the model without any restriction on parameters in terms of the BLEU score. In fact, previous studies on language modeling demonstrated that the parameter sharing achieved better performance (Melis et al., 2018;Merity et al., 2018;Takase et al., 2018). If we applied several techniques to improve the performance of a deep model (Li et al., 2020) or a model consisting of many parameters (Takase and Kiyono, 2021), we might raise BLEU scores of the lowest part of Table 1. However, since our purpose is not to achieve the top score, we trained each model with the conventional training procedure.

Datasets
In the high resource setting, we constructed 44.2M translation sentence pairs as a training dataset with the procedures of Kiyono et al. (2020) which achieved the best result in the WMT 2020 news translation task. In addition, we augmented the training data by using the back-translation technique (Sennrich et al., 2016a) in the same manner as Kiyono et al. (2020). We obtained 284.3M pairs as synthetic training data. For evaluation, we add newstest2018 and 2019 to the set used in Section 4.1 to because Kiyono et al. (2020) used these two test sets. In the same as Section 4.1, we measured case-sensitive detokenized BLEU with SacreBLEU.

Methods
We used the original Transformer (big) setting (Vaswani et al., 2017) as our baseline in using genuine training data. We call this setting Vanilla in this experiment. Moreover, we also prepared  Universal, which shares the parameters with all layers, namely, M = 1, N = 6. We increased the dimensions of each layer in Universal to make their parameter size almost the same as others. For the proposed strategies, we used M = 6 and N = 12.
In using both of the genuine and synthetic (backtranslated) datasets, we applied CYCLE (REV) to the BASE setting in Kiyono et al. (2020) because CYCLE (REV) achieved the best BLEU scores on most test sets in Table 1. We also used M = 6 and N = 12 in this configuration. We compare the reported scores of the best model described in Kiyono et al. (2020). Their model is composed of 9 layers (i.e., M = 9 and N = 9); thus, it contains considerably more parameters than ours. Table 2 shows BLEU scores of each method on each test set. Similar to the experiments in Section 4.1, we reported the averaged scores of three models trained with different random seeds. Table 2 also shows the total number of parameters 5 . Table 2 shows that the proposed strategies achieved better BLEU scores than Vanilla and Universal when we prepared almost the same number of parameters. This result indicates that the proposed strategies are also parameter efficient in the high resource setting. In addition, since we used M = 6 and N = 12 for proposed strategies, they are also more efficient than Universal in terms of computational time (see Table 1).

Results
When we used additional synthetic data for training in the same manner as Kiyono et al. (2020), CYCLE (REV) achieved comparable BLEU scores to the best system of Kiyono et al. (2020) except for newstest2019 6 even though the parameter size 5 The parameter sizes of Vanilla (big) in Table 1 and Vanilla in Table 2 are different from each other due to the difference of sharing embeddings. Following Kiyono et al. (2020), we did not share embeddings in the high resource setting. 6 For newstest2019, synthetic data might harm the quality of CYCLE (REV) was smaller than theirs. This result indicates that CYCLE (REV) is also efficient in the construction of models for recent competitive tasks. In addition, this result implies that our proposed strategies can be used in the configuration where we train many parameters with a tremendous amount of data such as recent pre-trained language models, e.g., GPT series (Brown et al., 2020). We investigate the effect of the proposed strategies on language models in Appendix A.

Datasets
We conduct experiments on the other direction and language pair. For the German-to-English training dataset, we used the identical data in Section 4.1.
For English-to-French and French-to-English, we used the WMT 2014 training dataset. We applied the same pre-processing as in , and used 35.8M English-French sentence pairs. Each configuration, we used newstest2013 and new-stest2014 as valid and test sets, respectively. We also measured case-sensitive detokenized BLEU with SacreBLEU in these experiments.

Methods
We compare our proposed strategies with baselines used in Section 4.1. We used the Transformer (base) setting as Vanilla and prepared Universal which is M = 1, N = 6. For the proposed strategies, we used M = 6 and N = 18. In these configurations, the training time of proposed strategies are almost the same as one of Universal as described in Table 1.  Table 4: The parameter sizes, computational speeds based on the T-Md with 6 layers (Vanilla), and word error rates of each method. Scores in bold denote the best results for each set.

Results
parameter sharing strategies (SEQUENCE, CYCLE, and CYCLE (REV)) achieved better scores than Universal in all datasets. These results are consistent with results in Table 1. These results also indicate that the proposed strategies are more efficient than Universal, which shares parameters of one layer with all layers, because they achieved better performance with almost the same parameter size and computational time.
In the comparison among the proposed strategies, CYCLE and CYCLE (REV) outperformed SE-QUENCE on German-to-English but it is difficult to conclude that CYCLE and CYCLE (REV) are superior to SEQUENCE on English-to-French and French-to-English. This result implies that the best strategy might depend on a language pair. However, it is suitable to use CYCLE or CYCLE (REV) as a first step because they were effective in construction of deep models on English-German and achieved comparable scores to SEQUENCE on English-French.

Datasets
To investigate the effect of our proposed strategies on other modality, we conduct comparisons on the automatic speech recognition (ASR) task. We used the de-facto standard English ASR benchmark dataset: LibriSpeech (Panayotov et al., 2015). The dataset contains 1,000 hours of English speech from audiobooks. We used the standard splits of LibriSpeech; used all available training data for training and two configurations (clean and other) of development and test sets for evaluation. We applied the same pre-processing as in . We measured word error rate on each set.

Methods
We also compare our proposed strategies with baselines in Section 4. As the base architecture, we used Transformer based speech-to-text model (T-Md) described in . In contrast to architectures in Section 4, the Transformer in T-Md consists of the pre layer normalization. We prepared 6 layers for the encoder and decoder in Vanilla and Universal. For proposed strategies, we stacked more layers for the encoder side in the same as in . We prepared N = 16 and M = 8 for the encoder side, and N = 8 and M = 4 for the decoder side. Table 4 shows word error rates of each method on each dataset. This table indicates that Universal outperformed Vanilla in all sets. The proposed parameter sharing strategies (SEQUENCE, CYCLE, and CYCLE (REV)) achieved better scores than Universal in all sets even though they are faster than Universal. These results are consistent with results in machine translation experiments in Section 4. Thus, the proposed strategies are also more efficient in the ASR task.

Results
In contrast to machine translation experiments, SEQUENCE outperformed CYCLE and CYCLE (REV) in the ASR task. We consider that this result is caused by the difference of layer normalization positions in the Transformer architecture. We used Post-LN based method (Admin)  in machine translation experiments, but Pre-LN based method in this ASR task.  demonstrated that the position of the layer normalization has a strong effect on the property of Transformers. The experimental results in language modeling (Appendix A) also imply that SEQUENCE is more appropriate when we use the Pre-LN based Transformer. The main focus of this study is empirical comparisons to the widely used parameter sharing strategy, Universal (Dehghani et al., 2019), but we address theoretical analyses in the future to understand the relation between parameter sharing strategies and Transformer architectures.
In the past decade, various studies reported that a large amount of training data improve the performance in NLP tasks (Suzuki and Isozaki, 2008;Brants et al., 2007;Mikolov et al., 2013;Sennrich et al., 2016a;. Moreover, recent studies indicated that the larger parameter size we prepare, the better performance the model achieves when we have a large amount of training data (Devlin et al., 2019;Brown et al., 2020). In fact, the best system on the WMT 2020 news translation task is composed of about 10 times as many parameters as the widely used Transformer (base) setting (Kiyono et al., 2020). However, due to the limitation on a GPU memory capacity, we have to explore a parameter efficient way, which achieves better performance while saving the parameter size.
Parameter sharing is a widely used technique as a parameter efficient way (Dehghani et al., 2019;Dabre and Fujita, 2019;Xia et al., 2019;. Dehghani et al. (2019) proposed Universal Transformer. Their method requires parameters for only one layer (i.e., M = 1) of a Transformerbased encoder-decoder, and shares these parameters with N layers. Dabre and Fujita (2019) investigated the effectiveness of Transformer sharing parameters of one layer across all layers on various translation datasets.  used this parameter sharing strategy to construct a parameter efficient model. As reported in these studies, we can achieve better performance by the Transformer sharing parameters of one layer across all layers when we use the same parameter size as the original Transformer. However, this strategy requires much more computational time as described in Table 1 because weight matrices for each layer are much larger. To solve this problem, we propose a new parameter sharing strategies that prepare parameters for M layers and assign them into N layers, where 1 ≤ M ≤ N . Experimental results show that our proposed strategies are more efficient than the method sharing parameters of one layer with across layers (Dehghani et al., 2019;Dabre and Fujita, 2019;. Xia et al. (2019) proposed an encoder-decoder which shares parameters of the encoder part and decoder part.  proposed the method to share the attention weights to make the computation of Transformers fast. These techniques are orthogonal to our proposed method. Thus, we can combine them to improve the efficiency of parame-ters and computational time.
In this study, we explore a parameter efficient method. On the other hand, recent studies proposed method to accelerate the training. Li et al. (2020) proposed a training strategy for a deep Transformer. Their strategy trains a shallow model and then stacks layers to construct a deep model. They repeat this procedure until the desired deep model. They indicated that their strategy was faster than the training of whole parameters of a deep Transformer. Takase and Kiyono (2021) compared regularization methods in terms of training time. Their experimental results show that the simple regularizations such as word dropout are more efficient than complex ones such as adversarial perturbations. We can use those findings to accelerate the training of the proposed method.

Conclusion
We proposed three parameter sharing strategies: SEQUENCE, CYCLE, and CYCLE (REV), for the internal layers in Transformers. In contrast to the previous strategy, which prepares parameters for only one layer and shares them across layers such as Universal Transformers (Dehghani et al., 2019), the proposed strategies prepare parameters for M layers to construct N layers. In the proposed strategies, we stack layers whose weight matrices are smaller than ones of Universal Transformers to raise expressiveness power with saving the increase of computational time.
Experimental results in the standard machine translation setting show that the proposed strategies achieved comparable BLEU scores to those of Universal with a small computational time when we prepared almost the same parameters for each method. In addition, the proposed strategies slightly outperformed Universal when we spent almost the same time to train them. Thus, the proposed strategies are efficient in terms of the parameter size and computational time. Through other experiments, we indicated that the proposed strategies are more efficient than Universal in the high resource setting, other language pairs, and another modality (speech-to-text).  Table 5: The parameter sizes and perplexities of each method. The lower part indicates scores reported in Baevski and Auli (2019) and the score of SEQUENCE with more parameters. Scores in bold denote the best results for each set. † represents our re-run of Baevski and Auli (2019).

A.1 Dataset
We focused Transformer-based encoder-decoders in experiments in previous sections. However, recent studies often employed the decoder side only as a pre-trained model. Thus, we conduct experiments on the language modeling task to investigate the efficiency of our proposed strategies when we use the decoder side only. We used Wikitext-103 (Merity et al., 2017) which contains a large amount of training data. We measured perplexity of validation and test sets.

A.2 Methods
We used the Transformer with adaptive inputs (Baevski and Auli, 2019) as the base architecture. In the same as in Baevski and Auli (2019), the Transformer in the language modeling consists of the pre layer normalization. We set N = 6 for Vanilla and Universal. For the proposed strategies, we set N = 12 and M = 6. Table 5 shows perplexities of each method. This table indicates that Vanilla achieved better performance than Universal. Thus, the sharing parameters of one layer with all layers might not be suitable for a large-scaled language modeling task. In contrast, the proposed strategies outperformed Vanilla. This result indicates that our proposed strategies are also more efficient than Universal in the language modeling task. Through the comparison among proposed strategies, SEQUENCE achieved the best perplexity. As described in Section 5, SEQUENCE might be more appropriate to the Transformer with the Pre-LN configuration. To explore the reason, we believe that we have to conduct the theoretical analysis of the Transformer during its training. We address this issue in the future study.

A.3 Results
The lower part of Table 5 shows the reported score of Baevski and Auli (2019), our reproduced score, and SEQUENCE with more parameters. This part indicates that SEQUENCE achieved better perplexities than others even though the parameter size of SEQUENCE is smaller. Therefore, SEQUENCE is also efficient when we prepare a large amount of parameters for a language model.