Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%.


Introduction and Related Work
Transformer-based Pre-trained Language Models (PLMs) have become the main building blocks when creating models for most Natural Language Processing (NLP) tasks.PLMs come in three main architectures: decoder-only (e.g.GPT), sequence-to-sequence (seq2seq, e.g.BART, T5), and encoder-only (e.g.BERT).Multilingual models such as XLM-RoBERTa (encoder-only) and mBART/mT5 (seq2seq) are also common.Raffel et al. (2020b) showed that seq2seq models can perform many NLP tasks on par with similarlysized encoder-only models trained via Masked Language Modeling (MLM) by framing tasks such a sentence classification or sequence labeling as text generation.However, encoder models remain more efficient at inference for sequence labeling tasks * Equal Contribution.like Named Entity Recognition (NER) and Partof-Speech tagging (POS): an encoder can label all words in the sequence with a single forward pass, while a seq2seq model must generate each word's label autoregressively.
Motivated by the need for both an encoder model for efficient sequence labeling and a seq2seq model for generative tasks like semantic parsing and summarization, we explore recipes to pre-train both models.Compared to training each model from scratch, we propose two sequential training recipes which reduce the total compute cost (Section 2.1.6).
The first recipe is to extract the encoder of a seq2seq model as proposed in Ni et al. (2022).Although it performs well on classification tasks, we show that the encoder from seq2seq underperforms a from-scratch encoder on sequence labeling tasks.Variations of masking during seq2seq training and reducing the decoder size do not provide a consistent benefit to the encoder.We also explore continuing training the extracted encoder on MLM for a small number of updates.However, we show it cannot consistently close the gap in performance across different datasets.
The second recipe is to warm-start seq2seq pretraining with an encoder pre-trained via MLM (Fig-ure 1).Rothe et al. (2020) proposed a similar idea for fine-tuning.AlexaTM 20B and AlexaTM 5B applied this recipe for pre-training, by warm-starting with Alexa Teacher Model encoders (Soltan et al., 2022;Rosenbaum et al., 2022b;FitzGerald et al., 2022).We add the novelty of comparing to a seq2seq model pre-trained from scratch with the same data and codebase.First, we observe that if the encoder is frozen the whole time, the model under-performs a from-scratch seq2seq model on semantic parsing and summarization tasks.While cross-attention fusion across different layers of the encoder reduces the performance gap, we find that we can match performance of a from-scratch model by using standard cross-attention and unfreezing the encoder partway through training.
Overall, the second recipe demonstrates a viable approach for efficient pre-training of both a multilingual encoder and a multilingual seq2seq model, matching the performance of training each model from scratch, while using 27% less total compute.
See Appendix A for additional related work.

Pre-Training Setup
We describe our pre-training objectives, models, datasets, two recipes for initializing one model type from the other, and compare compute costs.

Models
We pre-train ten models (Table 1): one fromscratch encoder, five from-scratch seq2seq models, one encoder from a seq2seq model with continued MLM training, and three two-stage seq2seq models warm-started with the from-scratch encoder.We report the pre-training Compute Cost for each, where "TU" (Training Units) is defined as 100k update steps for 12 model layers with hidden dimension 1024 and batch size 1M tokens (Appendix D, E).

Encoder Model From Scratch
We train an encoder model ("roberta-12e" in Table 1) following a similar recipe to XLM-RoBERTa (Conneau et al., 2020a), using the MLM objectve (Figure 2a) of randomly masking 15% of subword tokens, as introduced in BERT (Devlin et al., 2019).We use a batch size of 1M tokens and train for 500k update steps.Notably, these settings match our seq2seq models.We use "PreLayerNorm" (Xiong et al., 2020), moving the layer norms to inside residual blocks to improve training stability.

Seq2Seq Objectives
Our seq2seq training follows the architecture and de-noising task of BART and mBART (Lewis et al., 2020;Liu et al., 2020); the only architecture change we make is to again use PreLayerNorm.The de-noising objective selects 15% of the tokens in the input (spans of length ∼ Poisson(3)), and either (i) simply drops them, or (ii) replaces each selected span with a single mask token.The model is trained to reconstruct the original input entirely.See Figures 2b and 2c, respectively.We add a suffix "-mask" to the model names that use masking instead of dropping the tokens.Intuitively, adding an explicit mask token for de-noising makes the reconstruction task easier, as the decoder knows exactly where the missing tokens are needed.

Seq2Seq Models From Scratch
All of our seq2seq models use 12 encoder layers ("12e").The first five models are trained from scratch starting from randomly initialized weights.The models "bart-12e12d" and "bart-12e12d-mask" use 12-layer decoders (same number as encoder layers) using the seq2seq de-noising training objective without masking and with masking, respectively.The remaining three models use a smaller decoder of either 2 layers ("bart-12e2d" without masking, "bart-12e2d-mask" with masking) or 1 layer ("bart-12e1d-mask", with masking).We hypothesize that reducing the size of the decoder may strengthen the encoder when it is extracted and used on its own.

Recipe 1: Encoder of Seq2Seq + MLM
We extract the encoder from the seq2seq model "bart-12e12d" and continue training via MLM for 100k updates ("bart-12e12d+mlm").We initialize the MLM head from the input embedding and untie.

Recipe 2: Two-Stage Seq2Seq Models
Finally, we train three seq2seq models following the two-stage setup (Figure 1).We initialize the encoder weights of the seq2seq model with the MLM encoder "roberta-12e" (Section 2.1.1)and train via seq2seq de-noising without masking.The first two models train for 500k updates with the encoder always frozen: "2stage-bart-12e12d" uses standard cross-attention, where the decoder attends to only the final encoder layer, and "2stage-bart-12e12dattn-f" uses a novel application of attention fusion (Cao et al., 2022)  The last model, "2stage-bart-12e12d-unfrz" uses standard cross-attention and unfreezes the encoder partway through training, applying 200k update steps with the encoder frozen, then 150k update steps with the encoder unfrozen.
In all cases, we initialize and tie the decoder embeddings from/to the encoder embeddings and keep them frozen as long as the encoder is frozen.The LM head is also initialized from the encoder embeddings, but it is untied from the embeddings and unfrozen from the beginning of the training.

Compute Cost Comparison
The baseline of training both models from scratch has a compute cost of 15.0 TU: 5.0 TU for "roberta-12e" plus 10.0 TU for "bart-12e12d".Our proposed recipes reduce the total compute cost either by 17% (to 12.5 TU) or by 27% (to 11.0 TU).

Pretraining Dataset
We pre-train on a combination of Wikipedia and mC4 (Xue et al., 2021) data in 12 languages: Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu.We pack sequences of tokens to produce sequences of approximately 512 subword units.We allow unrelated content to be packed together in the same sequence, separated with a special symbol "[DOC]".Maintaining a relatively constant number of subword sequences reduces padding and results in efficient compute.We up-sample data for different languages following Conneau et al. (2020a).

Fine-Tuning Results
We present the results on fine-tuning our pretrained models.All runs are averaged over three random seeds and reported as mean ± standard deviation.See Appendix C for hyperparameters.
With a 12-layer decoder, the explicit mask token during seq2seq pre-training does not seem to improve the encoder.However, when the decoder has only 2 layers, the mask token is crucial: "bart-12e2d-mask" out-performs "bart-12e2d" by a wide margin across tasks.We hypothesize that the mask token makes de-noising easier, by signaling where tokens should be filled in, and without this signal, the task is too challenging for a seq2seq model with just a 2-layer decoder.Reducing the decoder further to only 1 layer does not benefit the encoder.
Continuing training the seq2seq-extracted encoder on MLM for 100k updates does not close the gap to the from-scratch encoder across datasets.Some tasks improve, while others degrade.

Seq2Seq Model Results
We evaluate the generation quality of our seq2seq models on two datasets: mTOP (Li et al., 2021) cross-lingual zero-shot semantic parsing, and XSUM (Narayan et al., 2018) English summarization.For mTOP, following CLASP (Rosenbaum et al., 2022a), we use space-joined tokens as input, word sentinels, and SCIEM (Space-and Case-Insensitive Exact Match) metric.For both datasets, we generate outputs using beam search with k=3.
As shown in Table 3, the two-stage model with encoder unfrozen partway through training is on-par with the from-scratch seq2seq model: compared to "bart-12e12d ", "2stage-bart-12e12dunfrz" is only 0.1 points behind on mTOP en (83.3 vs. 83.4)yet 2.5 points ahead on cross-lingual zeroshot (48.2 vs. 45.7).On XSUM, the two-stage model is on-par or slightly better than the fromscratch seq2seq models.
Masking during seq2seq pre-training does not greatly impact generation quality.When the encoder is frozen ("2stage-bart-12e12d"), the results are slightly behind; attention fusion ("2stage-bart-12e12d-attn-f") does not provide a clear benefit.Overall, our proposed two-stage seq2seq pretraining recipe provides both a multilingual encoder and a seq2seq model on-par with the two models trained from scratch, while reducing compute cost by 27% (from 15.0 to 11.0 TU).

Conclusion and Future Work
In this work, we studied recipes to efficiently pretrain both a multilingual encoder and a seq2seq model by re-using the weights from one model for the other.We found that the most effective recipe is to start training of a seq2seq model from a pretrained encoder and unfreeze it partway through the training.Future work can explore even more efficient pre-training strategies such as jointly training on MLM and sequence-level de-noising objectives, and probe further why the encoders trained as part of a seq2seq model do not do well on sequence labeling tasks.

Limitations
Our proposed two-stage training recipe is beneficial under the assumption that a pre-trained model is needed for generative as well as sequence labeling tasks.We believe that is typically the case, as one tries to offset the pre-training investment by using the model for as many tasks as possible, but this assumption might not apply in all cases.While we assess the effect of randomness on fine-tuning results by using multiple seeds, we have not done that for the pre-training itself.Even at our mediumsize scale, it is already prohibitively expensive to do so.The evidence for the effectiveness of the twostage approach is also limited by the number of tasks evaluated (2 sequence classification tasks, 2 sequence labeling tasks, 2 generation tasks), but we believe it is a reasonable trade-off between robust results and compute investment.
Commonly, encoder transformer models are pre-trained using the MLM objective (Devlin et al., 2019).Decoders are pre-trained using a next-token left-to-right prediction (causal) language modeling objective (Radford and Narasimhan, 2018) or some version of autoregressive de-noising (Lewis et al., 2020).Seq2seq models often combine these objectives (Lewis et al., 2020;Bao et al., 2020).
We follow the multilingual approach of models such as XLM-RoBERTa (Conneau et al., 2020b) (encoder-only) and mT5/mBART (Xue et al., 2021;Liu et al., 2020) (seq2seq), where the model is pre-trained on data from multiple languages.This enables cross-lingual zero-shot fine-tuning, where the model is fine-tuned on task data only from a single language (usually English), then evaluated on multiple languages.
Previous literature has explored using a pre-trained encoder to initialize a larger encoder (Chen et al., 2022) or a seq2seq model (Rothe et al., 2020).The latter was applied to large-scale models, e.g.AlexaTM 20B and AlexaTM 5B (Soltan et al., 2022;Rosenbaum et al., 2022b;FitzGerald et al., 2022).Our work provides the first direct comparison of warm-starting vs. from-scratch seq2seq pre-training using the same data and codebase.
Recently, Sentence-T5 (Ni et al., 2022) studied the opposite direction, showing that extracting the encoder from T5 (Raffel et al., 2020a) can out-perform BERT on several sentence-level tasks.We also explore extracting the encoder from a seq2seq model, adding the novelty of the first explicit comparison with MLM encoders using the same pre-training data and codebase.Furthermore, whereas Sentence-T5 studies only sentence level tasks in English, we study both sentence-level and token-level (e.g.sequence labeling) multilingual tasks.We show that the encoder extracted from a seq2seq model under-performs on token-level tasks, motivating our proposed sequential pre-training recipes.
EncT5 (Liu et al., 2022) proposes an alternative method to fine-tune the encoder from a seq2seq model for classification and sequence labeling tasks, by attaching a randomly initialized one-layer decoder with cross-attention.They report substantial improvements on UDPOS (Part-of-Speech tagging, a sequence labeling task) compared to an mBERT (MLM encoder) model of similar encoder size, however the comparison is between models pre-trained on different data and codebases.For a cleaner comparison, we would need to implement and evaluate the EncT5 framework with our models, which is challenging since no reference implementation is available, and also because Liu et al. (2022) provide only the average number across languages for UDPOS and do not report per language.Therefore, we defer a more thorough study of EncT5 vs. standard feed-forward layer classification heads to future work.

C Fine-tuning Hyperparameters
Table 10 shows the hyperparameters for fine-tuning the pre-trained models.For encoders, we first performed a single run with learning rates among 1e-6, 3e-6, 1e-5, 3e-5, 1e-4 for each task and model, and found that the best learning rate was nearly always 1e-5 or 3e-5, with only small differences between those two options by model.For consistency, we then fixed the learning rate for each task and ran each model on each task with three random seeds.We use Adam (Kingma and Ba, 2015) optimization.We freeze the embedding layer which we find generally slightly improves the cross-lingual zero-shot results.
For XNLI, we follow the standard practice established in BERT (Devlin et al., 2019) to attach the classification head to the first token ("<s>" for all of our models).We also explored max pooling across all tokens and did not observe a significant difference in performance.
For mATIS++, following Chen et al. (2019), we use two separate classification heads, one for Intent Classification (IC) attached to the encoder output of the first subword token of the sequence, and the second for Slot Labeling (SL) attached to the first subword of each whole word in the sequence.
Similarly, for WikiANN NER and UDPOS, we again use a single classification head attached to the first subword of each whole word in the sequence.When computing f1 score for sequence labeling tasks (mATIS++ SL and WikiANN NER), we ignore the "O" ("Outside") tag, using the seqeval (Nakayama, 2018) implementation which takes into account the BIO tags present in WikiANN.

D Details on Compute Cost
We provide details on the compute cost reported in Table 1.The unit "TU" (Training Updates) is defined as the compute cost for 100k updates (forward and backward pass) of 12 model layers with hidden dimension 1024 and batch size 1M tokens.The encoder-only MLM model trains for 500k updates, for Compute Cost 5.0 TU.The Seq2Seq Models From Scratch have more layers, and therefore a larger Compute Cost for 500k updates.For example, "bart-12e12d" has 12 layers each for encoder and decoder, resulting in a compute cost of 10.0 for 500k updates.As a baseline, training both the MLM encoder and the seq2seq models from scratch would incur a compute cost of 5.0 + 10.0 = 15.0TU.  79.9 29.8 62.8 60.9 68.9 58.7 68.7 13.6 29.4 69.5 33.7 27.0 47.5 ±0.2 ±1.0 ±0.6 ±0.5 ±0.3 ±1.4 ±0.4 ±0.3 ±0.8 ±0.7 ±1.1 ±0.9 ±0.5 Table 7: Encoder model results by language on WikiANN Named Entity Recognition (NER) test sets, f1 score.
Recipe 1 (Encoder of Seq2Seq + MLM), first pays compute cost 10.0 TU for the seq2seq training, then 1.0 TU for 100k MLM updates on the extracted encoder, for a total of 11.0 TU.
For Recipe 2 (Two-Stage Seq2Seq Models warm-started with MLM encoder), we first pay a compute cost of 5.0 TU from MLM pre-training of the encoder, then add compute cost for the second stage seq2seq pre-training.When the encoder is frozen, we only need to compute the forward pass for the encoder, not the backward pass.We estimate that when the encoder is frozen, its forward pass uses 1/2 the compute as a forward and backward pass would use.(In reality, the ratio is likely less, as we also save memory by not needing to store the optimizer states.)Therefore, when the encoder is frozen for "2stage-bart-12e12d" and "2stage-bart-12e12d-attn-f", the 500k decoder updates incur a compute cost of 2.5 on the encoder side and 5.0 on the decoder side.Adding this to the 5.0 for MLM initialization gives a total compute cost of 5.0 + 7.5 + 12.5 TU.
For "2stage-bart-12e12d-unfrz", the 200k updates with frozen encoder incur a compute cost of 1.0 TU on the encoder side and 2.0 TU on the decoder size for a total of 3.0 TU.During the final 150k updates, the encoder is unfrozen, so the compute cost is 3.0.Adding the 5.0 compute cost for MLM Encoder initialization, the total compute cost for this model is 5.0 + 3.0 + 3.0 = 11.0TU.

E Pre-Training Details
We show an example sentence for each of our pre-training objectives in Figure 2.
Models were pre-trained (8 or 16 machines) and fine-tuned (1 machine) on AWS p3dn.24xlargeinstances.For MLM pre-training, we use a peak learning rate of 1.5e-4 (1e-4 for the second stage of Recipe 1) warmed up over 5k update steps (1k for the second stage of Recipe 1) and decayed linearly down to 5e-6 over the total number of updates (500k or 100k, respectively).For seq2seq pre-training, we use the same learning rate as MLM pre-training: peak of 1.5e-4, warmed up over 5k updates, and linearly decayed down to 5e-6 for the duration of pre-training.For all pre-training runs, we use dropout of 0.1.

F Dataset Sources
We show in Table 11 the source locations of the datasets we use for fine-tuning evaluation.

Figure 1 :
Figure 1: Two-stage seq2seq pre-training.First (left), we train the encoder via Masked Language Modeling (MLM).Second (right), we attach a randomly initialized decoder to the pre-trained MLM encoder, and train on the same data with de-noising objective.The encoder may remain frozen for part or all of the second stage.

Table 1 :
during cross-attention, where the decoder attends to all encoder layers.Model architecture details.All models use a batch size of 1M tokens with hidden dimension of 1024, feed-forward dimension of 4096 and 16 attention heads.

Table 5 :
Encoder model results by language on mATIS++ test sets, Intent Classificaiton (IC) accuracy.

Table 6 :
Encoder model results by language on mATIS++ test sets, Slot Labeling (SL) f1 score.

Table 9 :
Seq2Seq model results by language on mTOP semantic parsing test sets, SCIEM.