MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators

Prompting has recently been shown as a promising approach for applying pre-trained language models to perform downstream tasks. We present Multi-Stage Prompting, a simple and automatic approach for leveraging pre-trained language models to translation tasks. To better mitigate the discrepancy between pre-training and translation, MSP divides the translation process via pre-trained language models into three separate stages: the encoding stage, the re-encoding stage, and the decoding stage. During each stage, we independently apply different continuous prompts for allowing pre-trained language models better shift to translation tasks. We conduct extensive experiments on three translation tasks. Experiments show that our method can significantly improve the translation performance of pre-trained language models.


Introduction
Recent years have witnessed the rapid development of pre-trained language models (Devlin et al., 2019;Brown et al., 2020), with GPT-3 (Brown et al., 2020) as the most representative model. By using prompts and a few examples, GPT-3 can perform various NLP tasks without using finetuning (Brown et al., 2020), including translation, question-answering, and cloze tasks. This opens the possibility of using a single pre-trained language model to perform all NLP tasks .
Neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017) is the current de facto paradigm for machine translation. With the breakthrough of pre-trained Work in progress. language models (Devlin et al., 2019;Brown et al., 2020), efforts have been devoted to utilizing pretrained language models for translation tasks Weng et al., 2020;Zhu et al., 2020;Guo et al., 2020;Stickland et al., 2021;Sun et al., 2021b). Previous studies can be roughly divided into three categories: (1) finetuning pre-trained language models Weng et al., 2020). (2) integrating pre-trained language models into neural machine translation models (Zhu et al., 2020). (3) adapting pre-trained language models to translation tasks with adapters (Guo et al., 2020;Stickland et al., 2021;Sun et al., 2021b). Despite advances, these studies either treat pre-trained language models as a component of an NMT model or made non-subtle changes to pre-trained language models.
Recent studies (Brown et al., 2020;Zhang et al., 2021;Wei et al., 2021) have shed some light on using only pre-trained language models as translators. Via prompting (Brown et al., 2020;Li and Liang, 2021;, pre-trained language models can perform translation tasks without modifying their network structures or parameters (Brown et al., 2020;Zhang et al., 2021;Wei et al., 2021), which provides an efficient and elegant alternative approach for translation tasks. Compared with training separate neural models for translation tasks, we indicate two benefits of directly using pre-trained language models as translators: 1. Preserving the ability to perform multiple tasks simultaneously. Using pre-trained language models as translators can preserve the ability to perform multiple tasks in a single batch by simply using different prompts (Li and Liang, 2021 Figure 1: Overview of using prompts for adapting a multilingual GPT (mGPT) model to machine translation tasks. Note that we reset the position ids during each stage in multi-stage prompting. data (Devlin et al., 2019;Brown et al., 2020).
However, a naive prompting may not be sufficient to fully exploit the potential of pre-trained language models on translation tasks. Therefore we believe it is worthwhile to further investigate how to use pre-trained language models as translators.
In this paper, we present Multi-Stage Prompting (MSP), a simple and efficient approach for adapting GPT-style pre-trained language models to translation tasks. Inspired by neural machine translation models (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017) which use separate networks for encoding and decoding, we explicitly divide the translation process via pre-trained language models into three different stages: the encoding, the re-encoding, and the decoding stages. By using different prompts during each stage, the pretrained language models first learn to encode the source sentence in the encoding stage. Then the pretrained language models encode more expressive source representations by re-encoding the source sentence using previously encoded activations. Finally, the pre-trained language models perform translations with the re-encoded activations during the decoding stage. Following prefix-tuning (Li and Liang, 2021) and prompt tuning , we use trainable continuous prompts in different stages, which are learned through backpropagation. With MSP, we expect pre-trained language models can play different roles during different stages, and thus making the pre-trained models better translators. Figure 1 gives a comparison between previous approach and our proposed method.
We conduct experiments using a multilingual GPT (mGPT) model on low-, medium-, and highresource translation tasks. Experiments verify that our MSP can significantly improve the translation performance of pre-trained language models. Our method improves the translation performance of pre-trained language models via prefix-tuning by at least 1.2 BLEU points. Our method also outperforms a strong multi-lingual NMT model using the Transformer architecture by 1.8 BLEU points, showing the potential of using pre-trained language models as translators.

Background
Prompting is a promising way of using pretrained language models (PLMs) for downstream tasks (Brown et al., 2020;Li and Liang, 2021;Gao et al., 2020). For example, we can use a template "English: x German: y" and fill in a source sentence x to use a PLM to perform an English→German translation task. Prompts can be either discrete sequences (Brown et al., 2020;Gao et al., 2020) or continuous vectors (Li and Liang, 2021;, constructed by manually-designed (Brown et al., 2020) or automatic search (Gao et al., 2020;Li and Liang, 2021;. Let z = [z 1 , . . . , z n ] be a sequence of tokens, we use P (z) to denote the probability of the sequence z. In this paper, we shall assume that P (z) is modeled using an N -layered autoregressive Transformer network (Vaswani et al., 2017) f LM (z, H; θ), where z is a word embedding, H is a sequence of past activations, and θ denotes the parameters of the Transformer network. We use d to denote the hidden size of the Transformer network and use h t ∈ R 2N d to denote an activation at time step t, which is a concatenation of a set of keyvalue pairs { k (i) , v (i) |i = 1 . . . N } in the Transformer network. Given an input z t ∈ R d and a sequence of past activations H t−1 = [h 1 , ..., h t−1 ], the conditional probability P (z t |z <t ) is modeled as follows: where V is the vocabulary size, e z i is the word embedding of z i , and "·" denotes matrix production. g t is the output of the Transformer network: Instead of using discrete prompts, we use continuous prompts and define a prompt P as a set of L vectors {p 1 , . . . , p L }, where p i is a trainable continuous vector that shares the same dimension as h i . Li and Liang (2021) propose to prepend the prompt P to the past activations H. Formally, the computation involved in Eq. (2) becomes To make the notation simpler, we use the following equation to denote repeatedly application of f LM over a sequence z i:j = [z i , . . . , z j ] given past activations H: where . G i:j and H i:j have similar definitions. By prepending the prompt P and optimizing p i using task-specific training data and gradient descent, the pre-trained LM can achieve strong performance on downstream tasks that comparable to finetuning while keeping θ frozen (Li and Liang, 2021).

Proposed Method
We propose multi-stage prompting, a simple and lightweight method for adapting pre-trained LMs to translation tasks. We first describe MSP in section 3.1. Then we describe the reparameterization of continuous prompts in section 3.2. Finally, we describe the training objective for learning prompts in section 3.3.

Multi-Stage Prompting
Brown et al. (2020) treat the translation task using the GPT-3 model as a generation task given a few examples and a prompt. However, we believe there are two potential weaknesses of this approach: • Lack a separation of encoding and decoding. Unlike neural machine translation models which use two networks to model encoding and decoding, simply treating translation as a context generation task may not be optimal for making PLMs as translators.
• Limited expressive power of source representations. The auto-regressive LM is unidirectional, and therefore is incapable of directly producing a bidirectional representation of the source sentence.
To overcome the above weaknesses, we propose to divide the procedure of using PLMs as translators into three separate stages: the encoding, the re-encoding, and the decoding stages. By providing different prompts at different stages, we believe the PLM can behave differently druing each stage, and is more capable for generating translations.
Given a source sentence x = [x 1 , . . . , x S ] and a target sentence y = [y 1 , . . . , y T ], the details of the three stages are described as follows: The Encoding Stage. In the encoding stage, the PLM encodes the source sentence x into a sequence of activations H e 1:S by using an encoding stage prompt P e . This procedure is the same with naive prompting. Formally, it can be describe as follows: The Re-encoding Stage. In the re-encoding stage, the PLM produces fine-grained representations of the source sentence by re-encoding x given past activations H e 1:S and a re-encoding stage prompt P r , which allows each representation to condition on all words in x. This procedure can be described as H r 1:S = f LM (X 1:S , P r ; H e 1:S ), where P r ; H e 1:S denotes the concatenation of two sequences P r and H e 1:S . The Decoding Stage. Finally, we obtain the hidden vectors G 1:T for predicting the probability of the target sentence y in the decoding stage, given the refined source representation H r 1:S and a decoding stage prompt P d : Figure 2 gives a detailed illustration of MSP. By dividing the translation process into multiple stages and applying different prompts, we expect the PLM model can generate better translations.

Reparameterization
Learning better prompts for adapting pre-trained language models to translation tasks is challenging. Previous studies (Li and Liang, 2021; Liu  et al., 2021b) suggest that using neural networks to reparameterize continuous prompts can bring significant improvements. We adopt the same architecture as (Li and Liang, 2021). Formally, we reparameterize P e , P r , and P d using the following network: where W 1 ∈ R d×d , W 2 ∈ R d×2N d , P e φ ∈ R L×d , P r φ ∈ R L×d , and P d φ ∈ R L×d are trainable parameters. Once the training is done, we pre-compute P e , P r , and P d using the above network. And the network and all trainable parameters are then dropped.

Training Objective
We use the cross-entropy loss for learning prompts. Given G 1:T = [g 1 , . . . , g T ] in Eq. (7), the training objective is formally described as follows:  the TedTalks dataset (Qi et al., 2018) for both training and testing.
• Medium-Resource Translation: We used the WMT14 English-German (En-De) dataset as the training corpus for the medium-resource translation task, which consists of 4.5M sentence pairs. The test set is newstest2014.
• High-Resource Translation: We used the WMT20 English-Chinese (En-Zh) dataset as the training corpus for the high-resource translation task, which consists of 28M sentence pairs. The test set is newstest2020.
We used case-sensitive BLEU (Papineni et al., 2002) as the evaluation metric. The BLEU score is calculated using the SACREBLEU toolkit (Post, 2018). 3 Baselines. We compare our method with the following baselines: • Transformer (Vaswani et al., 2017). State-ofthe-art neural machine translation models.
• Prefix-Tuning (Li and Liang, 2021). We use prefix-tuning for adapting the mGPT model to translation tasks.
Hyper-Parameters. All our models are trained on a machine with 8 RTX 3090Ti GPUs. For transformer models, we used the transformer-big setting and used the same tokenization and vocabulary as of mGPT. All other settings are the same with Vaswani et al. (2017). For prefix tuning and MSP, we set the prompt length to 128. We use Adam (Kingma and Ba, 2014) (β 1 = 0.9, β 2 = 0.98 and = 1× 10 −9 ) as the optimizer. Each mini-batch contains about 32k tokens. We train prompts for a total of 80k steps. We used the beam search algorithm to obtain translation from the mGPT model, and the beam size is set to 4. We implement our models with the open-source toolkits THUMT (Tan et al., 2020) and Transformers (Wolf et al., 2020). Table 1 and 2 show the results on X→En and En→X translation tasks, respectively. Our method achieves an average of 31.9 BLEU points on X→En translation tasks and an average of 27.9 BLEU En→X translation tasks, outperforming the prefix-tuning baseline by 2.6 BLEU points and 1.2 BLEU points, respectively. Our method also outperforms a strong multilingual Transformer model by 2.4 BLEU and 1.8 BLEU points, respectively.

Results on the TedTalks Dataset
The results indicate that pre-trained language models can effectively exploit large unlabeled raw data, and using pre-trained language models as translators can achieve superior performance than NMT models in low-resource translation scenarios.

Results on the WMT14 En-De Dataset
Model Method BLEU   Table 3 shows the result for the WMT14 En-De translation task. With MSP, the mGPT model improves the translation performance by 3.7 BLEU points compared with the prefix-tuning baseline. However, the translation performance of mGPT with MSP is behind the NMT model by a large margin. We conjecture there are two reasons: 1. Limited capacity of the mGPT model. Our mGPT model is relatively small and trained on massive multilingual data. As a result, the capacity of the mGPT model may limit the performance on translation tasks.
2. Difficulty of adapting PLM to machine translation tasks. As translation is quite different with language modeling, it is generally difficult for adapting PLMs to translation tasks.
We conduct further experiments to validate our conjecture. We train a separate Transformer encoder to directly map a source sentence to a continuous prompt, leaving the mGPT model only serving as a decoder. Using this approach, the gap of translation performance between the mGPT model and the NMT model narrows to 1.1 BLEU points. The result verifies our assumption that the capacity of mGPT is limited and the difficulty of adapting PLM to translation tasks. Table 4 shows the results on the WMT20 En-Zh translation task. We also compare our method with previous works. Our method outperforms the results of mT5-XXL, CPM-2, and Ernie 3.0 models on this task, albeit using a much smaller pretrained model. Using prompt tuning for adapting mGPT to the En-Zh translation task performs much worse than using prefix-tuning. Prompt tuning introduces fewer trainable parameters than prefixtuning, which may be insufficient for adapting a relatively small pre-trained LM to translation tasks. Our approach outperforms the baseline using prefix-tuning by 6.2 BLEU points. The results indicate that on high-resource and complex translation directions, multi-stage prompting is more effective in adapting PLMs than prefix-tuning. Table 5 shows the ablation study on the WMT14 En-De translation task. Using a single prompt during the 3 stages drops the translation performance of mGPT model to 19.8 BLEU points (row 2 vs. row 1), which coincides with our intuition that using different prompts in different stages helps PLMs adapting to translation tasks. Using a double source template with prefix-tuning performs inferior to multi-stage prompting (row 3 vs. row 2), which indicates the necessity of differentiating stages. Repeating the source two times improves the translation performance (row 3 vs. row 4), which confirms that re-encoding is effective in improving the translation performance of PLMs.

Related Work
Prompting. Brown et al. (2020) propose to use a few examples and prompts to adapt the GPT-3 model to downstream tasks, which is referred to as in-context learning. Their prompts are manually designed. Gao et al. (2020) present LM-BFF for automatic prompts generation. They use T5 model (Raffel et al., 2019) to generate templates for prompting PLMs. Li and Liang (2021) propose prefix-tuning, which uses continuous vectors as prompts. These prompts are trained using taskspecific data and optimized through gradient descent.  propose prompt tuning, which is similar to prefix-tuning but with fewer trainable parameters. Zhang et al. (2021) investigated using prompt tuning for adapting CPM-2 model to the WMT20 English-Chinese translation task. Our method is also based on prompting. We use continuous prompts for adapting PLMs to translation tasks. Unlike Li and Liang (2021) Table 4: Results on the WMT20 En-Zh translation task. "#Params." indicates the number of parameters of pretrained models.

# Method BLEU
1 Multi-Stage Prompting 21.2 2 Single prompt for all stages 19.8 3 Prefix-Tuning (template: "x <S1> x <S2> y") 18.8 4 Prefix-Tuning (template: "x <S> y") 17.5 propose grafting a source BERT model and a target GPT model for translation tasks. Compared with these approaches, our method only uses one multilingual GPT model. Moreover, we do not add trainable adapter networks into PLMs. Stickland et al. (2021) investigate using BART and mBART models for machine translation tasks, their approach relies on adapter networks and finetuning part of PLMs. Our approach is based on prompting, we only use prompts for adapting the PLMs to translation tasks. Furthermore, their approach applied for encoder-decoder architecture PLMs while ours applied for decoder-only PLMs. Wang et al. (2021) investigate using decoder-only architecture for translation tasks. Our method also uses a decoder-only architecture. However, our model is pre-trained on monolingual data and we only use bilingual data to learn prompts, while Wang et al. (2021) use parallel data to learn the whole model.

Conclusion
We have presented multi-stage prompting, a method for making pre-trained models better translators. Experiments show that with multi-stage prompting, pre-trained language models can generate translation even better than neural machine translation models, showing the potential of using pre-trained language models for translation tasks.
In future work, we plan to extend our methods to pre-trained language models with the encoder-decoder architecture.