Transfer Learning for Sequence Generation: from Single-source to Multi-source

Multi-source sequence generation (MSG) is an important kind of sequence generation tasks that takes multiple sources, including automatic post-editing, multi-source translation, multi-document summarization, etc. As MSG tasks suffer from the data scarcity problem and recent pretrained models have been proven to be effective for low-resource downstream tasks, transferring pretrained sequence-to-sequence models to MSG tasks is essential. Although directly finetuning pretrained models on MSG tasks and concatenating multiple sources into a single long sequence is regarded as a simple method to transfer pretrained models to MSG tasks, we conjecture that the direct finetuning method leads to catastrophic forgetting and solely relying on pretrained self-attention layers to capture cross-source information is not sufficient. Therefore, we propose a two-stage finetuning method to alleviate the pretrain-finetune discrepancy and introduce a novel MSG model with a fine encoder to learn better representations in MSG tasks. Experiments show that our approach achieves new state-of-the-art results on the WMT17 APE task and multi-source translation task using the WMT14 test set. When adapted to document-level translation, our framework outperforms strong baselines significantly.


Introduction
Thanks to the continuous representations widely used across text, speech, and image, neural networks that accept multiple sources as input have gained increasing attention in the community (Ive et al., 2019;Dupont and Luettin, 2000).For example, multi-modal inputs that are complementary have proven to be helpful for many sequence generation tasks such as question answering (Antol et al., Table 1: Comparison of various approaches to transferring pretrained models to single-source and multisource sequence generation tasks.Different from prior studies, this work aims at transferring pretrained sequence-to-sequence models to multi-source sequence generation tasks.2015), machine translation (Huang et al., 2016), and speech recognition (Dupont and Luettin, 2000).In natural language processing, multiple textual inputs have also been shown to be valuable for sequence generation tasks such as multi-source translation (Zoph and Knight, 2016), automatic postediting (Chatterjee et al., 2017), multi-document summarization (Haghighi and Vanderwende, 2009), system combination for NMT (Huang et al., 2020), and document-level machine translation (Wang et al., 2017).We refer to this kind of tasks as multi-source sequence generation (MSG).
Unfortunately, MSG tasks face a severe challenge: there are no sufficient data to train MSG models.For example, multi-source translation requires parallel corpora involving multiple languages, which are usually restricted in quantity and coverage.Recently, as pretraining language models that take advantage of massive unlabeled data have proven to improve natural language understanding (NLU) and generation tasks substantially (Devlin et al., 2019;Liu et al., 2019;Lewis et al., 2020), a number of researchers have proposed to leverage pretrained language models to enhance MSG tasks (Correia and Martins, 2019;Lee et al., 2020;Lee, 2020).For example, Correia and Martins (2019) show that pretrained autoencoding (AE) models : Overview of our framework."A", "B", and "C" denote sentences in different languages.After being pretrained on unlabeled data, the single-source sequence generation (SSG) model is finetuned on single-source labeled data.Then, the SSG model is extended to the MSG model by adding a fine encoder upon the pretrained encoder (i.e., the coarse encoder).Finally, the MSG model is finetuned on the multi-source data.The proposed framework aims to reduce the pretrain-finetune discrepancy and learn better multi-source representations.
As most recent pretrained sequence-to-sequence (Seq2Seq) models (Song et al., 2019;Lewis et al., 2020;Liu et al., 2020) have demonstrated their effectiveness in improving single-source sequence generation (SSG) tasks, we believe that pretrained Seq2Seq models can potentially bring more benefits to MSG than pretrained AE models.Although it is easy to transfer Seq2Seq models to SSG tasks, transferring them to MSG tasks is challenging because MSG takes multiple sources as the input, leading to severe pretrain-finetune discrepancies in terms of both architectures and objectives.
A straightforward solution is to concatenate the representations of multiple sources as suggested by Correia and Martins (2019).However, we believe this approach suffers from two major drawbacks.First, due to the discrepancy between pretraining and MSG, directly transferring pretrained models to MSG tasks might lead to catastrophic forgetting (McCloskey and Cohen, 1989;Kirkpatrick et al., 2017) that results in reduced performance.Second, the pretrained self-attention layers might not fully learn the representations of the concatenation of multiple sources because they do not make full use of the cross-source information.
Inspired by adding intermediate tasks for NLU (Pruksachatkun et al., 2020;Vu et al., 2020), we conjecture that inserting a proper intermediate task between them can alleviate the discrepancy.In this paper, we propose a two-stage finetuning method named gradual finetuning.Different from prior studies, our work aims to transfer pretrained Seq2Seq models to MSG (see Table 1).Our approach first transfers from pretrained models to SSG and then transfers from SSG to MSG (see Figure 1).Furthermore, we propose a novel MSG model with coarse and fine encoders to differentiate sources and learn better representations.On top of a coarse encoder (i.e., the pretrained encoder), a fine encoder equipped with cross-attention layers (Vaswani et al., 2017) is added.We refer to our approach as TRICE (a task-agnostic Transferring fRamework for multI-sourCe sEquence generation), which achieves new state-of-the-art results on the WMT17 APE task and the multi-source translation task using the WMT14 test set.When adapted to document-level translation, our framework outperforms strong baselines significantly.

Approach
Figure 1 shows an overview of our framework.First, the problem statement is described in Section 2.1.Second, we propose to use the gradual finetuning method (Section 2.2) to reduce the pretrainfinetune discrepancy.Third, we introduce our MSG model, which consists of the coarse encoder (Section 2.3), the fine encoder (Section 2.4), and the decoder (Section 2.5).

Problem Statement
As shown in Figure 1, there are three kinds of dataset: (1) the unlabeled multilingual dataset D p containing monolingual corpora in various languages, (2) the single-source parallel dataset D s involving multiple language pairs, and (3) the multisource parallel dataset D m .The general objective is to leverage these three kinds of dataset to improve multi-source sequence generation tasks.
Formally, let x 1:K = x 1 . . .x K be K source sentences, where x k is the k-th sentence.We use x k,i to denote the i-th word in the k-th source sentence and y = y 1 . . .y J to denote the target sentence with J words.The MSG model is given by P m (y|x 1:K ; θ) = J j=1 P (y j |x 1:K , y <j ; θ), (1) where y j is the j-th word in the target, y <j = y 1 . . .y j−1 is a partial target sentence, P (y j |x 1:K , y <j ; θ) is a word-level generation probability, and θ are the parameters of the MSG model.

Gradual Finetuning
As training neural models on large-scale unlabeled datasets is time-consuming, it is a common practice to utilize pretrained models to improve downstream tasks by using transfer learning methods (Devlin et al., 2019).As a result, we focus on leveraging single-source and multi-source parallel datasets to transfer pretrained Seq2Seq models to MSG tasks.
Curriculum learning (Bengio et al., 2009) aims to learn from examples organized in an easy-tohard order, and intermediate tasks (Pruksachatkun et al., 2020;Vu et al., 2020) are introduced to alleviate the pretrain-finetune discrepancy for NLU.Inspired by these studies, we expect that changing the training objective from pretraining to MSG gradually can reduce the difficulty of transferring pretrained models to MSG tasks.Therefore, we propose a two-stage finetuning method named gradual finetuning.The transferring process is divided into two stages (see Figure 1).In the first stage, the SSG model is transferred from denoising autoencoding to the single-source sequence generation task, and the model architecture is kept unchanged.In the second stage, an additional fine encoder (see Section 2.4) is introduced to transform the SSG model to the MSG model, and the MSG model is optimized on the multi-source parallel corpus.
Formally, we use φ p to denote the parameters of the SSG model.Without loss of generality, the pretraining process can be described as follows: where z is a sentence that could be in many languages, z is the corrupted sentence obtained from z, P s is the probability modeled by the SSG model, and φp are the learned parameters.In this way, a powerful multilingual model is obtained by pretraining on the unlabeled multilingual dataset D p .Then, in the first finetuning stage, let φ s be the parameters of the SSG model, which are initialized by φp .As the single-source parallel dataset D s is not always available, we can build it from the K-source parallel dataset D m .Assume x 1:K , y is a training example in D m , a training example x, y in D s can be constructed by sampling one source from each K-source training example with a probability of 1/K.The first finetuning process is given by where φs are the learned parameters.The learned SSG model is capable of taking inputs in multiple languages.
In the second finetuning stage, φ m , the parameters of the coarse encoder, the decoder, and the embeddings, are initialized by φs while γ are the randomly initialized parameters of the fine encoder.Thus, θ = φ m ∪ γ are the parameters of the MSG model.The second finetuning process can be described as where P m is given by Eq. ( 1).As a result, the model is expected to learn from abundant unlabeled data and perform well on the MSG task.In the following subsections, we will describe the MSG model architecture (see Figure 2) applied in the second finetuning stage.

Input Representation and the Coarse Encoder
In general, pretrained encoders are considered as strong feature extractors to learn meaningful representations (Zhu et al., 2019) Token Embeddings

Positional Embeddings
Token Embeddings Linear & Softmax have multiple sources involving different languages and pretrained multilingual Seq2Seq models like mBART (Liu et al., 2020) usually rely on special tokens (e.g., <en>) to differentiate languages, concatenating multiple sources into a single long sentence will make the model confused about the language of the concatenated sentence (see Table 6).Therefore, we propose to add additional segment embedding to differentiate sentences in different languages and encode source sentences jointly by a single pretrained multilingual encoder.
Formally, the input representation can be denoted by where X k,i is the input representation of the ith word in the k-th source sentence, and E tok , E pos , and E seg are the token, position, and segment/language embedding matrices, respectively.E tok and E pos are initialized by pretrained embedding matrices.E seg is implemented as constant sinusoidal embeddings (Vaswani et al., 2017), which is denoted by 2 If the pretrained model already contains the segment/language embedding matrix, then the pretrained one is used.
Then, the pretrained encoder is utilized to encode multiple sources: where SelfAtt(•) and FFN(•) are the self-attention and feed-forward networks, respectively.R is the representation output by the i-th encoder layer, and R (0) 1:K refers to X 1 . . .X K , where X k is equivalent to X k,1 . . .X k,I k and I k is the number of tokens in the k-th source sentence.
However, we conjecture that indiscriminately modeling dependencies between words by the pretrained self-attention layers cannot capture crosssource information adequately.To this end, we regard the pretrained encoder as the coarse encoder and introduce a novel fine encoder to learn better multi-source representations.

The Fine Encoder
To alleviate the pretrain-finetune discrepancy, we adopt the gradual finetuning method to better transfer from single-source to multi-source.In the first finetuning step, the coarse encoder is used to encode different sources individually.As multiple sources are concatenated as a single source in which words interact by pretrained self-attentions, we conjecture that the cross-source information cannot be fully captured.Hence, we propose to add a randomly initialized fine encoder, which consists of self-attentions, cross-attentions, and FFNs, on top of the pretrained coarse encoder to learn meaningful multi-source representations.Specifically, the cross-attention sublayer is an essential part of the fine encoder because they perform fine-grained interaction between sources (see Table 5).
Formally, the architecture of the fine encoder can be described as follows.First, the representations of multiple sources output by the coarse encoder are divided according to the boundaries of sources: where N c is the number of the coarse encoder layers, Split(•) is the split operation.Second, for each fine encoder layer, the representations are fed into a self-attention sublayer: where is the representation corresponding to the k-th source sentence output by the (i − 1)-th layer of the fine encoder, in other words, k is the representation output by the self-attention sublayer of the i-th layer.Third, representations of source sentences interact through a cross-attention sublayer: where Concat(•) is the concatenation operation, O \k is the concatenated representation except k is the representation output by the crossattention sublayer of the i-th layer.Finally, the last sublayer is a feedforward network: After the N f -layer fine encoder, the representations corresponding to multiple sources are given to the decoder.

The Decoder
Given that representations of multiple sources are different from that of a single source, to better leverage representations of multiple sources, we let the cross-attention sublayer take each source's representation as key/value separately and then combine the outputs by mean pooling. 3Formally, the differences between our decoder and the traditional Transformer decoder are described below.First, the input representations of the i-th decoder layer are fed into the self-attention sublayer to obtain G (i) j .Second, a separated cross-attention sublayer is adopted by our framework to replace the traditional cross-attention sublayer: where is the output of the fine encoder derived by Eq. ( 14), P (i) j,k is the representation corresponding to the k-th source, H (i) j is the combined result of the separated cross-attention sublayer, and the parameters of separated cross-attentions to leverage each source are shared.Finally, a feedforward network is the last sublayer of a decoder layer.In this way, the decoder in our framework can better handle representations of multiple sources.
For the APE task, following Correia and Martins (2019), we used the data from the WMT17 APE task (English-German SMT) (Chatterjee et al., 2019).The dataset contains 23K dual-source examples (e.g., English source sentence, German translation, German post-edit ) for training in an extremely low-resource setting.We also followed Correia and Martins (2019) to adopt pseudo data (Junczys-Dowmunt and Grundkiewicz, 2016;Negri et al., 2018), which contains about 8M pseudo training examples, to evaluate our framework in a high-resource setting.We adopted the dev16 for development and used test16 and test17 for testing.
For the multi-source translation task, following Zoph and Knight (2016), we used a subset of the WMT14 news dataset (Bojar et al., 2014)

Models
which contains 2.4M dual-source examples (e.g., German source sentence, French source sentence, English translation ) for training, 3,000 from test13 for development, and 1,503 from test14 for testing. 4It can be seen as a medium-resource setting.
For the document-level translation task, we used the dataset provided by Maruf et al. (2019) from IWSLT2017 (TED) and News Commentary (News), both including about 200K English-German training examples, which can be seen as low-resource settings.For IWSLT2017, test16 and test17 were combined as the test set, and the rest served as the development set.For News Commentary, test15 and test16 in WMT16 were used for development and testing, respectively.We took the nearest preceding sentence as the context, and then constructed the dual-source example like German context, German current sentence, English translation .

Hyper-parameters
We adopted mBART (Liu et al., 2020) as the pretrained Seq2Seq model.We set both N c and N d to 12, and N f to 1.The model dimension, the filter size, and the number of heads are the same as mBART.We adopted the vocabulary of mBART, which contains 250K tokens.We used minibatch sizes of 256, 1,024, 4,096, and 16,384 tokens for extremely low-, low-, medium-, and high-resource settings, respectively.We used the development set to tune the hyper-parameters and select the best model.In inference, the beamsize was set to 4. Please refer to Appendix A.1 for more details.

Evaluation Metrics
We used case-sensitive BLEU (multi-bleu.perl)and TER for automatic post-editing.For multi-source translation and document-level translation, SACRE-BLEU5 (Post, 2018) and METEOR6 was adopted for evaluation.We used the paired bootstrap resampling (Koehn, 2004) for statistical significance tests.

Main Results
Table 2 shows the results on the automatic postediting task.Our framework outperforms previous methods without pretraining (i.e., FORCEDATT,
DUALTRANS, and L2COPY) by a large margin and surpasses strong baselines with pretraining (i.e., DUALBERT and DUALBART), which concatenate multiple sources into a single source, significantly in both extremely lowand high-resource settings.
Notably, the performances of our framework in the extremely low-resource setting are comparable to results of strong baselines without pretraining in the high-resource setting and we achieve new state-of-the-art results on this benchmark.
Table 3 demonstrates the results on the multisource translation task.Our framework substantially outperforms both baselines without pretraining (i.e., MULTIRNN and DUALTRANS) and with pretraining (i.e., single-source model MBART-TRANS and dual-source model DUALBART).Surprisingly, the single-source models with pretraining are inferior to the multi-source model without pretraining, which indicates that multiple sources play an important role in the translation task.
Table 4 shows the results on the document-level translation task.Our framework achieves significant improvements over all strong baselines.Unusually, the previous method for handling multiple sources (i.e., DUALBART) fails to consistently outperform simple sentence-and document-level Transformer (i.e., MBART-TRANS and MBART-DOCTRANS) while our framework outperforms these strong baselines significantly.
In general, our framework shows a strong generalizability across three different MSG tasks and four different data scales, which indicates that it is useful to alleviate the pretrain-finetune discrepancy by gradual finetuning and learn multi-source representations by fully capturing cross-source information.

Analyses
In this subsection, we further conduct studies regarding the variants of the fine encoder, ablations of the other proposed components, and effect of freezing parameters.Experiments are conducted on the APE task in the extremely low-resource setting.
The BLEU scores calculated on the development set are adopted as the evaluation metric.
Comparisons with the variants of the fine encoder.Table 5 demonstrates comparisons with the variants of the fine encoder.We find that the fine encoder (see Section 2.4) is effective (compared to "None"), the cross-attention sublayer is important (compared to the one without cross-attention), and our approach outperforms "FFN adapter", which is proposed by Zhu et al. (2019) to incorporate BERT into sequence generation tasks by inserting FFNs into each encoder layer.We find that stacking more fine encoder layers even harms the performance (see the last three rows in Table 5) which rules out the option that the improvements owe to increasing of parameters.
Ablations on the other proposed components.
Table 6 shows the results of the ablation study.We find that gradual finetuning method (see Section 2.2) is significantly beneficial.Lines "-segment embedding" and "-concatenated encoding" show that concatenating multiple sources into a long sequence and adding sinusoidal segment embedding for the coarse encoder are helpful (see Section 2.3).The line "-separated cross-attention" reveals that taking each source's representation as key/value separately and then combine the outputs is better than concatenating all the representations and do the cross-attention jointly (see Section 2.5).ing parameters initialized by pretrained models and parameters initialized randomly is essential for achieving good performance on MSG tasks.

Adversarial Evaluation
We adopt adversarial evaluation similar to Libovickỳ et al. ( 2018) which replaces one source with a randomly selected sentence.As shown in Table 8, both sources play important parts and the French side is more important than the German side (Randomized Fr vs. Randomized De).

Case Study
An example in multi-source translation task is shown in Table 9.The four outputs at the bottom of the table are generated by the last four models in Table 3.We find that single-source models have different errors (e.g., "each hospitals" and "travelling clinics") and multi-source models fix some errors because of taking two sources.Additionally, DualBart still output erroneous "weekly", while TRICE outputs "weekend" successfully.We believe TRICE is better than baselines because multiple sources are complementary and the fine encoder could capture finer cross-source information, which helps correct translation errors.
4 Related Work

Multi-source Sequence Generation
Multi-source sequence generation includes multisource translation (Zoph and Knight, 2016), automatic post-editing (Chatterjee et al., 2017), multidocument summarization (Haghighi and Vanderwende, 2009), system combination for NMT (Huang et al., 2020), and document-level machine translation (Wang et al., 2017), etc.For these tasks, researchers usually leverage multi-encoder architectures to achieve better performance (Zoph and Knight, 2016;Zhang et al., 2018;Huang et al., 2019).To address the data scarcity problem in MSG, some researchers generate pseudo corpora (Negri et al., 2018;Nishimura et al., 2020) to augment the corpus size while others try to make use of pretrained autoencoding models (e.g., BERT Reference-En Each of these weekend clinics provides a variety of medical care. MBART-TRANS (De) Each weekend hospitals offers medical care in a number of areas.
MBART-TRANS (Fr) This travelling clinics provides a variety of healthcare services.DUALBART Each of these weekly hospitals provides healthcare in a variety of areas.TRICE Each of these weekend clinics offers a variety of health care.
Table 9: Example of multi-source translation.Some erroneous parts are highlighted by underlines.MBART-TRANS (De/Fr) takes single source (De/Fr) as input while DUALBART and TRICE take both sources as input.We believe that multiple sources are complementary and TRICE could correct errors by capturing finer cross-source information.(Devlin et al., 2019) and XLM-R (Conneau et al., 2020)) to enhance specific MSG tasks (Correia and Martins, 2019;Lee et al., 2020;Lee, 2020).Different from these works, we propose a task-agnostic framework to transfer pretrained Seq2Seq models to multi-source sequence generation tasks and demonstrate the generalizability of our framework.

Conclusion
We propose a novel task-agnostic framework, TRICE, to conduct transfer learning from single-source sequence generation including selfsupervised pretraining and supervised generation to multi-source sequence generation.With the help of the proposed gradual finetuning method and the novel MSG model equipped with coarse and fine encoders, our framework outperforms all baselines on three different MSG tasks in four different data scales, which shows the effectiveness and generalizability of our framework.
Figure1: Overview of our framework."A", "B", and "C" denote sentences in different languages.After being pretrained on unlabeled data, the single-source sequence generation (SSG) model is finetuned on single-source labeled data.Then, the SSG model is extended to the MSG model by adding a fine encoder upon the pretrained encoder (i.e., the coarse encoder).Finally, the MSG model is finetuned on the multi-source data.The proposed framework aims to reduce the pretrain-finetune discrepancy and learn better multi-source representations.

J
'aime la musique .</s> <fr> c N ´d N f N Figure 2: The architecture of our framework.Multiple sources are first concatenated and encoded by the coarse encoder and then encoded by the fine encoder to capture fine-grained cross-source information.Finally, the representations are utilized by the decoder to generate the target sentence.For simplicity, this figure only illustrates the situation that the input contains two sources (K = 2).

Table 2 :
, Results on the automatic post-editing task (extremely lowand high-resource)."DUALBART": a method to leverage pretrained Seq2Seq models adapted from "DUALBERT".Please refer to Appendix A.3 for detailed descriptions of baselines and the same below.

Table 6 :
Ablation study.The case-sensitive BLEU scores are calculated on the development set of the APE task for all experiments for analyses.Note that we remove only one component at a time.

Table 8 :
Adversarial evaluation on the multi-source translation task."Randomized Fr/De" denotes that the Fr/De source is replaced with a randomly selected sentence.