GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator

Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance.


Introduction
The pre-training-then-fine-tuning paradigm has been proven successful in many natural language processing tasks (Devlin et al., 2019;Schick and Schütze, 2021). While there are various pre-training approaches for the encoderonly architectures (Clark et al., 2020;Conneau et al., 2020), the encoder-decoder pre-training is underexplored, which is essential for natural language Figure 1: A pre-training sample of our method, where replaced token detection (discriminator) and replaced token denoising (generator) are used for pre-training. The discriminator classifies each generated token into REPLACED or ORIGINAL, where REPLACED denote the predicted token is different from the gold token. The red fonts denote incorrect predictions.
generation. To pre-train the entire encoder-decoder model, BART  proposes a denoising language model objective and T5 (Raffel et al., 2020) pre-trains the models with a span corruption objective. Furthermore, mBART  and mT5  extend them to be multilingual pre-trained language models.
Unlike most encoder-decoder pre-training methods that simply apply sequence-to-sequence tasks on a single encoder-decoder architecture, we explore the approaches to pre-train the model in a GAN-style manner with an auxiliary discriminator. GAN (Goodfellow et al., 2014) performs well on both text and image generation tasks by combining the generator and discriminator. It aims to improve the ability of the generator to produce highquality samples, which is important for the encoderdecoder pre-training when transferred to downstream generation tasks. Similarly, MaskGAN (Fedus et al., 2018) shows the GAN-like training can improve the quality of the autoregressive language model. Therefore, it is intuitive to leverage GAN to empower the encoder-decoder pre-training by unifying language understanding and generation.
In this work, we propose a pre-training framework GANLM, using GAN-style learning to im-prove the transferability of pretrained language models for the natural language generation. Specifically, the encoder reads the masked source sentence and the generator obtains target distribution. Then, the discriminator distinguishes whether each token sampled from the target distribution matches the target gold sentence (replaced token detection). The misclassified tokens by discriminator are regarded as hard tokens for the generator to predict accurately. We replace original tokens in the target sentence with misclassified sampled ones to construct the noisy previous context for predicting the target sentence (replaced token denoising). In Figure 1, the generator predicts the masked words "guardian watered", where the incorrect token "guardian" and correct token "watered" are both misclassified into REPLACED and ORIGINAL by the discriminator. Next, we resample a different token "watering" from the generated distribution. Consequently, the target tokens "gardener watered" are replaced with the sampled tokens "guardian watering" to construct the noisy sample. The generator predicts the next word conditioned on previous noisy tokens (replaced token denoising). Through combing two tasks, GANLM strengthen generation performance with the enhanced language understanding capability from the replaced token detection task.
Our method is effective for text generation and can be extended to natural language understanding tasks. We pre-train GANLM model on large-scale monolingual corpora and evaluate the performance of our pre-trained English model GANLM and multilingual model GANLM-m on various downstream tasks, including text summarization, machine translation, and data-to-text generation. Experimental results demonstrate that our method substantially outperforms previous pre-trained encoder and sequence-to-sequence models on generation tasks. Our method is further tested on GLUE  and XNLI (Conneau et al., 2018) to validate the transferability of our pre-trained model. Analytic experiments emphasize the importance of the discriminator in both the pre-training and finetuning stage, leading to better performance.

Model Overview
Our GAN-style pre-trained model comprises a generator (G) and discriminator (D), which are both encoder-decoder frameworks and conditioned on the same encoder (Enc). In Figure 2, the encoder reads the masked sentence and the generator decoder obtains the target distribution. Then the discriminator decoder distinguishes whether each token in the sampled target sentence matches the gold reference. Tokens in the target gold sentence are randomly replaced with misclassified ones by the discriminator to construct the noisy sample, which is fed into the generator decoder to predict the target sentence (replaced token denoising).

Masked Sequence Generator
Given a monolingual sentence x = (x 1 , . . . , x n ) with n words from the dataset D k of language L k ∈ L all = {L 1 , . . . , L K } (|L all | = K), some random spans of contiguous tokens in x are corrupted as the source sentence, which is denoted as x src = (x 1 , . . . , x \u:v , . . . , x n ). x \u:v is a masked span of x u:v , where the fragment from position u to v is corrupted by [MASK]. Given x src , the generator predicts the original identities of the masked tokens x trg = (x \1 , . . . , x u:v , . . . , x \n ) autoregressively: where θ E and θ G denote the encoder and decoder parameters of the generator. Enc-Dec denotes an encoder-decoder model. The generator predicts the next position t token x trg t based on previous tokens. The training objective of sequence-to-sequence masked language modeling (S2S-MLM) on the dataset D k of language L k is defined as: where x src and x trg are derived from x.

Replaced Token Detection
The generator outputs the distribution of each target token and we create a sampled sentencex trg by randomly sampling tokens from the distribution. The discriminator distinguishes whether each token in x trg is replaced compared to x trg . Given the target distribution P G (x trg t |x src ) (x trg t ∈ x trg ) from the generator, we constructx trg for the discriminator: where REPLACE(·) replaces target t-th position unmasked token in x trg with the sampled token x t from the generated distribution P G (x trg t |x src ). Given the source sentence x src and the encoder θ E , the decoder of the discriminator θ D obtains a sequence of hidden representations H d =  Figure 2: Overview of GANLM, including (a) replaced token detection and (b) replaced token denoising. The encoder reads the source sentence and the generator obtains target distribution, where the generator and discriminator are supervised by the gold labels in (a). The discriminator distinguishes whether the sampled tokens "guardian watered" are replaced (both tokens are misclassified in this example). For the correct predicted token "watered", we obtains a different token "watering" by resampling. The target tokens are replaced with the misclassified tokens to construct the noisy input, which are used to predict the gold sentence "gardener watered [EOS]" in (b).
(h 1 , . . . , h n ) by feeding the sampled sentencex trg to the discriminator decoder: where θ E and θ D denote the encoder and decoder parameters of the discriminator. The decoder of the discriminator θ D adopts the bidirectional language model to classify each input token by extracting the past and future representations. Given the representations H d , the discriminator classifies sampled tokensx trg into the REPLACED or ORIGINAL label with a sigmoid function σ: where W d ∈ R de×2 is the matrix projects the token representations to two categories (REPLACED or ORIGINAL) and d e is the model hidden size. The training objective of the replaced token detection task for the discriminator is: where 1(·) is the indicator function.

Replaced Token Denoising
Although our model structure is similar to GAN, the generator is trained with maximum likelihood rather than the standard GAN objective due to the difficulty of the GAN training in NLP. We replace tokens in x trg with misclassified tokens by discriminator to construct the noisy previous context x trg f . If the sampled tokenx trg t = x t is labeled with ORIGINAL, we will resample the token x t is labeled with REPLACED, the miscalssified token x t directly replaces x t in the target sentence. Given the target sentence x trg and generated probabilities P G , we replace tokens in x trg with sampled tokens as the previous noisy context: where v = {v 1 , . . . , v p } (|v| = p) denotes the positions in x trg of the misclassified tokens. The training objective of the replaced token denoising (DG) task based on the source sentence x src and target noisy context x trg f is described as: where x trg is predicted by the previous noisy tokens x trg f instead of previous gold context.

Multi-task Learning
Given multilingual corpora D all = {D 1 , . . . , D K } of K languages, the pre-trained model with parameters {θ E , θ G , θ D } is jointly trained over K languages to optimize the combined self-supervised objective as below: where λ = 10.0 is the discriminator weight and L all = {L 1 , . . . , L K }. To improve model efficiency, a tiny discriminator decoder (4 layers) is adopted to help generator decoder (12 layers).

Discriminator-enhanced Fine-tuning
To fully utilize the pre-trained parameters, we keep the auxiliary discriminator in downstream generation tasks (discriminator-enhanced fine-tuning) to enhance the generator, where both the pre-trained generator and discriminator are recycled. Given the annotated corpus D s of K languages, the pre- where x and y are the parallel pair from D s . The objective in the fine-tuning stage use the original pair x and y without S2S-MLM. The generator {θ E , θ G } are kept for inference by throwing out the discriminator decoder θ D . Alternatively, the discriminator can also be separately fine-tuned on the downstream task.

Pre-training Details
Model Configuration In the experiments, we adopt a sequence-to-sequence base-setting Transformer architecture with 768 hidden size, 3072 FFN (feed-forward network) dimension, 12 attention heads, and 12 encoder/decoder layers. The maximum sequence length of learned positions embeddings in the encoder/decoder is set as 1024. All token embedding matrices and output projection matrix parameters are shared for model efficiency.
Dataset Following the previous work , our English pre-trained model GANLM is trained on 160GB English monolingual data from BookCorpus, CC-News, OpenWebText, and CC-Stories. In addition, we pre-train GANLM-m with 6TB multilingual data as the pioneering work (Ma et al., 2021), which is a combination of CC100, CC-Net, and Wikipedia, covering 100 languages. All texts are tokenized by SentencePiece (Kudo and Richardson, 2018) and encoded by the dictionary from XLM-R (Conneau et al., 2020).
Optimization For S2S-MLM, we randomly mask 15% of the words in each instance with an average span length of 3 (Raffel et al., 2020). For the replaced token detection, we set the discriminator weight λ = 10.0. We adopt Adam ( Multilingual Translation IWSLT-17 of 5 languages and WMT-10 of 11 languages are utilized for multilingual translation. For IWSLT-17, English (En), German (De), Italian (It), Dutch (Nl), and Romanian (Ro) corpora are downloaded from the IWSLT-2017 benchmark. We use dev2010 for validation and tst2017 for test. For WMT-10, we use the parallel data of 11 languages from the WMT benchmark for evaluation .
Data-to-Text Generation Data-to-text generation accepts multiple triplets and produces a description. WebNLG (Gardent et al., 2017)

Fine-tuning Details
Abstractive Summarization During finetuning, we use the Adam (Kingma and Ba, 2015) optimizer with an initial learning rate of 1e-4 and the batch size is set as 2048 tokens on 8 V100 GPUs. The models are trained with the label smoothing cross-entropy with a smoothing ratio of 0.1.

Neural Machine Translation
For the largescale multilingual dataset WMT-10, our pre-trained model is fine-tuned on 32 V100 GPUs with a learning rate of 3e-4. For all bilingual translation tasks and the IWSLT-2017 benchmark, we adopt Adam with a learning rate of 1e-4 and set the batch size as 2048 tokens on 8 V100 GPUs.
Data-to-text Generation We use Adam with a learning rate of {8e-5,1e-4} and set the batch size as 16 sentences on the WebNLG dataset.

Comparing Pre-training Objectives
To verify the potential of our pre-training task under a fair comparison, we re-implement previous pre-training tasks and pre-trains baselines on the same corpora with 500K steps, including BERT/mBERT (Devlin et al., 2019), ELECTRA (Clark et al., 2020), BART (Lewis et al., 2020)/ mBART , and T5 (Raffel et al., 2020)/mT5 . Table 1 reports the ROUGE and BLEU points on the summarization dataset XSum and multilingual translation dataset IWSLT-17. All models have 12 encoder and 12 decoder layers with a hidden size of 768. We observe that the encoder-decoder pre-trained model (T5/mT5) outperforms the pre-trained encoder (ELECTRA, BERT/mBERT), which corroborates the encoder-decoder pre-training is more beneficial to the downstream generation task. Experiments }∼ show the importance of the discriminator and replaced token denoising. Experiment demonstrates that only the replaced token detection task can still bring improvement through strengthening the encoder shared by both generator and discriminator. Besides, the replaced token detection task is also helpful to downstream language understanding tasks with a powerful encoder. Lastly, the results verify that fine-tuning with the help of the pre-trained auxiliary discriminator further improves performance.

Results of GANLM
The English pre-trained model GANLM is evaluated on the abstractive text summarization task with the ROUGE (Lin, 2004) scores.
XSum As shown in Table 2, the pre-training methods achieve significant improvements over the strong baseline PTRNET without pre-training. The sequence-to-sequence pre-trained model such as UniLMv2 + s2s-ft outperforms other pre-training baselines, where the pseudo-masked technique is applied to the fine-tuning stage. Our method beats all pre-training baselines by a large margin with the discriminator-enhanced fine-tuning strategy. It emphasizes the importance of the fine-tuning strategy for the performance of downstream tasks.

CNN / DailyMail
Our method is also evaluated on the CNN / DailyMail dataset in Table 2. The comparisons further indicate that our method obtains strong performance on generation by leveraging the discriminator.

Results of GANLM-m
To evaluate the multilingual pre-trained model GANLM-m, we report the BLEU (Papineni et al., 2002) scores for machine translation and ROUGE (Lin, 2004) scores for text summarization and datato-text generation.
WikiLingua Table 3 reports the average ROUGE scores of 18 WikiLingua languages. The large improvement over other pre-training method demonstrate the summarization ability of our GANLM-m.

WMT14 En-De
The results on the bilingual translation are presented at Table 4. We observe that the proposed GANLM outperforms all previous works in the high-resource machine translation scenario (> 4M sentence pairs).

WMT14 En-Fr
We further conduct experiments on the WMT14 En-Fr bilingual translation task. Table 4 GANLM-m shows that GANLM-m still brings significant improvement to the downstream task with large-scale machine translation finetuning data (> 40M sentence pairs).

WMT16 En-Ro
For the low-resource setting (< 1M sentence pairs), there is an average gain of +4 BLEU points compared to the Transformer baseline in Table 5. With the same back-translation data, GANLM-m further improves the model performance and still beats other baselines.
WebNLG Table 8 presents the performance on the data-to-text generation task, showing that GANLM outperforms multilingual sequence-tosequence pre-training baselines mBART and mT5 by +2 ROUGE-L points on both languages.

Analysis
Ablation Study To analyze the effect of the proposed pre-training and fine-tuning strategies, we conduct an ablation study of each component of our method in Table 9. Experiment { and } verify the merits of the replaced token detection and replaced token denoising. Furthermore, experiment shows that our model with the replaced token denoising task obtains the best performance by jointly fine-tuning generator (G) and discriminator (D).
Low-resource Setting To further analyze the performance of GANLM-m given different sizes of downstream parallel data, we randomly extract K percentage of the whole sentence pairs as the finetuned parallel data from the full WMT-16 En→Ro training data. We set K = {10%, 20%, . . . , 100%} and compare our method with the Transformer baseline model. Figure 3 shows the BLEU points of our pre-trained multilingual model and the baseline. When the parallel data size is small, the baseline without pre-trained model produces unsatisfactory results. Similarly, in Figure 3(a), GANLM finetuned on nearly half data (purple line, 50%) defeats  Table 6: En→X evaluation results for bilingual (1→1), one-to-many (1→N), and many-to-many (N→N) models on WMT-10. The languages are ordered from high-resource languages (left) to low-resource languages (right).    Table 9: Ablation study of our method on the test set of the abstractive summarization benchmark XSum, where GANLM is fine-tuned on the downstream task with different pre-training and fine-tuning strategies.
the baseline trained on all pairs (green line, 100%), exemplifying the effectiveness of our method in low-resource scenarios.
Discussion on Discriminator The weight value λ and layer number of the discriminator are key factors to our pre-training task. As shown in Figure 4, we vary discriminator weight in Figure 4(a) to find a balance between the generator and discriminator objective. To this end, we study the performance of GANLM with different λ, where λ ranges from 5.0 to 100.0. When the weight of the discriminator is 10.0, multiple pre-training tasks are balanced. Moreover, we find it more efficient to have a tiny discriminator (3 ∼ 6 layers) in Figure 4(b).    Figure 4: Effect of (a) discriminator weight and (b) Discriminator layer on the WMT14 En→De task.

Multilingual Representations
in WMT-10 and visualize their representations (Maaten and Hinton, 2008) of the last two encoder layers in Figure 5 using our multilingual model fine-tuned on WMT-10 and the multilingual baseline. The first hidden state of the encoder is adopted as the sentence representation. Compared to Figure  5(a) and 5(b) of the baseline, different languages become closer and likely to overlap with each other in Figure 5(c) and 5(d) of our method, demonstrating that our method effectively aligns representations of different languages to the shared space.   large hidden size of 1024. Our pre-trained model is fine-tuned on the same training data as DeltaLM + Zcode (Yang et al., 2021). Compared to M2M-124 large , GANLM-m with fewer training data and only 430M parameters depends more on the transferability of the cross-lingual pre-training model. In Table 10, our method outperforms the DeltaLM + Zcode in zero-shot translation direction (Avg X→Y ) by +1.5 BLEU points, benefiting from our pretrained model in cross-lingual zero-shot transfer.

Massively Multilingual Translation
Comparison of Pre-training Cost Our English pre-trained model GANLM is trained for nearly 2 weeks on 128 A100 GPUs (40GB), with 500K training steps and a batch size of 8K sequences. Compared to the re-implemented T5 (Raffel et al., 2020), our method is only 0.5 times slower than T5 with the same training steps but gets a significant improvement on the machine translation, text summarization, and data-to-text generation tasks.
Language Understanding Our method can be easily extended to various downstream language understanding tasks. We use the GLUE benchmark  to estimate English pre-trained model GANLM and the XNLI dataset (Conneau et al., 2018) to evaluate the capability of the multilingual language understanding. Our method is tested on each language separately by fine-tuning generator (G) or discriminator (D) on the XNLI dataset. In Table 11, Our English pre-trained model performs better than RoBERTa. Additionally, our pre-trained model outperforms the previous crosslingual pre-trained encoder XLM and pre-trained encoder-decoder model mT5 in Table 12.  Fine-tuning on English training set (Cross-lingual zero-shot transfer)

Related Work
Language models based on large-scale data and self-supervised objective have been widely used for NLP tasks. Pre-training a Transformer encoder (Vaswani et al., 2017;Devlin et al., 2019;Cui et al., 2020) with the masked language modeling (MLM) or a decoder (Radford et al., 2018(Radford et al., , 2019Schick and Schütze, 2021) bring significant improvement for downstream natural language understanding (NLU) tasks. Many variants Sun et al., 2019;Clark et al., 2020;Chi et al., 2022) are proposed to enhance the pre-trained model. There are numerous attempts to pre-train a sequence-to-sequence model by adding generative objectives, such as T5 .
Recent works Conneau et al., 2020;Chi et al., 2021b) aim to learn cross-lingual representations in multiple languages. mBART  pre-trains a sequenceto-sequence Transformer model with the denoising objective. mT5 ) extends the span corruption task for multilingual training and mT6 (Chi et al., 2021a) amplifies the generation task by introducing a partially non-autoregressive objective. More multilingual pre-trained models (Ma et al., 2020;Chi et al., 2020) are further proposed to solve cross-lingual generation tasks.
In this work, we introduce GANLM, a state-ofthe-art pre-training encoder-decoder framework for both language generation and understanding tasks trained on large-scale corpora. Our GANstyle models are pre-trained with replaced token detection and replaced token denoising by introducing an auxiliary discriminator. Extensive experiments prove the effectiveness of GANLM on various language generation and translation benchmark datasets. We further verify the capability of the pre-trained model on multiple downstream understanding tasks.