ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation

Now, the pre-training technique is ubiquitous in natural language processing field. ProphetNet is a pre-training based natural language generation method which shows powerful performance on English text summarization and question generation tasks. In this paper, we extend ProphetNet into other domains and languages, and present the ProphetNet family pre-training models, named ProphetNet-X, where X can be English, Chinese, Multi-lingual, and so on. We pre-train a cross-lingual generation model ProphetNet-Multi, a Chinese generation model ProphetNet-Zh, two open-domain dialog generation models ProphetNet-Dialog-En and ProphetNet-Dialog-Zh. And also, we provide a PLG (Programming Language Generation) model ProphetNet-Code to show the generation performance besides NLG (Natural Language Generation) tasks. In our experiments, ProphetNet-X models achieve new state-of-the-art performance on 10 benchmarks. All the models of ProphetNet-X share the same model structure, which allows users to easily switch between different models. We make the code and models publicly available, and we will keep updating more pre-training models and finetuning scripts.


Introduction
In recent years, quite a few natural language generation pre-training models are proposed Song et al., 2019;Brown et al., 2020). Downstream generation tasks benefit from these large scale pre-training models greatly in fluency and accuracy. Researchers also extend these general pre-training works into specific domains such as DialoGPT (Zhang et al., 2019) is extended from GPT (Brown et al., 2020) for dialog system, mBART  is extended from BART  for multi-lingual generation, CodeBERT  is extended from BERT (Devlin et al., 2018) for programming language modeling, etc.
Although there are pre-trained models for some specific domains, it is not convenient for users to find them and set them up. Besides, even some models in the same pre-training family with the same model structure and pre-training tasks, their codes and details vary a lot because of different implementation and backends selection.
The main contributions of ProphetNet-X can be described as follows: • We provide a family of pre-trained models named ProphetNet-X, with six models including English and Chinese natural language generation in open-domain and dialog, multilingual generation, and code generation.
• All the pre-trained ProphetNet-X models share the same model structure. Users only need to simply modify one model file to use it in different language or domain tasks.
• We conduct extensive experiments, the results show that ProphetNet-X models achieve new state-of-the-art performance on 10 publicly available benchmarks.

Architecture
We train different ProphetNet-X models based on ProphetNet. ProphetNet is an encoder-decoder natural language generation model with future n-gram prediction. ProphetNet leverages stacked Transformer encoder layers and stacked multi-stream self-attention Transformer decoder layers. Prophet-Net aims to prevent overfitting on strong local correlations such as 2-gram combinations, and deploy future tokens' prediction to enhance autoregressive generation ability. Given the input sequence x = (x 1 , . . . , x M ) and output sequence y = (y 1 , . . . , y T ), n-gram ProphetNet-X replaces the auto-regressive predicting dependency relationship p(y t |y <t , x) with p(y t:t+n−1 |y <t , x). Firstly, ProphetNet-X gets the encoded hidden states with stacked Transformer encoder layers H enc = Encoder(x 1 , . . . , x M ). Then, decoder with n-stream self-attention predicts next n tokens at each time step, as: p(y t |y <t , x), . . . , p(y t+n−1 |y <t , x) = Decoder(y <t , H enc ).
The optimization target of ProphetNet-X can be described as: log p θ (y t+j |y <t , x) future n-gram loss The details of ProphetNet and multi-stream selfattention can be found in .

Pre-training Corpus
In this section, we introduce the pre-training corpus for ProphetNet-X. For ProphetNet-Zh, we collect Chinese Wikipedia, CLUE (Xu et al., 2020b) and Chinese Common Crawl data to reach 160GB. For traditional Chinese data, we firstly use OpenCC 3 to convert them to simplified Chinese. The pre-training corpus includes common webs, online forums, comments websites, Q&A websites, Chinese Wikipedia, and other encyclopedia websites. We build a simplified Chinese char vocabulary. The char vocabulary size is 9,360.
For ProphetNet-Multi, besides Wiki-100 corpus, we select 52 common languages to collect and clean multi-lingual data from Common Crawl. After cleaning and tokenizing, the Common Crawl corpus size we use is described in Table 1. The ProphetNet-Multi vocabulary is same as XLM-R (Conneau et al., 2019) 250k sentencepiece 4 model.
For ProphetNet-Dialog-En, we utilize Reddit comments dataset (Zhou et al., 2018;. We firstly load the weights of ProphetNet-En then clean 60 million sessions for pre-training. For ProphetNet-Dialog-Zh, we use the pretraining corpus from  and we crawled 18.2 million dyadic dialogues (conversation between two persons) longer than or equal to 2 turns (one turn denotes one utterance from one person) from the Douban group 5 which is a popular social networking service in China. The pretraining corpus size comparison between  and ProphetNet-Dialog-Zh is shown in Table 2. We also load the pre-trained model from ProphetNet-Zh before pre-training, which already contains external knowledge from open-domain Chinese corpus.
For ProphetNet-Code, we conduct pre-training on both PLs (Programming Languages) and their describing NL (Natural Language). We use the pre-   (Husain et al., 2019). It covers 6 programming languages, including Go, Java, Javascript, PHP, Python, and Ruby. We employ the same sentencepiece tokenizer as CodeBERT . The tokenizer is used for both PL and NL, with a vocabulary size 50,365.
For ProphetNet-En, we directly take the model pre-trained in ProphetNet . It is pretrained with 160GB English raw texts, including Wikipedia, books, stories, news, and web texts. The vocabulary of ProphetNet-En is same as BERT subwords vocabulary. The vocabulary is based on bpe subwords with a max length matching algorithm. Its vocabulary size is 30,522.

Pre-training Settings
We carry out pre-training with 12-layer encoder, 12layer decoder ProphetNet models. The hidden size is 1,024, feed forward size is 4,096, future tokens' prediction length is 2. Both the max sequence lengths of the input and output are set to 512.

Finetuning Benchmarks
For different ProphetNet-X models, we select different benchmarks to evaluate them, respectively.
For ProphetNet-Multi, we follow Unicoder F N P to evaluate on XGLUE (Liang et al., 2020) for cross-lingual zero-shot generation tasks. The pretrained multi-lingual model is finetuned with English supervised data and inference with English and other un-seen languages data. There are NTG (News Title Generation) and QG (Question Generation) tasks.
For ProphetNet-Dialog-Zh, we use the STC (Shang et al., 2015) single-turn open-domain dialog dataset cleaned by , and real-world Xiaoice Chinese dialog dataset for evaluation.
For ProphetNet-Code, we evaluate the performance on code summarization task from CodeXGLUE (Lu et al., 2021).

Results
For ProphetNet-Zh, we see significant improvements in Table 3. TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004) are extractive baselines and others are abstractive baselines. MTF-S2S single (Xu et al., 2020a) and MTF-S2S multi denote single task finetuning and multi-task finetuning on MATINF dataset. We see consistent gains on both Chinese question answering task and summarization tasks. For ProphetNet-Multi, we show the results in Table 4, Unicoder DAE and Unicoder F N P are pre-trained on Wiki-100 with denoising auto encoder task and ProphetNet, respectively. Comparing the results between the Unicoder F N P and ProphetNet-Multi, we see that more pre-training corpus improves supervised English inference results and other zero-shot languages inference performance. And compared with other baseline methods, ProphetNet-Multi achieves new state-of-theart results on both NTG and QG tasks.
For English open-domain dialog generation, we show the results in Table 5 and Table 6, compared with strong new proposed PLATO (Bao et al., 2020), we see that ProphetNet-Dialog achieves performance improvements.

Setting
Win Lose Tie Kappa Ours-C vs Xiaoice-C 68% 26% 6% 0.73 Ours-C vs Xiaoice-S 76% 24% 0% 0.65 Ours-S vs Xiaoice-S 81% 19% 0% 0.67 Table 8: Human evaluated results for ProphetNet-Dialog-Zh on real-world Xiaoice dataset. Here, Ours means ProphetNet-Dialog-Zh, Xiaoice means old Xiaoice retrieval based dialog system. -S(single-turn) denotes only the last turn is fed to our model or Xiaoice traditional single-turn retrieval model. -C(context) denotes feeding dialog history into our model or Xiaoice traditional multi-turn retrieval model. generate fluent and meaningful responses but have lower BLEU scores because of the writing style difference. Thus, we conduct a human evaluation as in (Zhao et al., 2020). We randomly collect 500 single-turn and 500 multi-turn context-response pairs from the online logs of the real-word dialog system Xiaoice. Then, we recruit 3 native speakers as human annotators. The annotators have to judge which response is better, based on informativeness, consistency, and fluency of the responses.

Related Work
ProphetNet  is the most related to our work since we carry out pre-training based on it. Other related works involve pre-training works in different domains. For English gener-ation pre-training, MASS (Song et al., 2019) proposes an unsupervised pre-training task with span masked and recover. BART  feeds corrupted sentences into the encoder and reconstructs the original sentences. GPT (Radford et al., 2019) models perform language modeling pre-training with Transformer decoder. For multilingual pre-training, mBART  introduces language labels to adopt BART denoising pre-training. Based on GPT (Radford et al., 2019), DialoGPT (Zhang et al., 2019) and CDial-GPT  adopts language model pre-training with English and Chinese dialog corpus respectively. CodeBERT  and GraphCodeBERT  are two pre-training models for programming languages. PLBART (Ahmad et al., 2021) is similar to multilingual BART with language tags to perform denoising pre-training on programming languages.

Conclusion
In this paper, we pre-train ProphetNet-X on various languages and domains, including open-domain (for English, Chinese, and Multi-lingual), dialog (for English and Chinese), and programming (for Ruby, Javascript, Go, Python, Java, and PHP). All the models share the same model structure and are easy to use. Extensive experiments show that ProphetNet-X achieves new state-of-the-art performance on 10 benchmarks. In the future, we will extend ProphetNet-X to support more domains such as biomedical text and protein pre-training.