Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization

This paper presents Z-Code++, a new pre-trained language model optimized for abstractive text summarization. The model extends the state-of-the-art encoder-decoder model using three techniques. First, we use a two-phase pre-training to improve the model’s performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ createsa new state-of-the-art on 9 of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM540B on XSum, and the finetuned 200x larger GPT3175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.


INTRODUCTION
Text summarization aims at producing a concise and fluent summary while preserving salient content and overall meaning of the source documents. It has been applied in a wide range of real-world applications, e.g., summarizing Web search results for interactive information retrieval  and generating medical summaries from doctor-patient conversation transcripts .
While the extractive approach is the dominant approach in commercial systems due to its simplicity and effectiveness (Allahyari et al., 2017), the abstractive approach is getting more attention in the research community as neural language models are used (e.g., Rush et al., 2015;Nallapati et al., 2016;Chopra et al., 2016;Liu & Lapata, 2019b;a;Pasunuru et al., 2021). Compared to the extractive approach where a summary is constructed using extracted sentences, abstractive summarizers paraphrase the idea of the source documents in a new form, and have a potential of generating more concise and coherent summaries.
However, good abstractive summarizers are harder to develop since we have to deal with problems like semantic representation, inference and low-resource text generation, which are more challenging than sentence extraction. Recently, large-scale pre-trained language models (PLMs) such as PEGASUS (Zhang et al., 2020), GPT (Radford et al., 2019;Brown et al., 2020), T5 (Raffel et al., 2020), have been applied for abstractive summarization. While these models can produce surprisingly fluent text, the generated summaries often contain factual inconsistencies, caused by distorted or fabricated facts about the source documents, which is known as the hallucination problem (Kryściński et al., 2019;Celikyilmaz et al., 2020;Ji et al., 2022). In addition, since the amount of text in the source documents can be very large, it is expensive to train an end-to-end abstractive model (e.g., an encoder-decoder transformer model) given the memory constraints of current hardware and the latency constraints of applications such as online document summarization for interactive information retrieval. Therefore, a two-stage approach is widely used, where a subset of document sentences is coarsely selected using 1 arXiv:2208.09770v1 [cs.CL] 21 Aug 2022 an extractive summarizer, and an abstractive summarizer generates the summary conditioning on the extraction (Liu & Lapata, 2019b). This approach is sub-optimal in that salient information might be missed in the extraction.
In this paper, we propose a new encoder-decoder PLM optimized for abstractive summarization, Z-Code++, which significantly extends Z-Code (Wang et al., 2020), a state-of-the-art PLM developed for machine translation, as follows.
First, Z-Code++ is pre-trained on web text using two tasks, replaced token detection (RTD) and corrupted span prediction (CSP). RTD uses a generator to generate ambiguous corruptions and a discriminator to distinguish the ambiguous tokens from the original inputs . RTD is proved to be more sample-efficient than the classic mask language modeling (MLM) task in learning text representations for language understanding (Bajaj et al., 2022). In CSP, a consecutive segment of tokens are corrupted and the model is learned to predict the corrupted spans using all the uncorrupted tokens in the original input (Raffel et al., 2020;Joshi et al., 2020). CSP can be viewed as a generalized form of gap sentences generation (GSG), a pre-training task tailored to abstractive summarization (Zhang et al., 2020), where the spans are entire sentences. CSP outperforms GSG in our experiments.
In the second phase of grounded pre-training (Peng et al., 2022), the model is continually trained on summarization corpora of documents-summary pairs to better support low-resource fine-tuning to downstream summarization tasks that require the model to produce summaries grounded in source documents. We find in our experiments that grounded pre-training significantly boosts the results on downstream tasks in low-resource settings.
To handle the large input documents, we use fusion-in-encoder(FiE), a simple yet effective method of encoding long sequences in a hierarchical manner. It works by first splitting the input sequence into small chunks, applying attention on each chunk locally to get the chunk representation, and applying attention globally on the concatenated chunk representations to get the representation of the original input.
In addition, we replace the self-attention layer in the encoder with the disentangled attention (DA) layer (He et al., 2020;, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. DA is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words "deep" and "learning" is much stronger when they occur next to each other than when they occur in different sentences. We show in our experiments that DA leads to a more effective abstractive summarizer. For evaluation, we have pre-trained two Z-Code++ models on English data and multi-lingual data, respectively. The English model is trained using 160G English text data and the vocabulary of DeBERTaV2 (He et al., 2020). The multi-lingual model is trained on mC4 corpus which is the same as mT5.
These models are evaluated on 13 text summarization tasks across 5 languages, and create new state of the art on 9 tasks. As of May 6th, 2022, Z-Code++ sits atop of the XSum leaderboard, surpassing UL2 20B , T5 11B and PEGASUS. It is worth noting that our models are very parameter-efficient. For example, Z-Code++ outperforms PaLM 540B , which is 600x larger in model parameters, on XSum, and outperforms a fine-tuned, 200x larger, GPT3 175B on SAMSum. In zero-shot and few-shot settings, our models outperform more substantially the competing models.

Z-CODE++
This section describes three modeling techniques we have exploited to optimize Z-Code++ for abstractive summarization, including two-phase pre-training, disentangled attention, and long sequence encoding.

TWO-PHASE PRE-TRAINING
The two-phase pre-training, which includes the language model pre-training and grounded pretraining phases, is inspired by the GODEL recipe (Peng et al., 2022) that has been proposed to pre-train language models for grounded text generation tasks, such as dialog response generation and abstractive question-answering. Text Span Figure 1: The two pre-training tasks, replaced token detection (RTD) and corrupted span prediction (CSP), used in the language model pre-training phase of Z-Code++. RTD task is to optimize the encoder, and CSP is to optimize the encoder-decoder. Encoders in the same color share parameters during training.

Replaced Token Detection Corrupted Span Prediction
In the language model pre-training phase, Z-Code++ is pre-trained using two language modeling tasks, replaced token detection (RTD)  and corrupted span prediction (CSP) (Raffel et al., 2020;Joshi et al., 2020). As illustrated in Figure1 (Left), RTD uses a generator trained with MLM to generate ambiguous tokens to replace tokens in the original input X, and a discriminator to determine whether a token is from X or generated by the generator. Let θ G and θ D be the parameters of the generator and the discriminator, respectively. The MLM loss of the generator is written as whereX G is the input to the generator by randomly masking 15% tokens in original input X. The input sequence of the discriminator is constructed by replacing the masked tokens, x i , i P C, with the tokens,x i , sampled by the generator as Then the discriminator is trained using the loss where 1p¨q is the indicator function andX D is the input to the discriminator constructed via Equation 2. In ELECTRA , the discriminator and generator share token embeddings and their parameters are optimized via MLM and RTD jointly as L " L MLM`λ L RTD . However, as pointed out in He et al. (2021), such embedding sharing makes training highly inefficient since MLM and RTD pull token embeddings into very different directions, creating the "tug-of-war" dynamics. MLM tries to map the tokens that are semantically similar to the embedding vectors that are close to each other. RTD, on the other hand, tries to discriminate semantically similar tokens, pulling their embeddings as far as possible to optimize the classification accuracy. Thus, we use the method of gradient-disentangled embedding sharing (He et al., 2021) by re-parameterizing the token embeddings of the discriminator as where E D and E G are the embedding parameters of the discriminator and generator, respectively, sg is the stop gradient operator which only allows gradients propagation through E ∆ . E ∆ is initialized as a zero matrix. In each training pass, we first run a forward pass of the generator to generate inputs for the discriminator, and then a backward pass to update E G with respect to MLM. After that, we run a forward pass for the discriminator using the inputs produced by the generator and run a backward pass with respect to the RTD loss to update E D by propagating gradients only through E ∆ . After model training, E ∆ is added to E G and the sum is saved as E D in the discriminator, as Equation 4.
The CSP is widely used to optimize the encoder-decoder PLMs such as T5 (Raffel et al., 2020). As illustrated in Figure 1 (Right), given input string X, we first select a continuous span Y i by first randomly selecting a start position in X and a span with an average length of 3. Then we replace the selected span Y i with a sentinel token [M i ]. We repeat the process until the replaced tokens amount to 15% of all tokens in X. Then, we feed the corrupted inputX CSR to the encoder. The encoder-decoder model is then trained to recover the Y i from the context. The CSP loss is written as If we restrict the corrupted span Y i to a complete sentence, CSP is equivalent to the GSG task which simulates the process of extractive summarization and is shown to be effective for training abstractive summarizers (Zhang et al., 2020). In this study, we find the that CSP, as a more general form of GSG, works better across many natural language understanding and generation tasks, including summarization, as to be discussed in Section 3.
Combining the pre-training tasks of MLM, RTD and CSP, in the language model pre-training phase, Z-Code++ is optimized using the joint loss as L " λ 1 L MLM`λ2 L RTD`λ3 L CSP , where we set λ 1 " 1, λ 2 " 30, λ 3 " 1 in our experiment.
In the second phase of grounded pre-training, Z-Code++ is continually pre-trained on a collection of summarization datasets, as shown in Table 1, which consist of documents-summary pairs pX, Y q, to better support low-resource finetuning for downstream summarization tasks that require the model to generate target summaries Y grounded in source documents X, as ppY |Xq " N ź n"1 ppy n |y 1 ,¨¨¨, y n´1 , Xq Following T0 (Wei et al., 2021), FLAN (Sanh et al., 2022), and GODEL (Peng et al., 2022), we add for each training pair pX, Y q a natural language instruction of the summarization task, as illustrated in the below example and in Table 1. In our experiment, we only apply grounded pre-training for low-resource summarizations. Unless specified, we apply the first phase Z-Code++ to downstream task adaptation.
Instruction: Summarize the following news article into a one sentence summary. Source: Officers searched properties in the Waterfront Park and Colonsay View areas of the city on Wednesday. Detectives said three firearms, ammunition and a five-figure sum of money were recovered. A 26-year-old man who was arrested and charged appeared at Edinburgh Sheriff Court on Thursday.
Target: A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh

DISENTANGLED ATTENTION
Disentangled Attention (DA) is first used in DeBERTa (He et al., 2020;. DA is an extension of the classic self-attention (SA) mechanism in that DA represents each input word using two separate vectors: one for the content and the other for the position. Meanwhile, its attention weights among words are computed via disentangled matrices on both their contents and relative positions. The experiments of DeBERTa shows that DA is more efficient than SA to encode the positional dependency in Transformer models. Z-Code++ adopts DA in modeling. Our experiments show that DA leads to a more effective abstractive summarizer.

LONG SEQUENCE ENCODING
It is challenging to encode long sequence given the OpN 2 q memory and computation complexity of self-attention and DA. Various sparse attention mechanisms have been proposed to alleviate the problem. However, sparse attention often hurts performance on short sequences due to the decrease of attention precision. Inspired by fusion-in-decoder (Izacard & Grave, 2020) and hierarchical transformer (Liu & Lapata, 2019a), we propose fusion-in-encoder (FiE), a simple but effective mechanism to encode long sequences while retaining high attention precision on short sequences. FiE works by separating the L encoder layers of Z-Code++ into m local layers and n global layers.
In each local layer, the hidden states of input sequence are split into small chunk of size l (e.g. 256 or 512), and self-attention (or DA) is only applied to those small chunks locally with a complexity of Opl 2 q. After local layer, the hidden states of those small chunks are concatenated together to form the representation of the long sequence. Global layers are the same as original self-attention (or DA) layers in encoder to fuse the local states of small chunks. With FiE, the complexity of encoder is reduced from OpLN 2 q to OpmN l`nN 2 q. Both the local layers and fusion layers are initialized with the corresponding weights of encoder layers of Z-Code++. Please check Appendix A.3 for a graphic illustration of FiE. In experiment, we show that compared with LongT5 (Guo et al., 2021) which applies sparse attention that is specifically optimized for summarization, Z-Code++ achieves similar or better performance on long document summarization tasks.

EXPERIMENT SETUPS
Datasets We validate the effectiveness of Z-Code++ on 11 representative summarization tasks, which are detailed in Table 2. Among these datasets, XSum (Narayan et al., 2018), CNNDM (See et al., 2017), NewsRoom (Grusky et al., 2018), and MultiNews (Fabbri et al., 2019) are news article summarizations, while SAMSum (Gliwa et al., 2019), MediaSum , and Reddit TIFU  are conversation-like summarization tasks. Following LongT5, we use MultiNews, MediaSum, arXiv (Cohan et al., 2018) and PubMed (Cohan et al., 2018) to assess the long document summarization capability. In addition, WikiLingua (Ladhak et al., 2020) and MLsum (Scialom et al., 2020) are used to evaluate the capacity of Z-Code++ on multilingual summarization.   (2021), a 6-layer generator with the same structure as the encoder is employed during the pre-training stage. Z-Code++ LARGE is trained on 160G data with a vocabulary of size 128k. We pre-train Z-Code++ LARGE for 1M steps with a batch size of 2048. AdamW is used as the optimizer in all experiments. For tasks with an input length of more than 10k words, i.e., arXiv and PubMed, Fusion-in-Encoder is used to encode the document as described in 2.3. For the other standard summarization tasks with moderate input length (i.e., less than 4k words) we directly feed the input document to the encoder.

Implementation Details
For multilingual summarization, we have built Z-Code++ LARGE with the same architecture but different training data and vocabulary. Specifically, Z-Code++ LARGE is trained with mC4 data and a vocabulary of size 250k, which are the same as mT5 (Xue et al., 2021). Following XLM (Lample & Conneau, 2019), CCMatrix (Schwenk et al., 2019) and CCAligned (El-Kishky et al., 2019), parallel data is used to enhance the cross-lingual summarization of Z-Code++ LARGE . Due to the limited computational resource, Z-Code++ LARGE is trained with only 500B tokens instead of 1T tokens as that for mT5 training.
We use grid search to choose the grounded training and fine-tuning hyper-parameters based on validation set, the parameter search range are listed in appendix A.1.

RESULTS ON STANDARD ENGLISH SUMMARIZATION TASKS
We first conduct experiments to compare the performance of Z-Code++ LARGE with SOTA and PEGASUS LARGE on 7 representative standard public English summarization datasets with moderate document length, including AESLC, SAMSum, XSUM, WikiHow, NewsRoom, CNN/DailyMail(CNNDM), and Reddit TIFU. Following (Chowdhery et al., 2022;, for each dataset we report F-measure of ROUGE-2 score. Detailed F-measure of ROUGE-1/ROUGE-2/ROUGE-L scores can be found in Appendix 10. As listed in Table 3 (Pang et al., 2022) We compare Z-Code++ to PEGASUS and LongT5, which is optimized for long document summarization. Results in Table 4 show that Z-Code++ LARGE exceeds all the strong competitors on all long document summarization datasets and lifts SOTA by 0.35 point on average. For FiE, which is used to generate summaries for arXiv and PubMed, we choose the chunk size l " 256, and choose the last layer of encoder as fusion layer based on the experiment results. Specifically, Z-Code++ LARGE outperforms LongT5 3B with less than 1/3 of parameters. These results demonstrate both the effectiveness and flexibility of Z-Code++ by using Disentangled-Attention to encode word dependencies.

HUMAN EVALUATION
As human evaluation is the most reliable measurement of the quality of natural language generation models, we submit the test results of XSum to the leaderboard (Khashabi et al., 2021) which requires human raters to compare the generated summaries side by side with human written references. We list   (Tay et al., 2022), on the leaderboard in terms of human-overall score. As the human evaluation score is an average of side-by-side preference comparison scores, a score of 0.51 indicates that the annotators prefer the output of Z-Code++ to the human written references. Further more, while hallucination is one of the most critical problems for abstractive summarization, Z-Code++ does not suffer much, i.e., 0.55, among the leaderbard. The human evaluation results validate that Z-Code++ produces higher quality summaries than other models.

Cross-lingual summarization
WikiLingua (  Following GEM-benchmark (Gehrmann et al., 2021), we evaluate the performance of Z-Code++ LARGE 2 on multilingual summarization with WikiLingua and MLSum. We compare Z-Code++ LARGE with mT5 LARGE and mT5 XLARGE . The results of PaLM 540B , a state of the art PLM, are also listed in Table  6. Compared with mT5 XLARGE , Z-Code++ LARGE achieves substantially better performance across all the tasks with only 1/3 parameters and half training data. In addition, we observe a significant performance gap between Z-Code++ LARGE and PaLM 540B on WikiLingua, which is not surprising due to the sharp difference in model size and capacity. However, Z-Code++ LARGE surpasses PaLM 540B on MLSum by a large margin, i.e., 3.7% on MLSum(de), 2.8% on MLSum(es), albeit Z-Code++ LARGE has less than 1/500 parameters. We believe that by scaling up Z-Code++ to a moderate size (e.g., 10B), the performance gap on WikiLingua would be mitigated. We leave it to future work.

RESULTS ON LOW-RESOURCE SUMMARIZATION
We explore how well knowledge learned in different pre-training stages can generalize to low-resource summarization scenarios, i.e. zero/few-shot evaluation. For the grounded pre-training phase, we choose to include MediaSum, MultiNews, NewsRoom, and WikiHow datasets. Corresponding instructions are listed in Table 1. We reckon that incorporating diverse datasets and instructions is beneficial, which we leave it to future work. For the fine-tuning stage, following the setting in Zhang et al. (2020), we randomly select the number of training data to 0, 10, 100, and 1000, and sample examples from XSUM, CNNDM, and SAMSum, and then fine-tune Z-Code++ until no significant improvement on the validation set is observed. Note that 0 denotes zero-shot evaluation.  Table 7: ROUGE-2 score in different summarization datasets. Results are shown on their full test sets using 10, 100, and 1000 training examples. 0 denotes zero-shot results. Results marked withm ean that unfine-tuned checkpoints perform the best, i.e., zero-shot performance is better than the fine-tuned one. Z-Code++ : LARGE refers to fine-tuning from phase 1 pre-trained model. Z-Code++ LARGE fine-tuned from two-phase pre-trained model.

CONCLUSIONS
We present Z-Code++, an efficient and effective pre-trained language model optimized for abstractive text summarization. The model extends the encoder-decoder model using three techniques. The first is a two-phase pre-training process, where the model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. The second technique is the use of the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively. The third is the fusion-in-encoder method for encoding long sequence inputs. We present a comprehensive empirical study to validate the effectiveness of Z-Code++. The model creates new state of the art on 9 out of 13 text summarization tasks across 5 languages. In addition, we show that our model is parameter-efficient in that it outperforms the 600x larger PaLM 540B on XSum, and the finetuned 200x larger GPT3 175B on SAMSum. Z-Code++ also generalizes well to low-resource downstream tasks. For example, in zero-shot and few-shot settings, our model outperforms more substantially the competing models.

ACKNOWLEDGMENTS
This effort is part of the Microsoft's AI at Scale initiative 3 and project Alexander. We thank all that contribute to this project.     Table 11: ROUGE-1/ROUGE-2/ROUGE-L scores in different summarization datasets. Results are shown on their full test sets using 10, 100, and 1000 training examples. 0 denotes zero-shot results. Results marked with˚mean that unfine-tuned checkpoints perform the best, i.e., zero-shot performance is better than the fine-tuned one. Z-Code++ : LARGE refers to fine-tuning from phase 1 pre-trained model. Z-Code++ LARGE fine-tuned from two-phase pre-trained model.