GLM: General Language Model Pretraining with Autoregressive Blank Infilling

There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). However, none of the pretraining frameworks performs the best for all tasks of three main categories including natural language understanding (NLU), unconditional generation, and conditional generation. We propose a General Language Model (GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25× parameters of BERT Large , demonstrating its generalizability to different downstream tasks.


Introduction
Language models pretrained on unlabeled texts have substantially advanced the state of the art in various NLP tasks, ranging from natural language understanding (NLU) to text generation (Radford et al., 2018a;Devlin et al., 2019;Yang et al., 2019;Radford et al., 2018b;Raffel et al., 2020;Lewis et al., 2019;Brown et al., 2020).Downstream task performance as well as the scale of the parameters have also constantly increased in the past few years.In general, existing pretraining frameworks can be categorized into three families: autoregressive, autoencoding, and encoder-decoder models.Autoregressive models, such as GPT (Radford et al., 2018a), learn left-to-right language models.While they succeed in long-text generation and show fewshot learning ability when scaled to billions of parameters (Radford et al., 2018b;Brown et al., 2020), the inherent disadvantage is the unidirectional attention mechanism, which cannot fully capture the dependencies between the context words in NLU tasks.Autoencoding models, such as BERT (Devlin et al., 2019), learn bidirectional context encoders via denoising objectives, e.g.Masked Language Model (MLM).The encoders produce contextualized representations that suit natural language understanding tasks, but could not be directly applied for text generation.Encoder-decoder models adopt bidirectional attention for the encoder, unidirectional attention for the decoder, and cross attention between them (Song et al., 2019;Bi et al., 2020;Lewis et al., 2019).They are typically deployed in conditional generation tasks, such as text summarization and response generation. 2 .T5 (Raffel et al., 2020) unifies NLU and conditional generation via encoder-decoder models but requires more parameters to match the performance of BRET-based models such as RoBERTa (Liu et al., 2019) and DeBERTa (He et al., 2021).
None of these pretraining frameworks is flexible enough to perform competitively across all NLP tasks.Previous works have tried to unify different frameworks by combining their objectives via multi-task learning (Dong et al., 2019;Bao et al., 2020).However, since the autoencoding and autoregressive objectives differ by nature, a simple unification cannot fully inherit the advantages of both frameworks.
In this paper, we propose a pretraining framework named GLM (General Language Model), based on autoregressive blank infilling.We randomly blank out continuous spans of tokens from the input text, following the idea of autoencoding, and train the model to sequentially reconstruct the spans, following the idea of autoregressive pretraining (see Figure 1).While blanking filling has been used in T5 (Raffel et al., 2020) for text-to-text pretraining, we propose two improvements, namely span shuffling and 2D positional encoding.Empirically, we show that with the same amount of parameters and computational cost, GLM significantly outperforms BERT on the SuperGLUE benchmark by a large margin of 4.6% -5.0% and outperforms RoBERTa and BART when pretrained on a corpus of similar size (158GB).GLM also significantly outperforms T5 on NLU and generation tasks with fewer parameters and data.
Inspired by Pattern-Exploiting Training (PET) (Schick and Schütze, 2020a), we reformulate NLU tasks as manually-crafted cloze questions that mimic human language.Different from the BERTbased models used by PET, GLM can naturally handle multi-token answers to the cloze question via autoregressive blank filling.
Furthermore, we show that by varying the number and lengths of missing spans, the autoregressive blank filling objective can pretrain language models for conditional and unconditional generation.Through multi-task learning of different pretraining objectives, a single GLM can excel in both NLU and (conditional and unconditional) text generation.Empirically, compared with standalone baselines, GLM with multi-task pretraining achieves improvements in NLU, conditional text generation, and language modeling tasks altogether by sharing the parameters.

GLM Pretraining Framework
We propose a general pretraining framework GLM based on a novel autoregressive blank infilling objective.GLM formulates NLU tasks as cloze questions that contain task descriptions, which can be answered by autoregressive generation.

Autoregressive Blank Infilling
GLM is trained by optimizing an autoregressive blank infilling objective.Given an input text Each span is replaced with a single [MASK] token, forming a corrupted text x corrupt .The model predicts the missing tokens in the spans from the corrupted text in an autoregressive manner, which means when predicting the missing tokens in a span, the model has access to the corrupted text and the previously predicted spans.To fully capture the interdependencies between different spans, we randomly permute the order of the spans, similar to the permutation language model (Yang et al., 2019).Formally, let Z m be the set of all possible permutations of the length-m index sequence [1, 2, • • • , m], and s z <i be [s z 1 , • • • , s z i−1 ], we define the pretraining objective as We always generate the tokens in each blank following a left-to-right order, i.e. the probability of generating the span s i is factorized as: We implement the autoregressive blank infilling objective with the following techniques.

Multi-Task Pretraining
In the previous section, GLM masks short spans and is suited for NLU tasks.However, we are interested in pretraining a single model that can handle both NLU and text generation.We then study a multi-task pretraining setup, in which a second objective of generating longer text is jointly optimized with the blank infilling objective.We consider the following two objectives: • Document-level.We sample a single span whose length is sampled from a uniform distribution over 50%-100% of the original length.
The objective aims for long text generation.
• Sentence-level.We restrict that the masked spans must be full sentences.Multiple spans (sentences) are sampled to cover 15% of the original tokens.This objective aims for seq2seq tasks whose predictions are often complete sentences or paragraphs.
Both new objectives are defined in the same way as the original objective, i.e.Eq. 1.The only difference is the number of spans and the span lengths.

Model Architecture
GLM uses a single Transformer with several modifications to the architecture: (1) we rearrange the order of layer normalization and the residual connection, which has been shown critical for large-scale language models to avoid numerical errors (Shoeybi et al., 2019); (2) we use a single linear layer for the output token prediction; (3) we replace ReLU activation functions with GeLUs (Hendrycks and Gimpel, 2016).

2D Positional Encoding
One of the challenges of the autoregressive blank infilling task is how to encode the positional information.Transformers rely on positional encodings to inject the absolute and relative positions of the tokens.We propose 2D positional encodings to address the challenge.Specifically, each token is encoded with two positional ids.The first positional id represents the position in the corrupted text x corrupt .For the masked spans, it is the position of the corresponding [MASK] token.The second positional id represents the intra-span position.For tokens in Part A, their second positional ids are 0.For tokens in Part B, they range from 1 to the length of the span.The two positional ids are projected into two vectors via learnable embedding tables, which are both added to the input token embeddings.
Our encoding method ensures that the model is not aware of the length of the masked span when  6 2 0 8 O m d N d n b g D 6 z 3 H 7 R 5  reconstructing them.It is an important difference as compared to other models.For example, XL-Net (Yang et al., 2019) encodes the original position so that it can perceive the number of missing tokens, and SpanBERT (Joshi et al., 2020) replaces the span with multiple [MASK] tokens and keeps the length unchanged.Our design fits downstream tasks as usually the length of the generated text is unknown beforehand.

Finetuning GLM
Typically, for downstream NLU tasks, a linear classifier takes the representations of sequences or tokens produced by pretrained models as input and predicts the correct labels.The practices are different from the generative pretraining task, leading to inconsistency between pretraining and finetuning.
Instead, we reformulate NLU classification tasks as generation tasks of blank infilling, following PET (Schick and Schütze, 2020a).Specifically, given a labeled example (x, y), we convert the input text x to a cloze question c(x) via a pattern containing a single mask token.The pattern is written in natural language to represent the semantics of the task.For example, a sentiment classification task can be formulated as "{SENTENCE}.It's really [MASK]".The candidate labels y ∈ Y are also mapped to answers to the cloze, called verbalizer v(y).In sentiment classification, the labels "positive" and "negative" are mapped to the words "good" and "bad".The conditional probability of predicting y given x is where Y is the label set.Therefore the probability of the sentence being positive or negative is proportional to predicting "good" or "bad" in the blank.
Then we finetune GLM with a cross-entropy loss (see Figure 3).
For text generation tasks, the given context constitutes the Part A of the input, with a mask token appended at the end.The model generates the text of Part B autoregressively.We can directly apply the pretrained GLM for unconditional generation, or finetune it on downstream conditional generation tasks.

Discussion and Analysis
In this section, we discuss the differences between GLM and other pretraining models.We are mainly concerned with how they can be adapted to downstream blank infilling tasks.
Comparison with BERT (Devlin et al., 2019).As pointed out by (Yang et al., 2019), BERT fails to capture the interdependencies of masked tokens due to the independence assumption of MLM.Another disadvantage of BERT is that it cannot fill in the blanks of multiple tokens properly.To infer the probability of an answer of length l, BERT needs to perform l consecutive predictions.If the length l is unknown, we may need to enumerate all possible lengths, since BERT needs to change the number of [MASK] tokens according to the length.
Comparison with XLNet (Yang et al., 2019).Both GLM and XLNet are pretrained with autoregressive objectives, but there are two differences between them.First, XLNet uses the original position encodings before corruption.During inference, we need to either know or enumerate the length of the answer, the same problem as BERT.Second, XLNet uses a two-stream self-attention mechanism, instead of the right-shift, to avoid the information leak within Transformer.It doubles the time cost of pretraining.
Comparison with T5 (Raffel et al., 2020).T5 proposes a similar blank infilling objective to pretrain an encoder-decoder Transformer.T5 uses independent positional encodings for the encoder and decoder, and relies on multiple sentinel tokens to differentiate the masked spans.In downstream tasks, only one of the sentinel tokens is used, leading to a waste of model capacity and inconsistency between pretraining and finetuning.Moreover, T5 always predicts spans in a fixed left-to-right order.As a result, GLM can significantly outperform T5 on NLU and seq2seq tasks with fewer parameters and data, as stated in Sections 3.2 and 3.3.
Comparison with UniLM (Dong et al., 2019).UniLM combines different pretraining objectives under the autoencoding framework by changing the attention mask among bidirectional, unidirectional, and cross attention.However, UniLM always replaces masked spans with [MASK] tokens, which limits its ability to model the dependencies between the masked spans and their context.GLM feeds in the previous token and autoregressively generates the next token.Finetuning UniLM on downstream generation tasks also relies on masked language modeling, which is less efficient.UniLMv2 (Bao et al., 2020) adopts partially autoregressive modeling for generation tasks, along with the autoencoding objective for NLU tasks.Instead, GLM unifies NLU and generation tasks with autoregressive pretraining.

Experiments
We now describe our pretraining setup and the evaluation of downstream tasks.

Pretraining Setup
For a fair comparison with BERT (Devlin et al., 2019), we use BooksCorpus (Zhu et al., 2015) and English Wikipedia as our pretraining data.We use the uncased wordpiece tokenizer of BERT with 30k vocabulary.We train GLM Base and GLM Large with the same architectures as BERT Base and BERT Large , containing 110M and 340M parameters respectively.
For multi-task pretraining, we train two Largesized models with a mixture of the blank infilling objective and the document-level or sentencelevel objective, denoted as GLM Doc and GLM Sent .Additionally, we train two larger GLM models of 410M (30 layers, hidden size 1024, and 16 attention heads) and 515M (30 layers, hidden size 1152, and 18 attention heads) parameters with documentlevel multi-task pretraining, denoted as GLM 410M and GLM 515M .
To compare with SOTA models, we also train a Large-sized model with the same data, tokenization, and hyperparameters as RoBERTa (Liu et al., 2019), denoted as GLM RoBERTa .Due to resource limitations, we only pretrain the model for 250,000 steps, which are half of RoBERTa and BART's training steps and close to T5 in the number of trained tokens.More experiment details can be found in Appendix A.

SuperGLUE
To evaluate our pretrained GLM models, we conduct experiments on the SuperGLUE bench-mark (Wang et al., 2019) and report the standard metrics.SuperGLUE consists of 8 challenging NLU tasks.We reformulate the classification tasks as blank infilling with human-crafted cloze questions, following PET (Schick and Schütze, 2020b).Then we finetune the pretrained GLM models on each task as described in Section 2.3.The cloze questions and other details can be found in Appendix B.1.
For a fair comparison with GLM Base and GLM Large , we choose BERT Base and BERT Large as our baselines, which are pretrained on the same corpus and for a similar amount of time.We report the performance of standard finetuning (i.e.classification on the [CLS] token representation).The performance of BERT with cloze questions is reported in Section 3.4.To compare with GLM RoBERTa , we choose T5, BART Large , and RoBERTa Large as our baselines.T5 has no direct match in the number of parameters for BERT Large , so we present the results of both T5 Base (220M parameters) and T5 Large (770M parameters).All the other baselines are of similar size to BERT Large .
Table 1 shows the results.With the same amount of training data, GLM consistently outperforms BERT on most tasks with either base or large architecture.The only exception is WiC (word sense disambiguation).On average, GLM Base scores 4.6% higher than BERT Base , and GLM Large scores 5.0% higher than BERT Large .It clearly demonstrates the advantage of our method in NLU tasks.In the setting of RoBERTa Large , GLM RoBERTa can still achieve improvements over the baselines, but with a smaller margin.Specifically, GLM RoBERTa outperforms T5 Large but is only half its size.We also find that BART does not perform well on the challenging SuperGLUE benchmark.We conjecture this can be attributed to the low parameter efficiency of the encoder-decoder architecture and the denoising sequence-to-sequence objective.

Multi-Task Pretraining
Then we evaluate the GLM's performance in a multi-task setting (Section 2.1).Within one training batch, we sample short spans and longer spans (document-level or sentence-level) with equal chances.We evaluate the multi-task model for NLU, seq2seq, blank infilling, and zero-shot language modeling.SuperGLUE.For NLU tasks, we evaluate models on the SuperGLUE benchmark.The results  (Bao et al., 2020) 43.2 20.4 40.1 44.0 21.1 36.1 T5 Large (Raffel et al., 2020) 42.5 20.7 39.8 40.9 17.3 33.0 BART Large (Lewis et al., 2019) 44 are also shown in Table 1.We observe that with multi-task pretraining, GLM Doc and GLM Sent perform slightly worse than GLM Large , but still outperform BERT Large and UniLM Large .Among multitask models, GLM Sent outperforms GLM Doc by 1.1% on average.Increasing GLM Doc 's parameters to 410M (1.25×BERT Large ) leads to better performance than GLM Large .GLM with 515M parameters (1.5×BERT Large ) can perform even better. Sequence-to-Sequence.
Considering the available baseline results, we use the Gigaword dataset (Rush et al., 2015) for abstractive summarization and the SQuAD 1.1 dataset (Rajpurkar et al., 2016) for question generation (Du et al., 2017) as the benchmarks for models pretrained on BookCorpus and Wikipedia.Additionally, we use the CNN/DailyMail (See et al., 2017) and XSum (Narayan et al., 2018) datasets for abstractive summarization as the benchmarks for models pretrained on larger corpora.
The results for models trained on BookCorpus and Wikipedia are shown in Tables 3 and 4. We observe that GLM Large can achieve performance matching the other pretraining models on the two generation tasks.GLM Sent can perform better than GLM Large , while GLM Doc performs slightly worse than GLM Large .This indicates that the documentlevel objective, which teaches the model to extend the given contexts, is less helpful to conditional generation, which aims to extract useful information from the context.Increasing GLM Doc 's parameters to 410M leads to the best performance on both tasks.The results for models trained on larger corpora are shown in Table 2. GLM RoBERTa can achieve performance matching the seq2seq BART model, and outperform T5 and UniLMv2.
Text Infilling.Text infilling is the task of predicting missing spans of text which are consistent with the surrounding context (Zhu et al., 2019;Donahue et al., 2020;Shen et al., 2020).GLM is trained with an autoregressive blank infilling objective, thus can straightforwardly solve this task.We evaluate GLM on the Yahoo Answers dataset (Yang et al., 2017) and compare it with Blank Language Model (BLM) (Shen et al., 2020), which is a specifically designed model for text infilling.From the results in Table 5, GLM outperforms previous methods by large margins (1.3 to 3.9 BLEU) and achieves the state-of-the-art result on this dataset.We notice that GLM Doc slightly underperforms GLM Large , which is consistent with our observations in the seq2seq experiments.Language Modeling.Most language modeling datasets such as WikiText103 are constructed from Wikipedia documents, which our pretraining dataset already contains.Therefore, we evaluate the language modeling perplexity on a held-out test set of our pretraining dataset, which contains about 20M tokens, denoted as BookWiki.We also evaluate GLM on the LAMBADA dataset (Paperno The task is to predict the final word of a passage.As the baseline, we train a GPT Large model (Radford et al., 2018b;Brown et al., 2020) with the same data and tokenization as GLM Large .
The results are shown in Figure 4.All the models are evaluated in the zero-shot setting.Since GLM learns the bidirectional attention, we also evaluate GLM under the setting in which the contexts are encoded with bidirectional attention.Without generative objective during pretraining, GLM Large cannot complete the language modeling tasks, with perplexity larger than 100.With the same amount of parameters, GLM Doc performs worse than GPT Large .This is expected since GLM Doc also optimizes the blank infilling objective.Increasing the model's parameters to 410M (1.25× of GPT Large ) leads to a performance close to GPT Large .GLM 515M (1.5× of GPT Large ) can further outperform GPT Large .With the same amount of parameters, encoding the context with bidirectional attention can improve the performance of language modeling.Under this setting, GLM 410M outperforms GPT Large .This is the advantage of GLM over unidirectional GPT.We also study the contribution of 2D positional encoding to long text generation.We find that removing the 2D positional encoding leads to lower accuracy and higher perplexity in language modeling.Summary.Above all, we conclude that GLM effectively shares model parameters across natural language understanding and generation tasks, achieving better performance than a standalone BERT, encoder-decoder, or GPT model.

Ablation Study
Table 6 shows our ablation analysis for GLM.First, to provide an apple-to-apple comparison with BERT, we train a BERT Large model with our implementation, data, and hyperparameters (row 2).The performance is slightly worse than the official BERT Large and significantly worse than GLM Large .It confirms the superiority of GLM over Masked LM pretraining on NLU tasks.Second, we show the SuperGLUE performance of GLM finetuned as sequence classifiers (row 5) and BERT with clozestyle finetuning (row 3).Compared to BERT with cloze-style finetuning, GLM benefits from the autoregressive pretraining.Especially on ReCoRD and WSC, where the verbalizer consists of multiple tokens, GLM consistently outperforms BERT.This demonstrates GLM's advantage in handling variable-length blank.Another observation is that the cloze formulation is critical for GLM's performance on NLU tasks.For the large model, clozestyle finetuning can improve the performance by 7 points.Finally, we compare GLM variants with different pretraining designs to understand their importance.Row 6 shows that removing the span shuffling (always predicting the masked spans from left to right) leads to a severe performance drop on SuperGLUE.Row 7 uses different sentinel tokens instead of a single [MASK] token to represent different masked spans.The model performs worse than the standard GLM.We hypothesize that it wastes some modeling capacity to learn the different sentinel tokens which are not used in downstream tasks with only one blank.In Figure 4, we show that removing the second dimension of 2D positional encoding hurts the performance of long text generation.
We note that T5 is pretrained with a similar blank infilling objective.GLM differs in three aspects: (1) GLM consists of a single encoder, (2) GLM shuffles the masked spans, and (3) GLM uses a single [MASK] instead of multiple sentinel tokens.While we cannot directly compare GLM with T5 due to the differences in training data and the number of parameters, the results in Tables 1 and 6 have demonstrated the advantage of GLM.
Among encoder-decoder models, BART (Lewis et al., 2019) conducts NLU tasks by feeding the same input into the encoder and decoder, and taking the final hidden states of the decoder.Instead, T5 (Raffel et al., 2020) formulates most language tasks in the text-to-text framework.However, both models require more parameters to outperform autoencoding models such as RoBERTa (Liu et al., 2019).UniLM (Dong et al., 2019;Bao et al., 2020) unifies three pretraining models under the masked language modeling objective with different attention masks.
NLU as Generation.Previously, pretrained language models complete classification tasks for NLU with linear classifiers on the learned representations.GPT-2 (Radford et al., 2018b) and GPT-3 (Brown et al., 2020) show that generative language models can complete NLU tasks such as question answering by directly predicting the correct answers without finetuning, given task instructions or a few labeled examples.However, generative models require much more parameters to work due to the limit of unidirectional attention.Recently, PET (Schick and Schütze, 2020a,b) proposes to reformulate input examples as cloze questions with patterns similar to the pretraining corpus in the few-shot setting.It has been shown that combined with gradient-based finetuning, PET can achieve better performance in the few-shot setting than GPT-3 while requiring only 0.1% of its parameters.Similarly, Athiwaratkun et al. ( 2020 2020) also study blanking infilling models.Different from their work, we pre-train language models with blank infilling objectives and evaluate their performance in downstream NLU and generation tasks.

Conclusions
GLM is a general pretraining framework for natural language understanding and generation.We show that the NLU tasks can be formulated as conditional generation tasks, and therefore solvable by autoregressive models.GLM unifies the pretraining objectives for different tasks as autoregressive blank infilling, with mixed attention masks and the novel 2D position encodings.Empirically we show that GLM outperforms previous methods for NLU tasks and can effectively share parameters for different tasks.The hyperparameters for all the pre-training settings are summarized in Table 7.

A.3 Implementation
Our pretraining implementation is based on Megatron-LM (Shoeybi et al., 2019) and Deep-Speed (Rasley et al., 2020).We include our code in the supplementary material.Due to the size limit of supplementary material, we cannot include the pretrained models, but will make them public available in the future.

B Downstream Tasks B.1 SuperGLUE
The SuperGLUE benchmark consists of 8 NLU tasks.We formulate them as blank infilling tasks, following (Schick and Schütze, 2020b).Table 8 shows the cloze questions and verbalizers we used in our experiments.For 3 tasks (ReCoRD, COPA, and WSC), the answer may consist of multiple tokens, and for the other 5 tasks, the answer is always a single token.
When finetuning GLM on the SuperGLUE tasks, we construct the input using the cloze questions in Table 8 and replace the blank with a [MASK] token.Then we compute the score of generating each answer candidate.For the 5 single-token tasks, the score is defined to be the logit of the verbalizer token.For the 3 multi-token tasks, we use the sum of the log-probabilities of the verbalizer tokens.Thanks to the autoregressive blank infilling mechanism we proposed, we can obtain all the log-probabilities in one pass.Then we compute the cross entropy loss using the groundtruth label and update the model parameters.
For the baseline classifiers, we follow the standard practice to concatenate the input parts of each task (such as the premise and hypothesis for textual entailment, or the passage, question and answer for ReCORD and MultiRC) and add a classification layer on top of the [CLS] token representation.We also implemented cloze-style finetuning for the other pre-trained models, but the performance was usually similar to the standard classifier, as we shown in the ablation study.Models with blank-infilling objectives, such as T5 and our GLM, benefits more from converting the NLU tasks into cloze questions.Thus for T5 and GLM, we report the performance after such conversion in our main results.Fot the text summarization task, we use the dataset Gigaword (Rush et al., 2015) for model fine-tuning and evaluation.We finetune GLM LARGE on the training set for 4 epochs with AdamW optimizer.
The learning rate has a peak value of 3e-5, warmup over the 6% training steps and a linear decay.We also use label smoothing with rate 0.1 (Pereyra et al., 2017).The maximum document length is 192 and the maximum summary length is 32.During decoding, we use beam search with beam size of 5 and remove repeated trigrams.We tweak the value of length penalty on the development set.The evaluation metrics are the F1 scores of Rouge-1, Rouge-2, and Rouge-L (Lin, 2004) on the test set.
For the question generation task, we use the SQuAD 1.1 dataset (Rajpurkar et al., 2016) and follow the dataset split of (Du et al., 2017).The optimizer hyperparameters are the same as those of abstractive summarization.The maximum passage length is 464 and the maximum question length is 48.During decoding, we use beam search with beam size 5 and tweak the value of length penalty on the development set.The evaluation metrics are the scores of BLEU-1, BLEU-2, BLEU-3, BLEU-4 (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and Rouge-L (Lin, 2004).
Results of T5 Large on XSum are obtained by running the summarization script provided by Huggingface transformers6 .All the other results of baselines on seq2seq tasks are obtained from the corresponding papers.

B.3 Text Infilling
We follow (Shen et al., 2020) and evaluate text infilling performance on the Yahoo Answers dataset (Yang et al., 2017), which contains 100K/10K/10K documents for train/valid/test respectively.The average document length is 78 words.To construct the text infilling task, we randomly mask a given ratio r ∈ {10% • • • 50%} of each document's tokens and the contiguous masked tokens are collapsed into a single blank.We finetune GLM Large on the training set for 5 epochs with dynamic masking, i.e. the blanks are randomly generated at training time.Similar to the sequence-to-sequence experiments, we use an AdamW optimizer with a peak learning rate 1e-5 and 6% warm-up linear scheduler.
For comparison with previous work, we use the same test set constructed by (Shen et al., 2020).The evaluation metric is the BLEU score of the infilled text against the original document.We compare with two baselines: (1) BERT, which learns a left-to-right language model to generate the masked tokens on top of the blank representation, and (2) BLM proposed by (Shen et al., 2020), which can fill in the blank with arbitrary trajectories.

B.4 Language Modeling
We evaluate the model's ability of language modeling with perplexity on BookWiki and accuracy on the LAMBDA dataset (Paperno et al., 2016).
Perplexity is an evaluation criterion that has been well studied for language modeling.Perplexity is the exponentiation of the average cross entropy of a corpus.
where LAMBDA is a cloze-style dataset to test the ability of long-range dependency modeling.Each example is a passage consisting of 4-5 sentences with the last word missing and the model is required to predict the last word of the passage.Since we use WordPiece tokenization, a word can be split into several subword units.We use teacher forcing and consider the prediction correct only when all the predicted tokens are correct.

C Results on Other NLU Benchmarks
GLUE (Wang et al., 2018) is another widely-used NLU benchmark, including single sentence tasks (e.g.sentiment analysis (Socher et al., 2013)) and sentence pair tasks (e.g.text similarity (Cer et al., 2017) and natural language inference (Williams et al., 2018;Dagan et al., 2005)).The benchmark is usually considered as less challenging than Super-GLUE.SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) is an extractive question answering benchmark.We further compare GLM with BERT on the two benchmarks.
The results on GLUE and SQuAD are shown in Tables 9 and 10.On the two benchmarks, GLM can still outperform BERT with the same amount of parameters, but with a smaller margin.

D Text Generation Samples
We show texts generated by GLM Doc given unseen contexts randomly sampled from the test set.We use top-k random sampling with k = 40 for generation and set maximum sequence length to 512.Some of the texts are cut short.
Example D.1.The Wyoming State Legislature is the legislative branch of the U.S. State of Wyoming.
GLM: It consists of a state senate and a house of representatives.As of 2019, state senators are Steve Kish, Joe L. Williams, and Linda S. Bollens.Members of the Wyoming State Legislature are elected from single-member districts representing the majority of the state.The current state senate members are: In recent years, there have been four changes to the senate.The most recent is the creation of a six-seat district that includes all or part of the following: In the 2009 elections, the state senate members were elected to six-year terms.The current state house members are: The Wyoming Constitution assigns certain powers to the governor.Most notably, the governor is president of the senate and governor.However, if the governor desires to appoint a member to the Wyoming state senate, a law authorizes the governor to do so.The governor of Wyoming holds no legislative power but has the power to veto lawmakers, which is not limited to the veto of laws.Under the wyoming state constitution, the governor can veto the actions of the other members of the wyoming house of representatives.The governor can also appoint members of the wyoming senate.In addition, the governor can appoint members of the Wyoming house of representatives.Wyoming's constitution provides that the governor can appoint a member of the wyoming state senate to the wyoming supreme court, and the chairman of the wyoming senate.
Example D.2.Jonathan Terry is a television and film actor.
GLM: He was a voice actor for the "X-Men" cartoon series.He was also a voice actor for "the Simpsons" as well as "the marvelous misadventures of superman.He also has voiced characters in "the legend of korra" He has appeared on several television series, including "the simpsons", "the x-files", "heroes" and "the simpsons movie" as the character captain billy higgledy-pig, and the web series "krusty mysteries" as the character Colonel Trungus.He wrote for "the Guardian" newspaper.Jonathan Terry was born in London.Terry grew up in Surrey, England and attended the University of Sussex in the United Kingdom, graduating with a degree in english literature.He was a guest lecturer at King's College London, and then took two years of acting courses at the brit school of acting to prepare for his future career in the entertainment industry.Terry first appeared in the TV series "the Simpsons" as the character captain Billy Higgledypig, but his character was only a one-time recurring character in the series' first six seasons.He later appeared as a regular for the show's final six seasons, and has been a frequent guest in the show since.
He appeared in the first few episodes of "" as the character major Jack Ryan.He has also appeared as part of the supporting cast of several episodes of "the secret life of pets".He has also worked on "the simpsons" TV show since "the simpsons movie", most notably playing the roles of Captain Skeletor and the ghost of the same name.He plays characters in several films, including "", "", "" and "".He has appeared in music videos for the killers in 1993, the pretenders in 1995, and in the TV shows "the royal" and "the bill".
Example D.3.Corona was a station along the port Washington branch of the long island rail road in the Corona section of queens, New York City.It was one of two stations built by the flushing railroad in Corona, this one having been at Grand Avenue (later called National Avenue, now National Street ) and 45th Avenue.GLM: It had a brick entrance building next to the tracks, and one platform with 2 side platforms.The station burned down and was rebuilt in 1908.The station closed on september 15, 1927, with the train service transferred from Grand Avenue to 45th Avenue.There were plans to build a subway extension to Corona, but it was never built.In the 1980s, two stations were constructed on the line, Corona Road and Corona Park.The new Corona station opened in 1988, and the original Corona station was demolished.
Example D.4.Robert Lee Smith (born July 5, 1938) is a former American football cornerback in the national football league.He played for the Los Angeles Rams (1962)(1963)(1964)(1965) and the Detroit Lions (1965)(1966).He was also a kick and punt returner.
GLM: In his four-year NFL career, he played in 33 games and started 14, registering 62 career interceptions.He completed his NFL career with five interceptions.Smith was born in La Canada Flintridge, Michigan, in 1938.He attended Bishop O' Dowd high school in Flintridge.Smith' s father was a pharmacist.He graduated from Michigan State University in 1958 with a degree in business administration.Smith played for the Michigan Wolverines football team from 1959 to 1963.In his two years as a reserve cornerback, he led the conference in interceptions with five.In 1962, he set the Wolverines' all-time interception record with 13, and was second overall in the 1962 season's Heisman Trophy voting.Smith also won the Dick Butkus award as the nation's outstanding linebacker.In 1961, the "Los Angeles Times" wrote that Smith "is an outstanding pass rusher, with an average of almost 100 yards per punt return."Smith was inducted into the university of Michigan athletic hall of honor in 1989 and the national football foundation hall of fame in 1991.He was elected to the Michigan sports hall of fame in 1995.Smith earned the honor because of his accomplishments prior to his NFL career.He was one of four Michigan players honored as first-overall selections in the 1964 NFL draft.The others were Joe Namath, Bill Nelsen, and Jerry Kramer.In 1966, the NFL gave players $300,000 a season to play football.After his rookie season, he was not selected to play in the 1966 pro bowl.On January 13, 1966, the Rams traded smith to the Detroit Lions for Paul Hornung, and later that year he was traded to the Lions for Ray "the Lion" Jones in exchange for Linebacker Jim "the Hawk" Johnson.On September 10, 1968, he was traded back to Los Angeles for a second round pick in the 1970 draft.He was also traded to the St. Louis Cardinals for a second round pick in the 1970 draft.On June 2, 1970 he was cut by the Cardinals.On November 15, 1970, the Los Angeles Rams acquired Smith from the Lions in exchange for Linebacker Tony Harris.The Rams waived Smith during the September 1, 1972 offseason.Smith's number at Michigan State was # 7 in 1969.

Figure 1 :
Figure 1: Illustration of GLM.We blank out text spans (green part) and generate them autoregressively.(Some attention edges are omitted; cf.Figure 2.)

Figure 2 :
Figure 2: GLM pretraining.(a) The original text is [x 1 , x 2 , x 3 , x 4 , x 5 , x 6 ].Two spans [x 3 ] and [x 5 , x 6 ] are sampled.(b) Replace the sampled spans with [M] in Part A, and shuffle the spans in Part B. (c) GLM autoregressively generates Part B. Each span is prepended with [S] as input and appended with [E] as output.2D positional encoding represents inter-and intra-span positions.(d) Self-attention mask.Grey areas are masked out.Part A tokens can attend to themselves (blue frame) but not B. Part B tokens can attend to A and their antecedents in B (yellow and green frames correspond to the two spans).[M] := [MASK], [S] := [START], and [E] := [END].
Coronet has the best lines of all day cruisers.Positive < l a t e x i t s h a _ b a s e = " c b

Figure 3 :
Figure 3: Formulation of the sentiment classification task as blank infilling with GLM.
) andPaolini et al. (2020) convert structured prediction tasks, such as sequence tagging and relation extraction, to sequence generation tasks.Blank Language Modeling.Donahue et al. (2020) and Shen et al. (

Table 1 :
Results on the SuperGLUE dev set.

Table 2 :
Results of abstractive summarization on the CNN/DailyMail and XSum test sets.

Table 4 :
Results on SQuAD question generation.

Table 5 :
BLEU scores on Yahoo text infilling.

Table 8 :
Cloze questions and verbalizers for the 8 SuperGLUE tasks used in our experiments.* denotes the answer contains multiple tokens.
Since transformers can only operate on a window of fixed input size w, we cannot fully calculate p(x t |x <t ) and can only calculate p(x t |x t−w:t−1 ).Even calculating this value for each token is prohibitively expensive, since we need to conduct T evaluations of w-size contexts.To improve evaluation efficiency, we adopt overlapping evaluation, where we advance the sliding windows by some overlap o each time and only compute the cross entropy loss for the last o tokens of the window.In our experiments we set o = 256 for all the models.

Table 9 :
Results on the GLUE dev set.