Exposure Bias versus Self-Recovery: Are Distortions Really Incremental for Autoregressive Text Generation?

Exposure bias has been regarded as a central problem for auto-regressive language models (LM). It claims that teacher forcing would cause the test-time generation to be incrementally distorted due to the training-generation discrepancy. Although a lot of algorithms have been proposed to avoid teacher forcing and therefore alleviate exposure bias, there is little work showing how serious the exposure bias problem actually is. In this work, we focus on the task of open-ended language generation, propose metrics to quantify the impact of exposure bias in the aspects of quality, diversity, and consistency. Our key intuition is that if we feed ground-truth data prefixes (instead of prefixes generated by the model itself) into the model and ask it to continue the generation, the performance should become much better because the training-generation discrepancy in the prefix is removed. Both automatic and human evaluations are conducted in our experiments. On the contrary to the popular belief in exposure bias, we find that the the distortion induced by the prefix discrepancy is limited, and does not seem to be incremental during the generation. Moreover, our analysis reveals an interesting self-recovery ability of the LM, which we hypothesize to be countering the harmful effects from exposure bias.


Introduction
Language model (LM) has been a central module for natural language generation (NLG) tasks (Young et al., 2017) such as open-ended language generation (Radford et al., 2018;Brown et al., 2020), machine translation (Wu et al., 2017), dialogue response generation , image captioning (Lin et al., 2014), etc. For decades, maximum likelihood estimation (MLE) has been the most widely used objective for LM training. However, there is a popular belief in the natural language processing (NLP) community that standard Figure 1: An illustration of the training-generation discrepancy and our key intuition for verifying exposure bias (in blue): During generation, if we feed groundtruth data prefixes (from P D ) instead of the model's own samples (P M ), the generation performance should be better because the prefix discrepancy is removed. MLE training suffers from the exposure bias problem which leads to an incremental performance degradation during test-time generation.
The claim of the exposure bias problem (Bengio et al., 2015;Ranzato et al., 2016) is originated from the following discrepancy between MLE training and test-time generation for auto-regressive language models: During training, the model is trained to predict the next word conditioned on prefix (or history) words sampled from the ground-truth data distribution; While during generation, the model generates words conditioned on prefix sequences sampled from the model itself. Due to the exposure to real data during training, the language model could potentially be biased to only perform well with data prefixes. Therefore, it is claimed (and widely believed among researchers) that during generation the errors could accumulate along the generated sequence, and the distribution generated by the model would be incrementally distorted. The forced exposure to ground-truth data during training is also referred to as teacher forcing.
With the huge research efforts devoted to alleviate exposure bias, interestingly, the existence or significance of exposure bias is much less studied. Interestingly, despite the criticism, MLE (teacher forcing) has remained to be the dominant objective for LM training (Radford et al., 2018;Keskar et al., 2019;Brown et al., 2020). On the other hand, non-MLE training methods are still struggling to outperform the MLE baseline (Caccia et al., 2018;de Masson d'Autume et al., 2019). These developments lead us to question: Is exposure bias truly a serious problem for MLE training?
In this work we seek a direct answer to the above question. Here we briefly summarize our contributions: We design and experiment with various metrics to quantify the impact from exposure bias. On the contrary to our expectation, our measurements consistently show that removing the prefix discrepancy only brings limited gain, and the incremental distortion is not observed. Moreover, our analysis reveals an interesting self-recovery ability of the LM, which we hypothesize to be countering the harmful effects from exposure bias.

Related Works
We first clarify that "Whether exposure bias is serious for MLE training?" and "Whether new non-MLE algorithms improve generation performance?" are two related but different questions, and our work has a clear focus on the first question. Despite the large body of works (listed in Section 1) devoted to alleviate exposure bias, its actual impact has rarely been systematically studied or validated in a direct and principled way.  attempts to measure the gain from alleviating exposure bias, by counting the ground truth words whose probabilities in the predicted distributions produced by their proposed model are greater than those from the baseline model. However, it is unclear how this experiment can be linked to exposure bias. Schmidt (2019) provides valuable discussions around the claim of exposure bias, but they did not propose an operational definition or metric. Xu et al. (2019) attempts to measure exposure bias by comparing the model's performance on seen (training) data and unseen (test) data. However, this methodology is more about the generalization gap, instead of exposure bias. Finally, Wang and Sennrich (2020) discusses the potential link between exposure bias and hallucination in the context of machine translation.
In a relevant direction to answer the second question, several recent works attempt to evaluate whether the non-MLE training methods can really give superior NLG performance than standard MLE training for open-ended text generation. Caccia et al. (2018) tunes a "temperature" parameter in the softmax output, and evaluates models over the whole quality-diversity spectrum. Semeniuta et al. (2018) proposes to use "Reverse Language Model score" or "Frechet InferSent Distance" to evaluate the model's generation performance. Tevet et al. (2018) proposes a method for approximating a distribution over tokens from GANs, and then evaluates models with standard LM metrics.
These works arrive at a similar conclusion: The generation performance of text GANs is not convincingly better, or even worse, than standard MLE training. These negative results motivate us to reassess the exposure bias problem, which serves as a major motivation of text GANs. In the next section, we begin by introducing notations and background.

Background and Notations
Our experiments will be focused on the task of open-ended language generation, which is arguably a good test bed for exposure bias due to the following reasons: (1) The generation length is long.
(2) Different from typical seq2seq tasks such as machine translation, the generation space is only weakly constrained and the topics can be very diverse, which means the training-generation discrepancy could be large.
Notations Auto-regressive language models are trained to learn the probability distribution of the (l + 1) th word (or token) W l+1 in a sentence W conditioned on the prefix W 1:l := (W 1 , . . . , W l ) and prompt C. We use W i ∈ V to denote a discrete random variable distributed across the vocabulary V . For simplicity, we assume all sentences are of length L in the formulations. Denoting the groundtruth data distribution as P D , the standard MLE (also referred to as teacher forcing) training aims to minimize the negative log-likelihood (NLL) below: where P θ ( · | C, W 1:l ) denotes the conditional distribution of W l+1 of P θ given a prompt C and a prefix W 1:l , and θ stands for the set of parameters to be trained. Note that the concept of "sentence" (W ) can be naturally generalized to paragraphs or even articles, depending on the target task. We denote the distribution of a MLE-trained LM as P M , which is the major subject of this study. We will experiment with two popular model architectures: LSTM LM (Hochreiter and Schmidhuber, 1997;Sundermeyer et al., 2012) and transformer LM (Vaswani et al., 2017;Dai et al., 2019). For generation, we mainly focus on classical ancestral sampling, due to the following reasons: (1) The sampling algorithms (e.g., top-k sampling) are known to trade quality out of diversity (Caccia et al., 2018;Zhang et al., 2020;Nadeem et al., 2020). So, invoking them could "hide" the exposure bias problem because the prefixes from the model will be of higher quality (thus narrowing the discrepancy). (2) The sampling algorithms requires tuning of hyper-parameters, which will complicate the comparison. With these being said, in our experiments (including human evaluation) we have also tested with top-k sampling (Fan et al., 2018), with consistent observations.
In addition to popular metrics in natural language generation (NLG) such as BLEU (Papineni et al., 2002) or METEOR (Denkowski and Lavie, 2014), our quantification approaches also rely on the measurements of the divergence between two distributions. Let P denote the set of probability distributions on the vocabulary V , and let f div : P × P → R ≥0 be a divergence function between two distributions. We will adopt two popular probability divergence functions: total variation distance (denoted as d TV ) and Jensen-Shannon divergence (d JS ). Definitions of d TV and d JS are provided in Appendix A.

A Qualitative Attempt
We begin with a qualitative attempt to verify the seriousness of exposure bias, by designing a prefixswitching experiment as follows: We feed a MLEtrained transformer LM on the wiki-103 dataset with four types of prefixes of the same length: (1) test-data samples, (2) model's own samples, (3) test-data samples shuffled on word-level, or (4) samples from a uniformly random distribution on V . Then we let the model continue the generation given these prefixes and compare the quality of the samples in a qualitative manner. Details of the model and dataset are deferred to Section 6.
The intuition behind the prefix-switching experiment follows immediately from the original claim of exposure bias: During generation, if we set the prefix distribution to be the ground-truth data distribution instead of the model's own, the discrepancy between training and generation in the prefix will be removed, and hence the model's generation quality should be much better. We illustrate this idea in Figure 1. In the extreme case of shuffled or random prefixes, due to the claim from exposure bias about the incremental distortion, we expect the model to generate also badly distorted sequences.
The samples with different types of prefixes are shown in Table 1. To make the generation from data and model prefix more comparable, we force the same prompt at the beginning to constrain the topic. Moreover, we intentionally use long prefixes of length 100, in the hope that the incremental distortion of generation (as claimed by exposure bias) would become observable. We also include examples from a LSTM LM in Table 6 (Appendix B), which gives similar observations.
On the contrary to our expectation, we do not observe a noticeable difference in sample quality comparing samples from model and data prefixes.
More surprisingly, the model is still able to generate fairly good samples from shuffled prefixes. Even in the extreme case of random prefixes, we still observe basic language structures in the sample.
This experiment suggests that the MLE-trained LMs have the self-recovery ability, i.e., the model is able to recover from artificially distorted history input, and generate samples with reasonable quality. This observation casts doubt on exposure bias, which claims that the distortions in the generation should, on the contrary, be incremental.
This qualitative attempt suggests the impact from exposure bias could be more subtle than we expected. Especially, it would be difficult to judge whether the distortions are incremental via qualitative examination. Motivated by this, we now turn to more rigorous quantification methods to measure the impact of exposure bias.

Quantification Methods
We now introduce the definitions of our proposed metrics. We attempt to quantify the impact of exposure bias on three key aspects of open-ended language generation: quality, diversity, and consistency. We first introduce EB-M, which covers the quality and diversity aspects, and then EB-C, which covers the consistency aspect. Following the intuition of the prefix switching experiment, we design our quantification metrics to be a simple ratio reflecting the relative performance gain when data prefixes are fed to the model as opposed to the original model prefixes.

Definition of EB-M
In this section, we propose the EB-M metric. Since the key idea is to compare the generation quality with different types of prefixes, denoting the prefix distribution as P H ∈ {P M , P D } (model or data prefixes), we first formalize the following 3-step generation process: (1) Sample a fixed-length prompt C from P D .
(2) Given a prefix length l and a prefix distribution P H , we sample a prefix W 1:l from P H ( · | C).
(3) Conditioned on the prompt and prefix, we sample W l+1:l+lgen from P M ( · | C, W 1:l ), where l gen is the length of generation.
We denote the marginal distribution of W l+1:l+lgen of the above generation process as P W l+1:l+lgen M |H . From the claim of exposure bias, we expect the quality or diversity of W l+1:l+lgen to be better when P D is used as P H than P M .
With these ingredients in hand, we now propose the EB-M quantification metric for exposure bias ("M" stands for "marginal"). It reflects the relative performance gain when the length-l prefix is from P D instead of from P M , and is formulated as below: .
(2) f score is a pre-defined scoring function 1 of the generation samples, and we assume higher value of f score indicates that the generation is of higher quality or diversity. In our experiments, we will use popular NLG metrics including BLEU (Papineni et al., 2002) / Nist (Doddington, 2002) / METEOR (Denkowski and Lavie, 2014), which mainly capture the quality aspect, and backward-BLEU (Shi et al., 2018) / n-gram entropy , which capture the diversity aspect.
EB-M has several potential weaknesses: First, it doesn't reflect how the generation is consistent with the given prefix W 1:l , because it only focuses on the marginal distribution of W l+1:l+lgen . Second, even in the P M |D case, exposure bias still affects the generation W l+1:l+lgen . 2 To cover these shortcomings, in the next section we propose another quantification method named EB-C, which focuses on the model's word-level conditional generation distribution of W l+1 given prefix W 1:l . Finally, the standard NLG metrics (such as BLEU) have recently been criticized that they may correlate poorly with human judgements (Sellam et al., 2020). Therefore, in Section 6.3 we conduct a human evaluation for the completeness of our evaluation.

Definition of EB-C
We propose EB-C as a conditional counterpart of EB-M. Again, let P H ∈ {P M , P D } denote the prefix distribution. With a given prefix length l, we first define the conditional generation deviation (CGD) as the expected distance between P M and P D for W l+1 , conditioned on the prefix samples from P H , measured by divergence f div : [fdiv(PM (·|C, W 1:l ), PD(·|C, W 1:l ))]. (3) For the choice of f div , we will use the standard d TV and d JS divergence introduced in Section 3. A smaller CGD value indicates a better-modeled conditional distribution for W l+1 under prefix distribution P H , which captures the consistency aspect of text generation. Also note that since here we focus on the generation of a single word W l+1 , the distortion from exposure bias should be completely removed when data prefix is fed (i.e., in the CGD(M |D) case).
Exposure bias should induce a meaningful gap between CGD(M |M (l), f div ) and CGD(M |D(l), f div ). We now define the EB-C quantification metric for exposure bias at prefix length l with divergence f div to be: EB-C reflects the relative gain in CGD value when the prefix distribution is replaced by P D from P M . Since the computation of CGD requires access to the ground-truth data distribution, in our experiments we will first consider a synthetic setting, where an existing model is used as P D .

Experiment Results
In this section our quantification results for exposure bias are presented. In addition, we propose variants of the EB-M / EB-C metrics to analyze the self-recovery ability. For a systematic assessment of exposure bias, we decompose the claim of exposure bias into two factors: (1) The discrepancy in the prefix distribution would hurt the generation performance, in general.
(2) Moreover, the distortion should be incremental along the generation. The first factor can be reflected by the average magnitude of the measurements (the values are expected to be larger than 1 by a meaningful margin), and the second factor can be reflected by whether the measurements have an increasing trend along the prefix length. Below we begin by describing the experiment setting.

Experiment Setting
Most of our experiments are conducted on the wiki-103 dataset. 3 It has around 1.8m sentences / 101m words for training, and 4k sentences / 241k words for testing. We favour the wiki-103 dataset because it has long and complex paragraphs (from Wikipedia), which is useful for the measurements of exposure bias. It is also among the most popular datasets for LM benchmarking.
Real-data Setting for EB-M To prepare a MLEtrained P M , we use the code from Transformer-XL (Dai et al., 2019) to train a transformer LM on the wiki-103 dataset. The model is a 16-layer Transformer-XL model with a hidden dimension of 410 and an inner dimension of 2100. Since the computation of BLEU / METEOR / Nist scores requires large amounts of unseen real-data samples as references, we use half of the wiki-103 training data (around 900k sentences and 50m words) to train the model P M , and save the other half as samples from P D (used as reference for BLEU / METEOR / Nist). Other training configurations of transformer-XL (learning rate, batch size, etc.) are not changed. The resulting model P M has a test-set perplexity (PPL) of 27.81 (If trained on full training data, the PPL will be 24.02). In addition, we also train a 3-layer LSTM LM (Sundermeyer et al., 2012) with a hidden layer dimension of 600. It has a test-set PPL of 34.80.
Synthetic Setting for EB-C Recall that the estimation of EB-C (Equation 4) requires inference of the data distribution, therefore we first consider a synthetic setting where we treat the 16-layer transformer-XL model trained on wiki-103 full training data as P D . Then, we construct a pseudo training set which is roughly the same size of the original training set by sampling from it. Next, a randomly initialized 4-layer transformer-XL model is used as P M and trained on the pseudo training set with the same hyper-parameters. The resulting P M model has a perplexity of 84 on wiki-103 test set (while P D has a perplexity of 24), indicating that the pseudo P D model is far from being fully recovered by the training process. Finally, EB-C is estimated using 10k samples from P M and P D .
In our experiments for both EB-M and EB-C, we fix the length of prompt C and l gen (for EB-M) to be 20, and vary the prefix length l.   Table 8 and Table 9 in Appendix B. Both EB-M and EB-C measurements indicate that removing the prefix discrepancy gives around 1% or 2% of relative performance gain.

Quantification of Exposure Bias
We show the EB-M (in real-data setting) and EB-C (in a synthetic setting) measurements with different prefix length l in the upper and middle part of We observe that the EB-M measurements with various NLG metrics are around or less than 1.01, for both the LSTM or transformer model. Likewise, the EB-C measurements are around or less than 1.02. While this suggests that the prefix discrepancy does hurt the generation performance to some level (1% or 2% of relative degradation), its impact seems to be limited. More importantly, the performance gap does not become much larger as the prefix length grows (even if we use a long prefix length of 100), which contradicts the incremental distortion claim of exposure bias.
In addition, we test several natural variants of the EB-M experiments: (1) Set l gen to be shorter (10) or longer (30); (2) Use a smaller number of references (e.g., 1k); (3) Use top-k sampling with k = 40 instead of sampling from the whole vocabulary. In all these cases the observations are very similar (the measurements are less than or around 1.01), and we omit these results here.
We then check whether "worse" prefix would induce larger performance loss. We inject two types of noise into the data prefix: (1) Similar to the prefix-switching experiment (Table 1), we feed the transformer model with word-level shuffled data prefix, and then compute the quality score for the generations, abbreviated as f score (M |D shuf ). 5 The ratio between f score (M |D) and f score (M |D shuf ) is denoted as EB-M(M |D shuf , f score ).
(2) With a given corrupt rate, we replace tokens in the data prefix with tokens uniformly sampled from V (If the rate is set to 0.1, then a random 10% of tokens in the prefix is corrupted). Note that similar techniques have been used in Khandelwal et al. (2018) to study how LMs utilize context. We denote the resulting EB-M ratio as EB-M(M |D corrupt , f score ). We include the results with shuffled prefix in the lower part of Table 2, and results with corrupted prefix are shown in Figure 2. To save space, we only show results with BLEU or d JS (for EB-C), and the observations from other metrics are similar.
It is shown that the measurements from EB-M(M |D shuf ) or EB-M(M |D corrupt ) are much larger than EB-M(M ). For example, with a corrupt rate of 0.3, EB-M reports a relative performance loss of around 10%. The observations from EB-C are similar, but with larger measurements. We suspect the reason is that EB-C is a word-level metric, and is not affected by self-recovery (to be further discussed in Section 6.4).
These results match our expectation that a largeenough discrepancy would indeed induce significant degradation in the model's generation. In comparison, the training-generation discrepancy claimed by exposure bias seems to be still in the model's "comfort zone", with limited impact.

Human Evaluation
To verify our observations from the EB-M and EB-C experiments, we conduct a human evaluation with Amazon Mechanical Turk (AMT). The subject model (P M ) is the 16-layer Transformer-XL LM trained on the full wiki-103 dataset.
Our goal is to compare the human ratings of generations from the model with data or model prefixes. We follow the standard evaluation protocol for open-ended language generation: The turkers are shown with a context and a corresponding generation, and they are asked to rate the quality (how grammatical / informative / logical the generation is) and the consistency (how related the generation is to the context) of the generation. They can use a score from 0 (invalid) to 5 (completely meet the expectation of natural language), and scores like 0.5 or 4.5 are also allowed.
The context consists of a length-20 prompt (from data) and a prefix, which is either from P D or P M , of different length. 6 The prompts are taken from the beginnings (including the title) of articles in the wiki-103 validation and test set. The length of the generation is fixed to 20. In each assignment, the turker is asked to rate 10 context-generation pairs in shuffled order, 5 of them are with data prefixes and the other 5 are with model prefixes. Note that this evaluation is a little "unfair" in that the consistency rating of the generation will inevitably be affected by the quality of the prefix, giving the generations from model prefixes disadvantages. To prevent the quality rating from being also affected by the errors in the prefix, we first show turkers the generation and ask them to rate the quality, before showing them the context. We also explicitly ask turkers to judge the quality of the generation disregarding the errors in the context.
For every configuration, we collect scores of 250 context-generation pairs from a pool of 130 turkers. To reduce variance, for each context-generation pair we collect five replicas of scores from 5 independent turkers, and compute the average. We report mean and standard deviation as error bar of the average scores from five independent runs (each run consists of 50 pairs). The ratings have a inter-annotator agreement of around 65%.
The results are shown in Table 3. We observe that for the quality aspect, both the absolute gap and the relative improvement (around 1%) between generation from the two types of prefixes are small. The gap in the consistency aspect is larger but still limited (around 3%). This is however, as expected because the model prefix is of lower quality comparing to the golden data prefix. More importantly, for both quality and consistency, we do not observe a strong increasing trend of performance gap as prefix length grows. In addition, We repeat the human evaluation with top-k sampling (k set to 40). The results are included in Table 5 (Appendix B), with similar observations. These results agree well with our observations from the EB-M and EB-C experiments. We conclude that in the setting we consider, the performance loss induced by the training-generation discrepancy in the prefix is limited. Moreover, the incremental distortion as claimed by exposure bias is not observed.

Quantification of Self-Recovery
In this section, we propose variants of the EB-M / EB-C metrics to quantify the effect of self-recovery. In particular, we aim to check if we let the model continue the generation with artificially distorted  Figure 3: EB-M gap (with BLEU) and EB-C gap (with d JS ) measurements with different shuffled/corrupted prefix length and gap length. The corrupt rate is set to 0.3. It is shown that the model can self-recover from the artificial errors in the prefix.
(e.g., shuffled or corrupted) prefixes, would it be able to self-recover and generate sequences with decent quality. Take shuffled prefixes as an example, we introduce a gap length l gap and define EB-M gap to be where l = l + l gap . We use the notation M (l gap )|D shuffle (l) to emphasize that the shuffled prefix is still of length l, and the model's generation starts from l + 1 to l + l gap + l gen . EB-M gap (M (l gap )|D corrupt (l)) can be defined in a symmetric fashion. The definition of EB-C gap is similar but more involved, and we defer it to Appendix A. There are multiple possible outcomes from the introduction of the gap span: The model could either recover from the errors in the prefix, or, on the contrary, aggravate the distortion.
We show EB-M gap and EB-C gap measurements with different shuffled / corrupted prefix length and gap length in Figure 3. In all cases, a clear decreasing trend is observed as l gap grows. This is consistent with the qualitative observations in Section 4 (Table 1), showing that the LM has the self-recovery ability.
From this set of analysis, we suspect that the self-recovery ability is countering the harmful effects from exposure bias. We summarize it into the following hypothesis: Hypothesis 1. The mismatch between P M and P D as prefix distributions exists, and indeed leads to some level of distortion along the generation. However, the LM's self-recover ability could be countering the harmful effects from exposure bias, and  preventing an incremental performance degradation along the model's generation.

Caveats in Measuring Exposure Bias
In this section we want to point out that one needs to be cautious when using our methodology of comparing generation from data or model prefix, to measure exposure bias. In particular, in the EB-M experiments, we compute the NLG metrics with a large number of unseen test data as references. This is a common practice for open-ended generation. However, for other classical seq2seq applications such as dialogue or machine translation, one might be tempted to just use the single given reference. This could potentially lead to seriously misleading results.
In Table 4, we provide examples to illustrate this caveat. In some cases, if we give the model the prefix of the reference answer, it will be much easier for the model to guess the remaining words. On the other hand, even if the model's own generation is valid, it will receive a relatively much lower BLEU score. In other words, giving the model partial reference is too much "cheating", when the generation space is already constrained by the context.
For a fair comparison, what we really need is a reference answer conditioned on the prefix sam-pled from the model (our design of EB-C follows this spirit). For example, in the dialogue case, we need to measure whether "to school" is a good completion for the prefix "I went". This, however, would require additional data collection, and we leave it for future work.

Discussion and Limitations
Due to space constraint, we defer a discussion on teacher forcing as an objective function to Appendix D. We devote this section to discuss the limitations of this work. Firstly, the proposed quantification approaches only focus on exposure bias, and does not reflect the generation performance in general. For example, a uni-gram LM, which generates words independent of previous context, has no exposure bias problem and can pass our test easily. More importantly, since the original claim of exposure bias is not rigorously defined, our approaches can only act as reasonable proxies to measure its seriousness, and we humbly acknowledge they have limitations.
We also note that this work is focused on openended language generation (In Appendix E, we provide a preliminary study of the self-recovery ability in machine translation). Our results do not rule out the possibility that exposure bias could be more serious in other NLG applications of different nature, which we leave for future work.
Finally, the results from this work should not discourage researchers from exploring non-MLE training algorithms for LM (including text GANs). As shown by recent studies, there exists important problems other than exposure bias for the current NLG models, such as the likelihood trap (Holtzman et al., 2020), factuality (Massarelli et al., 2019;He et al., 2021), or robustness (Cheng et al., 2018;Kassner and Schütze, 2019), etc. Therefore, it is completely possible that a non-MLE training objective can lead to better generation performance (Lu et al., 2018;Huszár, 2015;Welleck et al., 2020;He and Glass, 2020;Gu et al., 2017).

Conclusion
In this work, we design and experiment with two metrics (EB-M and EB-C) as proxies to quantify the significance of exposure bias. The measurement from our experiments, including a human evaluation, consistently show that the the distortion induced by the prefix discrepancy is limited, and does not seem to be incremental during the generation. Moreover, our analysis reveals an interesting self-recovery ability of the LM, which we hypothesize to be countering the harmful effects from exposure bias. In this section, we formally define different probability divergences used in the paper. We first give the definition of the total variation distance d TV between two distributions P and Q on vocabulary V : For d JS , we first define the Kullback-Leibler divergence d KL : We can now define Jensen-Shannon divergence d JS : where M = 1 2 (P + Q). Definition of EB-C gap EB-C gap is, in spirit, similar to EB-M gap . In the following we assume shuffled prefixes are considered. We first introduce l gap and define where we omit the prompt C to save space, and W shuf 1:l refers to the token-level shuffled data prefixes.
Next, EB-C gap is defined as follows:

B Auxiliary Results and Plots
We show more samples for the prefix switching experiment for models trained on the wiki-103 dataset in Table 6 (LSTM), and Table 10 (transformer).
We show EB-M measurements on the wiki-103 dataset for the LSTM model with different NLG metrics, in Table 7.
In Table 5, we repeat the human evaluation (Table 3) with top-k sampling (k set to 40). The observations are similar. Note that the gap for consistency is smaller than the numbers in Table 3. We suspect the reason is that top-k sampling improves the quality of the prefix.

C Experiments with text GANs
In this section, we apply EB-C to text GAN models to compare the behavior of different training objectives. We compare MLE baseline against ScratchGAN (de Masson d'Autume et al., 2019) and RankGAN (Lin et al., 2017) in the synthetic setting. 7 For both ScratchGAN and RankGAN, we use the released code.
Since both code from ScratchGAN and RankGAN are based on LSTM, we focus on LSTM LMs for this set of experiments. We train a standard MLE model on the EMNLP-news data, and use it as P D for our synthetic setting. It refers to the EMNLP 2017 WMT News Section, which has around 268k sentences / 7.5m words for training and 10k sentences / 277k words for testing. It has been widely used in text GAN literature (Yu et al., 2016;Lu et al., 2018). The P D model is a one-layer LSTM LM with a hidden dimension of 512. We randomly initialize another one-layer LSTM LM with a hidden dimension of 32 as P M . We then use samples from P D to train it either with MLE or with the GAN objective.
The results are shown in Figure 4. We find that RankGAN and ScratchGAN give lower EB-C measurements than MLE, which is as expected, as these methods avoid teacher forcing. Most EB-C values in the ScratchGAN case are less than 1, which matches our intuition that GAN models should behave better when fed with model prefixes than data 7 The synthetic setting we consider has been popularly adopted in text GAN literature (Yu et al., 2016).  prefixes. On the other hand, EB-C in the RankGAN case is still slightly larger than 1. We believe the reason is that RankGAN still relies on MLE pretraining.
To the best of our knowledge, this is the first direct empirical evidence showing that non-MLE training could indeed avoid the exposure bias problem in that the model behaves better with model prefix than data prefix. It also suggests that EB-C correctly captures how the training-testing discrepancy affects generation. Note that lower EB-C value does not mean the generation performance is better (the authors of ScratchGAN acknowledge their performance is still inferior to the MLE baseline).

D Discussions
We discuss the critical question "Is teacher forcing really a biased objective?", from the perspective of objective functions. Note that the teacher forcing (MLE) objective (1) can be re-written as: where D KL denotes the Kullback-Leibler divergence, and θ denotes the trainable parameters in P M . Therefore, teacher forcing (MLE) training is minimizing the divergence of P θ , which is exactly the model's sampling distribution, from P D . While it is true that the training is "exposed" to data samples as prefixes, we should not simply deduce the objective is "biased".
A Concrete Toy Example for EB-C What kind of model has large EB-C values? Here we discuss a concrete toy LM which is hand-crafted to have a large EB-C value. However, we will argue that this model is unlikely to be a product of MLE training.  11.860 (+0.006) 11.859 (+0.006) 11.856 (+0.004) 11.857 (+0.008) 11.854 (+0.006) Table 8: The corresponding f score values for EB-M(M TF ) measurements in Table 2.  Table 9: CGD measurements for the transformer synthetic setting (EB-C values are shown in Table 2). 0.9, P M (W 2 = A|W 1 = B) = 0.5. Note that the model behaves worse when W 1 = A, which is of high probability during sampling.
For Example 1, we can easily get that for d TV , CGD(M |D(1)) = 0.2 and CGD(M |M (1)) = 0.36, which gives us EB-C(M, 1) = 1.8. However, this crafted model is unlikely to be an outcome of MLE training. The fact that P M (· | W 1 = B) is better modeled suggests that in the training data, there are more sentences beginning with W 1 = B than W 1 = A. So MLE training should assign more probability to P M (W 1 = B) than P M (W 1 = A), not the other way around. From this perspective, the claim of exposure bias seems to be conflicting with the MLE principle.

E A Preliminary Study for Machine Translation
In this section, we conduct a preliminary prefixswitching experiment for a standard neural machine translation (NMT) setting. We follow the example code from Fairseq 8 , to train a 6-layer encoder-decoder transformer model with a hidden dimension of 512 and an inner dimension of 1024 on the IWSLT14 German-to-English dataset. 9 It has around 160k sentences / 3.7m words for training, and 6.7k sentences / 150k words for validation or testing (in English). For decoding we 8 https://github.com/pytorch/fairseq/ tree/master/examples/translation 9 http://workshop2014.iwslt.org/ use beam-search with beam 20.
We feed the trained model with different types of prefixes during decoding which represents different levels of training-generation discrepancy. The samples are shown in Table 11. Note that the source input is kept intact.
Note that we should not directly compare the generation with data or model prefix with the corresponding reference, because in a constrained task such as MT, it would be too much cheating to give partial data reference to the model. Still, we observe that data prefixes do not greatly improve the generation. But, it could also be due to the short generation length, and errors from exposure bias did not build up yet.
More interestingly, in the extreme case of unrelated or random prefix, the model still generates fairly good partial translation. This suggests the self-recovery ability does not only exist for LMs trained for open-ended generation. Finally, we emphasize that the existence of self-recovery does not rule out the possibility that exposure bias could still be serious for machine translation. A comprehensive and principled study is needed (preferably with datasets of longer sequences) and we leave that as future work.