Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation

Current language generation models suffer from issues such as repetition, incoherence, and hallucinations. An often-repeated hypothesis is that this brittleness of generation models is caused by the training and the generation procedure mismatch, also referred to as exposure bias. In this paper, we verify this hypothesis by analyzing exposure bias from an imitation learning perspective. We show that exposure bias leads to an accumulation of errors, analyze why perplexity fails to capture this accumulation, and empirically show that this accumulation results in poor generation quality. Source code to reproduce these experiments is available at https://github.com/kushalarora/quantifying_exposure_bias

However, despite the successes achieved by these models on several conditional generation tasks, they continue to suffer from degenerate behaviors such as repetition, a lack of diversity, dullness, and incoherence, especially in open-ended generation settings such as text completion and dialog modeling (Holtzman et al., 2019). This degenerate behavior is often attributed to a mismatch between the maximum likelihood training and gen- * A part of this work was done when the author was an intern at Borealis AI.
† During a part of this work, the author was an employee at Borealis AI.
‡ During a part of this work, the author was an Academic Advisor at Borealis AI.
Maximum likelihood training, also referred to as teacher forcing (Williams and Zipser, 1989), factorizes the language model as a linear chain, and maximizes the log-likelihood of this factorized language model on a training corpus. During this maximum likelihood training, the model learns a distribution of the next tokens conditioned on the contexts from the ground-truth training data.
A concern with MLE-based training is that ground-truth contexts from the training corpus are not available during generation. Rather, the conditioning contexts during this phase comprise tokens previously generated by the model itself. The distribution of these contexts seen during the generation phase might be very different from the ones encountered during the training phase. This mismatch is referred to as exposure bias (Ranzato et al., 2016;. A side effect of exposure bias is that an error at any step during generation might have a cascading effect as the next context will incorporate this erroneous prediction, deviating away from the ground truth context distribution, leading to more errors. Several authors (Welleck et al., 2019;Choi et al., 2020;Li et al., 2016) have speculated that these errors might result in sequences that degenerate over the sequence length resulting in incoherent text, lack of vocabulary diversity, and hallucinations, and word-and phrase-level repetition.
There is an active debate in the language generation community on the impact of exposure bias in language generation. Authors have both validated (Xu et al., 2019;Zhang et al., 2019b) and questioned (He et al., 2019) the impact of exposure bias on language generation. Previous works have also linked exposure bias to out-of-distribution generalization (Schmidt, 2019), and out-of-domain generalization and hallucinations (Wang and Sennrich, 2020) but these claims remain weak in ab-sence of a clear and principled formalization of the exposure bias issue. Finally, several approaches have been proposed to mitigate exposure bias (Ranzato et al., 2016;Shen et al., 2016;Bahdanau et al., 2017;Leblond et al., 2018;Welleck et al., 2019), though these approaches improve the performance on the downstream tasks, the authors neither formalized exposure bias nor provided any empirical evidence that the downstream improvements are directly linked to the mitigation of exposure bias issue.
In this paper, we attempt to clarify this confusion by formalizing exposure bias in terms of accumulation of errors and by analyzing its impact on generation quality. We do this by providing a theoretically grounded understanding of the exposure bias issue by analyzing it from an imitation learning perspective. We use this perspective to show that behavior cloning-an imitation learning algorithm-is equivalent to teacher forcing under the choice of a particular loss function. We then exploit this equivalence by borrowing the bound on error accumulation caused by behavior cloning and use it to formalize exposure bias and analyze error accumulation in language generation.
Finally, we use this quantifiable definition of exposure bias to demonstrate that models trained using teacher forcing do suffer from an accumulation of errors. We also show, both analytically and empirically, why perplexity fails to capture this error accumulation, and how a lower exposure bias correlates with better generation quality.

Language Generation Formulation
Given a finite-sized vocabulary set V, language generation is posed as a problem of generating a variable-length sequence w n 0 ∈ V n from a language model p θ , either unconditionally or conditioned on a source x, using a decoding algorithm F: Language modeling is the problem of learning this parameterized model p θ that approximates an oracle model o.
Maximum likelihood-based training factorizes the probability distribution model, p θ (w n 0 ), into a linear chain, i.e., where w i is the token to be generated at step i and w i−1 0 is the context at time i; i.e., all the tokens seen from step 0 to step i − 1. 2 During maximum likelihood training, the language model is trained by minimizing the negative log-likelihood on the corpus D, i.e., where |D| is the number of tokens in the corpus. Given a trained language model p θ , the simplest strategy for generating a target sequence is to greedily sample the model;i.e., at each step i, pick the most probable token w i = arg max p θ (·|w i−1 0 ; x)as its prediction. For the next step i + 1, we use w i to generate the context w i 0 = w i−1 0 w i , and use it to predict the next token. This continues either until the maximum sequence length (T ) is reached, or a special end-of-sequence token (EOS) is generated.

An Imitation Learning Perspective of Language Generation
In this section, we will present an imitation learning perspective of language generation. This framing will allow us to borrow theoretical machinery from the imitation learning literature to formalize the exposure bias issue and analyze it in terms of the accumulation of errors due to a procedural mismatch between MLE-based training and generation. We start by posing language generation as a sequential decision-making problem and language modeling as an instance of imitation learning. We exploit these parallels to show behavior cloning, an imitation learning algorithm, is equivalent to teacher forcing under a choice of a particular loss function. We then exploit this equivalence to quantify the error accumulation due to exposure bias.
Language Generation is a Sequential Decision-Making Problem: A sequential decisionmaking problem can be formalized as learning a policy π(a t |s t ) over a space of actions a t ∈ A and states s t ∈ S where the next state s t+1 is conditioned on the current state-action pair and is determined by the transition distribution P (s t+1 |s t , a t ). We can use this framework to pose language generation as an instance of a sequential decision-making problem with language model p θ as the policy, contexts w t−1 0 ∈ V * as states, the next token prediction w t ∈ V as actions, and concatenation as the transition function.
This perspective allows us to appreciate the fact that, during generation, predictions at previous steps affect the next predictions, and errors over time can cascade resulting in incoherent sequences.
Language Modeling is Imitation Learning: Imitation learning is a class of methods to solve a sequential decision-making problem while having access to the oracle policy o or data generated by the oracle; i.e., In imitation learning, an agent learns a model policy π that reproduces the expert policy o but on the state-visitation distribution d t π that has been induced by the model policy π, i.e.: where l(π, s; o) is the expected per-step cost of model π mimicing oracle o in state s, d t π is the state-visitation distribution at step t induced by following policy π from step 1 to t − 1.
The sequential decision-making perspective of language generation allows us to pose language modeling as an instance of imitation learninglearning a model for a sequential decision-making problem with the help of an expert oracle (in RLbased methods) or using the data generated by the oracle (for MLE-based methods).
Teacher Forcing is Behavior Cloning: The assumption of access to an oracle is unrealistic in many scenarios. Behavior cloning is an approach to solving an imitation learning problem using only the training data generated by an oracle. In this setup, the state-action pairs in the training data are assumed to be identically and independently distributed. This is equivalent to reducing a sequential decision-making problem to a supervised multi-class classification learning problem.
Concretely, this learning problem can be seen as minimizing the expected per-step loss under the state distribution induced by the oracle: Here, L BC (π) is the behavior cloning loss, and l(s, π; o) is the per-step loss. Similarly, in practical scenarios, language models are also trained on a finite training corpus, D, that is assumed to be generated by the oracle; i.e., The maximum likelihood training loss from Equation 3, can be reformulated as learning the distribution over the next tokens, conditioned on the training contexts generated by the oracle, w t−1 The behavior cloning loss (Equation 4) is equivalent to the language modeling loss (Equation For our analysis though, we define per-step loss for language modeling, l(p θ , w t−1 0 ; o) as: This definition ensures that the per-step loss for the oracle is zero; i.e., l(o, w t−1 0 ; o) = 0. The per-step loss function defined by equation 8 ensures that the behavior cloning loss, L BC (p), under our definition is equivalent to teacher forcing loss, L T F (p), up to a constant term. This equivalence of L BC (p) and L T F (p) ensures that the model learned by minimizing either of the two losses will be identical.
Language Generation is Policy Rollouts: During policy rollouts, an agent in state s t executes the action a t , sampled from policy π, and ends up in state s t+1 . The agent's next state is dependent upon its own actions. This state evolution can be formulated as sampling from state-visitation distribution induced by the policy π, i.e., s t+1 ∼ d t+1 π . The performance of policy π during rollouts can be measured using the loss (cost) of executing the policy π: We can also formulate language generation in terms of policy rollouts from imitation learning. Mathematically, we can express generation as sampling contexts from the model's context distribution, i.e., w j−1 0 ∼ d j p θ ,F , and generating the next token w j conditioned on w j−1 0 , using the decoding algorithm F: We can now define the inference-time loss, L I (p θ ), as the accumulated T -step loss of model p θ imitating oracle o on the context distribution induced by the model: , is the context distribution at step t, induced due to use of model p θ and the decoding algorithm F, from step 1 to t − 1.

Exposure Bias and Error Accumulation
Ranzato et al. (2016) defined exposure bias as a behavioral mismatch between maximum likelihoodbased training and the generation procedure at inference time. During maximum likelihood-based training, the next token distribution is conditioned on ground truth data whereas, during generation, it has to rely on the model's own previously generated tokens. They also postulated that this training and generation context distribution mismatch might result in an accumulation of errors during generation.
Intuitively, when the model produces a token w i that makes the resulting context w i 0 unfamiliar, it might not be able to continue the generation adequately and is likely to produce another token which will further make the context flawed. This phenomenon reinforces itself as the context drifts further from what the oracle would produce, leading to an accumulation of errors.
In the imitation learning literature, the accumulation of errors while rolling out a policy trained using behavior cloning is analyzed in the terms of inference-time regret of the behavior cloning policy, π BC , with respect to the oracle policy, o, (Ross and Bagnell, 2010;Ross et al., 2011) i.e., Let t be the expected error of executing policy π at step t on the state-visitation distribution induced by the oracle o, i.e., Let be the average expected error of executing policy π over T step, i.e., = 1/T T t=1 t . Assuming l(s, π, o) is an upper bound on [0, 1] loss, we can bound the regret for a policy π BC as, The lower bound in Equation 14 assumes no accumulation of error, hence an expected error of at each step, whereas the upper bound assumes the worst-case scenario, resulting in linear growth in error at each step and overall quadratic accumulative growth w.r.t. maximum sequence length T . Relying on the imitation learning perspective of language generation presented in the previous section, we can now borrow this regret-based analysis from imitation learning literature to similarly bound the regret of a language generation model as where p θ is the model being used for generation, F is the decoding method being used for generation, = 1/T T t=1 t and t is defined as We will now use these bounds on the regret to analyze and quantify the error accumulation due to exposure bias in language generation.

Quantifying Error Accumulation due to Exposure Bias
In our analysis, we use two metrics, AccErr ≤ (l) and %ExAccErr ≤ (l) to measure the impact of error accumulation due to exposure bias. We define accumulated errors up to length l, AccErr ≤ (l), as Here, R ≤l (p θ , F) be the regret due to the use of language model p θ and decoding method, F, up to sequence length l, and ≤l = 1/l l t=1 t is the expected per-step error up to length l.
This metric captures the growth of error w.r.t. sequence length l. If exposure bias does indeed leads to error accumulation, AccErr ≤ (l) should grow super-linearly w.r.t. l.
We define our second metric, %ExAccErr ≤ (l), as the percentage of excess errors committed by the model that can be attributed to exposure bias, i.e., Here, l ≤l is the lower bound on the regret and is the minimum number of errors ( per step) a model would make if there was no accumulation of errors. %ExAccErr ≤ (l) allows us to compare models, training algorithms, and decoding strategies on the extra error that might be caused/mitigated by their use. A model, training algorithm, or decoding strategy that perfectly mitigates the exposure bias will result in zero excess accumulated errors.
In the rest of the paper, we use these definitions to show: 1) error accumulation in language generation is real, 2) perplexity fails to capture this error accumulation, 3) lower exposure bias correlates with a higher quality generation that is more coherent, uses more diverse vocabulary, and is less repetitive.
6 Study Setup: Open-ended Generation Text Completion Setup: Text completion is the standard experimental setup to measure the quality of generation in open-ended language generation (Holtzman et al., 2019;Welleck et al., 2019). It is also a generalization of numerous practical language generation applications such as story generation (Fan et al., 2018), contextual text completion (Radford et al., 2019), dialog modeling (Zhang et al., 2018).
Text completion models take a text passage or prefix w j 0 ∼ o as an input and generate a coherent continuation of the prefix, w n j+1 using the language model p θ and the decoding algorithm F, i.e., w n j+1 = F(p θ , w j 0 ). In this paper, we use this textcompletion setup to analyze the error accumulation due to exposure bias and its correlation with language generation quality.
Language Model and Dataset: We conduct our analysis using the GPT2 language model (Radford et al., 2019). We use the GPT2-117M model as our evaluation language model and use the train split of Wikitext-103 (Merity et al., 2016) for prompts.
We rely on GPT-2 model fine-tuned on Wikitext-103 as our approximate oracle. We tokenize the Wikitext-103 dataset using GPT-2's tokenization scheme. We chunk Wikitext-103's train split into sequences of length 512. Of these, we use the first 50 tokens as prompts for our generation experiments and generate the completions to a maximum length of 512 or up to the end of the sequence token. We use a total of 20k prompts for our evaluation.

Error Accumulation in Language
Generation is Real! Figure 1a plots AccErr ≤ (l) w.r.t. sequence length, l. The support (dotted, orange line) y = x, captures the linear growth. It shows AccErr ≤ (l) grows near-quadratically w.r.t. sequence length, empirically validating the theory that exposure bias would lead to the accumulation of errors. Figure 1b, further strengthens this claim by demonstrating nearlinear growth in excess errors w.r.t. to the sequence length.
We hypothesize that these excess errors would manifest in the form of language degeneration, especially in the latter part of the sequence, and would cause issues such as hallucinations, limited vocabulary, and word-and phrase-level repetitions.

Perplexity is Not Enough
Perplexity is a standard measure used to evaluate the quality of a language model. It is often used as a proxy measure for the text generation quality of the language model. In this section, we argue perplexity paints an incomplete picture regarding a model's ability to generate high-quality, coherent text. It only captures the average per-step error generalization gap (or lack of it) but fails to account for the error accumulation due to exposure bias. These accumulated errors, as seen in the previous section, can grow near-quadratically and can prove to be a major concern for any generation model that generates sequences longer than a few words.
Perplexity can be seen as scaled exponentiated average per-step error, , computed over a held-out   Figure 1a plots accumulated error till length l (AccErr ≤ (l))) w.r.t. l. This graph shows the quadratic growth of accumulated errors w.r.t to sequence length (l) as predicted by the theory. Figure 1b plots % excess errors due to error accumulation (%ExError ≤ (l)) caused by exposure bias. This indicates that extra errors due to exposure bias grows near-linearly with the sequence length, and decoding using greedy search results in over 70% more errors.

Search
Generation Quality  Table 1: Impact of error accumulation on generation quality. We observe that stochastic decoding methods not only lead to diverse language generation but also have lower exposure bias than the deterministic methods.
test set, D h : where H(p θ ; D h ) is the entropy rate (log perplexity) of the model p θ on the held-out test set D h . As entropy rate is a linear function of average per-step error, we hypothesize that it will only be able to measure the per-step generalization gap of the model and will fail to capture the error accumulation caused by reducing a sequential decisionmaking problem to a supervised learning problem.
In Figure 2, we plot the entropy rate, H(p θ ; D h ) ≤l , w.r.t. average per-step error, ≤l , and length-normalized regret up to length l, R ≤l (p θ , F)/l. We observe a strong correlation between the entropy rate and average per-step error (ρ = 0.9997) validating our theoretical observation that perplexity can capture the per-step generalization gap of language model p θ . On the other hand, the length-normalized regret exhibits a poor correlation with the entropy rate (ρ = 0.4003) indicating perplexity's failure to capture the error accumulation due to exposure bias. Figure 2: Analyzing (log) perplexity (H ≤l ) w.r.t to average per-step error ( ≤l ), and length-normalized exposure bias regret (R ≤l (p θ , F)/l). We observe that perplexity strongly correlates with average per-step error (ρ = 0.9997), but it has a weaker correlation with length-normalized regret (ρ = 0.4003).
A case in point of perplexity's inability to capture error accumulation is the degenerate behavior of GPT-2 (Radford et al., 2019) while generating moderately long sequences under greedy or beam search. This happens despite GPT2 having a low zero-shot perplexity on the held-out set of the Wikitext-103 dataset (perplexity: 37.50). We hypothesize that one of the reasons for the degenerate behaviors of large pre-trained language models such as repetition, low vocabulary usage, and a lack of coherent generation is the result of this accumulation of errors. An example of this behavior is presented in Table 2 where we observe GPT2 generating repetitive and incoherent text completion for a WikiText-103 prompt under deterministic decoding schemes such as greedy and beam decoding.

Error Accumulation Impacts Generation Quality
Finally, we examine the hypothesis that poor text generation capabilities of pre-trained large language models under greedy decoding might be due to the error accumulation caused by a procedural mismatch between generation and maximum likelihood training (Vijayakumar et al., 2016;Welleck et al., 2019;Holtzman et al., 2019). The regret-based definition of error accumulation allows us to analyze exposure bias along two axes of variation: the trained language model, p θ , and the decoding algorithm, F. In this set of exper-Context: ... He made one attempt to break through the blockade on 23 December, but it proved impossible to do as the Union controlled every channel from Savannah, aided ... Greedy: ... Savannah, aided by the British, to the North. The blockade was broken on 23 December by the British and the Americans. The blockade was broken on 23 December by the British and the Americans. The blockade was broken on 23 December by the British and the Americans. The blockade was broken on 23 December by the British and the ...
BeamSearch (k=5): ... Savannah, aided by the North Carolina Railroad. In the spring of 1788, in the midst of the blockade, he made a move to join the Confederacy, but it was too late. On 23 December 1788, he marched to the battle of Custer's Creek in the South Carolina River. In the following weeks, he returned to Savannah on 23 December, but it was too late. On 24 December 1788, he made another attempt to break through the blockade on 23 December, but it was too late. ... Top-k Sampling (k=100): ... Savannah, aided by a little naval warplane bombardment which had crippled many parts of the ship. As an early American merchant vessel he was carried forward as a heavy merchant, while at some time it was necessary to take up residence in Savannah to be used as a regular transport vessel. In the meantime, on the 20th of January 1816, Jefferson died. He was buried in an unmarked grave in the chapel inside Savannah in the family home at Glenbrook, where the rest of the family resides. ...
Top-p Sampling (p=0.94): ... Savannah, aided in the capture of his land and his supplies. He made two attempts to stop a blockade on 28 December, but failed to make it. There were, however, six people in Savannah who were convinced that Captain Walker was the cause of the blockade. He then made a trip to North Carolina where he gave up hope. ... Gold: ... He made one attempt to break through the blockade on 23 December, but it proved impossible to do as the Union controlled every channel from Savannah, aided by their occupation of Tybee Island at the mouth of the Savannah River. Bulloch reported to Mallory in late January 1862 that breaking out was hopeless so Mallory ordered him to turn the ship over to another officer and to return to Europe some other way. ... Table 2: Examples of completions using various decoding methods. We observe that the deterministic decoding schemes produce less diverse, incoherent, and more repetitive (highlighted in red) text. iments, we explore the impact of various decoding schemes on error accumulation due to exposure bias and the quality of the completed text.
For a quantitative analysis of the impact of various decoding algorithms on the quality of language generation, we measure the completion quality by using the same metrics as Welleck et al. (2019). These metrics are: 1.) rep/128 measures if the prediction token at step t occurs in previous 128 steps, 2.) wrep/128 counts the prediction's repetition at step t only if the predicted token is not the ground-truth token at that position, 3.) seq-rep-4 measure the repetition at the 4-gram level, and 4.) uniq measure the vocabulary diversity by accounting for the number of unique tokens generated by the model. Table 1 shows that various sampling-based decoding algorithms result in diverse and more coherent language generation and a lower percentage of excess errors. Sampling with temperature (with temp=1.2) leads to the least amount of repetition (both at the token and the n-gram level), second highest vocabulary diversity, and the least amount of excess errors due to exposure bias. This also bears out from our qualitative analysis in Table 2 as sampling with temperature produces the most coherent text. Greedy and beam search decoding schemes, in contrast, fare poorly in both reducing exposure bias and language generation quality metrics, producing repetitive and incoherent text. These quantitative and qualitative experiments offer us evidence that reducing exposure bias does lead to more coherent text generation.
We hypothesize that the reasonable amount of randomness introduced by stochastic sampling helps the model avoid sampling the most likely token at each time step, thus avoiding possible divergent contexts that might have resulted in a degenerate completion in the future. We conjecture that this timely intervention prevents the generation context distribution from diverging too far away from the training context distribution, helping it avoid the compounding of errors. This is also borne out by qualitative analysis as a reasonable amount of stochasticity does result in texts which look more coherent and oracle-like. A broader analysis of this behavior though is beyond the scope of this work and is left for future work.
Considering that choice of decoding algorithm does not impact average per-step error, , this rules out the role of modeling and model training in language degeneration. Hence, it is reasonable to assume that both qualitative and quantitative improvement in language quality observed in this experiment is strongly linked to the reduction in error accumulation due to exposure bias.

Related Work
Non-MLE Training Methods: Several approaches have been proposed to mitigate the ex-posure bias issue including RL-based optimization objectives (Ranzato et al., 2016;Shen et al., 2016;Bahdanau et al., 2017;, learning to search (Leblond et al., 2018), energy-based models (Deng et al., 2020), imitation learning (Du and Ji, 2019), generative adversarial networks (Yu et al., 2017) and knowledge distillation . Although these methods motivate their approaches as intending to reduce exposure bias, they neither formally analyze exposure bias nor provide any empirical evidence that these methods mitigate the effect of exposure bias. In this paper, we analyze the exposure bias from a principled imitation learning perspective in terms of the accumulation of errors. This definition can be adapted to evaluate various novel training and modeling approaches on their ability to reduce exposure bias.
Smarter Decoding Methods: Large language models have unusually low test perplexities but they falter at coherent and diverse language generation specifically in open-ended language generation tasks especially while using deterministic decoding schemes. Several authors (Vijayakumar et al., 2016;Welleck et al., 2019;Holtzman et al., 2019) have hypothesized that training and inference mismatch due to MLE-based training is responsible for the degenerate behavior. They have proposed smarter decoding schemes to mitigate the side effects of exposure bias resulting in better generation quality. Despite this being an active area of research, this often-repeated hypothesis for degenerate generation behavior has not received serious treatment until now. In this paper, we take a step towards explaining this discrepancy and show that error accumulation due to exposure bias might be the reason for this degenerate behavior and explain why perplexity has a handicap in capturing this compounding of errors.
Analyzing Exposure Bias: Schmidt (2019) and Wang and Sennrich (2020) link exposure bias to a generalization gap due to distribution and domain shift respectively. Performance degradation under domain and distribution shift is a major issue with language generation, and direct evidence supporting this hypothesis will provide insights into building more robust language generation models. Unfortunately, neither of the papers formally analyzes the exposure bias issue or empirically links the generalization gap to exposure bias directly.
Three recent papers, Xu et al. (2019); Zhang et al. (2019b); He et al. (2019), have tried to em-pirically evaluate the impact of exposure bias on language generation. The first two papers validate the existence of exposure bias whereas He et al. (2019) show language models have self-recovering ability negating the impact of exposure bias. All three analyses are based on the empirical definition of exposure bias which, in turn, is based on the informal formulation by Ranzato et al. (2016).
In this paper, we provide a principled and theoretically grounded approach to analyze exposure bias in language generation and show that it is indeed a problem and that it might explain the degeneration issue with large language models on open-ended tasks under deterministic decoding.

Discussion
In this paper, we analyze language generation from an imitation learning perspective. We use this analysis to arrive at a theoretical bound on error accumulation due to exposure bias. This bound predicts a super-linear growth in error accumulation during generation due to exposure bias. In our experiments, we validate this bound and show that accumulation due to exposure bias indeed results in super-linear growth in errors.
We then show, both analytically and empirically, why perplexity is not enough to capture this accumulation of errors and hypothesize that this accumulation of errors is responsible for the degenerate language generation. Finally, we provide some evidence for this hypothesis by evaluating the impact of various decoding schemes on error accumulation and generation quality. We show that techniques that improve the generation quality do result in a lower error accumulation and this indicates that excess error accumulation due to exposure bias might be a factor affecting language generation quality.
Our analysis provides a principled and theoretically grounded way to understand exposure bias. We believe this analysis can pave way for developing smarter training and decoding algorithms to address this error accumulation resulting in more robust language generation models.