MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P – that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may “over-generalize”, in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies.


Introduction
Rapid advances in pre-trained large-scale autoregressive language models (LMs) have dramatically improved the performance of a variety of tasks (Radford et al., 2019;Brown et al., 2020;Zhang et al., 2022;Chowdhery et al., 2022).However, these systems still struggle in many openended generation settings, where they are asked to produce a long text following a short prompt.In these cases, we seek systems that generate sensical, coherent, fluent, and engaging, or in short, human-like text (Pillutla et al., 2022).
Different decoding strategies to generate such text from pretrained LMs suffer from different degeneration problems.Unbiased sampling1 usually Q -logP(x) -logQ(x) Reverse Cross-Entropy Forward Cross-Entropy x ~ Q x ~ P Figure 1: MIXCE combines two complementary driving forces: reverse CE helps narrow the model distribution Q θ down when it is broader than data distribution P , while forward CE helps broaden Q θ out when it is narrower than P . 2 results in incoherent and nonsensical text, while greedy and beam searches often get stuck in repetition loops (Holtzman et al., 2020).These observations suggest that the learned LM distribution Q θ still differs substantially from the human LM distribution P .A possible reason is that the autoregressive modeling of Q θ gives a non-zero probability to every possible sequence of tokens, while many sequences are impossible under P .Nevertheless, we still hope that Q θ (x) is as small as possible when P (x) = 0. To this end, maximum likelihood estimation (MLE), i.e., minimizing the cross-entropy is the most widely used objective to train Q θ (x) using sequences sampled from P .In an idealized setting, with unlimited training data and model capacity, as well as a perfect optimizer, fitting Q θ with MLE will learn a distribution as close to P as we like.However, in practice, we only have finite and noisy data.
We argue that the MLE objective only weakly penalizes generations x from Q θ that are "bad", in the sense that P (x) = 0.When Q θ puts a small amount of probability mass onto P (x) = 0 space, MLE cannot sufficiently discourage this behavior (see Figure 3 in Appendix C).Moreover, minimizing forward CE, −E x∼P [log Q θ (x)], is equivalent to minimizing the forward KL divergence between P and Q θ , i.e., KL(P ||Q θ ) = E x∼P [log P (x)/Q θ (x)].Forward KL has a zeroavoiding property -avoiding Q θ (x) = 0 when P (x) ̸ = 0 (Murphy, 2012).Therefore, if there is noise in the data, Q θ will try to cover the noise as well, which leads the model to over generalize, in the sense of putting non-trivial probability mass over P (x) = 0 generations (Huszár, 2015;Theis et al., 2016;Ott et al., 2018;Kang and Hashimoto, 2020).As a result, we observe samples from the model deviating from human-like text.A common strategy is to modify the decoding method, e.g., top-k, top-p, typical, contrastive (Fan et al., 2018;Holtzman et al., 2020;Meister et al., 2022;Li et al., 2022) samplings, to tailor the model distribution Q θ in a post-hoc manner to avoid unwanted generations.In contrast, our approach differs: how can we obtain a better Q θ to obviate the need for these sampling strategies?
We propose a novel training objective for autoregressive LMs -MIXCE that Mixes the forward and reverse Cross-Entropies: [log P (x)].MIXCE can be understood in two ways.First, we want model generations to be high-quality as well as diverse.Reverse cross-entropy reflects how we conduct human evaluations, sampling from the model Q θ and evaluating it by the human P , where the focus is text quality.Forward cross-entropy emphasizes the diversity of model generations (Hashimoto et al., 2019).Second, MIXCE works similarly to a mixture of the forward and reverse KL divergences.The reverse KL divergence (KL(Q θ ||P )) is zero-forcing -forcing Q θ (x) = 0 when P (x) = 0 -and thus more strongly penalizes generating non-human-like samples compared to MLE.Overall, MIXCE combines two complementary driving forces to better fit Q θ to P (Figure 1).We elaborate on these interpretations in § 3.1.
Unfortunately, optimizing reverse cross-entropy is intractable because we do not know P .Hence, we propose an approximation of the reverse crossentropy (see § 3.2), which ends up being a selfreinforced loss function that encourages the model to produce generations in which it is already con-fident.This loss function has the same computational complexity as forward cross-entropy, making MIXCE easy to implement and as fast as MLE.
We demonstrate the effectiveness of MIXCE in both a synthetic setting, where the "human" distribution P is known, as well as a real setting.For the synthetic case, we evaluate six learning objectives: MIXCE, MIXCE * (MIXCE without approximation), forward KL (=MLE), reverse KL, the mixture of two KL divergences, and Jensen-Shannon (JS) divergence.We show that MIXCE * works slightly worse than the mixture of KLs while outperforming other objectives, and MIXCE works worse than MIXCE * but generally outperforms MLE.In real settings, we finetune GPT-2 (Radford et al., 2019) of different sizes on three English text domains using MIXCE or MLE.Our results show that, compared to MLE, unbiased sampling from MIXCE-finetuned models produces text that has diversity (Meister et al., 2022) closer to that of human text, has higher Coherence (Su et al., 2022), has higher Mauve (Pillutla et al., 2021), and is preferred by humans.When using top-p sampling (Holtzman et al., 2020) and carefully tuning p, generations from MLE-finetuned models are similar to those generated from MIXCE-finetuned models.Nonetheless, MIXCE models have tuned p values closer to 1, implying a less noisy model distribution.In addition, we modify the original Mauve to make it more robust to spurious features (e.g., text length), under which MIXCE still improves over MLE when using unbiased sampling.

Autoregressive Language Modeling
Language generation is mostly based on the autoregressive language modeling methodology.The generation of one word is conditioned on previously generated words, Q θ (x t |x <t ), and the final probability of the sequence x is the product of probabilities of each step, Early works build n-gram neural LMs (Bengio et al., 2000) and then RNN-based LMs (Mikolov et al., 2010), and now Transformers (Vaswani et al., 2017) have become the dominant architecture.Language generation models have either a decoderonly (Mikolov et al., 2010) or an encoder-decoder architecture (Sutskever et al., 2014;Bahdanau et al., 2015).In this work, we focus on decoder-only LMs.In recent years, many large-scale pre-trained decoder-only LMs have been introduced (Radford et al., 2019;Brown et al., 2020;Zhang et al., 2022;Chowdhery et al., 2022).They can be finetuned for downstream tasks and even perform surprisingly well in a zero-shot or few-shot manner.Despite the impressive performance, language degeneration is one of the key issues that remain to be solved.

Language Degeneration
According to Holtzman et al. (2020), language degeneration refers to output text that is bland, incoherent, or gets stuck in repetitive loops.It is widely observed in open-ended generations from pretrained LMs.Two commonly observed patterns of degeneration are the incoherent text from unbiased sampling and the repetitive text from greedy or beam search.Degeneration also appears in sequence-to-sequence generation tasks but in a slightly different form (Stahlberg and Byrne, 2019).
There is no agreement on what causes degeneration.Ott et al. (2018) attribute it to data noise and the smooth class of model functions.It is inherent in the model's structure to have support everywhere, in particular, because all probabilities are produced by softmax, which is strictly positive.Therefore, Hewitt et al. (2022) assume that an LM distribution is the true data distribution plus a uniform-like smoothing distribution.Based on the observation that human-like text has a large but not too large likelihood under the learned LM distribution (Zhang et al., 2021), a lot of works propose empirically useful decoding methods beyond unbiased sampling and greedy/beam search (Fan et al., 2018;Holtzman et al., 2020;Eikema and Aziz, 2020;Basu et al., 2021;Meister et al., 2022;Li et al., 2022;Hewitt et al., 2022;Su et al., 2022;Krishna et al., 2022).One of these approaches is the canonical top-p (or nucleus) sampling method (Holtzman et al., 2020), which samples from top tokens that take up p proportion (e.g., 95%) of the probability mass at each decoding step.Even though these decoding methods work impressively well, they are post-hoc fixes rather than learning the LM accurately in the first place.Therefore, some other works criticize the MLE training objective and propose alternative loss functions.

Objectives Beyond MLE
Unlikelihood training (Welleck et al., 2020;Li et al., 2020) was proposed to penalize repetition (or any undesirable phenomenon) explicitly during training.The idea is to minimize the likelihood of a set of negative tokens at each generation step during training.The selection of negative tokens is pre-defined, e.g., tokens that appear often in the previous context.MIXCE shares the same goal with unlikelihood training -matching the human LM distribution, but provides a more general approach without targeting any specific problem.
Similar to our motivation, Kang and Hashimoto (2020) think that the zero-avoiding property of MLE makes the model sensitive to dataset noise.To cover these noisy examples, the model has to put non-trivial probability mass on the P (x) = 0 area.To combat this problem, they propose a loss truncation method that drops high-loss (low-likelihood) examples during training time.
Pang and He (2021) want to address the mismatch of learning objective and human evaluation (likelihood vs. quality) and introduce the GOLD algorithm to approximate reverse cross-entropy.Our approximation is similar to theirs but has a different derivation process (see § 3.2).Moreover, GOLD is evaluated on controlled generation tasks (e.g., summarization and translation) in which the goal is to generate one high-quality text for each input, and diversity is not so important.In contrast, if we train the LM only with reverse CE till convergence, the model will deterministically produce the most likely text for each prompt, which is undesirable for an LM.Therefore, mixing forward and reverse CEs is necessary.
The idea of MIXCE is also relevant to GANs (Goodfellow et al., 2014).GANs optimize the Jensen-Shannon (JS) divergence between model and data distributions.Essentially, JS divergence is also for balancing the two driving forces of forward and reverse KL divergences (Huszár, 2015), and it has been successfully used for evaluating LM-generated text (Pillutla et al., 2021).However, probably due to the discrete nature of text, GANs have not been well applied to LM training.Caccia et al. (2020) show that previous language GANs often give up diversity for quality.
Another related work is Popov and Kudinov (2018), which finetunes LMs with the sum of the forward cross-entropy loss and reverse KL divergence.They train a discriminator to estimate reverse KL, similar to a GAN.On the other hand, we directly approximate reverse cross-entropy in our objective function, without training an additional discriminator.
Concurrently, with the same motivation as ours, Ji et al. (2023) propose to replace MLE with min-imization of the total variation distance (TVD) (Van Handel, 2014) between data and model distributions.Notably, their final approximation of TVD, which they call TaiLr, is equivalent to forward cross-entropy when the hyperparameter γ = 0 and equals our approximated reverse cross-entropy when γ = 1.

MIXCE
Our MIXCE learning objective for training LMs is the combination of forward and reverse crossentropies, written as (1) where η is the mixing ratio.When η = 1, it becomes the normal MLE objective; and when η = 0, it is the reverse cross-entropy only.
The MIXCE loss can be understood in two ways.First, reverse and forward cross-entropy (CE) emphasize quality and diversity respectively.The reverse CE, −E x∼Q θ [log P (x)], focuses on quality because it resembles how we conduct human evaluations -sampling from the model Q θ and evaluating it by the human P .In human evaluations, the focus is more on the quality of the model-generated text.So, it is possible that a model always generates the same few high-quality texts, but still gets high human evaluation scores.This is similar to the mode collapse problem of GANs.The forward CE, −E x∼P [log Q θ (x)], instead focuses more on diversity because it needs any sample from P to have a non-trivial probability under Q θ (Hashimoto et al., 2019).Note that it does not mean forward CE has zero effect on quality, rather, the model likelihood Q θ (x) only loosely correlates with the human-perceived quality of x (Zhang et al., 2021).
Second, we hypothesize that MIXCE works similarly to a mixture of forward and reverse KL divergences, which we will show empirically in our synthetic experiments ( § 4.1).On the one hand, minimizing forward KL is equivalent to optimizing forward CE.On the other hand, reverse KL divergence, E x∼Q θ [log Q θ (x) P (x) ], has two parts: reverse CE and negative entropy of Reverse CE is minimized when the model deterministically outputs the most likely example, i.e., Q θ (x) = δ(the most likely x under P ).Instead, minimizing the negative entropy (maximizing the entropy) of the model encourages it to be as un-certain as possible, i.e., having a large support and uniform distribution.This entropy term counteracts the narrowing-down effect of reverse CE.As discussed above, forward CE pushes the Q distribution to fully cover the support of P .In this case, forward CE can also help counteract the narrowingdown effect of reverse CE, i.e., the maximizing entropy term becomes less important when forward CE is present.Hence, we think it is reasonable to drop it from reverse KL.
Overall, MIXCE combines two complementary training signals, as shown in Figure 1.Reverse CE prevents the model distribution from being broader than the data distribution, while forward CE is more helpful for preventing the model distribution from being narrower than the data distribution.Although forward CE also has non-zero loss when the model distribution is too wide, its loss magnitude is much smaller than what reverse CE provides (see Appendix C for more discussion).When data is clean, two CEs work jointly to help learn the data distribution better.When data is noisy, the mixing ratio η allows us to trade-off between emphasizing a good coverage of the data and putting more weight on the actually high-quality sequences.

Optimization of Reverse CE
Optimizing MIXCE is non-trivial.The obstacle is to minimize the reverse CE, −E x∼Q θ [log P (x)] with respect to θ.To this end, we need to know P and to have a differentiable sampling operation from Q θ .In our synthetic experiments ( § 4.1), we use a distribution P of our own construction and use Gumbel-Softmax (Jang et al., 2017;Maddison et al., 2017) to make the sampling operation differentiable.
However, in a real setting, we do not know P .
First, from (2) to ( 3), we substitute expected log-likelihood by expected accuracy.Irsoy (2019) shows that expected accuracy is a comparable or better alternative loss function to cross-entropy for classification tasks.Then, following the Policy Gradient theorem (Williams, 1992;Sutton et al., 1999), we get ( 4) and ( 5), where we view model Q θ as the policy and P (x) as the reward we want to optimize for the whole sequence.Next, we switch from the expectation of Q θ to the expectation of P (from ( 5) to ( 6) and ( 7)), so that we can use the offline samples from P (data samples in the training set) instead of online sampling from Q θ .We unfold Q θ (x), which results in (8).Until this point, theoretically, we are already able to optimize the model using Equation ( 8) without knowing P .However, the product of Q θ (x t |x <t ) has a very high variance, and in practice, it underflows when T is large.Therefore, we apply a final rough approximation that leads to (9).
Equations ( 8) and ( 9) are apparently not equivalent to each other.Nonetheless, they have similar effects.Intuitively, in (8), we weigh the gradients of each sequence differently based on their sequencelevel probabilities, Q θ (x); in other words, it promotes high-likelihood sequences.Similarly, ( 9) weighs gradients at each step by Q θ (x t |x <t ), i.e., promoting high-likelihood tokens at each step.So essentially, they both encourage the model to produce generations in which it is already confident.We call it a self-reinforced objective.To further illustrate why self-reinforcement makes sense, we conduct an analysis using GPT-2 (Radford et al., 2019).Please refer to Appendix B for a detailed discussion.In short, we show that MLE-pretrained GPT-2 on average assigns a higher probability to human text than to text sampled from the model.Therefore, when we promote high-probability sequences or tokens, it is like "pushing" the model distribution toward the human distribution.But, we need to avoid overly "pushing" it to the extremely high-probability region where repetitive greedy search outputs locate.
Note that our approximation of reverse crossentropy is relevant to the method proposed by Pang and He (2021), though we have a different derivation process from theirs.Please see Appendix A for a detailed comparison.
Finally, combining forward CE and Equation ( 9), our approximated MIXCE objective is to maximize This loss function has the same computational complexity as forward CE (MLE).Since Q θ (x t |x <t ) is strictly lower than 1 (it is around 0.017 to 0.13 when using GPT-2), the gradient from approximated reverse CE is smaller than that from forward CE.Therefore, it is important to tune η to balance the effects of two CEs.

Synthetic Experiments
We first conduct experiments in a synthetic ideal setting, where we know P , to show the effectiveness of mixing two cross-entropies with or without approximation.Moreover, during evaluation, we can directly compare the learned model parameters against the ground truth parameters of P .
Define the "human" LM P .We start by defining P as a bi-gram LM.Bi-gram means that the prediction of the next token only depends on the immediately previous token, i.e., P (x t |x t−1 ).Therefore, P is determined by a transition matrix among words M ∈ R V ×V (V =vocabulary size) and a start token probability distribution π ∈ R V , i.e., stochastic finite-state automata.The last token in the vocabulary is the end-of-sequence (EOS) token.For simplicity, we initialize π as a uniform distribution.To initialize M, we use two methods.The first is random initialization.We sample categorical distributions from a Dirichlet (α=0.5)prior to initialize each row of M.However, one remaining problem is that P has support everywhere.To have P = 0 areas, we randomly assign 0s to a certain percent of values in each row of M and then re-normalize to sum to 1. 3 We test 3 percentages: 10%, 50%, and 90%.The second is initialization using real data.We sample 5000 pieces of text from WebText (Radford et al., 2019), count the occurrence of bigrams, and then use the occurrence to initialize M. In this case, there are naturally 0s in M, and the larger the vocabulary size is, the sparser M is.No matter which initialization is used, we reserve the last row of M for EOS and it has all 0s, i.e., will not transit to any token.We set the vocabulary size V =20, 50, 100, 500, or 1000. 4  Learn an LM Q θ .We implement model Q θ as a simple neural bigram LM.Given the word embedding e i−1 of the previous token x i−1 , the next token is predicted via a simple neural network f : where Synthetic data.We sample sequences from P .We set the max sequence length as 500.We sample 50K and 5K sequences as the training and validation set, respectively.There is no test set because we directly compare the learned transition matrix M ′ to the gold M during evaluation.

Metrics.
(1) avg.js: we compute the JS divergence between each row (except the last row) of M ′ and the corresponding row in M, and then average across rows.This metric evaluates the overall divergence of M ′ from M, and equals 0 iff M ′ = M; (2) avg.0s: we get the probabilities from M ′ from positions where the corresponding gold probabilities are 0 in M, and take their average.If M ′ = M, avg.0s = 0, but vice versa is not true.
(4) JS, we use a general definition of JS divergence (Huszár, 2015) (5) Oracle mixture of cross-entropies (MIXCE * ), where we use the known P .(6) Approximated 4 Our defined bi-gram LMs are always tight, i.e., do not "leak" probability mass onto infinite sequences because we make sure that all accessible tokens also have non-zero paths to other tokens.Please refer to Du et al. (2022) for the proof.
5 When η = 0.5, it is the same as the objective of GAN (Goodfellow et al., 2014).But instead of using GAN's min-max loss, we directly optimize JS because we know P .Metrics.
(1) Perplexity (ppl) is defined as , where N is the number of examples and T is the sequence length.Perplexity is not necessarily correlated with human perceived quality (Zhang et al., 2021).(2) Diversity (div): following Meister et al. (2022), we define n-gram diversity as the average fraction of unique vs. total n-grams for n ∈ {1, 2, 3, 4} in each piece of text.(3) Mauve (Pillutla et al., 2021) compares model-generated text against human text via a KL divergence curve and is the stateof-the-art metric for open-ended text generation.We use Mauve as our primary metric.(4) Coher-ence (coh) (Su et al., 2022) computes the cosine similarity between the embedding of prompt and the embedding of continuation, and embeddings are from SimCSE (Gao et al., 2021).All metrics are the closer to human scores the better.
Objectives.Since we have no access to P , we can only implement two out of the six objectives we test in the synthetic setting: (1) MLE, which is equal to forward CE or forward KL; (2) MIXCE, the approximated mixture of cross-entropies.
Decoding.We use unbiased sampling (see footnote 1) as our primary decoding method as it allows us to explore the learned distribution in an unbiased way (Eikema and Aziz, 2020).Additionally, we test top-p sampling (Holtzman et al., 2020) to check if MIXCE is complementary to advanced decoding methods, and we carefully tune p on the development set.For each text, we take the first 50 tokens (by GPT-2 tokenizer) as the prompt and set the max generation length as 512.
Model selection.We finetune the model for 5 epochs on the training set and save the best checkpoint with the lowest dev loss.We select the best mixing ratio η and the best p based on the Mauve score on the dev set.The search space of η is [0.99, 0.9, 0.7, 0.5, 0.3, 0.1, 0.01, 0.0] and that of p is [0.85, 0.87, 0.89, 0.91, 0.93, 0.95, 0.97, 0.99].Selected best ηs are reported in Table 12 in the Appendix.Best ps are reported in Table 3. Metric scores are reported on the test set and are 3-run averages because sampling is stochastic.
Results.Table 2 shows unbiased sampling results of models in different sizes and finetuned with different objectives on three datasets.As you can see, MIXCE-finetuned models usually get worse perplexity but consistently better diversity, mauve, and coherence, compared to MLE-finetuned models.Table 3 shows top-p sampling results from the same models as Table 2. Since perplexity will not change as the decoding method changes, we instead report the selected best p in this table.It can be seen that after carefully applying top-p sampling, MIXCE-finetuned models work on par with MLE-finetuned models for diversity, mauve, and coherence.Nonetheless, the best p for MIXCE models is always 0.99, while MLE models have smaller and more diverse ps.This indicates that MIXCE leads to a less noisy model distribution.

Robustness & Analysis
Varying training data sizes.We test 3 other training data sizes: 10K, 25K, and 100K using GPT-2 small.Table 5 in the Appendix contains the results, and it shares the same story trend as Table 2: MIXCE-finetuned models get worse perplexity but in general work better than MLE-finetuned models for diversity, mauve, and coherence.
Varying η and max generation length.To examine how the mixing ratio η and the max generation length affect the performance, we show the mauve score curves on the dev set in Figure 4.The x-axis is the mixing ratio η from 0 to 1 (MIXCE=MLE when η = 1), and the y-axis is the mauve score with different max generation lengths (128, 320, and 512).First, reasonable performances are usually observed when η ≥ 0.1, and only training the models with approximated reverse CE (i.e., η = 0) leads to degeneration.Second, the advantage of MIXCE is more prominent when the max generation length is longer.
Controlled Mauve.The max generation length is not the actual text length because when sampling from the model, EOS can be generated at any step.We find that the actual text length can affect the mauve computation.Even if we truncate all texts to the same length, the incompleteness caused by truncation can be another confounding factor.Both text length and text completeness are irrelevant to text quality but can be used by mauve to distinguish model generations from human texts.Therefore, to eliminate the influence of these confounding factors, we propose a controlled mauve (or c-mauve) computation approach.Concretely, for human texts and model generations, we randomly sample 10K L-length text fragments from each of these two sets.L is the number of tokens.Then, we compute the mauve between these two sets of fragments.

Conclusion
We propose a novel training objective, MIXCE, for autoregressive language modeling.MIXCE combines forward and reverse cross-entropies, which can be viewed as combining two complementary driving forces for better fitting the model distribution to the data distribution.We demonstrate the superiority of MIXCE over MLE in both synthetic and real settings via both automatic and human evaluations.In the future, MIXCE can be potentially used for pretraining language models.

Limitations
One apparent disadvantage of MIXCE is the mixing ratio η.As shown in Table 12 and Figure 4, the best η changes as the experimental setting changes.
It may be because we use mauve as the model selection criteria or because different datasets have different noise levels.In general, we do not have a good answer to which η should be used.The ideal solution is to select η based on the performance of the development set like what we did.However, in pretraining settings, it is too expensive to search over multiple ηs.Therefore, how to find a universal η or how to determine η automatically is an important problem to resolve before MIXCE can be reliably used for pretraining.
As we mentioned in § 1, language degeneration of open-ended generation shows two distinct patterns: the non-sensical text from unbiased sampling and the repetition loops from greedy search.Though MIXCE helps improve the performance of sampling, we still see repetition loops when using greedy search.

Ethical Considerations
As the OpenAI team pointed out, GPT-2 does not distinguish fact from fiction, so it can not support use cases that require the generated text to be true.Additionally, GPT-2 reflect the biases inherent to the systems they were trained on, so it can not be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use case.Though our MIXCE-finetuned GPT-2 gets improved performance with respect to the metrics we used, the above statement still holds.At this point, we are not sure whether MIXCE can help improve factuality or lead to less biased generations, but we are sure that the generations still have non-factual content and biases.Pang and He (2021) also propose to approximate reverse CE, and the resulting GOLD algorithm is similar to our Equation 9. Here, we would like to clarify the difference and connection.
The following equation is the start policy gradient equation used by Pang and He (2021).
They used different notations from ours.π θ is the same as our Q θ , i.e., π θ (a t |s t ) is the same as our Q θ (x t |x <t ).Q is the accumulated future reward from timestamp t, T t ′ =t γ t ′ −t r t ′ , γ is the decay factor and r t ′ is the reward for each step.We will discuss Q in detail later.
Then, they apply importance sampling to sample from a different behavioral policy π b .Since they also use examples from the training set, their π b is the same as our human (or data) distribution P .
w t is the importance weight.They use a per-action approximation: w t ≈ π θ (at|st) π b (at|st) , which is similar to how we get Equation 9 from Equation 8. Since π b is unknown, they assume a uniform distribution: π b ≈ 1/N (N is the number of training examples).Hence, their final approximated gradient is: They define r t ′ and Q in three ways.The first is called δ-reward, i.e., Q = 1.In this case, their final gradient is exactly the same as our Equation 9.However, as you can see, we take a different path of derivation.Instead of using this δ-reward, our Q is the sequence-level reward P (x).The reward P (x) nicely helps us to switch from the expectation of Q θ to the expectation of P (from Equation 5to Equation 7).Therefore, without assuming a uniform distribution of π b , our π b is just P .
When using the other two rewards, they also need to know P .To address this, they use an MLEpretrained model as a proxy of P .
Overall, we introduce a different derivation approach for approximating reverse CE.Moreover, as we mentioned in § 2.3, Pang and He (2021) focused on improving controlled generation tasks where the focus is on the quality of the text, while we focus Reverse Cross-Entropy Forward Cross-Entropy on open-ended generations where quality and diversity are both important.Therefore, we mix reverse CE with forward CE to form our MIXCE learning objective.

B Intuition behind the Self-reinforced Objective
To further illustrate why this self-reinforced objective (Equation ( 8) or ( 9)) makes sense and their shortcomings, we conduct an analysis using GPT-2 large (Radford et al., 2019).We first sample 5000 pieces of text from WikiText, WebText, and Writ-ingPrompts, respectively, and we call them human texts.Then, using the first 50 tokens of each human text as a prompt, we get 5000 sampling and greedy search generations from pretrained GPT-2 large (max generation length = 512).Next, we use the same model to score human texts and model generations and get the sequence-level and tokenlevel negative log-likelihoods.Figure 2 shows the histograms of these negative log-likelihoods.
In Figure 2, we take the human text histogram (in blue) as a proxy of human distribution and the sampling text histogram (in red) as a proxy of model distribution.As you can see, the support of model distribution usually contains the support of human distribution.It supports our previous claim that MLE-trained models tend to over-generalize.Meanwhile, at both the sequence and the token levels, the model on average assigns a higher probability to human text than to text sampled from the model.Therefore, when we promote highprobability sequences or tokens, it is equivalent to pushing the model distribution toward the human distribution.However, we need to avoid overly pushing it to the extremely high-probability region where greedy search outputs locate (in yellow) because they are known to be poor-quality and repetitive.Also, as shown in the figure, when promoting high-probability sequences, even if we overdo it, it will still be within the support of human distribution.In contrast, when promoting high-probability tokens, it can go outside the support of the human distribution, which is the drawback of Equation ( 9) compared to Equation (8).
Lastly, if we train the model only with the selfreinforced objective till convergence, it is inevitable to end up with a model that can only output greedy search generations.Hence, we need to combine it with the forward cross-entropy.

C Loss Magnitude
As shown in Figure 1, we use reverse cross-entropy (CE) to provide a driving force for narrowing the model distribution down when it is broader than the data distribution.And forward CE is to broaden the model distribution up.However, it does not mean forward CE does not have the opposite drive force because forward CE is minimized if and only if Q θ (x) = P (x).However, as shown in Figure 3, the loss magnitude is greatly smaller than the loss magnitude we get from reverse CE.Unbiased sampling is used as the decoding method.Each dot is the average of 3 runs of sampling and the error bar shows the standard deviation of 3 runs.

D.1 Additional synthethic experiments
Table 6 shows the results of additional synthetic experiments besides Table 1 in the main paper.Here, the goal transition matrix M is randomly initialized with 10% and 90% zero probabilities.
As the magnitudes of both avg.js and avg.0s are fairly small, we examine the 95% confidence intervals under one synthetic experimental settinginitializing the transition matrix M by the bigram occurrence in the WebText data and setting vocabulary size as 1000.Table 7 contains the results.We can see that 95% confidence intervals are small enough to maintain the trend of the results.

D.2 Varying training data sizes
Table 5 shows the results of using different training data sizes in the real-data setting.
texts and the same set with an extra new line token after each text (or the same set with the last k tokens being truncated), the score will be lower than 0.01.Though you may think truncating all texts to the same length can resolve this problem, we find that the incompleteness caused by truncation can also be a confounding factor.For instance, keeping human texts intact, we truncate texts generated by two systems by their shorter lengths (i.e., for each example, we truncate text1 and text2 by min_length(text1, text2)).Then, the system whose texts less will get a greatly larger mauve score than the other system.Therefore, to eliminate the influence of these two confounding factors, we propose a controlled mauve computation approach.Concretely, for the set of human texts T h and the set of model-generated texts T m , we randomly sample 10K L-length text fragments from each of these two sets.L is the number of tokens in each text fragment.After that, we com- pute the mauve between these two sets of 10K text fragments.We denote this controlled mauve as c-mauve L .
To sample each fragment, we first randomly sample a text t i from the set, and then randomly select a start token s (as long as there are more than L tokens from s to the end of t i ), then the fragment is t i [s : s + L].Finally, Table 8 shows the results.We set L = 100, 200, and 300, except that we could not get 10K 200-token fragments from WikiText because its texts are shorter.The Coherence score (Su et al., 2022) computes the cosine similarity between the prompt and the continuation.We suspect that the length of the continuation may affect the score.Therefore, following the same idea of controlled mauve, we also sample 10K fragments of the same length from the set of texts for evaluation and compute coherence on the fragments.And for each fragment, we take the first 50 tokens as the prompt and the rest as the continuation.Table 9 shows the results.As you can observe, under this controlled setting, MIXCE-finetuned models generally achieve better coherence over MLE-finetuned models.

D.5 Text length of model generations
Though by default we set the max generation length as 512, the actual text length can vary as the EOS token can be sampled at any time step.Therefore, we list the average text length of the human text and GPT2-large generations in  ratio (0.1) chosen based on mauve (see Table 12).However, we do not think shorter text length leaves to better mauve, as shown by the other two datasets and discussed in D.4.

F Human Evaluation Details
We conduct A/B testing (or pairwise comparison) to compare generations from two models.As shown in Figure 5, in each job, we give the evaluator two text paragraphs (in random order) that share the same beginning part (the prompt) but have different continuations.Then, they need to choose which one they think is better (or non-distinguishable).To avoid random selections, they are also asked to provide a justification for their choice.We find this justification not only gives us additional explanations of their choices but also helps us easily identify bad workers, because bad workers tend to use one single justification or several repeated justifications.We instruct them by defining a good text paragraph as being: • Fluent: Should have no obviously ungrammatical sentences, missing components, etc. that make the text difficult to read.
• Coherent: Should stay on topic with the prompt and build from sentence to sentence to a coherent body of information.
• Informative: Should have diverse and interesting content.
Since short text has little information and long text is difficult to read, we only use paragraphs with 5 to 8 sentences for evaluation.If a paragraph has more than 8 sentences, we truncate it to 8 sentences.And we remove paragraphs with less than 400 or more than 2000 characters.Besides, to eliminate the influence of length difference, we do not select examples whose length difference between two paragraphs is more than 1 sentence or more than 200 characters.We conduct this evaluation on Amazon Mechanical Turk.We only allow workers, who are located in the US, have a Masters Qualification, 7 have an approval rate larger than 97%, and have more than 10000 HITs approved, to do our tasks.In addition, we first ran a testing batch, then manually checked the results, and selected 44 qualified workers to continue doing the rest of our tasks.
For each of the 3 datasets, we sampled 105 examples and collected 3 responses per example.In total, we received 945 human evaluations.We pay workers $1 per response, and it takes around 5 min-7 https://www.mturk.com/worker/helputes to finish one response, i.e., the hourly rate is around $12.
Table 13 shows that inter-annotator agreements.

G Reproducibility
In our GPT-2 experiments, we use English text data from 3 domains: (1) WikiText (Merity et al., 2017): text from Wikipedia, and we use wikitext-103-raw-v1 from Hugging Face.8Its license is Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).( 2) WebText (Radford et al., 2019): text from the Web.It was used for pretraining GPT-2.The full WebText is not available but they released , and b 2 ∈ R V are model parameters.After training this model, the learned transition matrix can be obtained by M ′ = f (E), E is the word embedding matrix.

Figure 2 :
Figure 2: The histograms of sequence-level and token-level negative log-likelihoods of human texts and model generations from GPT-2 large.

Figure 3 :
Figure3: Forward CE only weakly penalizes the model Q θ when it puts a small amount of probability mass onto P (x) = 0 space.And the loss magnitude is much smaller than what we will get from reverse CE.

Figure 4 :
Figure4: The mauve scores obtained by MIXCE-finetuned GPT-2 models on development sets with different max generation lengths and different η.Note that when η = 1, MIXCE is equivalent to MLE.The x-axis is the mixing ratio η, and the y-axis refers to mauve scores with different max generation lengths.The 3 lines in each subplot show the results of GPT-2 models in different sizes.The 3 subplots in each row are the results of 3 datasets respectively.Unbiased sampling is used as the decoding method.Each dot is the average of 3 runs of sampling and the error bar shows the standard deviation of 3 runs.

Figure 4
Figure 4 illustrates the curves of mauve scores on the development sets.

Figure 6 -
11 are 6 randomly sampled examples from human evaluation results, 2 examples per dataset.

Table 2 :
(Jang et al., 2017;Maddison et al., 2017)r four objectives all need to sample from Q θ and require gradients to pass through this sampling operation.To this end, we use Gumbel-Softmax(Jang et al., 2017;Maddison et al., 2017)to make sampling differentiable.Unbiased sampling results of models finetuned by MLE or MIXCE on three datasets.For all metrics, the closer to the human scores the better.Bold numbers are the ones that are closer to human scores in each setting.Each number is a 3-run average.
Table1: Synthetic experimental results.Random (50%) randomly initializes M and sets 50% of the probabilities to 0. WebText means initializing M by the bigram occurrence in the WebText data.Gold refers to the results when M ′ =M.avg.js is our main metric, which represents the average JS divergence between M and M ′ (please see the definition of avg.0s in text).Each number is a 5-seed average, and Table7shows the 95% confidence intervals of some experiments.mixture of cross-entropies (MIXCE), where we assume P is unknown.withη,we choose the best η based on the avg.js result on the validation set.We report a 5-seed average for each experiment.The search space of η is [0.99, 0.9, 0.5, 0.1, 0.01].Selected best ηs are reported in Table11in the Appendix.

Table 3 :
Caccia et al. (2020)ts of the same models as Table2.Since changing the decoding method will not affect perplexity, we report the selected best p instead.Results.Table 1 (and Table 6 in the Appendix)shows the results of our synthetic experiments.Across 4 kinds of initialization of M and 5 vocabulary sizes, we observe some common patterns.First, the mixture of two KLs often gets the best avg.js compared to other objectives, and MIXCE * usually comes second.This supports our expectation that the mixture of two cross-entropies approximates the mixture of two KLs ( § 3.1), as well as demonstrates that combining two KLs or CEs can help learn the data distribution more accurately compared to MLE.Second, the approximated MIXCE usually under-performs MIXCE * but outperforms forward KL (MLE).Third, reverse KL generally works best for the avg.0s metric, due to its property of zero-forcing -forcing Q θ (x) = 0 when P (x) = 0. Lastly, JS divergence oftentimes works similarly to reverse KL, which is consistent with the observation made byCaccia et al. (2020)-language GANs trade off diversity for quality.

Table 8
shows the results.As you can see, c-mauve scores are in general very high (≥ 0.90), which may indicate that, after controlling the confounding factors, the ability of mauve to distinguish model text from human text has been weakened.MIXCE still gets better performance than MLE in most cases.Besides, we also compute controlled coherence in the same fashion, and MIXCE retains its advantage.Please refer to Appendix D.4 for more details about controlled Mauve and Coherence.

Table 5 :
Unbiased sampling results of GPT-2 small models finetuned by MLE or MIXCE on three datasets of different training data sizes.All metrics are the closer to the human scores the better.Bold numbers are the ones that are closer to human scores in each setting.

Table 7 :
Synthetic experimental results with 95% confidence intervals.WebText means initializing M by the bigram occurrence in the WebText data.

Table 10 .
We observe that model generations are always shorter than human text.Compared to MLE, our MIXCEfinetuend model produces shorter text on Wiki-Text while producing longer text on the other two datasets.We suspect that the shorter length of MIXCE on WikiText is due to the small mixing

Table 8 :
Controlled mauve results.Unbiased sampling is used as the decoding method, i.e., using the same model generations as Table2.Human scores are not 1 because sampling 10K fragments twice result in two different sets.Each number is a 3-run average.

Table 9 :
Controlled coherence results.Unbiased sampling is used as the decoding method, i.e., using the same model generations as Table2.Each number is a 3-run average.

Table 10 :
Unbiased sampling text lengths of models finetuned by MLE or MIXCE on three datasets.Length is computed by simply splitting text by whitespaces.
Table 11 has the best ηs for synthetic experiments.Table 12 contains the best ηs selected for GPT-2 experiments.

Table 11 :
The selected best η of synthetic experiments reported in Table1 and Table 6.The model section is based on avg.js.

Table 12 :
The selected best η of GPT-2 experiments reported in Table2.The model section is based on mauve (max length=512) on the dev set.