Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation

Despite the huge progress in myriad generation tasks, pretrained language models (LMs) such as GPT2 still tend to generate repetitive texts with maximization-based decoding algorithms for open-ended generation. We attribute their overestimation of token-level repetition probabilities to the learning bias: LMs capture simple repetitive patterns faster with the MLE loss. We propose self-contrastive training to penalize the output of a premature checkpoint of the same model when it incorrectly predicts repetition, which is shown to mitigate repetition effectively while maintaining fluency on two datasets. Furthermore, we find that LMs use longer-range dependencies to predict repetitive tokens than non-repetitive ones, which may be the cause of sentence-level repetition loops.


Introduction
Existing LMs prefer to generate repetitive texts for open-ended generation with greedy decoding or beam search (Welleck et al., 2020a).Even largescale pretrained LMs such as GPT3 (Brown et al., 2020) still generate redundant sentences (Dou et al., 2022).Despite many solutions proposed from the perspective of both training (Welleck et al., 2020b) and decoding (Holtzman et al., 2020), the cause of preference for repetition still needs to be clarified.
By analyzing the training dynamics of LMs regarding (non-)repetitive tokens, we reveal the learning bias towards repetition: LMs capture simple repetitive patterns first, which dominate the output distribution throughout the input space, and then learn more non-repetitive patterns during training.We show that the repetition problem can be mitigated by only training more steps (i.e., allowing over-fitting), although the coherence with inputs will be impacted.Conversely, when trained insuf-ficiently, LMs will overestimate repetition probabilities even for golden prefixes.We propose selfcontrastive training (SELFCONT), which exploits the contrast with a premature checkpoint of the same model by penalizing its output when it incorrectly predicts repetition.Experiments on two datasets show that SELFCONT effectively alleviates repetition while maintaining fluency by factoring out the undesired repetition behaviors highlighted by the premature checkpoint.
Besides the above analysis about overestimating token-level repetition probabilities during training, we also find that LMs use longer-range dependencies to predict repetitive tokens than non-repetitive ones.It may explain why LMs tend to fall into repetition loops (Xu et al., 2022).The problem may be solved by improving the modeling of long-range dependencies (e.g., increasing model sizes), which are left to future work.

Related Work
Regarding the cause of the repetition problem, Fu et al. (2021) theoretically derived bounds of repetition probabilities of the first-order Markov LM, although it is difficult to extend the bounds to general LMs.Another line of works attributed repetition to error accumulation during generation (Welleck et al., 2020b;Arora et al., 2022), while LMs still prefer repetition given golden prefixes.
We divide recent works that alleviate repetition into training-and decoding-based methods: (1) Training-based Methods.Welleck et al. (2020b) proposed unlikelihood training (UL) to reduce the probabilities of repetitive generations.Lin et al. (2021) and Xu et al. (2022) further extended the framework at the token and sequence level, respectively.SELFCONT focuses on token-level modeling, which is orthogonal with sequence-level methods.Xi et al. (2021) adopted additional modules to learn repetition patterns and control repetition explicitly.(2) Decoding-based Methods.
One straightforward solution to repetition is blocking repetitive n-grams generations (Paulus et al., 2018) or penalizing probabilities of repetitive candidates (Keskar et al., 2019).Li et al. (2022) selected candidates that maximize the probability difference between different-sized models.Sampling-based decoding methods are also shown effective in avoiding repetition, such as temperature sampling (Ficler and Goldberg, 2017), Top-k sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), and typical sampling (Meister et al., 2022).Although these methods reduce superficial repetition, it is unclear whether they utilize the underlying long-range dependencies to maintain coherence.

Empirical Analysis
Neural networks (NNs) are highly expressive to approximate arbitrary input-output mappings.Using Fourier analysis, Rahaman et al. (2019) showed the spectral bias of NNs: they learn low-frequency components faster during training, which are less complex and vary globally without local fluctuation.Our key hypothesis is that simple repetitive patterns may be such low-frequency components and learned by LMs early.In this section, we first formulate LMs ( §3.1), and then investigate the training dynamics ( §3.2) and the ability to model long-range dependencies ( §3.3) of LMs.

Language Models
LMs aim to fit the mapping x t = f (x 1:t−1 ) defined by a training corpus, where x 1:t is a sequence from the corpus.To this end, they are usually trained by minimizing the following cross-entropy loss: where x t ∈ {0, 1} |V| is the one-hot representation of x t indicating its index in the vocabulary V, and Figure 1 plots the training curves, revealing the learning bias of the LM: (1) The initially learned components prefer to copy input tokens throughout the input space, as indicated by predicting repetitive tokens at ∼90% of positions for both golden and generated prefixes.(2) With golden prefixes, at those positions where x t is repetitive, the LM predicts repetition almost constantly during training.When x t is non-repetitive, the LM predicts more non-repetitive tokens with more training steps.The repetition ratio also gradually decreases in modelgenerated texts.(3) The token prediction accuracy improves faster when x t is repetitive, indicating that the LM learns repetitive patterns more easily.Moreover, we notice that the validation loss rises at the 1,500th step, where the LM predicts much more repetitive tokens than the ground truth.At the end of the training, the generation has a closer token repetition ratio to the ground truth.But manual inspection finds the coherence with inputs is poor due to over-fitting.Appendix A.1 shows several generation cases.

Modeling Long-Range Dependencies
Figure Figure 2 indicates (1) The LM only learns dependencies within ∼100 tokens overall.When the prefix length is larger than 100, the perplexity on golden tokens no longer drops significantly (p ⩾ 0.05).(2) The LM learns and utilizes longer-range dependencies to predict repetitive tokens than non-repetitive ones.The perplexity on golden repetitive/non-repetitive tokens plateaus when the prefix length is larger than 160/50, respectively.The case is similar for generated texts.
(3) The LM uses short-range contexts to predict non-repetitive tokens regardless of decoding algorithms.Contexts beyond 100 tokens hardly help predict non-repetitive tokens, implying samplingbased decoding reduces repetition through randomness instead of using long-range dependencies.
Based on the above observation, we conjecture that the LMs keep repeating the same sentence with maximization-based decoding (Xu et al., 2022)  (3) D norept , where each example also contains 30 random sentences, but there is at most one token overlapping between any adjacent 5 sentences (generally the period ".").Each dataset consists of 20k examples.We then generate texts using greedy decoding conditioned on the first 50 tokens in the original test set and compute the ratio of texts which fall into loops (Holtzman et al., 2020).As shown in Table 1, compared to D original , the LM trained on D random has higher repetition ratios because it learns shorter-range non-repetitive patterns only within one sentence.Besides, although sentences in each D random example are unrelated, they can contain repetitive tokens3 , making the LM learn spurious long-range repetitive patterns to get into repetition loops.In contrast, the LM trained on D norept rarely gets into loops since it learns both repetitive and non-repetitive patterns almost within one sentence.Specifically, any adjacent five sentences in each D norept example are unrelated and hardly share tokens.These findings empirically support our hypothesis.Appendix A.2 shows more details.

Self-Contrastive Training
We denote the premature checkpoint as f θ 0 , which frequently predicts repetitive tokens.Formally, the SELFCONT algorithm is formulated as follows: where sg(•) means stopping back-propagation of gradients, λ is a tunable hyper-parameter to control the extent of repetition penalty, and 1 is the indicator function.f θ 1 is the target LM initialized from f θ 0 , and we optimize f θ using Eq. 1 until the validation loss converges to the minimum.The gradient for each token u ∈ V has changed to: where is always 1 and ∇ u L degenerates to the same as the vanilla LM.If w is not 0 and u is not x t , tokens with high logits under f θ 0 will receive larger gradients than the vanilla LM since w v,u is mostly smaller than 1 with different v.As for u = x t (w ̸ = 0), it may also be penalized with a positive gradient if f θ 0 | u is large enough, which usually means a dull token.By penalizing components that excessively prefer repetitive or dull tokens highlighted by f θ 0 , f θ 1 can utilize more complex patterns learned later to generate texts.

Experiments
Datasets We conduct experiments on Wikitext-103 (Merity et al., 2016) and WritingPrompts (Fan et al., 2018).The prompt and story in each Writing-Prompts example are concatenated as a sequence.
We set the maximum sequence length to 512 and take the first 50 tokens as input to generate the rest.Baselines We compare SELFCONT to three baselines: MLE, token-level UL (Welleck et al., 2020b) and ScaleGrad (Lin et al., 2021).Since SELFCONT focuses on token-level modeling, we do not compare it to sentence-level methods that directly penalize repetition loops, e.g., DITTO (Xu et al., 2022).
Implementation All baselines are implemented based on GPT2 base .We set the batch size to 16, the learning rate to 1e-4, and λ in Eq. 3 to 4.0.
For SELFCONT, we fine-tune GPT2 base for one epoch using MLE and take the checkpoint as f θ 0 for both datasets.We use different p for different models based on the performance on the validation set.Appendix B shows more details.
Metrics We use perplexity (PPL) under GPT2 xl to evaluate fluency, MAUVE (Pillutla et al., 2021) to measure the similarity between golden and generated distributions, the token repetition ratios (R-l) to measure the ratio of tokens that appear in previous l tokens (Welleck et al., 2020b), and distinct (Dn) (Li et al., 2016) to evaluate the n-gram diversity.
The closer scores to the ground truth mean better quality for all metrics.
Results As shown in Table 2, SELFCONT outperforms baselines in all metrics using greedy decod-ing.However, the high R-128 score shows it can still generate repetition loops due to the disability of small-scale LMs to model long-range dependencies.Using nucleus decoding, we see that different baselines can achieve similar repetition ratios and diversity to the truth by tuning p, while SELFCONT has better fluency and higher MAUVE scores.

Conclusion
We present empirical studies on LMs' preference for repetition by analyzing the training dynamics, which highlights their learning bias towards simple repetitive patterns.We propose penalizing outputs of a premature checkpoint during training, which effectively mitigates repetition while maintaining fluency.We also provide insight into why LMs easily fall into repetition loops by showing their disability to model long-range dependencies.Sampling-based decoding reduces repetition through randomness but not utilizing long-range dependencies.We believe that maximization-based decoding can also generate coherent texts without repetition by improving the modeling of long-range dependencies, which is left to future work.

Limitations
The limitations of this paper mainly lie in the following folds: (1) We do not provide any theoretical analysis for the correlation between long-range dependencies and repetition loops, as

A.1 Training Dynamics
Table 4 shows several cases generated by the LM with greedy decoding at different training steps.We summarize the findings as follows: (1) In the beginning, the LM keeps repeating the high-frequency word "<eos>," indicating that it does not capture phrase-level dependencies yet.
(2) At the 1500th step, the LM first generates a few fluent sentences and then gets stuck into the repetition of "the building," showing that it learns long-range dependencies conditioned on the golden prefix while the repetitive patterns dominate the probability distributions conditioned on the generated prefix.This case suggests the global tendency towards repetition for out-of-distribution inputs.
(3) At the 6000th step, the LM can generate long, fluent texts without repetition.However, it is difficult for the LM to maintain coherence with inputs due to over-fitting.For example, in the generated first sentence, "she had begun in 1962," "she" conflicts with "he" in the input.

A.2 Long-Range Dependencies
Observation For the experiment in Figure 2, we generate texts with three decoding algorithms conditioned on the first 50 tokens on the test set.Ancestral decoding means directly sampling tokens from the original probability distribution.For nucleus decoding, we set p to 0.9. Figure 3 shows the performance of GPT2 large , which shows similar results with GPT2 base in Figure 2. Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .It was a new project in November 18th century , after a new new building to the building , was created by the new new building .It was built in 18th century .The building of the new building , was built in 1966 , which had been created a new building .The building , was built in the new building .In the building of the building , which was built in the building by the building by new building by the building .The building , the new building by new building by new building , the building , and , and work , in the building .<eos> In 2009 , the building , the building , the building , the building .The building , the building , the building .The building by the building , the building , the building , the building by the building , and new building , the building , the building , which included by the building .<eos> <eos> <eos> = = = = = = <eos> <eos> The building , the building , the building .The building is the building , the building , the building , the building , the building and building .The building and building , the building , the building , in the building , the building , the building and building and building , the building , the building .The building and building , the building , the building , the building , the building , the building , the building .In the building , the building , the building and building , the building .In the building , the building , the building , the building , the building and building and building and building , the building , the building , the building , the building and building , the building , the building , the building , the building .<eos> <eos> <eos> <eos> The building , the building , the building , the building , the building , the building , the building , the building , the building and building , the building , the building , the building , the building , the building , the building and building , the building , the building and building , the building , the building , the building .<eos> The building and the building and building , the building , and , the building and building , the building , the building , the building , 6000 <eos> = Robert Boulter = <eos> <eos> Robert Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .She had begun in 1962 , was built a series of New York Times in 1962 -1938 after producing the Laidlay , and described as well as well as The Lastunk> 's assistant ... ] named " .She later explained : " [ 500 people who did not be turned off for me to me to me .She decided to his own song " As Rocham R. Nails , saying " .She knew it 's script was going to make me to live film , and I mean it was through me to get it 't know what we 't know what we want to make me .<eos> On 17 November 1953 , and it was nominated for me .He had done what he wanted the BBC at a period at her following year because it appearance at the mid @-@ selling artist with her singing .She was nominated for the background scene in 1962 .She spent four days after getting part of the public appearances at the war appearance , California , and ] base ." <eos> = = <eos> <eos> Aftermath Meodie Glam artist of the summer May 1967 , New York Times , and the war as the Star Wars franchise .He began to use the National Association ( from his staff ; it was included Star ) , including the Lyds house east , and I was the West Virginia Tech back to the war and Mennon ; there .She developed by <eos> William Peninsular League <eos> = = <eos> The script , 2004 .The script was named after the North America for the LAM passed .The script was commissioned to the American co @-@ person to produce producer ( present , taking place of the mid @-@ old , and the mid @-@ old @-@ old film , The Next Generation .The New York Times , having won the 4th birthday of the 4 , in the 4 million viewers .This was announced that it was cut of the media .The Elder Scrolls IV of the production , in East Coaster and The company entitled The Next Generation .<eos> For example , including the war , having performed on 6 , having released in East Coast Division .<eos> Upon its crew became a series of the produce  Verification For the experiment in Table 1, we use the same approach to construct the corresponding validation sets of 480 examples for D original , D random and D norept , and train three LMs until the best validation performance.Table 5 shows several generation cases with greedy decoding.The LMs trained on D original and D random fall into repetition loops.Although the LM trained on D norept also generates sentences that have previously appeared, it does not get stuck into loops.We further investigate whether the three LMs show the selfreinforcement effect: the more times a sentence is repeated in the context, the higher the probability of continuing to generate that sentence (Holtzman et al., 2020;Xu et al., 2022).Figure 4 indicates that the LMs trained on D original and D random show the above effect, while the LM trained on D norept does not.The results suggest that longer-range repetitive patterns biased LMs to fall into repetition loops through the self-reinforcement effect whether such patterns are true or spurious.The LM trained on D norept always generate sentences in a limited set due to greedy decoding which aims to find the global maxima of probability distributions, instead of the preference for repetition loops.

B Hyper-Parameters
We decide the hyper-parameters λ in Eq. 3 and p for nucleus sampling by searching for the value that makes the R-64 score of generated texts closest to the ground truth on the validation set.We search λ in the range {1.0, 2.0, 3.0, 4.0, 5.0, 6.0}, and p in the range {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}.Table 6 shows the settings of p for different models.

Training Set
Cases D original <eos> = Robert Boulter = <eos> <eos> Robert Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .He has appeared in several films , including the television series The Bill , The Bill Goes to Washington , and The Bill Goes to Washington .He has also appeared in several films , including The Bill Goes to Washington , The Bill Goes to Washington , and The Bill Goes to Washington .He has also appeared in several films , including The Bill Goes to Washington , The Bill Goes to Washington , and The Bill Goes to Washington .<eos> Boulter was born in London , England , on 23 May 1986 .He is the third child of actor and actress Robert Boulter and his wife , Susan .He is the third of five children born to his wife Susan and their three children , Robert , Roberta , and Roberta .Robert Boulter 's father , Robert Boulter , was a film director and producer .He was the first actor to be cast in a film role , and the first to be cast in a television series .He was also the first actor to be cast in a television series .<eos> Boulter 's father , Robert Boulter , was a film director and producer .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series .He was the first actor to be cast in a television series D random <eos> = Robert Boulter = <eos> <eos> Robert Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a small , rectangular structure that was built in the late 19th century .The first of these was the <unk> , a smal D norept <eos> = Robert Boulter = <eos> <eos> Robert Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .The first two were built by the British Royal Navy .It was also released on the iTunes Store on September 28 , 2010 .It is also possible that he was a member of the royal family .He also said that he would not be returning to the team .@ 5 m ) wide and 2 feet ( 0 @.The song was written by producer and songwriter David Gilmour .It was also released on the iTunes Store on September 28 , 2010 .It was also released on the iTunes Store on September 28 , 2010 .It was also released on the iTunes Store on September 28 , 2010 .@ 5 million ( US $ 2 @,@ 000 ) .The song was written by producer and songwriter David Gilmour .He also said that he would not be returning to the team .It was also released on the iTunes Store on September 28 , 2010 .It is also possible that he was a member of the royal family .@ 5 m ) wide and 2 feet ( 0 @.The two ships were to be joined by two smaller ships .It was also released on the iTunes Store on September 28 , 2010 .He also said that he would not be returning to the team .It was also released on the iTunes Store on September 28 , 2010 .@ 5 million ( US $ 2 @,@ 000 ) worldwide .The song was written by David Gilmour and directed by David Gilmour .It was also released on the iTunes Store on September 28 , 2010 .It is also possible that he was a member of the royal family .He also said that he would not be returning to the team .@ 5 m ) wide and 2 feet ( 0 @.The two ships were protected by armour plates of 100 millimeters ( 3 @.It was also released on the iTunes Store on September 28 , 2010 .It was also released on the iTunes Store on September 28 , 2010 .As for baselines, we follow the original papers to set α to 1.0 for UL and γ to 0.2 for ScaleGrad.
As for the choice of f θ 0 , we empirically choose

C Modeling Token-Level Repetition
We compare SELFCONT with baselines in terms of the performance for modeling token-level repetition.As shown in Table 7, SELFCONT achieves higher overall accuracy, higher F1 score on nonrepetitive tokens, and comparable F1 score on repetitive tokens.SELFCONT <eos> = Robert Boulter = <eos> <eos> Robert Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .Boulter appeared in two episodes of the British comedy @-@ drama Oz in 2001 , playing an American businessman who helps an Italian woman ( Joanna Cassidy ) escape from her abusive husband ( Michael Sheen ) .He also played a character in the 2004 film Heat , which was directed by Michael Medvedev .Boulter appeared in two episodes of the British comedy @-@ drama Body of Proof in 2005 , playing a retired police officer who is accused of murdering his wife ( Christine Fay ) .He appeared in two episodes of the British television drama Body of Proof in 2006 , playing a retired judge who defends his client , Simon Ashdown ( Stephen Fry ) .Boulter also appeared in the Australian television series The X Factor in 2007 , playing a retired judge who defends his client , Kiernan Straker ( Simon Ashdown ) .<eos> Boulter was nominated for a Golden Globe Award for Best Actor -Motion Picture Musical or Comedy , but lost to Hugh Grant for the role of Seamus Heaney .In 2010 , he appeared in the British comedy @-@ drama The X Factor , playing a retired judge who is accused of murdering his wife ( Christine Fay ) .He also appeared in the Australian television series The X Factor in 2010 , playing a retired judge who defends his client , Simon Ashdown .In 2012 , he appeared in the British comedy @-@ drama Body of Proof , playing a retired judge who is accused of murdering his wife ( Joanna Cassidy ) .He also appeared in the American television series The X Factor in 2013 , playing a retired judge who is accused of murdering his wife ( Christine Fay ) .<eos> Boulter has appeared in a number of films , including the 2005 film Heat , and the 2008 comedy Body of Proof .He appeared in the British television series Heat 2 , and the 2009 comedy Body of Proof .In 2011 , he appeared in the Australian television series The X Factor .In 2012 , he appeared in the British television series The X Factor .

D Case Study
In 2013 , he appeared in the British television series The X Factor .In 2014 , he appeared in the Australian television series The X Factor .In 2015 , he appeared in the British television series The X Factor .In 2016 , he appeared in the Australian television series The X Factor Section 3 and Section 5.
B1. Did you cite the creators of artifacts you used?
Section 3 and Section 5.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
Section 3 and Section 5.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section 3 and Section 5.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 5.

C Did you run computational experiments?
Section 3 and Section 5.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 3, Section 5, Appendix Section B
the checkpoint after training for one epoch, which allows enough training steps for self-contrastive training.We use the premature checkpoint of the same model instead of other models since different models may have different biases.It costs about 24 hours to train SELFCONT on Wikitext-103 (∼10 epochs) or CNN News (∼6 epochs).The results are based on one NVIDIA Tesla V100 (32GB memory) with a random single run.
the output logits of the LM parameterized by θ.Predictably, with more training steps, argmax(f θ ) is closer to the target function f .We randomly sample 1k sequences containing 512 tokens from the Wikitext-103 dataset(Merity et al.,  2016)and train GPT2 base from scratch for 100 epochs 2 .Given a golden prefix x 1:t−1 , we regard the model prediction xt = argmax f θ (x 1:t−1 ) as correct if xt = x t .We call x t or xt repetitive if it is included in x 1:t−1 , and non-repetitive otherwise.
oritize learning low-complexity components, early stopping may result in unexpected generations.We are inspired to investigate whether simple repetitive patterns in human-written texts are learned first, thus dominating the generations.Figure 1: Top: Ratios of positions where x t or xt is repetitive or not, given golden prefixes of the test set.Bottom: Ratios of tokens that appear in previous l tokens, in model-generated texts with greedy decoding.

Table 2 :
Automatic evaluation results with greedy and nucleus decoding on Wikitext-103 and WritingPrompts.

Table 3 :
Table 3 presents the detailed statistics.Statistics of the datasets.
can be automatically evaluated reliably.Perplexity and mauve scores are also shown to correlate highly with manual evaluation for evaluating fluency and overall quality, respectively.
(Elman, 1990)GPT3).(4)Wedo not experiment with RNN-based models, which are also shown to prefer repetition(Elman, 1990).(5) We do not perform the manual evaluation to compare SELFCONT with baselines since we focus on repetition in this paper, which Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher.2019.Ctrl: A conditional transformer language model for controllable generation.arXiv preprint arXiv:1909.05858.Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, and Kyunghyun Cho.2020a.Consistency of a recurrent language model with respect to incomplete decoding.In Proceedings of the 2020 Conference on Empirical Methods in

Table 4 :
Generation cases with greedy decoding at different training steps to investigate the training dynamics.The inputs are highlighted in bold.

Table 5 :
Cases generated by three LMs trained on different training sets with greedy decoding.The inputs are highlighted in bold.Figure 4: per-token perplexity scores of texts generated by LMs trained on D original , D random and D norept with greedy decoding.We compute their respective perplexity scores using the corresponding LMs.

Table 6 :
Settings of p for nucleus sampling.

Table 8 and
Table9show the cases generated by different models on Wikitext-103 with greedy decoding and nucleus decoding, respectively.We see that SELFCONT can still get stuck into loops with greedy decoding since it hardly learns longer-range dependencies than standard LMs.Although sam-Robert Boulter = <eos> <eos> Robert Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .He has also appeared in the television series The Bill , The Bill 2 , The Bill and The Bill 3 .<eos>BoulterwasborninLondon,England,andgrewupintheUnitedStates.He studied at the University of California , Berkeley , and the University of California , Berkeley .He was a member of the British Academy of Film and Television Arts ( BAFTA ) in 2001 .He has appeared in several television series , including The Bill , The Bill 2 , The Bill 3 , The Bill 4 , The Bill 5 , The Bill 6 , The Bill 7 , The Bill 8 , The Bill 9 , The Bill 10 , The Bill 11 , The Bill 12 , The Bill 13 , The Bill 14 , The Bill 15 , The Bill 16 , The Bill 17 , The Bill 18 , The Bill 19 , The Bill 20 , The Bill 21 , The Bill 22 , The Bill 23 , The Bill 24 , The Bill 25 , The Bill 26 , The Bill 27 , The Bill 28 , The Bill 29 , The Bill 30 , The Bill 31 , The Bill 32 , The Bill 33 , The Bill 34 , The Bill 34 , The Bill 35 , The Bill 36 , The Bill 37 , The Bill 39 , The Bill 40 , The Bill 41 , The Bill 42 , The Bill 43 , The Bill 44 , The Bill 45 , The Bill 46 , The Bill 47 , The Bill 48 , The Bill 49 , The Bill 50 , The Bill 51 , The Bill 52 , The Bill 53 , The Bill 54 , The Bill 55 , The Bill 56 , The Bill 57 , The Bill 58 , The Bill 59 , The Bill 60 , The Bill 61 , The Bill 62 , The Bill 63 , The Bill 64 , The Bill 65 , The Bill 66 , The Bill 67 , The Bill 68 , The Bill 69 , The Bill 70 , The Bill 71 , The Bill 72 , The Bill 73 , The Bill 74 , The Bill 75 , The Bill 76 , The Bill 77 , The Bill 78 , The Bill 79 , The Bill 80 , The Bill 81 , The Bill 82 , The Bill 83 , The Bill 84 , The Bill 85 , The Bill 86 , The Bill 87 , The Bill 88 , The Bill 89 , The Bill 90 , The Bill 91 , The Bill 92 , The Bill 93UL <eos> = Robert Boulter = <eos> <eos> Robert Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .He has also appeared in several films , including the comedy @-@ drama The Man Who Would Be King (2002) , the science fiction film The Man Who Would Be Super (2004) , the science fiction film The Man Who Would Be Super (2006) , the science fiction film The Man Who Would Be Super (2008) , the science fiction film The Man Who Would Be Super ( 2012 ) , the science fiction film The Day After Tomorrow ( 2013 ) , the science fiction film The Day After Tomorrow ( 2014 ) , the science fiction film The Day After Tomorrow ( 2015 ) , the science fiction film The Day After Tomorrow ( 2016 ) , the science fiction film The Day After Tomorrow (2017) , the science fiction film The Day After Tomorrow (2017) , the science fiction film The Day After Tomorrow (2017) , the science fiction film The Day After Tomorrow (2017) , the science fiction film The Day After Tomorrow (2017) , the science fiction film The Day After Tomorrow (2017) , the science fiction film The Day After Tomorrow (2017) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ( 2017 ) , the science fiction film The Day After Tomorrow ScaleGrad <eos> = Robert Boulter = <eos> <eos> Robert Boulter is an English film , television and theatre actor .He had a guest @-@ starring role on the television series The Bill in 2000 .In 2002 he appeared as a character in the BBC 's crime drama series The Secret Service .He has also worked as a consultant for several films including The Man Who Would Be King (2004) , The Man Who Would Never Die ( 2007 ) , The Man Who Would Never Be King 2 ( 2009 ) , The Man Who

Table 8 :
Cases generated by different models with greedy decoding on Wikitext-103.The inputs are highlighted in bold.A1.Did you describe the limitations of your work?A2.Did you discuss any potential risks of your work?Not applicable.Left blank.A3.Do the abstract and introduction summarize the paper's main claims?A4.Have you used AI writing assistants when working on this paper?