Local Temperature Beam Search: Avoid Neural Text De Generation via Enhanced Calibration

Previous studies have constantly observed that a language model repeats itself, creating repetitions in an output sequence. To cope with the issue, stochastic decoding schemes have been the de facto approaches; the strategies add randomness in inference, hence avoiding the “self-loop”. However, the remedy comes at the cost of sacrificing output quality due to the randomness involved. In this work, we introduce a deterministic decoding scheme, local temperature beam search . This inference algorithm is an embarrassingly simple variant of beam search, yet it reduces repetition, whose level is superior to that of a sampling-based decoding algorithm, while maintaining the level of coherence as in beam search. Our idea is rooted in the concept of model calibration; we view a repetition as a casualty from overconfidence in a model. Therefore, our work mitigates the miscalibration present in the course of inference with a post-calibration approach applied in beam-specific manner. Our inference scheme is validated on text completion tasks, in which the repetition problem is seen most clearly, and is exhaustively compared with existing inference schemes.


Introduction
Neural language models have gained much attention with ground-breaking performances (Vaswani et al., 2017;Lewis et al., 2020), and accordingly, decoding algorithms have been studied extensively along with such models (Holtzman et al., 2020;Kool et al., 2019;Meister et al., 2021;Fan et al., 2018).An inference algorithm aims to find an optimal hypothesis from a search space, which the level of optimum is commonly approximated and mapped by a language model.A choice of decoding/search algorithm can result in significant differences in model outputs, such as in diversity and coherence (Ippolito et al., 2019).For this reason, a large body of research has been done to find an optimal search algorithm (Cho, 2016;Ippolito et al., 2019;Meister et al., 2022;Holtzman et al., 2020;Fan et al., 2018).
In a broad view, there are two branches in the inference algorithm: deterministic and stochastic.Deterministic branch includes greedy decoding and beam search which are maximization-based inference strategies; the methods select tokens that maximize a sequence probability, hence generating coherent and high quality sequences.However, it has been constantly reported that the maximizationbased schemes generate highly repetitive outputs (Welleck et al., 2020;Fu et al., 2021).Therefore, stochastic decoding algorithms, such as top-p (Holtzman et al., 2020) and top-k (Fan et al., 2018), have been the de facto options in an environment where a language model is likely to repeat itself, such as text completion.However, with the randomness introduced, the stochastic methods are at risk of generating incoherent sequence (Holtzman et al., 2020).
In this work, we view the repetition problem of a language model from the standpoint of model calibration.Our intuition is rooted in two interesting findings: 1) a language model assigns high probabilities to repeating tokens (Holtzman et al., 2020), but 2) human texts hardly contain repetition within a sequence (Paulus et al., 2017).In summary, we hypothesize that a language model is overconfident when repeating itself, assigning spuriously high predictive scores to predictions that are less likely to appear.The calibration of a language model is of importance especially in beam search (Müller et al., 2019); beam search keeps only a finite number of "likely" beams, whose likelihood is the (log) probabilities mapped by a model in inference.Therefore, when a probability in a beam is overestimated due to overconfidence, the search will be biased towards the overconfident beam, leading to the degeneration in text, i.e. repetition.
In this light, we propose local temperature beam search which is a deterministic algorithm that mitigates the long-standing repetition problem in a deterministic search algorithm while fully enjoying the strengths of a maximization-based approach.We mitigate the bias in beams, caused by overconfidence, by introducing local temperature scaling, in which a temperature value is decided in a contextspecific manner.Accordingly, the repetition problem of a language model with our decoding scheme is reduced as much as a sampling-based scheme, whereas coherence score of generated outputs is as high as that of a maximization-based strategy.We attribute such improvement to its connection to n-gram blocking (Paulus et al., 2017); we illustrate how the proposed beam search is an implicit version of n-gram blocking.In this sense, the proposed idea brings more freedom in a beam search process while reducing repetition as n-gram blocking.The proposed decoding scheme is thoroughly tested in text completion, in which the stochastic decoding schemes have been dominant until now.
The contributions are as follows: • Our work views the repetition problem of a language model from the standpoint of model calibration.
• We conduct a preliminary experiment that empirically illustrates the link between overconfidence and repetition.
• Our work bridges n-gram blocking to the proposed decoding scheme, and we attribute the success of local beam search to implicit penalty presents in our algorithm.
• The proposed beam search is robust to a wide choice of beam width and temperature.

Neural Sequence Generation
Given a neural language model parameterized with θ, neural sequence generation is auto-regressively predicting a sequence of tokens.The following is an inference procedure at time step t.
where P (y| ŷ<t ; θ) is the conditional probability distribution mapped by a model.f is a decoding/search algorithm, and ŷ<t is a series of tokens preceding time step t.The inference is done in a left-to-right manner, predicting the next token given a context.In return, an output is a sequence of tokens ŷ ∈ V + , where V + is Kleene closure, a set of all possible strings from the vocabulary set V .

Maximization Decoding
A maximization decoding algorithm simply takes the most "likely" words in a context; a language model continues generating a sequence by appending tokens that maximize the sequence probability.Specifically, the log sentence probability, often referred to a score, is defined as the following: The scoring function s intakes the inferred predictions ŷt and computes the log sentence probability by adding the log probability of the current prediction to the score of preceding inferences.Continuing a sequence with the most probable candidate is called greedy decoding.When a search space expands to multiple candidates, it becomes a beam search.
Beam search is a type of Breadth First Search that expands and keeps only B probable beams in the course of inference.The selection of B beams is based on their scores as follows.

Ŷt = arg max
Ŷt indicates a set of hypotheses/beams at time step t and the size of the set is the beam width B.
The maximization inference schemes excel in generating a coherent text and the application has been widely witnessed (Vaswani et al., 2017;Lee et al., 2021).However, it comes at the expense of diversity.Numerous studies have reported that a sequence generated with an algorithm that maximizes the likelihood tends to contain repetitive phrases (Fu et al., 2021;Holtzman et al., 2020).

Calibration
A model calibration indicates the trustworthiness of a model prediction (Guo et al., 2017); a wellcalibrated model makes predictions whose predic-tive scores match accuracy.Formally, where Ŷ and P are predictions and probability assigned when making the prediction.Therefore, for instance, when a calibrated model makes a prediction with 0.8 probability, the chance of the prediction to be correct is 0.8.When the predictive score is greater than the accuracy, a model is overconfident; in the opposite case, the model is underconfident in making a prediction.There have been constant efforts to enhance calibration of a model (Lee et al., 2022), temperature scaling (Guo et al., 2017) being one of the early and effective attempts.
Temperature Scaling is a post-processing calibration method (Guo et al., 2017), which scales a logit vector with a single global temperature τ .
P (y i |y <t ; τ, θ) refers to a conditional probability distribution over the output space scaled with a temperature τ , and z is a logit vector.When τ is set to ∞, the resulting probability distribution becomes a uniform distribution.On the contrary, when the temperature approaches a value close to 0, the distribution is mapped to a one-hot encoding with its whole probability mass assigned to an arg max index.With this scaling mechanism, the temperature scaling has been found to mitigate the overconfidence problem in a neural network by setting the temperature greater than 1.0 (Müller et al., 2019;Guo et al., 2017).

Related Work
Recent studies have introduced decoding schemes that mitigate such weakness present in the popular decoding schemes.Diverse beam search (Vijayakumar et al., 2016) adds sequence dissimilarity between the beams in the score function, expecting diversity in the final outputs of B beams.Diverse sibling beam search (Li et al., 2016) penalizes beams that have the same root, so that outputs do not overlap with each other.Delayed beam search (Massarelli et al., 2020) transits from sampling to maximization, in which the first j time steps are inferred with a sampling-based scheme and the rest of the generation is done with beam search.Stochastic beam search (Kool et al., 2019) and conditional poisson stochastic beam search (Meister et al., 2021) bring randomness to beam search by sampling predictions at each time step.Cho (2016) proposes adding noise to a hidden state of a decoder, exploring the data manifold with the noise.This noise-based decoding has been shown to mitigate the low diversity in generated text and has been referred to as Noisy Parallel Approximate Decoding (NPAD).(Keskar et al., 2019) introduces repetition penalty, in which the penalty is given to a repeating token by discounting the corresponding logit value while leaving other logit values untouched.Lastly, contrastive search (Su et al., 2022) is a recent attempt that utilizes hidden representation in order to avoid repetition.The approach noticeably reduces repetitions, yet the core weaknesses are 1) high computation costs and 2) high dependency on isotropy level of an inference model.
Another popular branch of inference strategy is sampling.Top-k (Fan et al., 2018) considers only the k most probable tokens when sampling, truncating the sampling space to k indexes.Topp sampling considers the smallest set of indexes whose sum amounts up to the p probability.Typical sampling (Meister et al., 2022) samples a token based on the expected information content of a token.For more detailed explanation, please refer to Appendix A.

Language Models Make Repetitions with Overconfidence
We make a connection between repetition and overconfidence with a preliminary experiment; we find that a language model is largely overconfident when making a repetition.We draw such observations by comparing predictive scores when generating n-gram repetitions and the probability of n-gram repetitions to appear in human-written text.We compute the average predictive score of a language model when generating a unigram, bigram, trigram, and quadgram repetition in the course of text generation.For the probability of repetitions appearing, we compute the probability of unigram, bigram, trigram, and quadgram repetition in humangenerated text.We compare the predictive scores and probability of such repetitions to appear, which the result is depicted in Table 1.
It is clear that a language model assigns spuriously high probability mass, around 0.9, when making a repetition, even though there is only a

Local Temperature Beam Search
We propose an embarrassingly simple approach in beam search that handles the miscalibration issue in the course of inference; we apply a calibration method, namely temperature scaling (Guo et al., 2017), at every time step when a beam is found to be overconfident, thereby preventing the accumulation of miscalibration and degeneration in text.

Progressive Local Temperature
As witnessed in Table 1, the chance of n-gram repetition appearing in human-written text exponentially decreases as n grows, whereas predictive score of n-gram repetition mapped by a language model increases linearly.Accordingly, we propose to apply temperature scaling to handle such discrepancy in model calibration with progressive local temperature; a beam is assigned with a high temperature which is proportional to its level of overconfidence.
where τ is the global temperature and γ is the temperature increasing factor.ŷmax is the argmax prediction of the probability distribution mapped by the inference model given context ŷ<t,b .Therefore, Equation 6 is validating whether the addition of ŷmax t,b to ŷ<t,b creates an n-gram repetition.If it is found to be the case of self-loop, the temperature is set according to the level of repetition.Therefore, the temperature τ t,b is both progressive and local; the beam-specific and time step-specific temperature value is computed on-the-fly during an inference.It is worth noting that the proposed approach does not increase temperature with unigram repetitions, since a word is likely to appear multiple times in an utterance.In addition, when a generated text is free of repetition, then probability distributions over the inference remain untouched.Therefore, the proposed local temperature beam search becomes a vanilla beam search.
The core rationale behind the use of n-grambased checking is to lessen model dependency within the proposed decoding scheme; recent studies, i.e. contrastive search (Su et al., 2022), utilize hidden representation to identify signs of repetitions.One major drawback is that such algorithms require a language model to have isotropic representation (Su and Collier, 2022).However, some language models, i.e. gpt2-small (Radford et al., 2019), are found to display anisotropic representations (Li et al., 2020;Su et al., 2022).Consequently, decoding schemes that involve hidden representations are not model-agnostic.On the contrary, by utilizing n-gram based matching logic, the proposed method alleviates the model dependency, and hence local temperature beam search can be coupled with off-the-shelf language models with less prerequisites.

Scaling Probability with Local Temperature
With the local temperature computed for each beam at each time step, each beam probability distribution is scaled with the local progressive temperature obtained.
We see repetition as overconfidence and set a temperature value that surely increases entropy of a distribution; this practice smooths the distribution, decreasing spuriously high predictive scores.A vanilla beam search generates a sequence with a self-overlap pattern, such as "to repeat itself ".On the contrary, local temperature beam search avoids degeneration in text as the mechanism penalizes a beam with a sign of repetition, and the penalty is accumulated to subsequent time steps.Therefore, the output inferred with the proposed decoding scheme is free of repetition, yet in high quality.
There exists two core differences between our temperature scaling and the vanilla temperature scaling (Guo et al., 2017), the first being how a temperature value is chosen.In (Guo et al., 2017), a temperature is a learned parameter on validation dataset, as a model can be both overconfident or underconfident.However, we remove the learning process, as repetition indicates overconfidence, thereby leaving the τ as a hyperparameter with the condition, τ ≥ 1.0.The second is the use of beam-specific progressive temperature.The proposed local temperature scaling has n number of multiple temperature options for each beam, {τ, τ +γ, • • • , τ +γ * (n−1)}, while vanilla temperature scaling maintains a single global temperature τ .

Connection to n-gram Blocking
Local temperature τ b contributes more than simply smoothing a probability distribution; mitigating overconfidence of a beam with τ b is equivalent to penalizing the overall score of a beam.We connect this aspect to n-gram blocking and illustrate how our approach is an implicit version of n-gram blocking.
Let there be a repetitive n-gram at time step t in a beam b ′ .We describe the score of the beam in n-gram blocking (Equation 8) and in our approach (Equation 9): The beam with repeating n-gram under n-gram blocking is explicitly penalized as the score of the beam is assigned with −∞.Therefore, the beam is immediately removed from the search boundary due to the arg max operations in Equation 3. On the other hand, local temperature beam search computes scores with a reduced amount that is proportional to the τ b when the beam is found to contain repetitions; the beam is implicitly penalized, which the discount amount The core driving force of the proposed approach is not just implicit penalty, but the accumulation of the penalty in the beam throughout the rest of the inference.As seen in Equation 2, the scoring function has the shape of recursion; a score of a previous time step accumulates to following time steps.This indicates that a penalty in the proposed approach remains present in a beam along future decoding.Therefore, the beam with the penalty is at risk of dropping out from the beam search boundary.For instance, in Figure 1b, the second occurrence of the word "itself" receives a penalty not only within that time step but also from the previous time step as it is found to be a bi-gram repetition.Therefore, with such penalty, the beam is dropped from the beam search group, and thus other beams are explored in the inference.

Dataset
The efficacy of the proposed decoding scheme is tested in text completion tasks, in which the self repetition problem has been widely witnessed (Holtzman et al., 2020).Given a prefix, a language model coupled with an inference algorithm generates a sequence of tokens conditioned on the prefix.We conduct experiments on the popular Wikitext-103 dataset (Merity et al., 2016) and Web-Text (Radford et al., 2019).The test datasets are composed of 2.2k sentences for Wikitext-103 and 5k for WebText1 .

Experiment Details
For the model used in experiments, we have utilized pretrained language models available at transformers by huggingface (Wolf et al., 2020a).To be specific, we have conducted experiments with GPT2-small, GPT-medium, GPT-large, and GPT2-small finetuned on Wikitext-103.Links to the model checkpoints are listed in Appendix D Given the prefix length of 32, a language model generates a maximum number of 128 tokens that follows the prefix.For all of the experiments conducted, we set the global temperature to 1.0.For Top-p sampling, p is set to 0.8, and for contrastive search, k and α are set to 4 and 0.6 respectively.For beam search and ours, the beam width is set to 10, otherwise explicitly mentioned.For the hyperparameter n within the proposed approach, we set the value to 4, hence checking repetitions from bigram to quadgram.For the temperature increasing factor γ, we set the value to 0.3 for WebText and 0.5 for Wikitext-103.
For evaluation metrics, we report rep-n ratio, which indicates the average distinct n-gram ratio in sentences.A high rep-n indicates a sequence is filled with n-gram repetitions.In addition to the rep-n score, following (Su et al., 2022), we also report the diversity score.Aside from the repetitionrelated metrics, we report perplexity, Sim-CSE score for coherence, and G-Score which indicates the overall quality of the generated text both considering coherence and diversity.For more details, please refer to Appendix B.

Evaluation Result
From Table 2, it is evident that the proposed strategy has the lowest repetitions and the most similar n-gram frequencies to those of humans across multiple datasets and with different language models.Beam search and greedy decoding clearly suffer from repetition shown with high rep-n, and low diversity scores.The problem is mitigated to a certain level in the sampling-based schemes, such as top-p.However, the most noticeable gain in handling the problem is seen from the proposed method.Though our method is a deterministic algorithm, it outperforms the stochastic decoding strategies in minimizing the self-loop.Furthermore, the G-score is higher than other decoding strategies across every dataset tested.This demonstrates the well-balance of coherence and diversity in the generated text by the proposed decoding scheme.Lastly, unlike contrastive search, our approach is free from the causality brought by the anisotropic representation of language models.Despite the fact that contrastive search works well with the GPT2 large model, the decoding scheme makes trivial difference to vanilla beam search and greedy decoding when paired with the GPT2 small and medium size models.This is in sharp contrast to the results of ours, as the proposed local temperature beam search fulfills its purpose with any language model.

Beam Width and Temperature
It has been constantly reported that an increase in beam width leads to degeneration in generated text, making the overlap within a generated sequence bigger (Holtzman et al., 2020).Our experiment results in Table 3 further support the claim; vanilla beam search suffers significantly from increased beam width.For instance, we observe an increase in every n-rep metric with the increase in beam width.This clearly shows that vanilla beam search is vulnerable to the choice of beam width.On the contrary, our strategy is robust to a choice of beam size.In fact, we observe a drop on each repetition metric with increased beam width.Therefore, unlike vanilla beam search, our search algorithm can be equipped with a varying size of search boundary.
Furthermore, increasing the global temperature does not guarantee prevention of repetitions.Beam search with temperature set to 2.0 achieves meaningful gain in terms of reducing repetition, compared to those of beam search with temperature 1.0.However, even with the enhanced ability, the diversity score still stays around 29.This implies that simply increasing the global temperature does not mitigate the repetition problem of language models, and incorporating local temperature is necessary in preventing text degeneration in terms of repetition.

Progressive Temperature
The proposed method applies progressive temperature, higher temperature for beams with longer n-gram repetitions.In this ablation study, we perform non-progressive temperature setting, where any beam with n-gram repetition or longer receives the same temperature value.As demonstrated in Table 3, progressive temperature setting is one of the core aspects of local temperature beam search; when temperatures are non-progressive and shared to the beams with repetitions, we observe clear drop in diversity scores.We find that outputs with non-progressive temperatures are filled with short repetitions, as demonstrated with the increase in rep-2 scores.

Computation Cost
Identifying and handling overconfident beams inevitably adds computation cost, yet the cost is trivial.The proposed approach computes n-gram matching for every time step.However, the n-gram matching is simply counting n-grams of a sentence.Therefore, the addition of "counting" operations at each time step does not add much computation cost to the vanilla beam search.
Table 4 depicts how previous repetition-handling strategies fail to generalize in machine translation tasks.Top-p sampling faces a severe drop in BLEU score compared to that of vanilla beam search, with the drop amounts up to 8.1.The same applies in contrastive search; BLEU score, on average across the 4 corpora tested, is down by 14 score compared to that of beam search.On the contrary, the proposed decoding strategy generates text with the same level of quality as with beam search.This empirical finding widens the scope of potential application of our inference algorithm and illustrates its superior generalization ability to the existing methods.

Conclusion & Future Study
In this study, we view the repetition problem of a language model as a calibration issue; a language model repeats itself as the model is overconfident in predictions.In this light, we propose a local temper-ature scaling in which the post-calibration method is applied only to the overconfident beams.Our local temperature beam search is a deterministic decoding strategy that excels in reducing repetition while maintaining coherence level; we attribute the success to the implicit penalty given in the course of generation.Lastly, unlike existing inference strategies, the proposed idea is robust to a variety of choice of temperature and beam width.
The objective of this paper is centered around reducing the self-overlap in a sequence, and hence the subject of local temperature scaling is chosen accordingly.However, the local temperature beam search can be utilized in tasks with different aims, as local temperature has a role of implicitly penalizing a beam.For example, a language model can be penalized with the proposed idea when the model generates a sequence with gender bias in it.We believe that ways to utilize our approach can be explored in future research.

Limitations
Since the proposed method scales probability distribution, not shift, the proposed idea does not change an output when coupled with a greedy decoding strategy.Greedy decoding simply takes argmax of probability distribution at each time step, and hence, the output with or without the local temperature scaling will not be changed within greedy.Therefore, the proposed idea is required to be utilized with a beam search, or a variant of it.

A Sampling Decoding
Different from the deterministic approaches, a sampling generation scheme requires a stochastic process; a prediction is done by sampling a token from a predicted categorical distribution, exploring a search space.
where Ṽ ⊆ V .When Ṽ = V , it is referred to as pure sampling.This practice, however, introduces high randomness in inference as the sampling space is large (i.e.|V |=32k in WMT14 (Vaswani et al., 2017)), and thus variants have been introduced; a sampling space is truncated to a certain subset of the space.Top-k sampling (Fan et al., 2018) limits the sampling space to top k probable indexes.However, considering a fixed number of candidates is found to be suboptimal, such as when dealing with a flat probability distribution (Holtzman et al., 2020).Therefore, top-p sampling (nucleus sampling) (Holtzman et al., 2020) is proposed, where the sampling space is truncated to the smallest set, such that the sum of their probabilities is equal or greater than a pre-defined p. Due to its flexibility, top-p is known to perform well on varying shapes of distributions.With randomness injected, the sampling-based methods have been utilized in a diversitypromoting environment, such as in dialogue (Tian et al., 2020), and story generation (Fan et al., 2018).However, the diversity comes at a price; it has been reported that the stochastic decoding algorithms are positively correlated with an increase in hallucination (Dziri et al., 2021).Furthermore, (Ippolito et al., 2019) witnesses a trade-off between diversity and quality in such algorithms.

B Metric
Sim-CSE We utilize sim-CSE sentence embedding model (Gao et al., 2021) to measure the coherency between prefix and generated text; prefix and generated text are both fed to sim-CSE model, and cosine similarity is computed between their sentence embeddings.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 5.1 C Did you run computational experiments?Section 5.2 C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 5.2

Figure 1 :
Figure1: We illustrate inference with (a) vanilla beam search and (b) the proposed local temperature beam search in a text completion task with a prefix "language models are known".A vanilla beam search generates a sequence with a self-overlap pattern, such as "to repeat itself ".On the contrary, local temperature beam search avoids degeneration in text as the mechanism penalizes a beam with a sign of repetition, and the penalty is accumulated to subsequent time steps.Therefore, the output inferred with the proposed decoding scheme is free of repetition, yet in high quality.
you describe the limitations of your work?After Conclusion A2.Did you discuss any potential risks of your work?Not applicable.Left blank.A3.Do the abstract and introduction summarize the paper's main claims?Abstract and Section 1 A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Left blank.B1.Did you cite the creators of artifacts you used?Section 5.1 B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.

Table 1 :
Preliminary experiment result on WebText corpus.The n-gram repetitions, dubbed as n-gramrep hereafter, and the corresponding predictive scores, P (ŷ; θ), are obtained from predictions made with beam search, with beam size set to 10.The likelihood of ngram repetition appearing, P (n-gram-rep), is computed from ground truth, which is human-written text.

Table 2 :
(Holtzman et al., 2020) on Webtext and Wikitext-103 test dataset.Following(Holtzman et al., 2020), in repetition-related metrics, bold numbers indicate scores that are the closest to those of ground-truth.In Sim-CSE and G-score, a bold number indicates best performance.

Table 3 :
Ablation study on Wikitext-103 test dataset.∆ indicates changes in hyperparameters.τ n indicates non-progressive local temperature.B indicates beam width and γ indicates the temperature increasing factor.