Reflective Decoding: Beyond Unidirectional Generation with Off-the-Shelf Language Models

Publicly available, large pretrained Language Models (LMs) generate text with remarkable quality, but only sequentially from left to right. As a result, they are not immediately applicable to generation tasks that break the unidirectional assumption, such as paraphrasing or text-infilling, necessitating task-specific supervision. In this paper, we present Reflective Decoding, a novel unsupervised algorithm that allows for direct application of unidirectional LMs to non-sequential tasks. Our 2-step approach requires no supervision or even parallel corpora, only two off-the-shelf pretrained LMs in opposite directions: forward and backward. First, in the contextualization step, we use LMs to generate ensembles of past and future contexts which collectively capture the input (e.g. the source sentence for paraphrasing). Second, in the reflection step, we condition on these “context ensembles”, generating outputs that are compatible with them. Comprehensive empirical results demonstrate that Reflective Decoding outperforms strong unsupervised baselines on both paraphrasing and abductive text infilling, significantly narrowing the gap between unsupervised and supervised methods. Reflective Decoding surpasses multiple supervised baselines on various metrics including human evaluation.


Introduction
Language Models (LMs) like GPT-2 (Radford et al., 2019), trained over vast unstructured data, can leverage enhanced generation methods (Holtzman et al., 2020;Martins et al., 2020;Welleck et al., 2019) to give fluent and coherent continuations to given input text-e.g. news articles or stories. 2 Sample " ! from +, … c 1 : The day after her discharge she told me she was a lot better … 1 ( : Amy had heart palpitations after a lot of caffiene 1 ) : By the time she arrived her heart felt much better +,(2 * , 2 + ) -NLG generated contexts Figure 1: Illustration of REFLECTIVE DECODING applied to paraphrasing and abductive infilling (αNLG Bhagavatula et al., 2020). Only the right-context is shown, although both right-and left-contexts are used in practice. First the contextualization step (1) captures aspects of an input by generating many representative contexts for it. Then in the reflection step (2) we sample generations that can replace the input and fit these representative contexts ← − RD.
GPT-3 (Brown et al., 2020) takes this a step further: given a small number of examples and a well-constructed prompt, it shows remarkable performance on tasks where vast quantities of supervised data and finetuning were thought to be necessary. While this demonstrates the potential for LM-decoding in few-shot or even zero-shot outof-the-box settings, limited access to GPT-3 and immense computational cost keep this from being a widely usable or efficient solution. Yet recent work shows that GPT-2 may hold similar capabilities when it is primed correctly. Li and Liang (2021) achieve supervised-level performance in a few-shot setting using smaller, accessible models like GPT-2. They learn a small number of taskspecific vectors as a prefix to the input, without tuning the model itself. Off-the-shelf GPT-2 is capable of few-shot learning given the right setup; our work aims to push this concept further, by showing that out-of-the-box LMs can solve complex generation problems simply by using the right decoding algorithm.
We introduce REFLECTIVE DECODING-a novel decoding method that allows LMs to be applied to generation tasks that break the "text continuation" paradigm, such as paraphrasing and textinfilling. REFLECTIVE DECODING requires no supervision, only two complementary off-the-shelf LMs-one forward ( −→ LM) and one backward ( ←− LM). That means no per-task finetuning, even on unstructured text in the target domain.
Inspired by the distributional hypothesis (Firth, 1957), REFLECTIVE DECODING works by generating text that might occupy the same contexts as an input. We use two LMs ( −→ LM and ←− LM) to generate contexts for a given input, which implicitly capture aspects of its meaning (the contextualization step). Then, in the reflection step, we condition on this ensemble of contexts, decoding over the input with generations that are distributionally related to-or replace-the input.
Paraphrasing is a natural application: a good paraphrase should intuitively be compatible with the same contexts as the original text. REFLEC-TIVE DECODING shows strong unsupervised paraphrasing performance: On the Quora question pair dataset, we find one variant of our model (RD 30 ) outperforms unsupervised baselines on all but one metric, and supervised baselines on both the SARI metric and human evaluation. We see the same trends on the Twitter URL corpus (Lan et al., 2017).
REFLECTIVE DECODING can also be applied to tasks that only replace part of the input, or generate within it, like infilling; on αNLG (Bhagavatula et al., 2020), we outperform the best unsupervised baseline on overall quality, effectively halving the gap with supervised methods. In both applications, REFLECTIVE DECODING directly applies off-theshelf pretrained models, without finetuning on the task or target domain. This provides evidence that off-the-shelf Language Models can excel at surprising applications, when paired with decoding algorithms designed to elicit specific kinds of infor-mation.

Notation
Arrows indicate the order in which sampling functions condition on and generate tokens: − → indicates generating from the left-most token to the right (left-to-right), while ← − indicates going rightto-left. For Language Models (LMs), this means −→ LM is what is often called a "forward LM", while ←− LM is a "backward LM". For our sampling function (RD), this also indicates which generated context is being conditioned on, e.g.
− → RD conditions on left context, extending it to the right to generate output.

Overview
Currently, LM-decoding is limited to a text continuation paradigm. Given an input text s input , LM(c|s input ) generates contexts c that might come after (forward, i.e. −→ LM) or before (backward, i.e. ←− LM) the input. LM-decoding generates outside of the input by continuing it, but many tasks require us to generate over or within the input: paraphrasing requires reformulating the input, while infilling requires inserting text in the middle of it.
Reflective Decoding approaches this shortcoming by turning conventional LM-decoding around. While LM(c|s input ) generates the kinds of contexts c the input might appear in, RD generates s that might replace s input in these same contexts. The distributional hypothesis (Firth, 1957) suggests semantically similar texts appear in similar contexts, meaning RD is also likely to sample in the semantic neighborhood of s input .
Concretely, REFLECTIVE DECODING samples s that fits the same contexts as s input in 2 simple steps. We first sample many representative contexts c i that could neighbor the input, e.g. using −→ LM in Figure 1. This is the contextualization step. Second, in the reflection step, we generate text in the opposite direction (using ←− LM in Figure 1), which fits these contexts as well as s input fits them. To consider all c i 's while decoding, we ensemble the different distributions imposed by conditioning on each c i : where Z normalizes the fraction to a proper probability distribution (see Equation 2). In essence, this  Experts (Hinton, 2002) framework, we can generate a hypothesis s that fits the full contextual fingerprint. Yet, some contexts are more informative than others: probable but generic contexts like "See the appendix for details." are not descriptive of neighboring text. We learn weights w i to prioritize contexts c i in the ensemble that are most informative for s input , by maximizing the probability of s input under Equation 1 (described in Algorithm 1). In effect, we are learning an on-the-fly autoencoder at inference time, using weighted ensembles of contexts as a representation (see §2.7, §A.1).
To motivate how this method functions, consider the paraphrasing example from Figure 1 with input s input = How are circulatory system tissues formed? Generated contexts reflect different aspects of s input : c 1 situates s input as a question (This is a medical question...), while c 2 and c 3 explore central concepts (as with all tissue...; about the circulatory system). Even though each context could follow many sentences, together they form a fingerprint for s input . A sentence that could be followed by all of c 1 , c 2 , c 3 will likely be a question (c 1 ), about tissue formation (c 2 ), and the circulatory system (c 3 ), and generally occupy the same semantic neighborhood as s input , e.g. How do circulatory systems form?
In the case of paraphrasing, our task is to replace all of s input with something that might appear in the same contexts. Other tasks, however, might require us to replace only part of a sentence (e.g. incontext paraphrasing) or even insert text at a given position (e.g. infilling). REFLECTIVE DECODING makes this easy: simply hold part of s input static when we generate from RD.

REFLECTIVE DECODING
Here we dive into the details of REFLECTIVE DE-CODING, by considering the right-hand context ensemble ( ← − RD), keeping in mind that the process is repeated on the left-hand as well ( − → RD). First, in the contextualization step (line 1 of Algorithm 1), we sample many right-hand contexts c i for s input , using −→ LM. These will be used as a representative sample of the contexts s input appears in. Second, in the reflection step (lines 2 & 3) our goal is to construct a sampling function ← − RD that will yield texts similar to s input . We define ← − RD as: This is equivalent to Equation 1, but giving the exact normalization factor in the denominator.
Equation 2 is a token-wise Product of Experts model, that captures the semantic neighborhood of s input via the combination of contexts c i and their weights w i ( §2.7). We learn w i that maximize ← − RD(s input ) (probability of generating s input under ← − RD), thereby up-weighting contexts specific to s input . We initialize these weights (line 2), then train them (line 3) using the Adam optimizer (Kingma and Ba, 2014). We normalize weights into a proper probability distribution at every step.
Reverse-direction − → RD is learned symmetrically, flipping the roles of −→ LM and ←− LM and sampling lefthand context instead (see §B.1 for details). Finally, we generate from ← − RD (and − → RD), sampling outputs that would appear in the same contexts as s input . Depending on the application, we rank and select a final output in different ways, always using −→ LM and ←− LM together to capture bidirectional fit.

Implementation
Weight Learning and Pruning Context weights w i are learned using the Adam optimizer (Kingma and Ba, 2014). In practice this takes under 100 steps (negligible time compared to LM decoding). While we sample tens of contexts (line 1 of Algorithm 1), many end up with negligible weight under the learned distribution (Equation 2). To efficiently sample from ← − RD and − → RD, we drop all but the top k c contexts and renormalize weights: k c < n c contexts are used during the reflection step.
Parameters We sample n c contexts to describe the source s input . We use nucleus sampling (Holtz-Task: !NLG % ! : Ray hung a tire on a rope to make his daughter a swing. __?__ % " : Ray ran to his daughter to make sure she was okay.

RD
He put her on the swing, and while she was on the swing, she fell off and was lying on the ground.
% ! : Tom and his family were camping in a yurt. __?__ % " : He chased it around until it left the yurt.

RD
He went to the yurt and found a bear that was in the yurt is it possible to make money as a film critic?

RD 30
is there a way to make money as a film critic?
RD 45 is it possible to make a living as a movie critic?

Application: Paraphrasing
To paraphrase, we begin by generating candidate outputs. Following §2.3 the REFLECTIVE DECOD-ING sampling function is learned in each direction ( − → RD, ← − RD) using the source sentence s input . Then, n s generations are sampled from both − → RD and ← − RD: s 1 , ..., s ns ∼ − → RD, s ns+1 , ..., s 2 * ns ∼ ← − RD This gives a robust set of candidates that are compatible with the same left and right contexts as s input . Many of these will be semantically related to s input , but must be scored and ranked in order to select true paraphrases. REFLECTIVE DECOD-ING is based on the notion that good "fit" with the same contexts is a robust measurement of similarity, yielding a natural "contextual scoring function" (Equation 7 and §2.7). We measure how likely candidate s is to generate the same contexts that s input did when constructing − → RD and ← − RD: where c rh are the generated contexts used in ← − RD, 2 https://github.com/yet-another-account/openwebtext and c lh for − → RD. This explicitly estimates how similar the contexts of s and s input are on both sides, the underlying objective of REFLECTIVE DECOD-ING.

Application: Abductive Reasoning
Abductive natural language generation (αNLG from Bhagavatula et al. 2020) is the task of filling in the blank between 2 observations o1 and o2, with a hypothesis h that abductively explains them. The challenge for LM-decoding is making use of context from both sides (o 1 on the left and o 2 on the right). This is particularly challenging for unsupervised decoding methods because unidirectional LMs cannot naturally condition on both sides when generating h.
REFLECTIVE DECODING simplifies this problem by capturing information about both o 1 and o 2 in a single decoding function ( ← − RD or − → RD), then holding o 1 and o 2 static at generation time (i.e. teacher forcing). Concretely, we use concatenated o 1 +o 2 as s input in Algorithm 1, and construct sampling functions − → RD, ← − RD informed by both observations. We are interested in sampling in between o 1 and o 2 , so when sampling hypotheses h from ← − RD we condition on the right-side observation o 2 (and vice-versa for − → RD and o 1 ). This is equivalent to appending the given observation to sampled contexts: Note that both − → RD and ← − RD contain information about both o 1 and o 2 , effectively turning a 2-sided contextual constraint into a 1-sided one.
We also use a task-specific scoring function to rank sampled hypotheses. We would like a hypothesis h that best explains both observations, and so use Language Models to measure this: Adding h should help to "explain" each observation given the other, i.e. that o 2 follows from o 1 + h and o 1 from h + o 2 . To filter hypotheses that only explain one of the two observations, we remove any that make either observation less probable than the empty hypothesis, imposing:

Intuitions and Theory
Here we discuss the theoretical intuition for RE-FLECTIVE DECODING, as a way to sample generations that share contextual "fit" with a source text, deriving the sampling function of Equation 2. We start by considering how to relate the meaning of two texts, generation s and input s input . We follow a distributional intuition (Firth, 1957), that meaning can be understood through the contexts in which text appears. Many distributional approaches learn contentful neural representations by predicting context given input text (Mikolov et al., 2013;Kiros et al., 2015), then compare these representations to establish semantic similarity. We can, instead, compare contexts directly-judging the difference in meaning between texts s input and s by their divergence: We use −→ LM to interchangeably denote the theoretical left-to-right distribution of text, and the LM estimating it. Thus, −→ LM(c|s) is the distribution over right contexts c given sentence s, and Equation 6 can be understood as the "contextual information difference" we expect s to have from s input . Note, we could similarly use left-hand context and ←− LM -and do so in practice.
We use finite-sample cross entropy as an effective empirical proxy for D KL : Where c i ∼ −→ LM(c|s input ) indicates sampling contexts for s input from −→ LM. Intuitively, we want to minimize this score when generating s: an optimal output has a similar meaning to s input and so fills approximately the same contextual hole, minimizing the value of this "contextual distance".
In this form,Ĥ compares 2 complete texts-s and s input -but we are trying to generate s for which the divergence from s input is low. We flip the role of "text" and "context" 3 to define a function from which we can sample s: (equivalent to Equation 2, derived in §A.1) s j is the j th token in s (sampled right-to-left from n to 0), and V is the vocabulary. Weights w i are learned by maximizing the probability of s input . Equation 8, estimates the probability of predicting s input and s from a finite set of contexts c i generated from s input . This approximately minimizes Equation 6, as being generated by the same weighted ensemble of contexts strongly correlates with generating the same contexts in the same proportions, i.e. low divergence, due to the sparsity of language. We can sample s with low contextual distance from s input using ← − RD. Further, we can use left context to construct − → RD by simply reversing the directions of the LMs used.  Past work has emphasized the important challenge of generating novel paraphrases (Liu et al., 2010;Chen and Dolan, 2011) We address this in 3 ways. First, we explicitly quantify a simple notion of novelty: to quantify the novelty-quality trade-off. Second, we include the SARI metric (Xu et al., 2016) which explicitly balances novelty from input with reference overlap. Third, we quantify an overall human quality metric accounting for this.
We have humans evaluate fluency, consistency, and novelty on Amazon Mechanical Turk. The overall score ("Human" in Table 1) is the rate examples meet thresholds for all 3: fluent enough to understand, with at most minor differences in meaning and at least minor differences in wording. On quora, we test 200 examples, with agreement (Fleiss' κ Fleiss, 1971) of 0.40 (fluency), 0.54 (consistency), 0.77 (novelty) and 0.48 (overall) i.e. moderate to substantial agreement (Landis and Koch, 1977 We also compare against a machine-translation approach (see Sec 6), pivoting through German using Transformer (Vaswani et al., 2017) models trained on WMT19 data (Barrault et al., 2019). MT is included in a separate section in our results as it uses supervised bilingual data (Table 1).
We include supervised baselines: the pointer generator trained by imitation learning (PG-IL) as in Du and Ji (2019), the diversity-promoting DiPS model (Kumar et al., 2019), and a finetuned BART model (Lewis et al., 2019), which uses a more complex pretraining method than our LMs. Note that DiPS generates multiple diverse paraphrases so we pick one at random. CGMH and REFLECTIVE DECODING both return multiple sampled, ranked paraphrases. We can easily control for N ovelty by taking the highestranked output that meets a N ovelty threshold. For both, we have a version with no threshold (T op), and with thresholds such that average N ovelty is 30 and 45. N ovelty cutoffs do not depend on the reference, only the source, and are equivalent to selecting with BLEU-ori (N ovelty is

Task: Abductive NLG
Task: The Abductive natural language generation task (αNLG) presented in Bhagavatula et al. (2020) requires generating a hypothesis that fits

Results and Analysis
Paraphrasing First, the Quora dataset: On automatic metrics from past works (BLEU, METEOR, TER P ) our lowest-N ovelty model setting (RD T op ) achieves the highest unsupervised scores, and highest overall on BLEU. Other high scoring rows (Source, PG-IL) are similarly low-N ovelty. The SARI metric explicitly balances N ovelty with similarity to reference. On SARI we see such low-N ovelty models perform worse. The best overall model on SARI is our medium-N ovelty setting (RD 30 ) which outperforms MT and supervised models.
Our human evaluation measures what fraction of outputs are found to be fluent, consistent, and novel. As with SARI, both our mid and high-N ovelty models perform quite well, again with the medium-N ovelty setting outperforming all baselines. As further validation for SARI as a proxy for human, they share the same top-5 models.
Results on the Twitter URL corpus largely support those on Quora. REFLECTIVE DECODING achieves the best unsupervised scores on noveltyaware metrics (Table 2), with the best overall SARI, even outperforming reference on the human metric, although MT achieves the highest overall.
In sum, REFLECTIVE DECODING is able to compete on previously used quality metrics favoring low-N ovelty, but can produce more varied outputs preferred by humans. RD 45 is among the best models by SARI and Human on Quora despite exceeding the novelty of even the reference.
αNLG Results on αNLG (Table 3) present a strong case that REFLECTIVE DECODING can effectively use bidirectional context. Strong hypotheses use information from both initial the observation o 1 and the future observation o 2 . Humans ranked the ability of REFLECTIVE DECODING to capture this 42.4, about 17 points above the nextbest unsupervised baseline and only 15 points below the best supervised method tested. We see similar results for overall evaluation. A likely factor in this is the (comparatively) high degree of coherence between h and o 2 by REFLECTIVE DE-CODING. Where other methods seem to pay more attention to observation o 1 (the o 2 column generally has much lower values), REFLECTIVE DECODING has comparably high coherence with left-hand (o 1 ) and right-hand (o 2 ) contexts.
We also include example generations in Figure 2 to demonstrate the ability of REFLECTIVE DECOD-ING to combine o 1 and o 2 . For example, h = He put her on the swing, and while she was on the swing, she fell off and was lying on the ground. incorporates information from both observations. Specifically, it takes into account the swing that Ray is building for his daughter which is only mentioned in o 1 , and hypothesizes about a potential injury due to Ray checking on his daughter in o 2 . See appendix for more generations.
Overall, the strong performance of REFLECTIVE DECODING on αNLG shows that unsupervised generation with context ensemble applies to infilling in addition to paraphrasing.

Discussion
REFLECTIVE DECODING Out-of-the-Box A major advantage to applying REFLECTIVE DECOD-ING is ease-of-use: armed with our pretrained language models, practitioners can immediately begin generating. With general pretrained models and underlying principles that are domain-agnostic, RE-FLECTIVE DECODING works across a broad range of text style-no finetuning required-making exploration and adaptation simple. Multiple rounds of generation mean REFLECTIVE DECODING may run slower than other methods at inference time 4 , but it avoids training time. There are clearly settings that favor supervised learning (narrow, known domain with abundant training data), but REFLEC-TIVE DECODING is a good option to begin generating and exploring immediately with high quality generation.
A useful abstraction for understanding RE-FLECTIVE DECODING for current applications is "prompting", i.e., writing a prefix to implicitly or explicitly describe a task for a pretrained model. RE-FLECTIVE DECODING generates natural contexts that the desired generation would appear in. This breaks from other methods of automatic prompting, which often forego "natural" prompts (Shin et al., 2020; Reynolds and McDonell, 2021), even making them continuous (Li and Liang, 2021;Hambardzumyan et al., 2021;Lester et al., 2021;Qin and Eisner, 2021). REFLECTIVE DECODING also notably creates a set of prompts (contexts) for each example, where other methods attempt to learn an overall task prompt. Still, all of these are connected by the popular intuition that useful behavior in pretrained models can be induced through contextual input.
Future Applications REFLECTIVE DECODING can extend beyond our experiments here, however. A simple example is in-context paraphrasing, i.e. writing a paraphrase that fits the true context that the original sentence appears in. Most existing paraphrasing methods consider only out-of-context sentences, and would require significant changes to consider context as a constraint; for REFLECTIVE DECODING we can simply combine true and generated contexts without with the same algorithm.
Driving REFLECTIVE DECODING is a notion of context as a representation, with clear poten-tial for future work. Pretrained LMs capture rich information about text spans, but accessing it without fine-tuning is nontrivial; within the model it is an uninterpretable mass of parameters and activation weights. Our work observes that unidirectional LMs are only capturing this information to predict adjacent context-this is the sole learning signal-so all of this information is expressed in the model's context prediction. Thus, we capture some of this rich information to represent spans, by capturing a finite-sample version of this full predictive distribution in generated contexts. In REFLEC-TIVE DECODING specifically, we use this form of representation to generate back into the source span-paraphrasing or infilling-but the notion can be applied much more generally. In translation for instance, we might first generate contexts for the source sentence that represent its meaning, noisily translate these contexts, then impose that any translations for the source fit the same contexts under a translation-language LM. Constraining translations in this way can add robustness to existing systems by anchoring translations to informative contexts. Beyond explicit generation even, we might use a very large LM like GPT-3 to define a strong scoring function or metric as in Equation 7, first generating contexts for some target sentence, then scoring candidates by how well they generate these same contexts. As in our work, such a score could indicate how well the option fills the same contextual role as the target, harnessing the strong reasoning of whatever model is used.

Related Work
Distributional Intuitions A key aspect of RE-FLECTIVE DECODING is using a distributional intuition to represent the meaning of a text through many contexts. Kiros et al. (2015); Miao et al. (2019) quantify semantic relationships and Lin and Pantel (2001) identify paraphrastic relationships under similar intuitions. A major point of difference between past work and ours is that we sample explicit contexts, allowing unsupervised generation back from these contexts, while past work typically learns a neural representation based on contexts and conditions on this vector-encoded representation.
Unsupervised Paraphrasing Some approaches train neural variational auto-encoders unsupervised to represent source sentences, then decodes from these representations to paraphrase (Roy  and Grangier, 2019b; Bao et al., 2019). This requires training specialized representations, whereas REFLECTIVE DECODING applies general-purpose LMs. We compare to Roy and Grangier (2019b). Paraphrasing by editing the input (Miao et al., 2019;Liu et al., 2019) has shown promise. Like REFLECTIVE DECODING, these approaches can be applied without training specialized models, but are necessarily limited by edit-paths and local minima, as edits are often restricted to single-word replacement, insertion, and deletion. Generated paraphrases must follow a continuous local edit path, while REFLECTIVE DECODING can generate new sentences from scratch. REFLECTIVE DECODING and MT-based paraphrasing both pivot through an alternative textual form to paraphrase (context and translation, respectively). But MT paraphrasing systems cycletranslate through a pivot language (Federmann et al., 2019;Wieting and Gimpel, 2018), which requires supervised bilingual translation data, with an implicit notion of interlingual paraphrasing. (2019) observe that paraphrases close to the source often win on automatic quality metrics. However, dissimilarity from the source correlates with human notions of paraphrasing (Liu et al., 2010). Kumar et al. (2019) increase novelty through their diversity-promoting sampling method. Alternative metrics that consider novelty alongside quality have been proposed (Sun and Zhou, 2012;Federmann et al., 2019). The SARI metric (Xu et al., 2016), included here, combines these notions.

Novelty in Paraphrasing Mao and Lee
Abductive Text Infilling αNLG (Bhagavatula et al., 2020) is a text infilling task that specifically measures the ability of models to explain bidirectional context (observations o1, o2) with a hypothesis that fits between them. This naturally fits RE-FLECTIVE DECODING, which fills in contextual gaps. Recent work has directly addressed this task (Qin et al., 2020) while the infilling literature is also quite applicable (Donahue et al., 2020). We compare to both of these methods on abductive infilling, showing superior results.

Conclusions
We present REFLECTIVE DECODING, a novel unsupervised text generation method for tasks that do not fit the text continuation paradigm. It uses just two pretrained Language Models to generate contexts that capture aspects of input text, generating back into the input from there. It significantly outperforms unsupervised baselines in quality and novelty for paraphrasing. Further, in abductive natural language generation it outperforms unsupervised baselines by a significant margin and halves the gap with supervised models. REFLECTIVE DE-CODING uses the concept of representing meaning with generated contexts, offering new possibilities for unsupervised conditional text generation.

Ethical Considerations
In order to complete our human evaluation we used Amazon Mechanical Turk. We estimated the range of times we expected our task to take, and made sure that at minimum workers would be paid a wage of $15.00 per hour if they were solely completing our task.
As part of this effort, we plan to release our code and model. Our forward and backward language models are the same size as the publicly available GPT-2 (Radford et al., 2019). Training time/energy was likely significantly smaller than the original release; existing code and hyperparameters were available, and we use a smaller dataset. Further, there is no publicly available backward GPT-2 model that we are aware of, so releasing a pair of forward and backward models that were trained on the same data allows for proper comparisons about left-to-right vs. right-to-left processing of English text.
We estimate that the potential dangers of releasing this from a malicious generation perspective are low. Our forward model is similar to already released GPT-2 models. While the backward model adds new generation potential and scientific novelty, it is unlikely to compare to GPT-3 (Brown et al., 2020) which many hobbyists and private companies now have access to. We believe that releasing a pair of forward and backward models will be more useful to researchers who wish to study the symmetries and asymmetries of the linguistic distribution.

A.1 Derivation of Sampling Function
Here we derive the sampling function used for REFLECTIVE DECODING, which allows generation using contextual similarity. This supplements §2.7. P c|s denotes the distribution of contexts c for sentence s. This will be 1-sided context, for instance right-hand context c rh (i.e. P c|s would be estimated by left-to-right −→ LM conditioned on s −→ LM(c|s)). Reversed P s|c goes back from context towards text. With right-hand context, this is estimated by ←− LM(s|c). In §2.7, we consider the task of comparing a source sentence s src with another sentence s. For instance, we may want to know if s is a paraphrase of s src . Following a distributional intuition (Firth, 1957) we define a simple way to compare meaning: Where D KL is the Kullback-Leibler divergence measuring the difference between distributions P c|ssrc and P c|s . This captures a notion above: we take the amount the contexts of s src and s differ as a proxy for their difference in meaning.
In paraphrase generation, we want to select for contextual closeness, and thus only need to rank options. We will then use cross-entropy: which is equivalent to D KL up to a constant offset, and is easier to estimate. Here, the sum over c indicates every possible context c, but in practice we us finite samples. From Sec 2.7, this quantifies contextual difference in meaning. For paraphrasing, we want a sentence s that minimizes this, which is equivalent to maximizing the exponent of its negation: Constant a 0 results from factors of P (c). The result is a Product of Experts (Hinton, 2002). P (s) −1 will prioritize more context-specific paraphrases (low probability but likely in context). However, our LMs are not well equipped to handle unlikely text, (expressivity is likely spent on likely text). Second, while less likely text can have higher similarity, this may not be the goal of our system. Rather we want related sentences that are also fluent and reasonable, so we drop P (s) −1 , the equivalent of multiplying in P (s), biasing the model towards likely sequences: A product of experts of the form: We must set the weights w c|s in the finite sample setting. To keep in line with this the format, we will enforce that weights constitute a proper distribution. In the limiting case (unlimited samples) w c|s should be set to P c|s (c). However, these are likely not efficient estimation weights. Further, exponentiating by this estimate will magnify errors. Instead, we learn these weights using a heuristic, discussed later.
Next, we move to the finite-sample setting, replacing distributions with LM estimates. Here we will consider right-context (meaning P s|c is estimated by ←− LM) but the left-context case proceeds symmetrically. Substituting in the LM distribution: Where now the product over c indicates product over the finite sampled contexts. We convert this to a sampling function, decomposing into tokens of generation s = s 0 ...  generations, generation parameters (truncation parameter p s from nucleus sampling, in paraphrasing) control how "greedy" or stochastic sampling is. However, the effect of p s depends on many dynamic (example-wise) factors. Setting p s too low may sample only the most likely option, too high gives off-topic candidates. The "correct" value of p s is highly example-dependent. We define entropy calibration to control how much "randomness" is used in sampling in a robust way. Rather than directly setting a p s for all examples, this specifies the approximate entropyĥ to sample with for each example. In the greedy case for instance, the desired entropyĥ is set to 0 (i.e. picking from a set of 1 possible option).
We search for p s in each case that is expected to give the correct entropy for the full generation, although p s is a token-level parameter. To estimate this, we take sampling entropy over the source text s 0 ...s n under the nucleus-sampling truncated distribution P p : V p is the truncated vocabulary with parameter p s . We select p s that gives a desired entropy, setthing this to 4 or 6 which we found effective (App. B.4).

B.4 Parameters
Here, we give model settings for our 2 experimental settings, paraphrasing and αNLG. See Table 4. αNLG requires higher variety (higher h sample , p c ), and fewer generated contexts (n c ). We experimented with different reasonable values on the dev set of each model, evaluating manually. We use transformer language models (Mega size) trained on TPU pods (TensorFlow) of size 512. These will be made publicly available. For generation we used 2 NVIDIA Titan Xp GPUs.

RD (us)
She had problems and needed help. He put her on the swing, and while she was on the swing, she fell off and was lying on the ground.
GPT-2-fixed I didn't think to her, this was a normal situation of course, that's what he does, right?

DeLorean
Sammy was a very sweet girl She hit the rope and the tire fell on top of her.

ILM
She wanted my daughter to have a new boyfriend His daughter was flying on the rope.

Supervised
COMeT-Emb Sammy was in a car accident Ray's daughter fell off the swing.

COMeT-Txt
Sammy got into a bad accident and her car broke down Ray's daughter fell and fell off the swing.

C.2 Human Evaluation
Human evaluation for Quora and Twitter are largely described in §3. We reiterate that thresholds are used for each measure, and "overall" is the rate that all thresholds are met. Agreement is calculated on these binary combined threshold categories (fol-lowing Schouten 1986). Full human results for paraphrasing are in Table 6. Human eval for αNLG is described in §3.

C.3 Twitter Dataset
We include here the full results for paraphrasing on the Twitter URL corpus (Lan et al., 2017), a set of paraphrase pairs created by linking tweets with matching shared URLs. We test unsupervised models CGMH, R-VQVAE (UPSA Twitter model is not available), and the backtranslation MT model. We include supplementary results to the main paper in Table 7.