HeLo: Learning-Free Lookahead Decoding for Conversation Infilling

We propose Heuristic Guided Lookahead De-coding (HeLo), a novel decoding strategy for conversation infilling. Conversation infilling aims to generate a seamless bridge of utterances connecting a given pair of source and target utterances. HeLo does not require fine-tuning or extra models – only the generating model itself. Instead, HeLo leverages a greedy lookahead phase before committing to any token. The HeLo framework is simple and can augment conventional decoding strategies paired with any autoregressive language model. Smooth transitions between utterances are encouraged with an annealing schedule. Our experiments show HeLo outperforms several baselines when evaluated with both automatic and human evaluation metrics, which, we argue, are appropriate for the task. 1


Introduction
Large pretrained language models are effective solutions to many popular natural language generation tasks such as machine translation and conversational dialogue.Guided content generation, however, is an equally compelling application that has received relatively less attention.In this setting, humans cooperate with language models to produce works of creative writing such as stories (Akoury et al., 2020;Coenen et al., 2021).
In this paper, we explore the cooperative generation of conversations which we dub conversation infilling.Conversation infilling aims to generate a seamless bridge of utterances connecting a given pair of source and target utterances.Such a task finds itself in many forms of creative writing, such as playwriting, movie scripts, and video game dialogue.For example, production of video game dialogue is large in scale.Game worlds often contain more interactive characters than a writer could ever hope to compose unique dialogues for.We envision conversation infilling as a scalable method to generate conversations where writers control the high-level aspects of conversations while relying on an AI-assisted writing tool to fill in the details.While similar to the task of text infilling (Zhu et al., 2019), conversation infilling requires explicitly the generation of an entire dialogue between two interlocutors rather than arbitrary text.
We propose a heuristic guided lookahead decoding strategy (HeLo, pronounced "hello!") for the task of conversation infilling.HeLo does not require fine-tuning or additional models outside the generating model itself.Instead, before committing to any token, HeLo performs greedy lookaheads to generate potential future conversations and prioritizes tokens that bring the conversation closer to the target utterance with a heuristic scoring function.To encourage smooth transitions between utterances, the magnitude of this heuristic bias depends on the current depth of the conversation.
We compare HeLo against several baselines across five datasets and propose a diverse set of automatic evaluation metrics, which, when taken together, are reasonable for our task.Our experiments demonstrate that HeLo outperforms all baselines on the majority of these metrics, albeit at the cost of generation speed.While speed is vital in real-time settings such as chitchat, we contend that it is fair to perform conversation infilling in an offline setting where speed holds a lower priority.We also perform a small human evaluation study that suggests HeLo is a promising approach to conversation infilling.
To remedy this shortcoming, we propose HeLo, a decoding strategy to approximate Equation 1.

HeLo Decoding
Let p θ be a parameterized autoregressive language model trained to generate utterances (one token at a time) in response to dialogue histories.Let V be the vocabulary of p θ .During the decoding process, we wish to bias selection towards tokens that encourage y tgt to appear in the future.On the other hand, the resulting utterance should also be a sensible response to its own dialog history (containing y src ).Modifying the distribution learned by p θ would therefore be at odds with this goal.
To balance these competing objectives, we take inspiration from A* search, a path search algorithm that leverages a heuristic function to find a path with a maximum score.HeLo treats decoding as a path search problem where nodes are partial conversations.Traversing to a connected node is analogous to extending a conversation by one token.At time step t, HeLo considers |V| potential tokens to extend the conversation with.The score of a potential token is HeLo score f (y where y (i) t is the tth token of the ith utterance and x t is the dialogue history of y (i) and the tokens of y (i) generated so far.Our heuristic function h(•) is the log probability of y tgt given the conversation so far and a possible multi-token continuation of the conversation if y (i) t were selected.We denote this continuation as y + .h(y Specifically, we greedily generate y + by selecting tokens that satisfy t , y + <k ) until p θ generates a stop token indicating the end of an utterance.In other words, we forecast a possible future state by performing a lookahead.If y tgt is likely to follow this future state, we assign greater importance to the token that initialized this state.
Similar to past methods that employ A*-like heuristics in beam search (Noda and Sagayama, 1995;Sun et al., 2017;Meister et al., 2020;Lu et al., 2021), HeLo uses f (•) to compute updated scores for all tokens under consideration and otherwise proceeds identically to beam search.That is, instead of maintaining a priority queue of all partial conversations explored so far (as in A* search), we only maintain the top-k partial conversations (i.e., paths) ranked by f (•).

Annealing Schedule
Intuitively, the influence of y tgt is less critical at the start of a conversation where the priority is to transition from y src smoothly.However, the importance of y tgt peaks when we generate y L .To encourage HeLo to smoothly transition to the next utterance, we experiment with an exponential annealing function similar to that proposed by Pascual et al. (2021).We update the heuristic score in Equation 2as t ) where We experiment with various combinations of λ 0 and c.When c = 0, we recover a non-annealed version of HeLo with a fixed amount of influence from y tgt at every utterance.

Baselines
Beam Search autoregressively generates conversations without knowledge of the target utterance y tgt .The dialog history x is initialized with the source utterance, y src .Prefixed Beam Search is the same as beam search, but we prepend y tgt before y src to condition the underlying generating model.
CoSim leverages the generating model's own embedding layer to compute (partial) utterance representations.Tokens that result in representations sharing high cosine similarity with y tgt are prioritized.Specifically, CoSim scores tokens with Equation 2 and sets the heuristic score to h(y where E(•) is a function that retrieves and averages the embeddings of its input tokens.
Finetuned is our only baseline with no modification to its decoding strategy.Instead, we use Key-BERT (Grootendorst, 2020) to extract keywords from y tgt and fine-tune a chatbot to conditionally generate intermediate utterances given a dialogue history and keywords.

Experimental Setup
Model Choice.We use Blenderbot (Roller et al., 2020;Wolf et al., 2019) as our backbone language model for all experiments.Specifically, we use the 400M-distill checkpoint.
See Appendix E for details.

Evaluation Metrics
BLEU (Papineni et al., 2002;Post, 2018) measures lexical and phrasal overlap between generated and human conversations.High overlap with human references suggests the usage of a similar transition strategy.
Utterance Perplexity (PPL x ) measures the perplexity of an utterance with respect to its dialogue history.We use Blenderbot 1B-distill throughout our experiments to compute perplexity.Given a conversation, PPL max is the perplexity of the most perplexing utterance of a conversation.PPL y (1) and PPL y tgt are the perplexities of y (1) and y tgt , respectively.A low utterance perplexity suggests a sensible and fluent response.
Conversation Perplexity (PPL) is the average utterance perplexity of a conversation.
Relative Standard Deviation (RSD) of utterance perplexities measures the smoothness of a conversation.Specifically, we compute the standard deviation of a conversation's utterance perplexities and divide by its mean perplexity.Since human text is known to produce higher perplexities than generated text, this metric allows for easier comparison.MAUVE (Pillutla et al., 2021) measures the similarity between two text distributions (rather than between a candidate and its reference).We compare the distribution of our generated conversations with their human-written counterparts.We employ MAUVE to measure text quality degradation.

Results
Automatic Evaluation.We compare two variants of HeLo to our baselines: HeLo-fixed and HeLoanneal.In short, the latter leverages an annealing schedule while the prior does not.Both variants of HeLo generally outperform the baselines, with HeLo-anneal yielding the best results.While BLEU scores are low throughout, HeLo-anneal scores the highest among decoding strategies and is competitive with Finetuned, suggesting an increased use of words and phrases that a human may utilize to bridge utterances.Unsurprisingly, all decoding methods produce lower perplexities than human references (Holtzman et al., 2019;Meister et al., 2022).However, note how the RSD values of HeLo-anneal approach those of human references, suggesting smooth transitions between utterances.This point is reinforced by the low PPL y tgt value of HeLo-anneal, suggesting a successful connection to y tgt (at a small cost in PPL y (1) ).Finally, stable MAUVE scores suggest HeLo does not degrade text quality compared to other decoding methods.Our results broken out by individual dataset (Appendix A) are generally consistent with our aggregated results.
While these metrics, independently, are not sufficient to measure the quality of the infilled conversations, we contend that together they paint a good approximation in place of human judges.Moreover, these metrics are easily replicated and commonly used 2 by the research community (Celikyilmaz et al., 2020).Human Evaluation.We randomly sampled 100 pairs of source and target utterances and asked human judges to compare the infilled conversations generated by our baselines and HeLo.We did not include Beam + Prefix due to its similar performance to Beam Search during automatic evaluations.The results are shown in Table 2.The judges rated HeLo generations as more likely to appear 2 MAUVE is a relative newcomer but is gaining adoption.between y src and y tgt relative to all baselines.On fluency, judges struggled to distinguish between HeLo and the baselines with one exception.While CoSim scored well in automatic metrics, the judges found HeLo generations were more fluent, suggesting that text quality suffers under CoSim.See Appendix F for details.These results suggest HeLo is a viable approach to conversation infilling with a modest cost in fluency.

Related Work
While, to the best of our knowledge, we are the first to explore conversation infilling, many have explored the related tasks of text infilling (Zhu et al., 2019;Donahue et al., 2020;Qin et al., 2020) and controllable text generation (Keskar et al., 2019;Yang and Klein, 2021;Mireshghallah et al., 2022).
Closer to conversation infilling, Tang et al. (2019) propose a method to guide conversations towards a target keyword.Wu et al. (2019) explore the task of proactive conversation where a dialogue agent leads a conversation by planning over a knowledge graph.Sevegnani et al. (2021) and Gupta et al. (2022) explore the task of one-turn topic transitions: given a source utterance u a and a partial utterance u b , generate text u ′ b such that the concatenation of u ′ b and u b is a sensible response to u a .Conversation infilling, in contrast, requires the generation of an entire conversation that bridges two utterances on behalf of both speakers.Moreover, their proposed methods require fine-tuning and external knowledge bases, while HeLo is a learning-free decoding method.Lu et al. (2021) propose NeuroLogic A*esque (NL), a decoding method that also employs a lookahead phase.The main differences between our methods lie in the heuristic score computation and the tasks explored.NL sets the heuristic score as a) the likelihood of the lookahead continuation itself or b) whether some constraint is satisfied in the lookahead, such as whether specific words appear or not.In HeLo, the lookahead completes a partial utterance to produce a wellformed potential conversation.We then set the heuristic as the likelihood of y tgt given this potential conversation.The likelihood of the lookahead itself or whether it satisfies certain lexical constraints does not affect our heuristic score.Moreover, Lu et al. (2021) do not explore the task of conversation infilling.Instead, they examine constrained forms of machine translation and commonsense, table-to-text, question, and story generation.

Conclusion
We propose HeLo, a learning-free heuristic guided decoding strategy for the task of conversation infilling.Automatic and human experiments suggest HeLo is a viable strategy compared to several baselines.Future work of interest includes improving the generation speed of HeLo for use in real-time settings and exploring other natural language tasks that may benefit from lookahead heuristics.

Limitations
HeLo is significantly slower than most conventional decoding methods.We show average running times in Appendix D. To fit our computational budget, we restricted the beam width and the number of tokens that initialize a greedy lookahead.While HeLo can be paired with any language model trained for dialogue generation, our experiments were only performed with BlenderBot.Future work to confirm its utility with other language models is needed.

Ethics Statement
We used publicly available datasets and model checkpoints for our experiments.No sensitive data was collected during our human evaluation study.As with most controllable text generation methods, HeLo could be used to steer dialogue generation towards toxic responses.If writers are to use HeLo for scaled conversation generation, care must be taken to ensure the generated conversations do not contain utterances that are unsuitable for their intended audience.
exceptions.For completeness, we show MAUVE for individual datasets, but best practice dictates using thousands of examples.Therefore, interpret MAUVE for individual datasets with caution.

B Example Conversations
We show examples of infilled conversations in Tables 5 and 6.All conversations were generated with the facebook/blenderbot-400M-distill checkpoint from Huggingface, a 360M parameter language model trained to generate dialogue.

C Hyperparameter Choices
We show the hyperparameters used in our experiments in Table 7.We performed hyperparameter sweeps with one random seed to inform our choices.We manually select the hyperparameters that appear to offer the best balance among the metrics.We show the results of these sweeps in Tables 8, 9, and 10.For computational efficiency, HeLo uses beam width 3 and only generates lookaheads for the top 40 tokens.We set beam width to 3 for all decoding strategies.The S-RSD metric is the relative standard deviation of the first discrete difference among the utterance perplexities of a conversation.

D Running Times
We show the average running times of the decoding methods we experimented with in Table 4. Generations were conducted on a single GeForce RTX 2080 Ti GPU.

E Datasets
All conversations were filtered to include at least six utterances and truncated to include no more than eight utterances.We use the first 500 examples of the test splits except for Meena and Empathetic-Dialogues.In the case of Meena, we use the humanto-human chatlogs made available in the Meena GitHub repository 3 .All emojis were removed from the Meena chatlogs.To gather 293 conversations from EmpatheticDialogues, we needed to use both the validation and test splits because many conversations were only four utterances long and, therefore, too short to meet our criteria.

F Human Evaluation
Each conversation pair was annotated by a single judge.We recruited 4 human judges in total.
3 https://github.com/google-research/googleresearch/tree/master/meenaJudges were presented with a source utterance, target utterance, and two sequences of utterances (option A or option B).One of the sequences was generated by a baseline and the other generated by HeLo-anneal.The options were randomized such that the baseline and HeLo-anneal could appear as option A or option B. The judges were asked two questions: 1) "Given the FIRST and LAST utterance of a conversation, which option is more likely to appear between the two given utterances?If you can't tell, select "Tie" (use sparingly)." and 2) "Ignore the FIRST and LAST utterances.Is one option noticeably more fluent than the other?If so, mark that option.Else, select "Tie" (more liberal use is fine).".What's your favorite cheese?", "I like cheese.The chedd is the second most popular in the USA.My favorite cheese is probably mozzarella.How about you?" Finetuned "I love it too, especially the regional diversity of the food.What's your favorite dish?", "I love lasagna and spaghetti and meatballs.What about you?", "Lasagna is one of my favorites too!I also love pizza and pasta.","Pasta is so versatile.It can be made with rice flour, wheat, or corn.","That's true, I love all of those ingredients in my pasta!Do you have a favorite type of pasta?", "I really like spagetti.It's one of the most popular foods in the world."HeLo-fixed "I do too!I love all the pasta dishes and the flavors of cheese.","Cheese is one of the most popular foods in the world.I love it too.

Figure 1 :
Figure 1: Example of the conversation infilling task along with an actual generation by HeLo.The source and target utterances are given.The model generates a bridge of utterances connecting the source and target.

Table 1 :
PPL↓ RSD↓ PPL max ↓ PPL y (1) ↓ PPL y tgt ↓ MAUVE ↑ Results averaged over 1873 conversations.Best and second best decoding methods are bolded and underlined, respectively.Both variants of HeLo generally outperform the baselines.HeLo-anneal achieves the best PPL y tgt score suggesting a successful bridge with y tgt .Stable MAUVE scores suggests HeLo does not degrade the quality of the generated text.HeLo-anneal RSD values approach those of human references suggesting smooth transitions.Conversations contain 6-8 utterances each.
See Appendix C for hyperparameter details and Appendix B for sample generations.We show our aggregated results (across five datasets) in Table 1.See Appendix A for full results.Decoding BLEU ↑

Table 2 :
Results of human evaluation over 100 pairs of infilled conversations (per baseline).Bridge measures which infilled conversation a human judged as more likely to appear between y src and y tgt .Fluency measures which infilled conversation (ignoring y src and y tgt ) was more fluent.

Table 3 :
PPL↓ RSD↓ PPL max ↓ PPL y (1) ↓ PPL y tgt ↓ MAUVE ↑ Experiment results averaged over conversations.Best and second best decoding methods are bolded and underlined, respectively.Both variants of HeLo generally outperform the baselines.MAUVE suggests that HeLo does not degrade the quality of the generated text.Note how HeLo-anneal RSD values approach those of human references suggesting smooth transitions.The number of infilled conversations per dataset is indicated by (n=x).Conversations contain 6-8 utterances each.For completeness, we show MAUVE for individual datasets, but best practice suggests using thousands of examples.Target "Especially cheddar cheese!It's the second most popular cheese in the US!" Human "Oh yes!It is the best.It's really changed over the years but it still has good roots.","What is your favorite Italian dish?", "I really enjoy chicken Parmigiana.What about you?", "I love chicken parmigiana as well, but I think my ultimate favorite is beef lasagna...extra cheese please!", "Chicken with sauce and mozzarella.... Be still my heart!", "Truthfully, anything with cheese is the best" Beam Search "I love it too!It's one of the most popular and influential foods in the world.","Yes, it is.It has been around for a long time.It was developed in Naples, Italy.","Yes, and it's influenced by a lot of different cultures.It's very diverse.","Yes, It is.I love how it has so many different ingredients.It can be made with vegetables, meats, and condiments.","Yes, that's true!It can also be made without cheese, which I find interesting.","Yes, I love cheese.It is one of my favorite foods.I can't believe it's been around since the 5th century BCE." Beam + Prefix "I love it too!It's one of the most popular foods in the world.It's so good!", "I know, it's so tasty!I like to put it on crackers or bread!", "That's a good idea!I like it on toasted bread with some cheese on top.","That sounds delicious!Do you like to add other ingredients to your cheese like vegetables, meats, and condiments?", "Yes, I love to add vegetables and meats to my cheese.I also like to make it into a sauce.","That sounds really good!What kind of sauce do you use?I've heard of some people using mayonnaise, but I've never tried it."CoSim "I love it too!It's one of the most popular and influential in the world.","It is!It is the most widely consumed food in the entire world.It has been around for a long time.","It has!It was the first to be popular in Europe and the United States.It's so good!", "It is.I especially like the cheese that is used in the dish.Cheese is one of my favorite foods.","It's a cheese!The most popular cheese in the US is cheddar.

Table 7 :
", "I know!It's so versatile too.It can be made from milk, cream, and other things.","Yes, and it can be aged for months at a time.It is so good.","Yes, it can!It can also be made in a variety of ways, such as in a casserole.","I love cassarole.It's one of my favorite foods to make.It consists of a large amount of cheese and other ingredients."HeLo-anneal "Me too!I love all the pasta dishes and the breads and cheese!", "I love pasta and cheese as well.It is one of the most popular foods in the world.","Cheese is a staple food in Italy.I love it too.","I love the variety of cheeses.I especially love the ones that are aged in wooden casks.","Yes, I love those too.I also love the soft, cheddar cheese.","I love Dairy products in general.So many types of cheese are available in the US." Hyperparameters used in our experiments.