Fine-grained Conversational Decoding via Isotropic and Proximal Search

General-purpose text decoding approaches are usually adopted for dialogue response generation. Although the quality of the generated responses can be improved with dialogue-specific encoding methods, conversational decoding methods are still under-explored. Inspired by \citet{wu2023learning} that a good dialogue feature space should follow the rules of locality and isotropy, we present a fine-grained conversational decoding method, termed \textit{isotropic and proximal search (IPS)}. Our method is designed to generate the semantic-concentrated response, while still maintaining informativeness and discrimination against the context. Experiments show that our approach outperforms existing decoding strategies in the dialogue field across both automatic and human evaluation metrics. More in-depth analyses further confirm the effectiveness of our approach.


Introduction
Dialogue response generation (Li et al., 2017;Wang et al., 2020) aims to generate the utterance that forms a coherent and fluent continuation given a dialogue context.Generic text decoding strategies (Rieser et al., 2014;Ritter et al., 2011;Chen et al., 2017) are usually adopted to produce grammatical and contextual responses.As an independent technique, decoding strategy can also enhance the generation quality of large language models.
Existing text decoding methods have been explored in various generic text generation tasks, but lack tailoring for dialogue generation, e.g., capturing dialogue-specific features and generating an informative and discriminative dialogue response (Su et al., 2021;Wu et al., 2023).Early maximizationbased methods, e.g., greedy search (Li et al., 2016b) and beam search (Wiseman et al., 2017), may lead to dullness and degeneration (Fan et al., 2018;Holtzman et al., 2018).Later sampling-based improvements are proposed to tackle these problems, including top-k sampling (Fan et al., 2018) and nucleus search (Holtzman et al., 2018).While alleviating degeneration, these sampling methods introduce critical semantic inconsistency and are not aligned with human-written prefix (Basu et al., 2021).Specifically, a bunch of studies (Ethayarajh, 2019;Su and Collier, 2023) have asserted that the problem of anisotropy, i.e., a distribution pattern in the latent space with features occupying a narrow cone in the space, leads to inconsistency and degradation of the generation.Although contrastive search (Su et al., 2022) has been proposed correspondingly to mitigate the issue, as a generalized text decoding strategy, it still ignores dialoguespecific features, such as utterance dependencies and conversational structure information.Therefore, research on conversational decoding methods is warmly needed.
In this work, we propose a fine-grained conversational decoding method, namely isotropic and proximal search (IPS).Different from traditional approaches, we consider the previous tokens and contexts separately from a granular perspective.Acknowledging that locality and isotropy are two important properties for refining the dialogue feature space, we design our IPS following these rules: (i) the generated output should be selected from the most probable candidate set predicted by the dialogue model; (ii) the generated tokens in the same utterance should be proximal to each other for expressing a concentrated idea; and (iii) the newly generated utterance should be discriminative enough with respect to the context utterances.In this way, our method encourages informativeness and discrimination among different utterances as well as maintains a concentrated idea within an utterance.We evaluate our approach on two commonly-used dialogue datasets, DailyDialog (Li et al., 2017) in English and LCCC (Wang et al., 2020) in Chinese.Both human and automatic evaluation results, i.e., indicators based on GPT3.5, consistently show that IPS can generate more fluent, coherent, and human-like responses than existing decoding methods.

Preliminary
Dialogue response generation Given a dialogue context D = {u 1 , u 2 , ..., u N } composed of N utterances, where u i = x i,1 , x i,2 , ..., x i,|u i | is a sequence of consecutive words, the task of dialogue response generation is to produce the continuation utterance u r = {w 1 , w 2 , ..., w |ur| }, (r = N + 1).
There are generally two key steps to finish the task, including context encoding and response decoding.For the first step, we obtain the context representations H from the language model by concatenating the utterances into a sequence.
where [EOU] is the special token inserted as the last token of each utterance.
For the decoding step, the response is generally produced in an auto-regressive manner as follows Dialogue modeling Wu et al. (2023) has demonstrated that locality and isotropy are two key properties for building a good conversational feature space.Specifically, locality encourages the model to aggregate the representations of tokens within an utterance while isotropy pushes away the representations of distinct utterances.

Isotropic and Proximal Search
We present a fine-grained conversational decoding method, i.e., isotropic and proximal search (IPS).Specifically, we expect the generated response to satisfy two requirements: 1) representations of the response tokens are nearby to convey a concentrated idea, saying proximity; 2) the response representation is discriminative to the context utterance representations, saying isotropy.During the decoding stage, for proximal search, we try to select the candidate token having the shortest average distance to the existing generated tokens.For isotropic search, we try to choose the token that enables the response representation most discriminative to representations of context utterances.As the response representation cannot be determined during the decoding stage, we calculate it in an approximate way, i.e., averaging the representations of the already generated tokens, as follows: where h RT is the response representation which will be dynamically updated along with the generation process, and T is the number of already generated tokens.
Up to now, the problem changes to how to generate the first token for starting the isotropic and proximal search since the method is heavily dependent on the previous tokens.To address this problem, we attempt to finish the first n-steps generation by traditional decoding methods, such as beam search, top-k sampling or nucleus sampling.On the other hand, as IPS is essentially a deterministic decoding strategy, this solution also enables it to produce diverse responses by using different decoding strategies in the first n steps.Therefore, in each step t after the first sampling stage, we calculate the proximal and isotropic values as follows: where s is the cosine similarity.h u i are the utterance representations obtained from the special token [EOU].The proximal value measures the average distance between the candidate token and the already generated tokens while the isotropic value stands for the average similarity between the undergoing response representation and all utterance representations.Next, the selection of the candidate token w t is formulated as, isotropic and proximal penalty where V (m) is the set of top-m predictions from the model's probability distribution p(w t | w <t , D) and m, is typically set as 4 ∼ 8.In Eq. ( 5), the first term, model confidence, is the probability of the candidate w t predicted by the model.The second term, isotropic and proximal penalty, aims to maximize the discrimination between the undergoing response and previous utterances and minimize the token difference within the response.The hyper-  (Pillutla et al., 2021), and GE represents G-Eval (Liu et al., 2023).
these two components.When α = 1, our method degenerates to the greedy search method.We claim our method is fine-grained because the generic auto-regressive generation predicts the next token by jointly considering the already generated tokens w <t and the context D, formulated as p(w t |w <t , D) while IPS splits these two factors.Specifically, proximity value only focuses on the effects of the already generated tokens, i.e., p_value t ∼ p(w t |w <t ), and isotropy value pays more attention to the context, i.e., i_value t ∼ p(w t |D, (w <t )) wherein w <t is just used to obtain the undergoing response representation h RT .

Experiments
Dataset We evaluate our method on two commonly-used datasets, DailyDialog (Li et al., 2017) in English and LCCC (Wang et al., 2020) in Chinese.Both of them are open-domain multi-turn dialogue datasets, collected from social media.For LCCC, owing to the academic-level computing resource, we follow previous work (Su et al., 2022), and sample a subset of the dataset, consisting of 100,000 dialogue examples.
Settings We fine-tune the models on DailyDialog and LCCC datasets for 6k steps and 7k steps, respectively.We use a batch size of 64 and truncate the training samples to a maximum length of 256.The parameters of the models are initialized from HuggingFace libraries and updated by Adam optimizer (Kingma and Ba, 2017) with a learning rate of 3e-5.We adopt the margin values of SimCTG and SimDRC suggested in their work, i.e., ρ = 0.5 for SimCTG and δ = 0.7, α = 0.3 for SimDRC.We conduct the isotropic and proximal search with the first n = 2 steps adopting top-k sampling (k = 7).The weight α is 0.6.We run all experiments with five different seeds and report the average score.
We also conduct a human evaluation with the help of recruited proficient English/Chinese speakers.We randomly sample 100 dialogue examples from DailyDialog and LCCC test sets.For each dialogue context, we generate responses using the aforementioned backbone models (BART, BART+SimCTG, BART+SimDRC) with six different inference strategies.Five annotators are hired independently to measure these samples.Annotators are instructed to give a score ranging from 1 to 5 over the following aspects, including fluency, informativeness, coherence, and semantic coverage 1 .

Results and Discussion
Table 1 lists the automatic evaluation results of the different methods with different decoding strategies.Similar results can be also found in human evaluation, as shown in Table 2.We can see that the models, collaborating with IPS, can produce more semantically consistent(high BERTScores and MAUVE scores) and human-like (high G-Eval scores) responses.Although contrastive search can generate more novel and diverse tokens (high Distinct scores), it usually suffers from the problem of prediction deviation, i.e., the predicted token being weakly related to the main idea of the response.This is also in line with the worse performance of contrastive search on other metrics, such as BERTScore, and G-Eval, indicating that the diverse responses produced by contrastive search are not accurate and human-like enough.Different from contrastive search, IPS tries to concentrate on the core meaning of the response and express it clearly, thus a slightly lower Distinct score is acceptable and expected.Note that IPS still has better distinct scores than other traditional decoding methods since it encourages discrimination and isotropy among utterances.
Although IPS can be directly used with different models and achieve good performance, the models 1 Details of human evaluation are in Appendix A.1.trained with SimDRC are the best testbed for IPS.We can see that SimDRC+IPS can mostly achieve the best performance across the board on both automatic and human evaluation.This is reasonable because the training process of SimDRC is greatly consistent with the search criterion of IPS, and they both push away the inter-utterance features and pull close the intra-utterance features.
Ablation Study Figure 1 shows the ablation studies on different components of the method, including the first n steps, the sampling strategy for the first n-step decoding, and the weight α.As shown in Figure 1(a), our method consistently outperforms the contrastive search no matter the number of first steps.We find some performance drops with the increase of the first-stage sampling steps.We think this is because more generic tokens are selected by traditional search methods, thus weakening the proximity and isotropy of the response.For strategies in the first n steps, we attempt beam search, top-k sampling, and nucleus sampling.We finally select top-k sampling as our first stage's strategy owing to its better performance in the comparisons.Figure 1(b) shows the results of different k values adopted in top-k sampling.We can see that our method exceeds the baseline by a large margin when k > 5.The effect of weight α is also studied, as shown in Figure 1(c).Our method consistently outperforms the baseline with the different weights, suggesting the robustness of our method.

Conclusion
In this work, we present a fine-grained conversational decoding strategy, namely isotropic and proximal search (IPS) to encourage the generation of isotropic and conversational tokens.Superior to existing decoding methods, IPS decouples the previous tokens and the context.Experiments show that our method achieves impressive performance on both automatic and human evaluation.

Limitations
During the experiments, we found that for a single piece of data in the DailyDialog test set, traditional text decoding methods such as beam search, top-k sampling and beam search take less than 1 second, the contrastive search takes about 5.07s, and the decoding time required by our proposed IPS is about 2.16s.Although our approach takes longer than the traditional text decoding method, our calculation speed is obviously faster than contrastive search.How to further improve the computing speed is still the direction we need to work on.

A.1 Human Evaluation Instructions
Please rate the quality of the generated response based on the given dialogue context and the target response over the following aspects: (1) Fluency; (2) Informativeness; (3) Coherence; (4) Semantic Coverage.We provide some instructions for your rating.

A.1.1 Fluency
This measures whether the generated text has no formatting problems, capitalization errors, or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read.The definitions of different scores are: • 5: The text is fluent, grammatically correct, and has no errors.It is easy to read.
• 4: The text is grammatically correct but has a few spelling or capitalization errors, which does not affect your understanding.
• 3: The text has minor errors in both grammar and spelling.The errors slightly affect your understanding.
• 2: The text has major errors in both grammar and spelling.The errors make the text hard to read.
• 1: The text does not make sense and it is unreadable.

A.1.2 Informativeness
This measures whether the generated text has diverse, informative, novel, or logically related content.The definitions of different scores are: • 5: The text contains very diverse, informative, and novel content.It is enjoyable to read the text.
• 4: The text contains many informative and novel contents.(Choose this score when you hesitate between 3 and 5.) • 3: The text contains some new information but also contains a few repetitions of the context.
• 2: The text only contains a few informative and new terms.(Choose this score when you hesitate between 1 and 3.) • 1: The text is dull, repetitive, and has no new information.All contents are from the dialogue context.

A.1.3 Coherence
This measures whether the generated text is semantically and factually consistent with the dialogue context.The definitions of different scores are: • 5: The text is semantically, factually, and topically consistent with the dialogue context.All contents of the text are related to the source text or can be inferred from the source.
• 4: The text is very related to the context but has minor inconsistencies or contradictions that do not affect its overall relevance.
• 3: The text is related to the context but has some obvious inconsistencies and contradictions.
• 2: The text is slightly consistent with the context.Many inconsistencies and contradictions in the context can be found.
• 1: The text is totally inconsistent with the context.It semantically or factually contradicted the context.

A.1.4 Semantic Coverage
This measures how many semantic content units from the target response are covered by the generated text.The definitions of different scores are: • 5: All semantic content units of the target text can be found in the generated text.They are semantically consistent.
• 4: Most of the content units of the target text can be found from the generated text while a few missing units do not affect the overall coverage.
• 3: Some semantic content units can be found in the generated text but also miss some important units.
• 2: Most of the semantic content units are not covered.Only a few insignificant units can be found in the generated text.
• 1: The text does not have any overlapping semantic content units with the target text.
We recruit five human workers to annotate 3,600 samples.To make sure the workers are fairly paid, we pay 0.1 dollars for each sample.Therefore, the total amount spent on participant compensation is 360 dollars.The annotators take 24 hours to finish the task, suggesting the hourly wage for each worker is 15 dollars.

A.2.1 Evaluation of G-EVAL Score
The API we used to test G-EVAl is gpt-3.5-turbo,and the following is the prompt (Liu et al., 2023): You will be given a conversation between two individuals.You will then be given one potential response for the next turn in the conversation.Your task is to give a final score for utterance.Please make sure you read and understand these instructions carefully.
The evaluation aspects are: 1. Engagingness: Is the response dull or interesting?
2. Naturalness: This measures whether the generated text has no formatting problems, capitalization errors, or obviously ungrammatical sentences to read.
3. Informativeness: This measures whether the generated text has diverse, informative, novel, or logically related content.
4. Coherence: This measures whether the generated text is semantically and factually consistent with the dialogue context.
The evaluation steps are: 1. Read the conversation, the corresponding label, and the response carefully.
2. Considering the above evaluation aspects, return a comprehensive final score ranging from 1 to 5 for each conversation.
3. Please only return 1 overall score, without any extra text descriptions.The return format should be like Score:1.Now please read the following conversation, and return the score.

A.2.2 More Experimental Results
Table 2 lists the results of human evaluation.

Table 2 :
Results of human evaluation on DailyDialog and LCCC datasets, where SC means the semantic coverage, info.means informativeness.Is it too late to mail the Continued Claim Form in now? greedy I will mail it in right now.beam search It's too late to mail it in now.You will receive your check this week.top-k sampling I will mail it in right now if you can't wait.nucleus search It's too late to mail it in now.You will receive your check this week.contrastive search Yes, you can do it now.IPS Yes, it is.content A: Hello, this is Mike, Kara.B: Mike! Good to hear from you.How are you?A: Everything is fine, and how are you?B: Things are going well with me.A: Kara, I had fun the other night at the movies and was wondering if you would like to go out again this Friday.greedy I was thinking of it, but I am not feeling too well.I have to go to the hospital for a couple of days.beam search That sounds great.I would love to go out with you top-k sampling I would love to, but I have to work late on Friday.nucleus search That sounds great.I would love to go out with you.contrastive search Mike, I was thinking of it.IPS That sounds great.I would love to.content A: What do you think I should do, doctor?greedy I think you should have a good rest for three days.beam search I think you'd better have a good rest for three days top-k sampling I think you should go to see the head of the family.nucleus search I think you'd better have a good rest for three days.contrastive search Well, it's hard to say.You look a little pale, don't you?IPS Well, you should have a good rest for three days.Then you'll feel better.
contentA: I didn't receive an unemployment check this week and need to know why.B: Did you mail us your last Continued Claim Form?A: I am afraid that I forgot to send it in.B: We cannot pay you unless you turn that form in every two weeks.A:

Table 3 :
Results of generation texts of DailyDialog.

Table 4 :
Results of generation texts of LCCC.