Towards Robust Personalized Dialogue Generation via Order-Insensitive Representation Regularization

Generating persona consistent dialogue response is important for developing an intelligent conversational agent. Recent works typically fine-tune large-scale pre-trained models on this task by concatenating persona texts and dialogue history as a single input sequence to generate the target response. While simple and effective, our analysis shows that this popular practice is seriously affected by order sensitivity where different input orders of persona sentences significantly impact the quality and consistency of generated response, resulting in severe performance fluctuations (i.e., 29.4% on GPT2 and 83.2% on BART). To mitigate the order sensitivity problem, we propose a model-agnostic framework, ORder Insensitive Generation (ORIG), which enables dialogue models to learn robust representation under different persona orders and improve the consistency of response generation. Experiments on the Persona-Chat dataset justify the effectiveness and superiority of our method with two dominant pre-trained models (GPT2 and BART).


Introduction
Developing a persona-consistent dialogue model has been one of the key issues and crucial problems in open-domain dialogue systems (Huang et al., 2020).Zhang et al. (2018a) define the problem of personalized dialogue generation, which aims to generate personalized responses based on textually described persona profiles.Many efforts have been made on developing dialogue models that generate responses consistent with the provided persona profile (Song et al., 2019(Song et al., , 2020a,b;,b;Wu et al., 2020a).
The recent development in transformer-based pre-trained models (Vaswani et al., 2017;Devlin et al., 2018;Liu et al., 2019;Chen, 2020) has led to great successes in dialogue systems (Wolf et al.,  2019; Wu et al., 2020b;Ham et al., 2020;Kulhánek et al., 2021;Cao et al., 2022;Deng et al., 2022bDeng et al., ,c, 2023)).Inspired by these successes, previous works incorporate those pre-trained models in persona-based response generation by concatenating the dialogue history and persona as input to generate the response in an auto-regressive manner (Song et al., 2021;Liu et al., 2022).However, a fine-tuned model can generate a high-quality and persona-consistent response in a certain ordering of personas, while varying this order may lead to a generic and even inconsistent response as illustrated by the example in Figure 1.We empirically show that the worst ordering of persona can lead to a 29.4% decline in BLEU score compared with the best ordering.
Ideally, a well-trained dialogue generation model should be able to generate a persona-consistent response regardless of the ordering of personas in the input.We perform experiments and analyses to identify the cause of the ordering sensitivity.We find that the ordering of persona in the input leads to different representations of context and response.We also show that the model can attend to the appropriate persona and generate high-quality responses under some representations but not under others.This leads to instability in response generation.
Motivated by the above findings, we propose ORder Insensitive Generation (ORIG), which is a simple and effective framework that helps models learn more robust and better representations for different persona orders.More specifically, we formulate ORIG as a constrained optimization problem, which optimizes a persona response generation objective under the constraint: given different orderings of persona, the response representations of the model are the same.Then we optimize it through a stochastic optimization approach.
Experimental results on the Persona-Chat dataset show that ORIG significantly improves the robustness of pre-trained models (GPT2 (Radford et al., 2019) and BART (Lewis et al., 2020)) under different orderings of input persona, as well as advances their generation performance.
In summary, our contributions are threefold: (1) We identify the order sensitivity problem in persona dialogue generation and conduct an empirical analysis to reveal its underlying reasons.( 2) We propose a model-agnostic framework, ORIG, that helps different persona dialogue models learn robust representations while achieving better performance.(3) We perform extensive experiments on the Persona-Chat dataset, showing that ORIG outperforms previous models and is more robust and less sensitive to different persona orderings.

Related Work
Maintaining a consistent persona is essential for building a human-like dialogue system, where most works regard persona as a set of sentences along with each dialog (Zhang et al., 2018a;Gu et al., 2019;Song et al., 2019;Wu et al., 2021;Cao et al., 2022;Deng et al., 2022a).Song et al. (2021) disentangled the task of persona-based dialogue generation into two sub-tasks: consistency understanding and dialogue generation while Cao et al. (2022) aims to alleviate the problem of limited data by data manipulation methods.Despite satisfactory performance in previous work, the impacts of different orders of personas are still under-explored, resulting in unstable and inconsistent responses.
Our work is also related to work on order sensitivity in prompt-based few-shot learning (Zhao et al., 2021;Lu et al., 2022).Zhao et al. (2021) found that the different order of training examples in the prompt can cause accuracy to vary from near chance to state-of-the-art in the few-shot clas- sification setting.Similarly, order sensitivity for In-context Learning also exists regardless of model size and the prompt format (Lu et al., 2022).Distinguishing from them, we focus on order sensitivity in the language generation task in finetuning setting, especially the impacts of persona orderings to generate persona-consistent responses.

Order Sensitivity Problem and Analysis
In this section, we first illustrate the seriousness of the order sensitivity problem by showing a huge performance fluctuation in persona dialogue models when fed the same personas in the best and worst orders.Then we analyse why their performance is volatile to different persona orderings.
To illustrate the problem, we finetune PLMs on the Persona-Chat by concatenating the persona and dialogue context together to predict the target response, including GPT2 and BART.After the training converges, we test them on two settings: (1) the best case: for each test sample, we feed the models all possible permutations of persona sentences and keep the maximum score for each sample as the final score; (2) the worst-case: perform the same process as (1), but take the minimum score.Table 1 shows the results for two models.Surprisingly, we find the ordering of input persona has a big impact on the models' performance: GPT2's worst case is 29.4% lower than its best case, while BART's is 83.2% lower.
Moreover, we find that the huge fluctuation in models' performance is closely related to the response representation changes caused by different orderings of input persona sentences.Concretely, we measure the similarity of the responses representation of the same test sample under different input orders of persona.We show their token-level similarity in the  sponse should be zero.However, their distances are significantly higher than zero.It reveals that the models behave more likely a left-to-right language model whose representation is prone to the different orderings of the previous input (e.g.persona sentences).That is highly undesirable for a robust personalized dialogue model.Thus, regularization of representation for the response tokens is necessary to help personalized dialogue models capture order-invariant representation.

Method
We introduce the proposed framework, named ORIG: ORder Insensitive Generation (ORIG).As shown in Figure 2, we transform the persona ordersensitivity problem as a constrained optimization problem that optimises a persona dialogue model under the uncertainty of the input persona order.

Problem Formulation
Given the dialogue context C = {u 1 , . . ., u m } and a set of persona descriptions P = {p 1 , . . ., p n }, the goal is to generate a personalized response r.
Formally, the generation problem can be formulated as the following chain rule: where θ is the parameters of the dialogue model.

ORIG Framework
According to the analysis in Section 3, the observation reveals that varying the order of input personas leads to different representations of the dialogue response, thus resulting in fluctuations in performance.
To learn more robust and consistent representations, we propose the ORIG framework that complements the response generation process with a constraint: given the different orderings of a persona, the model's response representations need to be the same.
Then the order-insensitive personalized dialogue generation problem is modelled as the following constrained optimization problem where P (r|C, P ; θ) are the model's predictions over the dialogue response, D denotes the dialogue corpus, and the function D is KL divergence to measure the difference between two distributions, and the Shuffle operator samples each persona ordering uniformly from the full permutation of P .

Optimization
As for optimization, we first apply the Lagrange multipliers strategy to convert the constrained problem into an unconstrained problem L θ = − log P (r|C, P ; θ) +γ • D[P (r|C, P ; θ), P (r|C, P ; θ)] (6 where γ is the multiplier corresponding to the equality constraints (3).Then we can update the parameters θ of dialogue models by stochastic gradient descent.

Experimental Setups
Datasets We evaluate the models on the Persona-Chat dataset (Zhang et al., 2018a), where each dialogue session has at least 6 turns of interactions.
And each interaction is conditioned on a persona that is described with 5 profile sentences.proposed ORIG.Our implementation was based on HuggingFace's Transformers library (Wolf et al., 2020).During training, the learning rate is set as 2 × 10 −5 , and the batch size for GPT2 and BART is set as 64 and 32, respectively.We trained both models for 10 epochs with Adam (Kingma and Ba, 2015) optimizer until they converged.During decoding, We employ a top-p (p=0.9)(Holtzman et al., 2020) plus top-k (k=50) sampling strategy, which is used to avoid sampling from the unreliable tail of the distribution (only consider a subset of vocabulary composed of k words with the highest probability or some most probable words whose sum of probabilities equals p at each decoding step).
The random seed for all experiments is set to 42.Evaluation Metrics We perform both automatic and human evaluations.(1) Automatic metrics: We adopt BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), Entropy (Zhang et al., 2018b) and CIDEr (Vedantam et al., 2015) for lexicalbased measurement.Following previous work, we also adopt the C-score (Madotto et al., 2019) to indicate the consistency between the generated response and the input personas.C-score is calculated by the entailment score of a RoBERTa model finetuned on the DialogueNLI dataset.(2) Human evaluation: We randomly sampled 200 samples from the test set and ask 3 crowdworkers to rate the generated responses in the following three aspects: response fluency, context coherence and persona consistency.The scores {0, 1, 2} indicate unacceptable, acceptable and excellent, respectively.The degree of agreement during human evaluation is measured by Fleiss' kappa (Fleiss, 1971).

Experimental Results
Improves performance in the original test set Table 3 shows different models' performance in the original test set without any modifications (for ORIG, "Shuffle" is used during training but is optional during testing.The Table 3   the absence of "Shuffle" during testing.This is to evaluate if ORIG performs well in the normal setting).From automatic metrics, we can see base models trained with our ORIG framework outperform the baselines.It justifies that our framework can be applied to different models to improve their performance.From human evaluation results, models with ORIG are superior to others on almost all metircs, especially on GPT2.This is consistent with the results of automatic metrics.The average kappa value of the annotation is 0.632, indicating good agreement during human evaluation.

caption signifies
Reduces variance and improves mean and worstcase performance Figure 3 shows that aside from reducing the variance, ORIG also improves mean and worst-case performance (detailed results in Table 4) across two models consistently, especially in GPT2 (the worst case performance is very close to the best case).We reduce the variance on GPT2 and BART by 91.6% and 51.8%, respectively.Meanwhile, we improve worst-case performance by 20.3% and 22.6% on GPT2 and BART respectively.The only drop is the best case.This is because our distance function D is unidirectional, which pulls in the two representations in Equation 3indiscriminately, causing the best case to go down and the worst to go up.We leave more complicated and directional distance constraints for future studies.

Conclusion
We show that the current practice of applying pretrained models to the personalized dialogue generation task is volatile across different input orders of personas.Through the analysis, we find that the problem arises from the representation changes induced by the input changes.Motivated by these, we propose our ORIG, a model-agnostic framework for finetuning the persona dialogue model such that it obtains a persona order-invariant representation.
Experiments on two dominant pre-trained dialogue models show that our framework improves performance and reduces order volatility.

Limitations
In this section, we discuss the limitations of this work.First, on the problems side, it's non-trivial to consider the order of all kinds of grounding knowledge, but we have only explored Persona-Chat.We hope to apply our method to more grounded generation tasks such as knowledge-grounded and document-grounded dialogue in the future.and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?All annotators are students.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: A dialog extract from Persona-Chat showing different orderings of the same persona can lead to different and even inconsistent responses.

Figure 3 :
Figure 3: Our ORIG improves the mean performance while reducing the variance of both models.Statistics about running each model 100 times on the test set and randomly shuffling the order of the input persona sentences in each run.

Table 2 (
persona and context are omitted for brevity), where the bidirectional KL function is employed as the distance function.Ideally, models should have the consistent response representation: KL distance between the same re-

Table 2 :
The token-level representation of the same response can be very different when the ordering of input persona changes.The value denotes the KL distance of the same tokens representation returned by the models fed with two different orderings of persona.

Table 3 :
Automatic and human evaluation results of applying ORIG on two base models in the original test set without any modifications on input persona orders.

Table 4 :
Statistical results of BLEU-1.The mean and variance are obtained by running each model 100 times on the test set and randomly shuffling the order of the input persona sentences in each run.The best case and worst case are obtained by feeding models the best and worst orderings of personas for every test sample.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.C2.Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?5.2 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 5.1 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? We provide a simple version in 5.1 D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students)