PK-ICR: Persona-Knowledge Interactive Multi-Context Retrieval for Grounded Dialogue

Identifying relevant persona or knowledge for conversational systems is critical to grounded dialogue response generation. However, each grounding has been mostly researched in isolation with more practical multi-context dialogue tasks introduced in recent works. We define Persona and Knowledge Dual Context Identification as the task to identify persona and knowledge jointly for a given dialogue, which could be of elevated importance in complex multi-context dialogue settings. We develop a novel grounding retrieval method that utilizes all contexts of dialogue simultaneously. Our method requires less computational power via utilizing neural QA retrieval models. We further introduce our novel null-positive rank test which measures ranking performance on semantically dissimilar samples (i.e. hard negatives) in relation to data augmentation.


Introduction
Effective conversation agents require external context as grounding information to enhance response generation.There has been much progress on each persona (Majumder et al., 2020;Joshi et al., 2017;Shuster et al., 2018;Wu et al., 2019) and knowledge (Li et al., 2022;Dinan et al., 2018;Zhao et al., 2020;Liu et al., 2021) grounded dialogue systems respectably.However, the combination of both and more unique contexts has not been studied, with limited interest in industry personabased QA systems (Byron et al., 2017;Ky and Joshi, 2021).(Feng et al., 2020;Dinan et al., 2018;Moghe et al., 2018) have shown the importance of directly optimizing knowledge extraction in dialogue, while (Zhang et al., 2018;Gu et al., 2021;Liu et al., 2020) have shown the importance of directly optimizing for concrete persona.We further argue that in practical settings, it is more realistic to assume the utility of multiple contexts, with an explicit use-case being travel assistance agent (Jang et al., 2021).

Persona Knowledge
Can you recommend me a museum?
I like pizza.Following the Knowledge Identification task in DIALKI (Wu et al., 2021), we define Persona and Knowledge Dual Context Identification as the task to identify persona and knowledge jointly for a given dialogue.The task is similar to personabased QA task in industry (Byron et al., 2017;Ky and Joshi, 2021), of creating a search engine based on persona, with exception of our study being in an interactive dialogue setting.We emphasize the specific interactions (Fig. 1) between persona, knowledge, and dialogue.We aim to formalize the nature of component-wise interactions via this research, resulting in enhanced multi-context retrieval methodology.This separation of grounding retrieval tasks could be a particular benefit for multi-context dialogue, in which we can study complex context-wise interactions first, then apply the identified behavior as a sub-component of end-to-end systems.As a starting point, we re-purpose existing tasks and find that Question Answering (QA) is a good candidate (Adolphs et al., 2021).This provides the benefit of reduced computation and streamlined architecture by reusing powerful retrieval models previously developed for diverse tasks.
We develop a framework that exploits this relation, of which an interesting aspect is combining persona and dialogue 1 as a form of component augmentation.This may be of further utility in complex systems as each pertains to attributes and actions of the human respectively.Interestingly, our suggested augmentation method creates positive and hard negative samples which could be applied to enhance retrieval (Appendix G).We introduce a novel evaluation methodology of the Null-positive Rank Test (NRT) to quantify this trait.
Our contributions are summarized as follows.
1. Persona and knowledge dual context retrieval methodology.We enhance specific interactions between all components to successfully retrieve dialogue contexts.We achieve SOTA performance for both persona and knowledge retrieval.
2. Framework for cross-task adaptation of dialogue context interactions.We introduce a framework to benefit from existing performant retrieval models for complex dialogue grounding retrieval.Our zero-shot inference allows reduced computation (Table C1) and streamlined architecture.
3. Evaluating the hard-negative trait of Persona-augmented Dialogue.We augment dialogue with persona to form an enhanced retrieval input, in which we observe hard negative traits.We introduce a novel test to isolate this trait, applicable to scenarios where semantically dissimilar samples are produced via data augmentation.

Related Works
Integrating either persona or knowledge bases with dialogue agents in isolation has been actively studied.(Zhang et al., 2018;Majumder et al., 2020;Xu et al., 2020;Rashkin et al., 2019) for Persona, and (Dinan et al., 2018;Zhao et al., 2020;Liu et al., 2021;Li et al., 2020;Ghazvininejad et al., 2017) for knowledge.Persona-only method prohibits elaboration with detailed knowledge.In contrast, relevant knowledge might depend on the persona of the user.We address the limitations by studying interactions between all components of dialogue.
Knowledge Identification (Wu et al., 2021) task has been defined in recent papers stemming from knowledge-grounded dialogue.Our work aligns with the view in (Wu et al., 2021) that context identification is a separately important task in an interactive dialogue setting, with similarities to open 1 Persona-augmented Dialogue question answering (Min et al., 2019;Chen et al., 2017) and industry persona-based QA systems (Byron et al., 2017;Ky and Joshi, 2021).Our research expands upon the Knowledge Identification task to specify persona & knowledge as dual contexts to be jointly retrieved from the dialogue.

Methodology
We maximize interactions between all components of a conversation turn for effective retrieval of dialogue groundings.Knowledge retrieval is a top-1 ranking task (Section 3.1), and persona retrieval is a point-wise scoring task with 1 or 0 true persona label (Section 3.2).We solve knowledge retrieval in a zero-shot manner, while we introduce null-positive rank test to investigate the hard-negative traits of Persona-augmented Dialogue (Section 3.3).2

Knowledge Retrieval
We introduce a novel adaptation of dialogue components as QA prompts (example in Fig. A1).This form is selected to infer relations between all inputs of the grounded dialogue and to replicate short question and descriptive answer pairs.
E is input to our model.Q i , A j , P i , K j are specific QA candidates and persona & knowledge pairs.D is the dialogue for the pairs.
We then find the best knowledge for all pairs of i and j in a permutative manner and record the knowledge of the most aligned pair.
best i , best j are indices from best-scoring persona / knowledge pair.true j is the index of predicted knowledge K. best i is discarded.M q is QA retrieval model for pair likelihood score.n / m is persona / knowledge count respectively.

Persona Retrieval
Continuing from Section 3.1, we fine-tune the QA retrieval model using augmented persona and predicted true knowledge pairs only. 3We report that Persona-augmented Dialogue exhibits hard negative attributes (Section 3.3).
E ′ is input to our model similar to E, only difference being fixed true knowledge.E ′ train is data from a separate training set formulated in the same manner as E ′ with labeled true knowledge.M f is the fine-tuned model (Appendix D).
Next, we infer selected data pairs with M f to obtain the persona likelihood score.We avoid retrieving unrelated persona via a threshold.
p i is the likelihood score for P i .p thres is the likelihood score threshold to remove persona that doesn't correspond to the dialogue turn.true i is the index of the predicted true persona.
Finally, the retrieved grounding information is: R is the retrieved true persona & knowledge pair for the given dialogue turn.

Null-positive Rank Test
We stress that fine-tuning model M q with Personaaugmented Dialogue (P i + D) to create model M f is a specific choice.This is because the QA setup cannot be utilized without adjustments, due to the model scoring output skewing higher (Fig. E1).To analyze without inflated scores, we first interpret Persona-augmented Dialogue as hard negative sampling (Appendix G), in which the augmentation produces non-trivially hard-to-distinguish samples. 4o evaluate the above observation, we present a novel methodology of null-positive rank test to quantify the inherent difficulty of ranking P i + D  A1).
samples.Inspired by ranking metrics such as MRR, MAP, and NDCG, we utilize rank of a specific sample to compute model performance.This allows us to isolate the discriminative performance of the model corresponding to samples of interest, regardless of score output (Fig. 2, example in Table A1).
We designate null-positive5 (P o ) sample as a baseline for the model.We measure the following: Can the model rank null-positive sample correctly in relation to non-trivially dissimilar augmented samples?The "non-triviality" metric which computes the average distance of null-positive sample's rank from the ideal rank6 is as follows: Variants of the metric are in Appendix B. ¬T is non-triviality metric, with lower values of the metric meaning the model ranks better.n r is the number of P o samples with adjusted rank r (Fig. 2).We report "non-triviality" for each model M q , M f .

Experiment Setup
We utilize Call For Customized Conversation (Jang et al., 2021) dataset and multiple neural QA models trained on MS MARCO dataset (Nguyen et al., 2016)   Similar results for bi-encoder (Table H2).

Knowledge Retrieval
We experiment with ablations of dialogue / persona / knowledge interactions and find permutative evaluation of eq. 1 form yields best performance.Table 1 shows strong performance increase for our prompt input from dialogue-only model, confirming that all components of dialogue is important.8

Persona Retrieval
Table 2 shows that fine-tuned P i + D model has the best performance.However, we observe low performance for the non-fine-tuned P i + D model.This is due to QA relationship of dialogue to true knowledge affecting the likelihood score (Fig. E1).Thus fine-tuning the model is a necessity to harness cross-domain adaptation.

Null-positive Rank Test
To verify our observation of the effectiveness of P i + D, we perform null-positive rank test (Section 3.3).The performance of the model has in-  Ours model is the finetuned variant, and Z.S. is Zero-Shot model.We report persona retrieval accuracy when p thres = 0 (0-Acc) and variants of non-triviality (eq.9, 12, 10, 11).Smaller non-triviality means superior ranking capability.Similar results for bi-encoder (Table F1).We report movements of delta in correct directions for rank −1, 0 and ranks with long distance 3, 4. Similar results for bi-encoder (Fig. F1).

Conclusion
We introduce persona-knowledge dual context retrieval method PK-ICR in this paper.We perform QA informed prompt-augmentations of data that successfully exploit the interactions between multiple dialogue components.We perform zeroshot top-1 knowledge retrieval and precise persona scoring.We present novel evaluation method of null-positive rank test as to isolate hard-negative effect of Persona-augmented Dialogue.We obtain SOTA results on both retrieval tasks of Call For Customized Conversation benchmark and report the alignment of the non-triviality metric with threshold-free performance.
We hope to stimulate readers to model dialogue context as an interactive whole of multiple components.As NLP community aims to tackle more complex dialogue systems, our methods may be further enhanced by sophisticated grounding contexts and interactions present in datasets.Our cross-task adaptation of dialogue grounding retrieval to QA task is limited in terms of the target task and our prompt construction.In addition, retrieval models informed by inductive bias for multi-context scenarios could further improve our methodology.Lastly, future work could also report on modeling downstream generation tasks based on grounding interactions.

A Samples
Table A1: We display ideal rank order for Persona-augmented dialogue (P i + D) along with null-positive sample P o (underlined).The rank is adjusted to be 0 for the ideal null-positive rank.This table corresponds to notations in Fig. 2.

B Null-Positive Rank Test Variants
We introduce variants of non-triviality ¬T metric (eq.9).Smaller numbers are better for all variants.
• ¬T + to only observe positive rank displacements.
• ¬T − to only observe negative rank displacements.
• ¬T 2 similar to how Mean Squared Error relates to Mean Absolute Error.
• ¬T weighted to provide constant weights for each rank.

D Experiment Setup
We utilize Call For Customized Conversation (Jang et al., 2021) dataset for evaluation and fine-tuning, which has 10 knowledge candidates and 5 persona candidates per dialogue.We dataset 13 , with a dummy short segment of "Title", and treating Knowledge as long answer segment.For persona search (eq.5, 7), we fine-tune for 2 epochs, 32 batch size, and sigmoid activation function with Binary Cross Entropy (cross-encoder) / Cosine Similarity (bi-encoder) Loss with p thres = 0.5.We list the official evaluation results on the test data.For MobileBERT (25M params) and BERT-base (110M params), we evaluate with Next Sentence Prediction task.We experiment with DistillRoBERTa (82M params) STS14 and NLI15 cross-encoder models.We work with RTX 3090 NVIDIA GPU.Table F1: Null-positive rank test results for P i + D & K truej bi-encoder models.Ours model is the fine-tuned variant, and Z.S. is Zero-Shot model.We report persona retrieval accuracy when p thres = 0 (0-Acc) and variants of non-triviality (eq.9, 12, 10, 11).Smaller non-triviality means superior ranking capability.

G Background: Ranking and Hard Negative Sampling
Text ranking is a task to generate an ordered list of texts in response to a query (Lin et al., 2021a).It is a core task of information retrieval (IR) where you obtain a ranked list of samples ordered by estimated relevance to the query.We introduce widely accepted neural approaches, 'cross-encoder' and 'bi-encoder'.We will also describe 'hard negative sampling', a data-centric approach to improve retrieval models.
For cross-encoder (Nogueira et al., 2019), query and a single sample are concatenated by '[SEP]' token as an input to the model, resulting in a relevance score (FFN output of the '[CLS]' token representation) for the specific sample.We note that this setup is similar to sentence-wise classification settings presented in (Devlin et al., 2018).Then, the samples are ordered by relevance score to produce the final ranked list.
For bi-encoder (Reimers and Gurevych, 2019), we generate dense vectors (sentence embeddings) per each query and each sample.This is obtained via '[CLS]' token representation of a specially finetuned model 16 , with a single query or sample input.The representations are then compared as pairs via cosine-similarity or dot-product to compute relevance scores.While original bi-encoder setup computes sentence-wise similiarity and not relevance scores, we utilize models fine-tuned on QA data (Appendix D).
Hard negative sampling (also known as hard negative mining) is a technique to obtain specific samples (hard negatives) that are difficult to distinguish from positive samples, yet have a different label.The hard negative samples are then incorporated during model fine-tuning to improve model capabilities.For example, in the context of ranking, non-relevant texts scoring high by how many keywords match (Xiong et al., 2020) may be considered hard negatives.(Xiong et al., 2020;Luan et al., 2021;Lin et al., 2021b) have demonstrated that hard negative sampling improves ranking models considerably.

H Detailed Results
More experiments are listed here.Our bi-encoder and cross-task experiments confirm our findings in Section 5. Explanation of the models in Appendix G.

91.57
Table H2: Persona retrieval results, models are zero-shot unless fine-tuned.We report persona retrieval accuracy per asymmetric QA prompt.We do not fine-tune DPR model due to implementation limitations.D, K, P each refer to dialogue, knowledge and persona.

I Retrieval Output Samples
We list some of the retrieved outputs with our best model in Table I1.
dialogue D persona P true knowledge K true I think I've been there before but I don't remember the name of this place.
I am fond of Modernist architechure.
The Casa de les Punxes or Casa Terradas is a building designed by the Modernista architect Josep Puig I Cadafalch.Located in the intersection between the streets of Rosselló, Bruc and the Avinguda Diagonal in the Barcelona Eixample area.
How much this railway line costed in those times?I love railway.Because of the difficult physical conditions of the route and state of technology, the construction was renowned as an international engineering achievement, one that cost US$8 million and the lives of an estimated 5,000 to 10,000 workers.
Who built this rail line?
I love railway.
The line was built by the United States and the principal incentive was the vast increase in passenger and freight traffic from eastern USA to California following the 1849 California Gold Rush.
What's the highest point in the Mulanje Massif?
I like to climbing up the elevations on my neighborhood to take a look around.
Sapitwa Peak, the highest point on the massif at 3,002 m, is the highest point in Malawi.
Who was the first explorer to find this mountain?
I have fantasies of being a Livingstone type explorer.
The first European to report seeing the Massif was David Livingstone in 1859, but archeological investigation reveals evidence of human visits to the Massif from the Stone Age onwards.Now I remember, can you tell me some characteristics of this channel?
N / A And may be the oldest canal in England that is still in use.It is usually thought to have been built around AD 120 by the Romans, but there is no consensus among authors.

Figure 2 :
Figure 2: Null-positive Rank Test (NRT).P o , P pos , P neg i denote null-positive sample, positive and negative personas respectively.We omit augmentation +D in the figure for brevity.r min = −1 and r max = +3 in this figure.Arrows are possible positions for P o .Numbers on the rightmost side are the null-positive adjusted rank values, being 0 right below P pos (example in TableA1).

Figure 3 :
Figure 3: Analysis of null-positive rank data for P i + D & K truej cross-encoder model.Delta value is the change between Zero-shot model and Ours model in terms of sample count (left axis).Ratio value is delta value divided by sample count for Zero-shot, in % (right axis).We report movements of delta in correct directions for rank −1, 0 and ranks with long distance 3, 4. Similar results for bi-encoder (Fig.F1).

Figure E1 :
Figure E1: Persona threshold ablation experiments with P i & K truej cross-encoder model.We report persona accuracy.p thres is defined in eq. 7. Dotted line correspond to Zero-shot model, and solid line is our best model.We find visible peak at 0.55 with our best model while Zero-shot model performance keeps increasing > 0.8.

Figure F1 :
Figure F1: Analysis of null-positive rank data for P i + D & K truej bi-encoder model.Delta value is the change between Zero-shot model and Ours model in terms of sample count (left axis).Ratio value is delta value divided by sample count for Zero-shot, in % (right axis).We report large movements of delta in correct directions for rank 0 and ranks with long distance 3, 4.
. More details in Appendix D.7
i & K true j 86.75 P i + D & K true j 83.83 P i & K true j (fine-tuned)89.12P i + D & K true j (fine-tuned) 91.57(+4.71)

Table 3 :
Null-positive rank test results for P i + D & K truej cross-encoder models.

Table C1 :
Computational effort required for the methods.

Table H1 :
Knowledge retrieval results, all models are zero-shot.We report top-1 knowledge retrieval accuracy per asymmetric QA prompt.D, K, P each refer to dialogue, knowledge and persona.+D&Ktrue j (bi-encoder) 77.90P i + D & K true j (cross-encoder)83.83Pi+ D & K true j (bi-encoder, fine-tuned)85.55P i + D & K true j (cross-encoder, fine-tuned)

Table I1 :
persona, knowledge and dialogue retrieved examples from our best model.