Instructive Dialogue Summarization with Query Aggregations

Conventional dialogue summarization methods directly generate summaries and do not consider user's specific interests. This poses challenges in cases where the users are more focused on particular topics or aspects. With the advancement of instruction-finetuned language models, we introduce instruction-tuning to dialogues to expand the capability set of dialogue summarization models. To overcome the scarcity of instructive dialogue summarization data, we propose a three-step approach to synthesize high-quality query-based summarization triples. This process involves summary-anchored query generation, query filtering, and query-based summary generation. By training a unified model called InstructDS (Instructive Dialogue Summarization) on three summarization datasets with multi-purpose instructive triples, we expand the capability of dialogue summarization models. We evaluate our method on four datasets, including dialogue summarization and dialogue reading comprehension. Experimental results show that our approach outperforms the state-of-the-art models and even models with larger sizes. Additionally, our model exhibits higher generalizability and faithfulness, as confirmed by human subjective evaluations.


Introduction
Both Verbal and non-verbal conversations play a crucial role in the realm of communication.They serve as channels for humans to exchange information, ideas and emotions (Kester, 2004).In an era of information explosion overload, dialogue summarization has become increasingly essential.The process involves extracting the key dialogue information, enabling people to grasp the essence of their daily interactions.
Conventional dialogue summarization models typically approach the problem as an unconstrained sequence-to-sequence task, treating  dialogue-summary pairs as straightforward inputoutput pairs (Shang et al., 2018;Goo and Chen, 2018;Chen and Yang, 2020).Although fine-tuning pre-trained language models such as BART (Lewis et al., 2020) has shown promising results, these models fail to consider the specific preferences of users who have distinct backgrounds, objectives, intents, and applications for the summaries they require.
In order to address this challenge, several methods have been proposed to integrate queries when generating summaries (Dang, 2006;Nema et al., 2017;Su et al., 2021;Zhong et al., 2021;Zhu et al., 2022;He et al., 2022).However, these models primarily concentrate on domains such as news (Dang, 2006;He et al., 2022), Wikipedia (Zhu et al., 2022), and meetings (Zhong et al., 2021).The exploration of query-based summarization for dialogues remains limited.Furthermore, Liu and Chen (2021) propose controllable generation using personal named entity planning, and Wang et al. (2022a) suggest controlling summary conciseness.However, both methods focus on specific aspects of controllability and still lack the flexibility to incorporate user requirements as shown in Figure 1.
The primary obstacle in instruction-based dialogue summarization is the scarcity of training data.While existing datasets contain dialogue-summary pairs, creating query-based dialogue summarization datasets with limited human involvement is challenging due to high costs, limited diversity, and potential quality issues.In this work, shed light by Self-Instruct (Wang et al., 2022c), we propose to synthesize query-dialogue-summary (QDS) triples by leveraging the conditional question generation and answering ability of general large language models (LLMs) (Wei et al., 2023).The process involves requesting LLMs to generate multiple candidate queries based on the reference summary.A filtering mechanism, employing text-based and semantic-based methods, is then applied to ensure the quality of collected queries.Finally, the querybased summarization is generated by triggering the question-answering ability of LLMs.This approach demonstrates a promising solution to generate query-based dialogue summarization triples while reducing human involvement and enhancing data diversity.
The InstructDS framework is shown in Figure 2. Through joint training with QDS triples, InstructDS can cater to user preferences by producing querybased summaries.This mixed training paradigm enhances the model's understanding of dialogue, which improves the factual quality of generated summaries.Our model exhibits superior domain generalizability by incorporating multiple datasets into a unified framework and the user's conciseness requirement can be fulfilled by our length-aware augmentations.
Our main contributions are summarized as follows: • We introduce InstructDS, the pioneering instruction-following dialogue summarization model.It is a text generation model designed to summarize dialogues while explicitly considering user instructions.
• We present a straightforward yet effective approach to synthesize query-dialogue-summary triples from dialogue-summary pairs, facilitating query-based dialogue summarization.This method leverages the question generation and answering capabilities of large language models (LLMs).We validate its effectiveness through evaluations conducted by human annotators.
• We conducted an extensive evaluation on 3 dialogue summarization datasets and 1 dialogue comprehension dataset.The results demonstrate a substantial improvement over previous models.Additionally, according to human subjective test, our generated summaries exhibit comparable levels of factuality, fluency, informativeness, and conciseness to human written ones.
2 Related Work

Dialogue Summarization
Dialogue summarization is the task of generating a concise and fluent summary of a conversation involving two or more participants.It has gained significant attention due to its broad applications and availability of relevant datasets (Gliwa et al., 2019;Chen et al., 2021;Zhao et al., 2021).Solutions on dialogue summarization are mainly based on sequence-to-sequence models including the pointer-generation network (See et al., 2017), T5 (Raffel et al., 2020) and BART (Lewis et al., 2020).However, it remains a challenging task due to the lengthy and unstructured nature of dialogues.Chen and Yang (2020) proposes extracting dialogue structures from various perspectives before summarization.Other approaches attempt to incorporate co-reference information (Liu et al., 2021b) and leverage dialogue understanding objectives (Liu et al., 2021a) to enhance the factuality and informativeness (Tang et al., 2022;Wang et al., 2022b).
Similar to text summarization, the process of generating dialogue summarization is uncontrollable and poses challenges in incorporating user preferences (Zhong et al., 2021;He et al., 2022).Efforts have been made to enhance the controllability of dialogue summarization.However, these approaches often have limited focus on personal named entities (Liu andChen, 2021, 2022;Wang et al., 2022a) and conciseness (Wang et al., 2022a).The primary challenge in instructive dialogue summarization lies in the availability of suitable su- pervision.While QMSum (Zhong et al., 2021) introduces the first query-based meeting summarization, it focuses on lengthy meetings and consists of only 232 meeting samples.To address this limitation, we propose a methodology for synthesizing query-dialogue-summary triples leveraging summary-anchored techniques, to facilitate instructive dialogue summarization.

Instruction Tuning
Recently, instruction-finetuning on large language models has demonstrated remarkable generalizability towards unseen tasks by leveraging task descriptions (Brown et al., 2020;Wang et al., 2022d;Chung et al., 2022).The availability of high-quality and diverse instructions unlocks the emerging capabilities of LLMs.For instance, Flan-series models are tuned with over 1800 tasks with diverse instructions (Chung et al., 2022).However, dialogue tasks, being a sub-domain, have limited supervised data.This limitation leads to sub-optimal performance on query-based dialogue summarization using existing instruction-finetuned models.To mitigate the reliance on human-annotated instruction data, Self-Instruct (Wang et al., 2022c) uses GPT3 for generating diverse instructions and input-output pairs.In a similar vein, our study introduces diverse and high-quality augmentations of query-based dialogue summarization data for instructive dialogue summarization.

Synthesize QDS Triples
The process of generating query-dialoguesummary (QDS) triples from dialogue-summary pairs in our pipeline involves three steps: 1) query generation from complete summary; 2) query filtering to ensure the validity and diversity; 3) query-guided summary generation.Query Generation.In order to capture a diverse range of potential queries, we deploy the question-generation ability of LLMs to generate multiple candidate queries.Specifically, we use Flan-T5-XL (Chung et al., 2022) (refer as model X) which has been trained on question generation datasets such as Quoref (Dasigi et al., 2019), MC-TACO (Zhou et al., 2019) and CosmosQA (Huang et al., 2019).We expect that its question-generation ability can be generalized to other narrative text.For each instance, we generate five candidate queries using the template shown in Table 7.
Filtering and Postprocessing.query filtering.1) Text-based filtering.Through an analysis of candidate queries, we observe that some queries are not answerable conclusively without hallucinations.Examples of such queries include 'What will' and 'How would' queries.Therefore, we utilize model X as a text-based binary classifier to determine the answerability of queries using the template in Table 7.This filtering process eliminates around 45% of generated queries that are likely to be unanswerable.2) Semanticbased filtering.To avoid redundancy and ensure diversity, we remove similar queries for the same dialogue-summary pair through semantic similarity measurement.For instance, queries such as 'What does Edward think about Bella?' and 'What does Edward think of Bella?' are almost identical in meaning.We keep only the first query if the semantic similarity score, computed using normalized BERTScore (Zhang et al., 2020), is above 0.65.The semantic-based filtering process eliminates an additional 50% of the queries.
Query-based Summary Generation.Using the query and the complete summary as input, we generate the query-based summary with model X.It is worth noting that generating query-based summaries from dialogues is challenging for model X.In contrast, generating query-based summaries from condensed summaries is comparatively easier as it allows the model to extract information from a more concise and structured source, which further guarantees the quality.
Quality Check.Finally, we collect QDS triples for three dialogue summarization datasets and present statistics in Table 1.On average, 1.3 QDS triples are generated for each dialogue-summary pair.To access quality and diversity, we enlist help from an expert to annotate 100 triples.Evaluation results in Table 2 demonstrate a significant improvement in the quality of synthesized triples after applying our filtering technique, with the quality score increasing from 45% to 75%.In the process, we incorporate the summary information as it represents a condense version of the information contained in the corresponding dialogues.The triples gathered are tend to have higher quality with fewer errors and cover more utterances as the comparison shown in Table 10.

Model Training
We perform instruction tuning with Flan-T5-XL model as the initial checkpoint.The instructions are three-fold: 1) general dialogue summarization, 2) query-based dialogue summarization and 3) their length-aware augmentations.For querybased dialogue summarization, the query and dialogue are concatenated as the input with the template: "###Instruction: {instruction}.### Input: {dialogue}.",where output is the summary.To account for length-aware generations, we append "The generated summary should be around {summary length} words long." to the original instruction.
To enhance the generalizability across different dialogue types, we combine three dialogue summarization datasets to train a unified dialogue summarization model.From the synthesized QDS triples, we random sample 5k triples from each dataset.For length awareness, each sample is augmented once.This results in a total of (14.7k + 12.5k + We evaluate and benchmark our method on three dialogue summarization datasets including SAM-Sum (Gliwa et al., 2019), DialogSum (Chen et al., 2021) and TODSum (Zhao et al., 2021).These datasets are equipped with dialogues and human-written or verified summaries.Additionally, we explore dialogue reading comprehension with DREAM dataset (Sun et al., 2019) (Lin, 2004), BERTScore (Zhang et al., 2020), human evaluation and ChatGPT (GPT-3.5-Turbo-0301)for comprehensive quality assessment.

Main Results
The performance of different models on the SAM-Sum dataset is presented in Table 3, providing insights into their capabilities for general dialogue summarization.Notably, InstructDS outperforms others and establish the new SOTA for both ROUGE and BERTScore metrics.In general, dedicated summarization models show better performance because of their optimization specifically for the single task and dataset.In the case of general LLMs, Alpaca demonstrates less promising  performance due to its optimization using synthesized instruction data, with limited involvement of dialogue summarization tasks.In contrast, as depicted in Table 1, FLAN-based models include the SAMSum dataset in their instruction tuning process, resulting in competitive performance.While ChatGPT is renowned for its versatility across various tasks, it is prone to generate lengthy summaries when not constrained by prompts (Qin et al., 2023).Therefore, we further experiment with adding the reference summary length to instructions during summary generation.This approach significantly improves the performance of Chat-GPT, achieving a balance between precision and recall.Meantime, InstructDS exhibits further performance boost, demonstrating its ability to follow length instructions.

Models
We conduct experiments on query-based dialogue summarization using the DREAM dataset and the results are presented in Table 5. InstructDS can achieve comparable performance with Flan-T5 and underperform ChatGPT.It is important to note that InstructDS is only directly trained with synthesized QDS triples from other datasets, whereas FLAN-based models are directly trained with the DREAM data.This demonstrates the effectiveness and generalizability of our synthesized triples.Further, we explored the impact of incorporating in-domain DREAM training data into InstructDS.This resulted in a significant performance boost, surpassing ChatGPT by a considerable margin.
To access the generalizability of InstructDS, we present results on DialogSum and TODSum datasets in Table 4. InstructDS maintenances its outstanding performance over other models.A significant gap exists between fine-tuned BART model and general-purpose LLMs.In contrast, In-structDS incorporates multiple data sources and augmented QDS triples.This comprehensive dialogue understanding framework contributes to its superior reasoning abilities across diverse dialogue domains and tasks.

Ablation Study
To provide insights into the effectiveness of In-structDS, we conduct ablation studies on In-structDS variants to answer two fundamental questions: 1) How do augmented QDS triples contribute to general and query-based dialogue summarization? 2) What is the effect of length awareness augmentation on general and length-controllable dialogue summarization?
The ablation results are presented in Table 3. First, on SAMSum and DialogSum datasets, the full InstructDS model demonstrates the best performance.Incorporating both synthesized QDS triples and length augmentation techniques contributes to an overall performance improvement.We attribute these performance boosts on the enhanced dialogue understanding and length awareness capabilities.On DREAM dataset, the inclusion of synthesized QDS triples leads to an improvement in query-based dialogue summarization performance, elevating it from 56.4 to 59.1.Notably, the performance is further enhanced when additional indomain training data from the DREAM dataset.In TODSum dataset, we observe that augmented QDS triples do not yield better summarization results.However, the utilization of length augmentation techniques does improve the performance.It is  tation samples is shown in Figure 5.It reveals that increasing the number of length augmentations can enhance the model's ability to control the generated summary length while a saturation point exists.
Meantime, the effect on general summarization is diverse and relatively less consistent.

Subjective Quality Evaluation
We conduct multidimensional evaluations to access the quality of generated summaries.This involves fine-grained Likert scores (scale 1 to 5, the higher the better) from both human and ChatGPT in four dimensions: Faithfulness, Fluency, Informativeness, and Conciseness (Wang et al., 2022b;Gao et al., 2023a).Evaluations are performed on SAMSum dataset and we randomly sampled 30 instances for human evaluation and 200 instances for ChatGPT evaluation.The user interface and prompt template can be found in Figure 6 and Table 7, respectively.Each dialogue was accompanied by one human-written summary and five machine-generated ones.We engaged 12 volunteers, resulting in 792 labeled samples.On average, each dialogue-summary pair receives assessments from 4.4 annotators.The mean and standard deviation of Likert scores are presented in Table 6.With human annotations, fluency is the bestperforming metric.All models, except for Alpaca, demonstrate the ability to generate fluent summaries.Alpaca's relatively poor performance can be attributed to its unsupervised training and limited exposure to dialogue data.For informativeness and conciseness, ChatGPT and Alpaca produce the most informative summaries but receive the lowest conciseness scores.These models tend to generate longer summaries, including elaborate details, indicating a limited understanding of the desired compressiveness.Faithfulness evaluation emerges as a crucial factor in practical applications, where ChatGPT surpasses human performance.This can be attributed to potential inaccuracies in the annota-tions of the SAMSum dataset (Wang et al., 2022b;Gao et al., 2023b).ChatGPT's ability to generate detailed content, similar to the concept of Chain-Of-Thought (Wei et al., 2022), also contributes to higher faithfulness.Overall, InstructDS achieves comparable performance to human-written summaries in terms of fluency, informativeness, and conciseness.While InstructDS still falls short on human-level faithfulness, it demonstrates noticeable improvements compared to previous models.
When using ChatGPT as an off-the-shelf evaluator, we observe that InstructDS is achieving on-pair or better performance over human written summaries on four dimensions.Especially for faithfulness, InstructDS is outperforming all other models except for ChatGPT.However, it is worth noting that ChatGPT exhibits biases towards its own outputs, resulting in potentially inflated evaluation scores.Similar patterns have also been found in other studies that involve ChatGPT evaluation.For example, Zhang et al. (2023) shows that ChatGPT always assigns higher scores to its own outputs when solving math problems, leading to significant concerns when using ChatGPT as an evaluator.Furthermore, in this work, we found that ChatGPT is not effective in evaluating the conciseness of summaries, which introduces a noticeable discrepancy compared to human evaluators in this aspect.We think it is because ChatGPT is not aware of the desired conciseness of summaries, which also attributes to the phenomenon that it is generating lengthy summaries.Further explorations are necessary for a robust ChatGPT evaluator in dialogue and other domains (Wang et al., 2023b).

Relationship with Query-focused Summarization
Query-focused summarization is closely related to our instructive dialogue summarization con-cept (Vig et al., 2022).There are some similarities and differences.An ideal instructive dialogue summarization model should be capable of handling a wide range of instructions when generating summaries.As illustrated in Figure 2, our current model can accommodate general dialogue summarization, query-based dialogue summarization, and dialogue summarization with length control.We anticipate that the range of instructions will be expanded in future research, encompassing diverse sets of instructions and multi-round dialogue summarization scenarios.In the meantime, as a domain-specific model, we anticipate that instructive dialogue summarization could exhibit emergent capabilities as shown in general instructiontuned LLMs.

Long Dialogue (Meeting) Summarization
Summarizing dialogues in meetings, especially long ones, is a challenging task that requires a model capable of processing extended sequences.One promising avenue for research involves expanding our current method for summarizing lengthy dialogues, which can improve query-based meeting summarization and comprehension.Instead of relying on the entire lengthy dialogue, our approach generates queries from reference summaries.This approach addresses the difficulties of using pre-trained language models for long dialogue inputs in meeting scenarios.One of the obstacles in meeting summarization is the limited data availability, which limits the model's ability to generalize across different domains.Nevertheless, our method has the potential to alleviate data scarcity issues in the context of summarizing lengthy meetings.Another challenge is that meeting summarization is often associated with transcripts with ASR errors (Jiang et al., 2023), which effect is under-explored in existing research.

Conclusion
In

Limitations
Dialogue summarization is a label-intensive task that demands substantial supervision and the collection of human-written summaries, which is both challenging and resource-intensive.Moreover, the transferability of annotations across different dialogue domains introduces additional complexity.Therefore, to develop a highly adaptable dialogue summarization model, leveraging unsupervised dialogue data becomes crucial.However, it is worth noting that InstructDS does not incorporate unlabelled dialogue data, leaving room for potential improvement.
Another important aspect to consider in dialogue data is privacy.The sensitive nature of dialogues can hinder the accessibility and public availability of diverse dialogue datasets.Therefore, future enhancements of InstructDS should address privacy concerns and explore the utilization of advanced learning techniques such as federated learning, which can enable collaborative and privacypreserving training processes.
Automatically evaluating the quality of dialogue summarization poses significant challenges.Acquiring human annotators for model development is expensive and inefficient.Existing evaluation metrics heavily rely on ROUGE, and ChatGPT has emerged as a newly proposed method for evaluation.As discussed in Section 4.4, it still lacks transparency and robustness.Therefore, there is a pressing need for more effective evaluation methods specifically tailored for dialogue summarization.Multilingual and multicultural evaluation is crucial since dialogues are frequently intertwined with local norms, slang, code-switches and cultural nuances (Wang et al., 2023a).

A Datasets and Metrics
In this section, we will provide a detailed explanation of the datasets and metrics that have been utilized in our work.

A.1 ROUGE Implementation
In the context of dialogue summarization, we observed that different studies incorporate different versions of the ROUGE metric, resulting in varied results.Specifically, we identified three widely used implementations of ROUGE: This section presents the main results using the Py-rouge package.The results for SAMSum dataset are presented in Table 8 while the results for DialogSum and TODSum are in Table 9.These results align with the patterns and conclusions discussed in Section 4.2.

A.2 DialogSum Preprocessing
In the original DialogSum paper (Chen et al., 2021), the authors used #Person1#, #Person2#, and so on to represent speakers because the original dialogue did not contain speaker information.However, this approach leads to inconsistencies as some names were already present in the original dialogue.To address this issue, we performed additional preprocessing on the data to align its format with SAM-Sum.Specifically, we employed the prompt template shown in Table 7 to prompt the FLAN-T5-XL model to predict the name of the person.We then applied rule-based filtering to determine the appropriateness of the predicted names.This filtering process involved considering factors such as forbidden names labeled by humans, the length of the predicted name, the presence of special symbols, and whether the predicted name has appeared in the original dialogue.If the name did not meet the criteria according to our rule-based identification method, we used FLAN-T5-XL again with the template from Table 7 to choose from a pool of ten candidate names, which consisted of five randomly sampled male names and five randomly sampled female names.Simultaneously, the name was correspondingly updated in the reference summary.
Examples of preprocessed dialogues can be found in Table 11 and Table 16.To facilitate future research and development, we will make the name-replaced version available for public access.

A.3 Evaluation on DREAM
DREAM (Sun et al., 2019) dataset is introduced for dialogue reading comprehension and the evaluation is designed as multi-choice questions.However, in real-world applications, where information queries are performed on dialogues, it is unlikely to have several candidate answers included as input.Realworld scenarios in dialogue reading comprehension are better represented as unconstrained text generation problems.
our evaluation of the DREAM dataset is conducted in an unconstrained manner, without providing candidate choices to the model.To assess accuracy, we utilize the BERTScore package to measure the similarity between the generated output and the answer choices, selecting the highest-scoring option as the final answer.The evaluation process is illustrated in Figure 7 and Figure 8.

B Templates and Examples
We provide more detailed illustrations of the human evaluation interface, instruction templates, synthesized QDS triples and case studies on model outputs.
• Figure 6: the interface used for subjective evaluation for annotators.The annotator is asked to label the quality of summaries in four dimensions.
• Table 7: the templates used for several tasks including query generation, text-based filtering, ChatGPT evaluation and the preprocessing on DialogSum dataset.
• Table 10, 11 and 12: shows the synthesized QDS triples including both kept and removed ones.The reason for filtering is also demonstrated. •

Instruction:
Highlight the key takeaways from the dialogue.Output: Hilary has the keys to the apartment.Benjamin wants to get them and go take a nap.Hilary is having lunch with some French people at La Cantina.Hilary … Instruction: Where is Hilary having lunch?Output: La Cantina Instruction: Summarize the dialogue with about 15 words.Output: Benjamin, Hilary and Elliot are discussing about their plans for the day after getting the apartment keys.

Figure 4 :
Figure 4: Ablation study on the number of included QDS triples.Performance on SAMSum (left, averaged ROUGE) and DREAM (right, accuracy) datasets are reported.

Figure 6 :
Figure 6: An illustration of the user interface for human evaluation of summarization qualities.
In non-instructive dialogue summarization, given dialogue-summary pair {D d,i , S d,i } p d i=1 from dataset d with p d ≥ 1 pairs of instances, a model M is expected to generate a summary given the corresponding dialogue: M(D d,i ) = S d,i .In instructive dialogue summarization, the focus shifts to structured triples.Given a querydialogue-summary (QDS) triple {Q d , D d , S d } t d

Table 1 :
To ensure validity and diversity, we present two methods for The dataset statistics include several dialogue summarization datasets such as SAMSum, DialogSum, and TODSum, as well as the DREAM dataset, which focuses on dialogue reading comprehension and contains natural query-dialogue-summary triples.The right part indicates the direct supervision exposures for Alpaca, Flan-Series, and InstructDS models.

Table 2 :
Quality review for the generated queries and summaries from synthesized DQS triples.After (before) filtering results are shown.Examples of both kept and filtered QDS triples can be found in Table10, 11. 12.

Table 3 :
ROUGE scores on SAMSum test set.The results are divided into two blocks: dedicated dialogue summarization models and general-purpose LLMs."w/ reference summary length" indicates reference summary lengths are provided in instructions.
(Hu et al., 2022) trainable parameters.7.9k + 5k × 3) × 2 = 100k samples for training.It is important to note that all experimental results are obtained using the unified model without any specific tuning for individual datasets, which can potentially yield better results but at the cost of reduced generalizability.We employ LORA for parameter efficient training with a total of 37.7 million trainable parameters(Hu et al., 2022).

Table 4 :
Results on DialogSum and TODSum dataset.BART results are computed from the outputs released by Chen et al. (2021) and Zhao et al. (2021).

Table 5 :
Results on DREAM dataset.InstructDS is not trained with DREAM data."+In-domain" indicates the inclusion of DREAM training data.

Table 6 :
Subjective quality evaluation with instances random sampled from SAMSum dataset (30 samples for Human Annotators, 200 samples for ChatGPT.).The Mean and standard deviation of evaluation scores are reported.

Table 7 :
Table 13, 14, 15, 16 and 17:shows case studies on the generated summaries and querybased summaries on all four datasets including SAMSum, DREAM, DialogSum, and TODSum.Generate an answerable and specific question based on the following context.Context: ${Summary} Can we get an answer from the context, yes or no? Question: ${Question} Context: ${Summary} Is the question fully answerable from the context without any guessing, yes or no? Question: ${Question} Context: ${Summary} Evaluate the quality of the abstractive summary from the dialogue.Please be extremely picky.Rate each summary on four dimensions: Faithfullness: whether the summary is correct according to dialogue, Fluency: Whether summary is grammarly correct, Informativeness: Whether the summary contains all essential information, Conciseness: Whether the summary is very concise (not verbose).Output should follow the template: 'Faithfulness': value, 'Fluency': value 'Informativeness': value, 'Conciseness': value.You should rate on a scale from 1 (worst) to 5 (best).Do not give detailed explanations.Dialogue ${Dialogue}.Summary: ${Summary} Who is #Person1# in the following dialogue?${Dialogue} (2) Select on proper name for #Person1# from ${candidate names} in the following dialogue?${Dialogue} Prompting templates for QDS triple generation, text-based query filtering, ChatGPT evaluation, and DialogSum preprocessing.