Frugal Prompting for Dialog Models

The use of large language models (LLMs) in natural language processing (NLP) tasks is rapidly increasing, leading to changes in how researchers approach problems in the field. To fully utilize these models' abilities, a better understanding of their behavior for different input protocols is required. With LLMs, users can directly interact with the models through a text-based interface to define and solve various tasks. Hence, understanding the conversational abilities of these LLMs, which may not have been specifically trained for dialog modeling, is also important. This study examines different approaches for building dialog systems using LLMs by considering various aspects of the prompt. As part of prompt tuning, we experiment with various ways of providing instructions, exemplars, current query and additional context. The research also analyzes the representations of dialog history that have the optimal usable-information density. Based on the findings, the paper suggests more compact ways of providing dialog history information while ensuring good performance and reducing model's inference-API costs. The research contributes to a better understanding of how LLMs can be effectively used for building interactive systems.


Introduction
Large Language Models (LLMs) have rapidly transformed the landscape of natural language processing (NLP) research through emergent capabilities like prompt-based learning, in-context learning (ICL), and conversational capabilities (Wei et al., 2022).While these novel approaches are being applied to various domains and tasks with an unrealistic speed and effectiveness, many dimensions of LLMs remain unexplored.For instance, since we do not need to follow a fixed schema for the textual inputs anymore (like standard supervised learning for text), the ways in which input-text can be presented, and its impact on task performance is an essential aspect that needs to be investigated.Additionally, as various LLM-inference APIs are becoming available for a price, trade-off between performance gain and prompting (inference) cost is another dimension that requires attention.
While efforts have been made to reduce inferencing costs of Transformer (Vaswani et al., 2017) models, these contributions have mostly been at the architecture level and require access to the model weights and source codes (Tay et al., 2022).As many models like GPT-3 (Brown et al., 2020), CODEX (Chen et al., 2021a), LaMDA (Thoppilan et al., 2022), andPaLM (Chowdhery et al., 2022) are now closed source, it is not possible for the end-user to optimize the model's costs using these approaches.
In recent prompt engineering literature, the focus has been on optimizing the prompt to improve downstream task accuracy (Chung et al., 2022;Wei et al., 2021), with the majority of past efforts targeting single-turn tasks (e.g., classification, reading comprehension, question answering, etc.).However, for longer inputs, another critical factor is the inferencing API cost, which has largely been ignored in prior works.This is especially true for interactive or dialog tasks.
This paper explores the trade-off between cost and performance for LLMs in a prompt-based/incontext learning (ICL) setup.We propose the idea of frugal prompting in the context of dialog models, which involves input optimization methods to maintain performance gains while minimizing costs.To compare effectiveness of input representations for in-context learning based methods while considering both cost and task performance, we introduce a new metric called Usable Information Density (UID).Using this metric, we gain insights into the capabilities of various ICL model families for understanding and accessing information from different input representations.
Overall, we make the following contributions in this paper.(1) We explore the effectiveness of various ICL models and input formats for dialog modeling.
(2) We propose a new metric, UID, that captures the tradeoff between accuracy and length for various (input format, ICL model) combinations.(3) Extensive experiments on two benchmark dialog datasets (MSC and TC) and four ICL models show that (a) Adding more context as part of the input does not necessarily improve UID by similar amounts across all ICL models.(b) For most ICL models, using the most semantically related utterance from dialog history is more cost effective compared to using full history, summarized dialog history or most recent utterance.

Literature Review
Large language models (LLMs) for Dialog modeling: A large number of recent dialog generation models have been based on pretrained LLMs like DialoGPT (Zhang et al., 2019), Plato (Bao et al., 2019), Blenderbot (Roller et al., 2021;Shuster et al., 2022), Meena (Adiwardana et al., 2020), and LaMDA (Thoppilan et al., 2022) which use the transformer architecture.Although large scale pretrained models like 175B Blenderbot-3 (Shuster et al., 2022) or the 137B LaMDA (Thoppilan et al., 2022) lead to high accuracies across multiple dialog datasets, these approaches can be prohibitively expensive due to the ever-increasing size of models.In-context learning (ICL) (Brown et al., 2020) with prompt-based models helps avoid expensive finetuning.Further, better accuracies have been obtained using instruction finetuning in models like T0 (Sanh et al., 2022), FLAN (Chung et al., 2022), Tk-Instruct (Wang et al., 2022), etc.But the increased inference costs due to large prompts sizes remains an open challenge.Ways to optimize computation for LLMs: Following environmental impact discussion of the training process of these LLMs (Strubell et al., 2019), multiple studies have proposed these two main lines of work on optimizing costs of LLMs: (1) Model distillation-based (Hinton et al., 2015;Sanh et al., 2019;Gou et al., 2021;Gupta and Agrawal, 2022) methods train a smaller, simplified model to approximate the predictions of a larger, more complex model.(2) Efficient transformer architectures (Kitaev et al., 2020;Wang et al., 2020;Zaheer et al., 2020;Beltagy et al., 2020) aim to reduce the quadratic complexity of the standard transformer architecture by using more efficient self-attention mechanisms.In this paper, we examine the costs associated with the use of prompts in LLMs and suggest a new method for assessing the cost-performance trade-offs involved, as well as strategies for optimizing the inference cost with regard to the inputs.Please refer to Appendix F for more detailed literature review.

Prompting Methods for Dialog Systems
We first present the necessary ingredients of a prompt for dialog systems.Next, we discuss recipes for manual and algorithmically optimized prompts.Lastly, we present ways of effectively including context information as part of prompts.

Prompt Ingredients for Dialog Systems
To build a prompt-based dialog system using LLMs, the following components or information sources are an important part of the prompt template.
(1) Task Instruction: The instruction is used to explain the task of a dialog response generation model.We also assign a system-role for the LLM (also called Person2) to play through the instruction for example the role of "an automated chat system".
(2) Dialog Context: As part of the dialog context, several components can be included like dialog history, persona information and Person1's latest utterance.(a) Dialog history: This refers to the past conversation between Person1 and Person2 that provides the context for the current conversation.(b) Background Information (BI): We also make use of some additional information like persona or knowledge sections when available.Persona is a fictional representation of a user consisting of series of sentences describing their characteristics, events and opinions.This is used to create a personalized experience during a conversation.Knowledge sections are short paragraphs from different data sources (Wikipedia, Reddit, and Washington Post) that are related to the topic of the conversation.We experiment with various combinations of different pieces of information to understand their impact on the accuracy versus inference cost.
(3) Person1's latest utterance: This is the most recent statement or question uttered by Person1 in a dialog, that prompts the Person2's response.(4) Exemplars: Although most recent LLMs are capable of solving tasks just using instructions (due to RLHF and instruction-finetuning), providing examples along with task description may help improve performance.We test our prompt-based models in two configurations with respect to number of examples: zero-shot and few-shot.
An example prompt is shown in Table 1.A full list of all the prompts used in our experiments can be found in the Appendix G. Table 2: Manually engineered prompt template with summary of dialog history, persona and latest person1 utterance as dialog context and with one exemplar.

Manual versus Perplexity Prompts
We experimented with two ways to design prompt templates: manual and automatically optimized prompts using a perplexity-based search method.Manually Designed Prompts: Manual prompts were designed keeping in mind general principles of prompt design (Liu et al., 2023) like role based prompting (Schulhoff and Contributors, 2022), specifically adding requirements like "generate a consistent, diverse response" so as not to get repetitive, dull responses and maintain consistency with respect to the current utterance and context.Table 2 illustrates one of our manually designed prompt template, with summary of dialog history, persona and current user utterance as dialog context and with one exemplar.Perplexity Optimized Prompts: We followed the strategy highlighted in Gonen et al. (2022) which claims that the performance of a prompt is coupled with the extent to which the model is familiar with its language, and this can be measured by the perplexity of the prompt.Given an LLM, we took the manually engineered prompt template, and created candidate prompt variants by using GPT3 and back translation.Further, we instantiated all such prompt templates using 100 instances (with full prompt sequence, including the input itself, and without the label), and computed average perplexity per template using the LLM.The lowest perplexity template was chosen.

Optimizing the Dialog History Input
Redundancies in conversations: In conversational agents, dialog history plays a crucial role in generating meaningful responses.It provides context and continuity, and enables the agent to remember previous interactions with the user.However, the dialog history can also be redundant, especially when it contains back-channeling, clarification, and mistake correction.While these elements are necessary for a natural and useful conversation, they increase the length of the dialog history without adding any new information.In addition, responses from some dialog models (like Instruct-GPT (Ouyang et al., 2022)-based models -textdavinci-003) could be elaborate and long.
Shortening Dialog Histories: To reduce the prompt length, we can compress the dialog history by removing redundancies.The goal is to give the agent only the parts that are relevant and informative for generating the next response.Two possible approaches to compress the dialog history into a shorter and more informative representation are selection and summarization.
• Selection: Two possible ways to select parts of dialog history are as follows.
(1) Recent-k: The simplest approach is to use a fixed-length dialog history from the most recent utterances.However, this approach may not be optimal, as users may refer back to context beyond the fixed length window and expect the system to understand.
(2) Semantic-k: In this approach, the most relevant k utterances from the dialog history are selected with respect to the current utterance.This method is simple, but its performance depends on the quality of the similarity measure used.We used the average of the similarity obtained using SimCSE model (Gao et al., 2021) and Sentence Transformers (Reimers and Gurevych, 2019) to measure the overall similarity between utterances.
• Summarization: An alternative approach is to use a summary of the full dialog history.et al., 2020)) can be used to shorten such background information as well.
4 Experimental Setup

Datasets
We experiment with two dialog datasets for comparing various methods on accuracy versus inference cost for prompt-based dialog systems: Multisession Chat (MSC) (Xu et al., 2022) and Topical Chat (TC) (Gopalakrishnan et al., 2019).We chose these datasets because of their varying characteristics and the length of the dialog history.
The MSC dataset consists of multiple chat sessions whereby the speaking partners learn about each other's interests and discuss the things they have learnt from past sessions.Each user participating in these sessions (or conversations) is asked to play a role (persona) while having the conversation.On the other hand, in the TC dataset, each pair of users is assigned one or more topics along with some facts or knowledge about the topic, and the users are asked to have a conversation about the topic.Users have no persona in the TC dataset but there are knowledge sections associated with the conversations.The test set contains 16,299 and 7,512 context response pairs in the MSC and TC datasets, respectively.Also, there are 11.9 and 20.0 average number of utterances in full conversations in the MSC and TC datasets, respectively.
Since we do not train or finetune any specific models, we do not use train splits of these datasets.For perplexity-based prompt optimization, we use validation splits of these datasets.We discuss detailed preprocessing steps in the Appendix A.

Summarization of Dialog and Background Information
We used BART and Pegasus models for summarization.However, in dialog summarization, the objective is to distill the most important information or key points from a conversation, which can be quite challenging because conversations tend to be more dynamic and context-dependent than normal documents.Unlike traditional summarization, in dialog summarization, there is a greater emphasis on preserving the coherence and context of the conversation.Hence, we used dialog summary datasets like DialogSum, SAMSum and CNN/DailyMail to finetune abstractive summary models like Pegasus and BART and picked up the best model for use in terms of summarization performance by calculating ROUGE metric on dialog summarization data.
We process the DialogSum and SAMSum datsets to remove all conversation instances having more than two speakers and normalized the speaker names to Person1 and Person2 so that the model does not hallucinate random names during summary generation.

Models and Prompt Design
For this study, we used GPT-3 (text-davinci-003), one of the most prominent models for promptbased or ICL.Along with GPT-3, we also included other open-source models that are capable of ICL: FLAN-T5 (google/flan-t5-xl), T0 (bigscience/T0_3B), and Tk-Instruct (allenai/tkinstruct-3b-def for zero shot and allenai/tk-instruct-3b-def-pos for few shot).These open-source models are generally smaller in size compared to GPT-3 (175B) and have the capability of ICL through instruction-finetuning based training.
We experiment with several input prompt settings: (1) Zero shot versus few shot.(2) Manually designed versus perplexity optimized prompts.(3) Settings based on usage of dialog history: (a) full history, (b) summarized dialog history (using any of the three summarization models), or (c) Recentk or semantic-k selection from history.(4) With and without summarized background-information.
In case of few shot, we use only one exemplar since (1) previous work (Madotto et al., 2021) has shown that one exemplar is enough, and (2) we wish to find methods which retain good accuracy with short input lengths.The exemplar is also formatted in the same way as the actual input.For example, if the actual input setting is to use persona with few shot, the exemplar also includes persona information.Similarly, if the actual input setting is to use summarized dialog history with input, the exemplar also includes summarized dialog history.
The exemplar is chosen based on the immediately previous utterances if available, else it is randomly chosen from the dataset.Thus, for each in-stance, the exemplar is different.For example, consider the Recent-4 few shot setting.Let ABCDEFG be the utterances in the conversation.Thus, the instance will have G as the target response, and input contains F as the current utterance and BCDE as the recent-4 dialog history.The input for this instance will also consist of an exemplar where the target response will be F and input for exemplar will contain E as current utterance and ABCD as the recent-4 dialog history.

Metrics
Performance We evaluate the performance of the models using several popular metrics: ME-TEOR, BLEURT and DEB.METEOR (Banerjee and Lavie, 2005) is widely used for various text-generation tasks (machine translation, dialog generation, etc.).It measures lexical-overlap between n-grams of the predicted and ground truth response.BLEURT (Sellam et al., 2020) uses a pre-trained BERT model for evaluating text generation models.DEB (Sai et al., 2020) is a BERTbased dialog metric further pre-trained on dialog data for next response classification using the nextsentence-prediction (NSP) loss.
Inference Cost To evaluate the effectiveness of different prompting methods for dialog systems, we need a metric that takes into account both the performance gain and the inference cost reduction.The cost is measured in terms of the length of the overall input, as longer inputs incur more inference-API costs and also slow down the inference.
We propose a new benefit-cost based metric to simultaneously consider both model performance and the inference cost incurred: the usableinformation-density (UID).UID with respect to metric M is defined as UID M (a) = (M H ) a /L H where M H is the average performance of the model as per metric M , L H is the overall combined size of input and output averaged across all the test examples, a is a metric-importance parameter.In the main paper, we present results using a=1, but show impact of varying a in the Appendix.With a=1, UID is defined as the ratio of performance to cost measured in terms of size of the input and output.The UID captures the amount of information, per token, usable by a model (Ethayarajh et al., 2022) for a given input/prompt configuration.The UID metric can be used to evaluate the effectiveness of different prompting methods.

Size Comparison across different Input Formats
Fig. 1 shows the variation in the average input prompt size as we vary the prompt constituents, dialog history (DH) and background information (BI), for the few shot.We show a similar figure (Fig. 4) for zero shot cases in the Appendix.We plot the variation for manually engineered as well as perplexity optimized prompts for both the datasets (MSC and TC).Y -axis indicates the overall length of the input prompt, which is fed to the large language models (LLM) without further processing.We observe that the complete dialog history is significantly longer compared to summarized or selection forms.Since we use one demonstration exemplar in few shot cases, the few shot prompts are typically twice as long as their corresponding zero shot prompts.Perplexity optimized prompts are slightly shorter than manually engineered prompts on average.Pegasus-DS summarized dialog history is almost 3 times shorter; Pegasus-DS summaries are shorter than Pegasus-CD summaries while BART-D summaries are shorter than Pegasus-DS summaries.Sizes of recent-2 (or semantic-2) are similar to the summarized dialog histories in terms of the final length of the input context to the model.However, we expect that the summarized dialog history will have more useful information stored in a compressed form compared to the greedy choice of only the recent-2 or semantic-2 utterances.In case of Pegasus-DS + BI, we use the BI summarized using Pegasus-CD model.Note that the summary of background information in TC is much larger compared to that in MSC.For example, Pegasus-DS + BI for TC is as large as full dialog history.

Performance Results and Analysis
In Figs. 2 and 3, we analyze the absolute performances of various LLM model-families using prompts based on various input representations for TC and MSC, respectively.We show results for few shot (FS) as well as zero shot (ZS) cases across three popular metrics -BLEURT, DEB and ME-TEOR.We also show results for manually engineered as well as perplexity optimized prompts averaged across various models (FLAN-T5, T0, Tk-Instruct and GPT-3).Since we do not have access to logits from GPT-3 model, we cannot optimize prompts for GPT-3 using perplexity.Detailed model-wise results are in Appendix Figs. 5 to 8.For each of these combinations, we show results for different input prompt combinations: (1) Full dialog history, (2) Summary of dialog history using BART-D or Pegasus-DS or Pegasus-CD, (3) Pegasus-DS summary of dialog history as well as Pegasus-CD summary of background information (BI), (4) Recent-k selected dialog utterances, and (5) Semantic-k selected dialog utterances, where k is varied as 1, 2, 4, 8, and 10.Note that we did not experiment with full background information since background information is very large in size, especially for the TC dataset.
As shown in Table 3, GPT-3 generally outperforms the other families of LLMs in terms of absolute performance (DEB, BLEURT, and METEOR) and Tk-Instruct performs the worst.Also, generally we observe the best results with full dialog history for TC in most cases for DEB and ME-TEOR.For MSC, even prompts with summarized history seem to do very well although they are much shorter.Averaged across metrics, we observe that Semantic-k performs better than Recent-k for all values of k (1, 2, 4, 8, 10) for both datasets.Further, while Semantic-k reaches peak perfor- mance at k=4, Recent-k attains the best results at higher values of k (8 or 10).Adding background information (knowledge facts) to Pegasus-DS helps boost DEB and METEOR significantly but hurts BLEURT on average for both the datasets.In MSC, amongst different configurations for history signal, Recent-1 performs the worst on average.In TC, BART-D performs the worst on average.Surprisingly, even though zero shot prompts are almost half the size compared to few shot prompts, zero shot results are better than few shot results, except for perplexity prompts in MSC.Although recent prompt engineering based studies motivate using demonstration examples, it turns out that examples are not very useful for dialog modeling.
Perplexity optimized prompts lead to shorter prompt sizes but not better accuracy values, except for DEB in TC.Since we cannot compute perplexity optimized results for GPT-3, we show results for the remaining three models.We observe that T0 is the best for DEB and METEOR while FLAN-T5 is the best for BLEURT.In both cases, zero shot results are better.

UID Results and Analysis
What is of our main interest is the fact that, in many cases, the summarized dialog-history input is able to attain much of the performance, sometimes even better than the full dialog-history setting, which has a much longer input length.Thus, we are interested in comparing various input representation methods in terms of how much information, per token, a particular LLM model can access, that is converted into better performance on the response generation task.Hence, in this section, we discuss relative importance of different components of the input prompt using the UID as a metric that explains the input prompt size versus the performance tradeoff.We show the UID results averaged across models in Table 4.We also show model-wise results in Tables 5, 6, 7, and 8 for FLAN-T5, T0, Tk-Instruct and GPT-3, respectively.We show results for few shot as well as zero shot cases, and for both the datasets (MSC and TC).We also show UID results across all dialog history, prompt-type and exemplar settings on the three different metrics -BLEURT, DEB and METEOR.Comparing manually engineered prompts versus perplexity optimized prompts, we observe that manually engineered prompts are better on average.We believe this is because perplexity and other metrics (BLEURT, DEB, METEOR) do not show similar correlation with dialog response quality as shown in (Liu et al., 2016).
Across the dialog history types, we make the following observations which hold for both datasets: (1) For most metrics across datasets, we observe that using one semantically related utterance is the best.The UID decreases as we increase k. (2) In terms of absolute metrics (Figs. 2 and 3), we observe that Recent-k typically increases with increase in k while Semantic-k peaks at k=4 and then drops.But in terms of UID, for both Recent-k and Semantic-k, UID reduces with increase in k.
(3) Adding background information to Pegasus-DS does not help.(4) Amongst summarization methods, Pegasus-DS and BART-D perform better than Pegasus-CD.This is expected since Pegasus-DS and BART-D are both trained on dialog datasets.Using summaries of the dialog history provides better UID results than using the full dialog history.This suggests that models can work more efficiently with summarized input.
As observed from Figs. 2 and 3, few-shot accuracy values are worse than zero-shot, although fewshot are almost twice the size of zero-shot prompts.
This implies that few-shot UID is much smaller than zero-shot UID as can be seen in Table 4.
Overall, we find that using full dialog history, or Semantic-k/Recent-k with large k are not very useful from a UID perspective.For both the datasets, it is clear that Semantic-1 and Recent-1 have very good UID values across all models and metrics, with zero-shot being better than few-shot.This suggests that having a smaller but more focused input is recommended for dialog model prompting.

Conclusion
In conclusion, this paper has explored the tradeoff between model performance and cost in interactive tasks where dialog history plays a crucial role.Since recent large language models tend to produce longer dialog responses, using this long dialog history as context for next utterance prediction becomes more expensive.However, the experiments conducted in this study have demonstrated that compressing dialog history can improve model performance without significantly increasing cost.Our findings suggest that the optimal representation of dialog history is one that provides the highest amount of usable information per token.Summaries of dialog history are better than using full history itself.Recent utterance or best semantically similar utterance are both better than summaries.One best semantically similar utterance is the best from both accuracy as well as usable information perspective.Overall, our results highlight the importance of carefully balancing model performance and cost in interactive tasks that rely on dialog history.
We experimented with datasets and models trained on languages with limited morphology like English.While we hope that these results will generalize to models trained on multi-lingual datasets; empirical validation needs to be done.
While the study examines TC and MSC, these conclusions may only apply to these datasets and to general open-domain chit-chat dialogue.However, there are many more dialogue settings than just these two.For example, it needs to be validated if the conclusions would apply to more information-critical dialogues (e.g.task-oriented dialogue datasets like MultiWOZ).
For task-oriented dialog systems with welldefined ontologies and belief states, the experimental design would need to be reconsidered, including aspects like prompts, summarization methods, and evaluation metrics.Standard summarization techniques may need to be adapted to better retain key belief state information in the summary.Although we believe that the well-defined ontology could potentially allow further optimization of prompt lengths compared to open-domain dialog.While the lower-level details would differ in applying frugal prompting notions to task-oriented dialogs, we are optimistic that similar beneficial findings around balancing model performance and computational costs could emerge.

Ethics Statement
In this paper, we studied how to efficiently use dialog generation models.Although we did not explicitly train our own dialog models, we would like to make the readers aware about potential risks in usage of such models.Many pretrained language representation models have learned patterns associated with exposure bias.Interpretability associated with the output is rather limited, hence users should use the outputs carefully.These models generate possible response candidates, and do not filter out any "problematic" candidates.Thus, for applications, where candidate responses could be problematic, (e.g., offensive, hateful, abusive, etc.), users should carefully filter them out before using the output from such models.
All the datasets used in this work are publicly available.We did not collect any new dataset as part of this work.
MSC dataset: The dataset was downloaded from https://parl.ai/projects/msc/.Xu et al. (Xu et al., 2022) describes details about creation of the dataset.Parl.ai makes models and datasets available under MIT License.
TC dataset: The dataset was downloaded from https://github.com/alexa/Topical-Chat.The dataset is available under Community Data License Agreement.
We used 4 models in this work: T0, Tk-Instruct, FLAN T5 and GPT-3 API.T0, Tk-Instruct and FLAN T5 are all provided under Apache 2.0 License on Huggingface.We used the publicly available GPT-3 API by signing up at OpenAI.

A Data Preprocessing
The MSC dataset is divided into multiple sessions, the first of which uses dialogs from the Per-sonaChat dataset.Each session has metadata information such as time elapsed from the past conversation and previous dialogs.Examples from Session 1 do not have enough context.Hence, we experiment with examples from sessions 2, 3 and 4 are used, and the results are averaged across the three.As per the dataset construction, a single conversation has been conducted across multiple sessions.Hence, as a first step, we aggregate all turns for a conversation across sessions 1, 2, 3 and 4 by concatenating them in a temporal way.Further, context-response example pairs for our experiments have been created by considering (i) second utterance of each turn of sessions 2, 3 and 4 as a response and (ii) first utterance of corresponding turn and entire conversation history as context.We also use the persona information as background information when constructing input for various dialog models.
The test split of the TC dataset includes two sections: frequent and rare.This is based on the frequency of the associated entities as observed in the training set.We combine these splits to create our test set and pursue our analysis.The conversations begin with a preprocessed reading set which is retrieved from Wikipedia.Further, context-response example pairs for our experiments have been created by considering (i) second utterance of each turn as a response and (ii) first utterance of corresponding turn and entire conversation history as context.
In both datasets, for each sample, we normalize the utterances by removing trailing whitespaces, and capitalizing first word of every sentence.

B Hyper-parameters for training dialog summarization models
We used a batch size of 8 and finetuned the models for 10 epochs.We tried using various learning rates (1e-5, 5e-5, 1e-4, 5e-4, 1e-3) and finally picked a learning rate of 1e-4 since that gave the most optimal performance on the validation set.During training we limited the maximum length of generated summary to 128 and set the number of beams to 5.

C Overall input length
Refer to Fig. 4 for a comparison of overall input length for various representations of dialog prompt across the two datasets for the zero shot setting.

D Detailed Model-wise Performance Results
In Figs.METEOR.We also show results for manually engineered as well as perplexity optimized prompts averaged across various models (FLAN-T5, T0, Tk-Instruct and GPT-3).Since we do not have access to logits from GPT-3 model, we cannot optimize prompts for GPT-3 using perplexity.

E Detailed Model-wise UID Results
We show the model-wise UID results in Tables 5, 6, 7, and 8 for FLAN-T5, T0, Tk-Instruct and GPT-3 respectively.We show results for few shot as well as zero shot cases, and for both the datasets (MSC and TC).We also show UID results across three different metrics -BLEURT, DEB and METEOR.For each of these combinations, we show results for different input prompt combinations: (1) Full dialog history, (2) Summary of dialog history using BART-D or Pegasus-DS or Pegasus-CD, (3) Pegasus-DS summary of dialog history as well as Pegasus-CD summary of background information (BI), (4) Recent-k selected dialog utterances, and (5) Semantic-k selected dialog utterances, where k is varied as 1, 2, 4, 8, and 10.
F Detailed literature review

F.1 Dialog modeling
The development of open-domain chatbot systems that possess long-term memory, generate engaging and coherent responses, and perform equally well on a variety of dialog tasks has been a longstanding challenge.Several Seq2Seq models (Serban et al., 2017;Shen et al., 2017;Zhao et al., 2017;Bao et al., 2019;Santra et al., 2021) have been proposed to address the specific properties of dialog modeling.Recently, a significant amount of focus has been on pretraining large dialog generation models like DialoGPT (Zhang et al., 2019), Plato (Bao et al., 2019), Blenderbot (Roller et al., 2021), Meena (Adiwardana et al., 2020), Blenderbot-3 (Shuster et al., 2022) and LaMDA (Thoppilan et al., 2022) using the transformer architecture.Retrieval augmented generation (RAG) has been another prominent approach to tackle the dialog generation task in both large and small-scale models (Wu et al., 2019;Gupta et al., 2020;Cai et al., 2021;Komeili et al., 2021;Zhu et al., 2018).Although large scale pretrained models like 175B Blenderbot-3 (Shuster et al., 2022) or the 137B LaMDA (Thoppilan et al., 2022) et al., 2021) using LLMs like GPT-J (Wang and Komatsuzaki, 2021) or GPT-3 (Brown et al., 2020) have also been investigated, but it is crucial to select the right prompts and context to achieve the best results.

F.3 Compute Intensive LLMs
One of the most critical drawbacks of these LLMs is the training and inferencing cost, especially for long sequences.Other than the complexity of a single forward pass, there are other costs involved in training an effective transformer LLM, e.g., amount of training data and compute needed (FLOP).Strubell et al. (2019) discusses the environmental impact that the training process of these LLMs has, in terms of total CO 2 emissions.Optimizing costs of LMs has mainly been explored from the perspective of increasing the efficiency of the inference step of a transformer.Model distillation-based (Hinton et al., 2015;Sanh et al., 2019;Gou et al., 2021;Gupta and Agrawal, 2022) methods train a smaller, simplified model to approximate the predictions of a larger, more complex model.Efficient transformer architectures, such as Reformer (Kitaev et al., 2020), Linformer (Wang et al., 2020), BigBird (Zaheer et al., 2020), and Longformer (Beltagy et al., 2020), aim to reduce the quadratic complexity of the standard transformer architecture by using more efficient selfattention mechanisms.
In this paper, we examine the costs associated with the use of Large Language Model (LLMs) and suggest new metrics for assessing the costperformance trade-offs involved, as well as strategies for optimizing the inference cost with regard to the inputs.

G Full list of prompts G.1 Manually engineering prompts
In this section, we provide a full list of manually engineering prompts.Tables 9 to 14 show prompt instances for six different settings: zero-shot versus few-shot, and passing persona versus dialog history summary versus both as context.The generations are from the GPT3 model.Rather than the persona, when we use knowledge facts, the prompt templates remain the same.When the dialog context consists of components other than summary, e.g., full history or recent-k utterances or semantick utterances, "summary" in prompt templates is replaced with "full history" or "list of recent-k utterances" or "list of semantic-k utterances" respectively.

G.2 Perplexity optimized prompts
Since we have only API access to the GPT3 model, we could perform perplexity optimization for only Flan T5 XL, T0 and Tk-Instruct models.Tables 15 to 19 show perplexity optimized prompts (templates as well as instance) for FlanT5XL, T0 and Tk-Instruct models under various settings like (a) zero shot versus few shot, and (b) persona, summary, knowledge section or combinations as dialog context.

H Impact of varying metric-importance index (a)
We vary a as [0.5, 1, 2, 5, 10].This updated formulation of the UID metric, with M H raised to an exponent "a", can be used to capture the importance assigned by the user on the model performance M H , e.g., when inference cost is less of a bottleneck.We analyzed the accuracy-length tradeoff using different values for the parameter "a" to capture various types of user requirements in terms of the allowed expenses towards the inference process.The average UID values (for zero-shot manual prompts) across all the models are shown in Tables 20 and 21.Based on these experiments, we found the following insightful observations.These tables show that for both MSC and TC, for DEB and METEOR, as the value of "a" is increased, summary-based dialog history variants tend to become better in terms of UID while Recent-k and Semantic-k variants tend to become less impressive.
Although, in terms of BLEURT (UID), the ranking is in favour of Semantic-1 or 2 and Recent-1 or 2 throughout the complete range of "a" that we have explored.This might be because BLEURT measures normal sentence semantic similarity but not context-response relevance as measured by DEB.To fully understand how the rank of various history signals vary over the value of the metricimportance "a", we plot the rank-order of all history signal types vs. the value of "a" (increased from 0.5 to 10) as show in Fig. 9.This rank order dynamics helps us clearly understand, as we give more and more importance to the model performance and ignore the cost of inference, how the choices over the history signal change.For example, in terms of the UID (DEB) metric on the MSC dataset, the average trend across models is that Recent-1 and Semantic-1 are the recom-

Automated
Chat System: Learn from the below example on how to generate consistent and diverse responses between Person1 and Person2 given background details along with summary.Example: Here are some background details about Person1: [BI(P1) E ] Here are some background details about Person2: [BI(P2) E ] This is a summary of a dialog exchange between Person1 and Person2: [S E ] Given the background details and the summary of the dialog exchange between Person1 and Person2, give a consistent and diverse response to the following dialog by Person1.Person1: [U E ] Person2: [R E ] Now try it yourself: Here are some background details about Person1: [BI(P1)]Here are some background details about Person2:[BI(P2)] This is a summary of a dialog exchange between Person1 and Person2:[S] Given the summary of the dialog exchange between Person1 and Per-son2 and their background details, give a consistent and diverse response to the following dialog spoken by Person1.Person1: [U ] Person2: ) Pegasus-CD: google/pegasus-cnn_dailymail model (which has been finetuned on CNN-DailyMail corpus, with 16 encoder and 16 decoder layers.(3) Pegasus-DS: google/pegasus-cnn_dailymail model further finetuned on both DialogSum and SAM-Sum data, with 16 encoder and 16 decoder layers.Training hyper-parameters are in Appendix B.

Figure 1 :
Figure 1: Comparison of average input length for various representations of dialog prompts across the two datasets for the few shot setting.DH = Dialog History, BI = Background Information.

Figure 3 :
Figure 3: Model averaged performance results for MSC Dataset.DH = Dialog History, BI = Background Information.

Figure 4 :
Figure 4: Comparison of average input length for various representations of dialog prompt across the two datasets for the zero shot setting.DH = Dialog History, BI = Background Information.

Figure 9 :
Figure 9: Trend in Ranks of History Signal Types for Different Values of the Metric-Importance Index a (for MSC dataset, DEB metric)

Table 1 :
Prompt template and an instantiation with summary of dialog history as dialog context and without exemplars or background information.This is perplexity optimized using FLAN-T5-XL model.More examples are provided in Appendix G.

Table 3 :
Model comparison based on average performance over history, prompt-type and exemplar settings.

Table 4 :
UID results across four models.Manual prompts are averaged over all 4 models; perplexity optimized prompts are averaged over all models except GPT-3.

Table 5 :
UID results for the FLAN-T5 model.

Table 7 :
UID results for the Tk-Instruct model.

Table 8 :
UID results for the GPT-3 model.Manual prompt only, since prompts for GPT-3 cannot be perplexity optimized.