InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems

Large language models (LLMs) have been used for diverse tasks in natural language processing (NLP), yet remain under-explored for task-oriented dialogue systems (TODS), especially for end-to-end TODS. We present InstructTODS, a novel off-the-shelf framework for zero-shot end-to-end task-oriented dialogue systems that can adapt to diverse domains without fine-tuning. By leveraging LLMs, InstructTODS generates a proxy belief state that seamlessly translates user intentions into dynamic queries for efficient interaction with any KB. Our extensive experiments demonstrate that InstructTODS achieves comparable performance to fully fine-tuned TODS in guiding dialogues to successful completion without prior knowledge or task-specific data. Furthermore, a rigorous human evaluation of end-to-end TODS shows that InstructTODS produces dialogue responses that notably outperform both the gold responses and the state-of-the-art TODS in terms of helpfulness, informativeness, and humanness. Moreover, the effectiveness of LLMs in TODS is further supported by our comprehensive evaluations on TODS subtasks: dialogue state tracking, intent classification, and response generation. Code and implementations could be found here https://github.com/WillyHC22/InstructTODS/


Introduction
LLMs have consistently pushed new frontiers in natural language processing (NLP) in terms of performance across a variety of benchmarks, such as MMLU (Hendrycks et al., 2020), BIG-Bench (Lewkowycz et al., 2022) and HELM (Bommasani et al., 2022), achieving state-of-the-art results in both natural language understanding (NLU) and generation (NLG) tasks (Bang et al., 2023).Various applications of LLMs have also been adopted in the industry, most prominently ChatGPT2 and Figure 1: InstructTODS is the first zero-shot end-toend task-oriented dialogue system that requires no taskspecific annotations, and ontology while generating more human-preferred responses.
GPT-4 3 , which can provide a natural answer to a diverse range of questions fluently and coherently.
Among the manifold tasks in NLP, task-oriented dialogue systems (TODS) represent a crucial domain.In general, TODS can be categorized into: the pipelined approach (Ham et al., 2020;Hosseini-Asl et al., 2020a;Ohashi and Higashinaka, 2022), relying on multiple sequential modules and heavy annotations for dialogue states and system actions, and the end-to-end approach (Banerjee and Khapra, 2019;Qin et al., 2020;He et al., 2022), where the systems generate responses directly from the user input and the KB.Both approaches lack adaptability to unseen domains.This adaptability often requires domain-specific structures (ontology), and data for TODS is notoriously expensive to collect and annotate (Eric et al., 2020).In this regard, LLMs present great potential thanks to their extensive pre-trained knowledge, enabling them to adapt to contextual information without any parameter updates or additional task-specific data.
However, utilizing LLMs for tasks requiring knowledge grounding, such as TODS, poses a criti- cal challenge that calls for thorough investigation and exploration.TODS requires dialogue systems to adeptly complete a specific goal by interacting with a user in natural language according to a certain set of bounded functions, ontology, and knowledge within the corresponding domain.Nevertheless, naively feeding all of the knowledge to the LLMs in TODS could lead to the generation of misleading and unfaithful information, i.e., hallucination (Ji et al., 2023;Azamfirei et al., 2023).
In this work, we first investigate the capability of LLMs to perform three key TODS objectives in zero-shot settings, specifically dialogue state tracking (DST), intent classification (IC), and response generation (RG).While LLMs demonstrate impressive capabilities and understanding of these tasks individually, a closer examination of their shortcomings reveals that the modular approach is not the most suitable for effectively using LLMs in TODS due to its restrictiveness.Rather than confining interactions within predefined elements like slots, values, or system actions, it is more advantageous to harness the emergent abilities of LLMs to process unstructured information, which also enables the system to easily adapt to new domains.
From these observations, we propose Instruct-TODS, a fully off-the-shelf framework to perform end-to-end unified TODS in a zero-shot setting using LLMs.InstructTODS is adaptable to any KB and does not require any ontologies or task-specific data.Instead of using predefined slot values, In-structTODS generates an unstructured proxy belief state from the dialogue context.Then, an action thought is generated to query the KB dynamically in natural language using an LLM.The retrieved information is then given to generate the response.
In summary, our contributions are as follows: • We provide an extensive evaluation and com-prehensive analysis of LLMs' zero-shot performance in several TODS subtasks, notably intent classification, dialogue state tracking, and response generation.
• We introduce InstructTODS, a fully off-theshelf framework to leverage instruction-tuned LLMs in zero-shot setting for end-to-end unified task-oriented dialogue, with the benefit of being effectively adaptable to any knowledge base (KB) while alleviating the need for any additional form of task-relevant data, such as intent, belief state, system action, etc.
• We provide valuable insights from the TODS experiments on the more general advantages and failure cases of LLMs to perform complex zero-shot NLP tasks.
2 Evaluating LLMs on Zero-Shot Task-Oriented Dialogue Subtasks As an intermediary step in exploring the potential of end-to-end TODS solutions, we first investigate how well the performance of state-of-the-art LLMs (we presented the comparison of different LLMs over multiple tasks in Appendix A), i.e., GPT-3.5 and GPT-4, in performing various modular taskoriented objectives in their respective settings.

TODS Subtasks
Let us define a dialogue set D n = {u 1 , r 1 , u 2 , r 2 , ..., u n , r n } where u i and r i denotes the user utterance and the system reply at turn i, respectively.
Intent Classification (IC) For IC, we have the set of labels C = {c 1 , c 2 , ..., c t }, from which we build the input for the LLM as x ic i = P ic (I ic , Concat(c j ) t j=0 , u i ) where P ic (.) is the IC Baseline results are directly taken from their respective works.The best performances in each section are in bold.
input template, I ic refers to the natural language instruction for IC and Concat(c j ) is the concatenation of all labels.We evaluate two generation settings, a single output setting where we query the model for the inferred intent, and a multi-output setting where we query the model for the top-3 intents given the user query by simply changing the instruction I ic .As such, we recast the classification task in a text-generation manner and compare our results with state-of-the-art IC baselines.

Dialogue
State Tracking (DST) For DST, we define the total set of slots S = {s 1,D 1 , s 2,D 1 , ..., s k,D l } where s i,D j is the i-th slot associated to domain D j .We give a singular hand-crafted exemplar distinct from the dataset to guide the generation format directly as JSON.We build the input x dst i = P dst (I dst , f dst (S), D i ) by providing the entire dialogue context, where P dst (.) is the DST input template, I dst denotes the instruction for DST f dst (S) refers to a textual transformation of the set of slots.We evaluate two settings with different slot transformations: one by providing all slots and another with only the active domain slots.
Response Generation (RG) For RG, given a dialogue D, we define the set of oracle system actions A = {a 1,1 , a 1,2 , ..., a n,m } where a i,j denotes the j-th system action of turn i.We construct the input x rg i = P rg (I rg , f rg (a i,1 , a i,2 , ..., a i,m ), D i ) where P rg (.) is RG input template, I rg denotes the instruction for RG and f rg (.) refers to a textual transformation of the set of system actions.We evaluate the capability of LLMs to leverage a structured system action while addressing the dialogue context to generate a response to the user.

Experiment Settings
Dataset For the dialogue state tracking, we evaluate the LLMs' capability on MultiWOZ 2.1 (MWOZ) (Eric et al., 2020).For intent classification, we evaluate two datasets: Banking77 (Casanueva et al., 2020), a fine-grained intent dataset in the banking domain, and CLINC150 (Larson et al., 2019), coarse-grained intents classification datasets covering over 10 different domains.The main challenge of the CLINC150 dataset is on inferring out-of-scope intent, which is particularly challenging without any model training.

Key Takeaways
The evaluation results for intent classification, DST, and response generation are shown in Table 1, Table 2, and Table 3, respectively.We summarize the key insights as follows: LLMs outperform most baselines.LLMs show significant improvements in intent classification and DST tasks compared to other zero-shot and few-shot baselines and perform almost comparably to few-shot models in the intent classification task.
LLMs offer better generalization and adaptable solutions to TOD.Unlike fine-tuned models, LLMs approach all tasks in an autoregressive generation manner, allowing greater flexibility and scalability to adapt to other tasks and domains.3 InstructTODS: An Instruction-Based Zero-shot End-to-End TODS By leveraging the insights from solving TODS subtasks on §2.3, we develop the first zero-shot end-to-end framework that operates without any domain information (ontology) and requires no task-specific annotations such as dialogue state, system act, intent, etc.This method is not only cost-efficient but also alleviates the ontology constraint of LLMs in the modular DST task and promotes the strength of LLMs in generating better and more human-preferred responses.Let us define a structured knowledge base (KB) as a set of tuples K = {(v a 1 1 , ..., v a k 1 ), ..., (v a 1 p , ..., v a k p )} where (a i ) k i=0 are the attributes of the KB, and (v a i j ) p j=0 are all the values associated to the attribute a i .We first define a naive modular LLM response generation approach that serves as a baseline, denoted as RG naive .4RG naive generates the user response by taking the entire KB along with the dialogue context as input.In this approach, we rely on the ability of the LLM to parse the entire KB during inference while processing the dialogue context, in order to perform in-context retrieval and response generation at the same time.As such, we build the input x RG i = P RG (I RG , f RG (K), D i ) where P RG (.) is the response generation input template, I RG denotes the instruction for response generation and f RG (K) refers to a textual transformation of the KB where we filter unnecessary information and values that are too long as they are not needed to accomplish the user goal.In this approach, the bottleneck resides in the context window limit of the LLMs.Unlike other approaches, InstructTODS aims to make the best use of the LLM abilities to perform end-to-end tasks in zero-shot settings without the need for additional modular NLU and DST models, allowing zero-cost adaptation to various domains with no parameter update.
In general, in order to process the dialogue history and interact with the KB, InstructTODS introduces two concepts, i.e., proxy belief state and action thought.The results from KB and the dialogue history are then fed as a context to the LLM for generating the user response.
In the following paragraphs, we describe each component of InstructTODS in more detail.

Proxy Belief State
We generate a proxy belief state Bi = P BS (D i ) from the dialogue history where P BS (.) denotes the prompt template and D i the dialogue context.Bi encapsulates everything that the user is looking for in natural language at this point of the dialogue.Note that, the proxy belief state does not need any prior knowledge about the domain nor any ontology to operate (e.g.domain, trackable slots, values, types of information, etc.).The proxy belief state is directly used to interact with the KB in a multi-turn fashion.

KB Interaction
To interact with the KB, we generate an Action thought A = P act ( Bi , (a i ) k i=0 ) where P act (.) is the template for action generation and (a i ) k i=0 the attributes of the KB.By providing the existing attributes of the KB at this step, we ground the LLM to accurately translate the belief state into information that can be queried from the KB, while filtering out unnecessary data.The action thought serves as an intermediary to leverage the code generation ability of LLM by generating a query Q = P KB (A, K) where P KB (.) is the template for code generation.The output from the KB is then parsed by the LLM to extract relevant information, denoted as I, presented in natural language, which provides a summary of the KB interaction.It also determines whether the current action thought has been fulfilled.If it remains unanswered, a new action thought is generated based on the extracted information, and the process repeats until a stopping criterion is reached indicating that no relevant knowledge is found in the KB.
Response Generation Once the KB interaction concludes, the final information, together with the original dialogue context, is passed to the model to generate the response Y = P RG (I, D i ) where P RG represents the response generation template and I the final information from the KB interaction.
In the case where no knowledge is found in the KB, the LLM prompts the user to provide additional information.We provide the prompt template in Appendix D. The depiction of how the Instruct-TODS framework works is presented in Figure 2.

Experiment settings
Baselines Our framework is compared to other end-to-end unified TODS approaches that perform end-to-end TODS using a unified text-totext paradigm through a single generalized text  et al., 2022).In addition, as described in §3, we add the naive version of the LLM response generation approach which is fed by the full KB (RG naive ), as an additional baseline to better evaluate the effectiveness of our framework.
Datasets We evaluate the end-to-end zero-shot capability on MultiWOZ 2.1 (MWOZ) (Eric et al., 2020;Lewkowycz et al., 2022).We split the evaluation into two settings, i.e., single-domain and multi-domain evaluation settings, where we show the capability of LLMs to tackle more complex TODS tasks in zero-shot end-to-end settings.
Automatic Evaluation For evaluating the endto-end framework, we measure the per domain Inform and Success rates, and the BLEU (Papineni et al., 2002), Inform rate, and Success rate (Eric et al., 2020) for all domains.The evaluation metric is computed on the delexicalized responses to avoid favoring models that provide more information than others and focus solely on the vocabulary used for the response generation.Additionally, we also incorporate an automatic human-likability score, namely USL-H (Phy et al., 2020).

Human Evaluation
We conduct an extensive human evaluation to measure the capability of LLMs in conducting zero-shot end-to-end unified TOD.Specifically, we conduct two human evaluations, which measure: 1) the informativeness, helpfulness, and humanness of the generated responses, and 2) the information correctness and hallucination rate of our InstructTODS.For evaluating informativeness, helpfulness, and humanness, we ask 3 annotators to rate the quality of the response using a 4-point Likert scale (see Appendix B).The sys-tem is helpful if it answers the user's request while pushing the conversation towards goal completion, informative if the system provides enough related information while answering the user, and human if the generated answer is fluent and human-preferred.
For measuring the incorrectness and the hallucination rate, the metrics are evaluated by a single TOD expert.The incorrectness and hallucination rate are measured by manually checking the ratio of correct, incorrect, and hallucinated entities provided in the generated responses.We conduct the human evaluation by taking 50 generated responses from all the models and the gold responses.

Automatic Evaluation
Our automatic evaluation is shown in Table 4.In general, we find a similar trend with the modular LLMs where LLMs produce lower BLEU scores-∼4 BLEU against ∼15 BLEU-with competitive Inform and Success rates compared to other endto-end unified TODS baselines.Note that, as mentioned in §2.3, LLMs often generate completely different responses to the gold knowledge, hence producing low automatic evaluation scores.Nevertheless, the low automatic evaluation scores do not sufficiently reflect the capability of Instruct-TODS.We will further elaborate on this in §5.2, raising a question of the sufficiency of evaluating TODS quality using only a single gold response.Some comparative generation samples between the different models can be found in Appendix C.

Informativeness, Helpfulness, and Humanness of InstructTODS
The results for our human evaluation are shown in Figure 3 for InstructTODS in comparison with the naive approach, the gold responses, and the two best-performing baselines in task completion (i.e.,  Galaxy and PPTOD).From the results, we show that InstructTODS is more informative, helpful, and human-like than the two fine-tuned end-to-end baselines by a noticeable margin.For both helpfulness and humanness, InstructTODS also outperforms RG naive and the gold response.Aligning with the human evaluation results, the generated responses by our framework also have higher humanness scores as shown in Figure 4, even higher than the gold responses.RG naive is the most informative, which is expected as the model processes the entire KB for information, however, the quality of the information greatly differs as shown in §5.3.

Incorrectness and Hallucination
We show the results for incorrectness and hallucination for the LLM-generated responses in Figure 5.While a sample can be incorrect, e.g., if the LLM database interaction fails, the LLMs do not necessarily generate unfaithful information.In-structTODS is more robust than naively employing the LLMs, improving the correctness by 15% and showing 11% of hallucination, half the amount of the RG naive .We observe that some types of information are more prone to hallucination, notably time and address.This bias towards temporal and spatial information aligns with our observation of LLMs' performance in DST ( §2.3).

LLMs on Multi-Domain TOD
While it is possible to use InstructTODS in multidomain with distinct KBs per domain, as we see in Figure 6, the performance degrades quickly for Success and slightly less for Inform as the number of domains increases.While fine-tuned end-to-end baselines operate with only one KB at a given turn by tracking the active domain through either state changes (Peng et al., 2021;Yang et al., 2021) or slot names (Kulhánek et al., 2021), our zero-shot framework does not assume any external knowledge nor ontology information.As such, all KBs are provided at each turn, and due to different KBs attributes overlapping in MWOZ, InstructTODS often queries incompatible information from the proxy belief state (e.g., "food" and "destination" at the same time), which are in different KBs.Hence, multi-domain degradation is largely due to the KB interaction failure.
6 Related Work
A more recent end-to-end TODS tackles end-toend response generation in a single sequence prediction problem (Hosseini-Asl et al., 2020a;Yang et al., 2021;Peng et al., 2021) with an autoregressive model.These approaches still mostly leverage TOD data (belief states, system acts, etc.) during generation.As general pre-trained LMs were shown to be effective for TODS (Mehri et al., 2019;Lubis et al., 2020;Lin et al., 2020), several subsequent works have explored pre-training approaches directly tailored towards TODS (Zhang et al., 2020b;Su et al., 2022;He et al., 2022).To the best of our knowledge, prior works require a structured format of dialogue states, system acts, and/or template responses, whereas InstructTODS alleviates such needs by incorporating an unstructured proxy belief state, which requires no domainspecific knowledge nor ontology to operate, allowing zero-shot adaptation to various TOD domains.
LLMs for TODS Recent works explore the applicability of LLMs in solving modular TOD tasks (Bang et al., 2023;Hudeček and Dušek, 2023) and a pipeline manner (Hosseini-Asl et al., 2020b;Su et al., 2022;Peng et al., 2021;Yang et al., 2021;Kulhánek et al., 2021;He et al., 2022).Additionally, Bang et al. (2023) inspect ChatGPT's capability for zero-shot end-to-end TODS, however, it is limited to only ∼1% of the test set available.Therefore, to the best of our knowledge, our work is the first to comprehensively study the utilization of LLMs for zero-shot end-to-end TODS.

Conclusion
In this paper, we introduce InstructTODS, an offthe-shelf framework to effectively perform endto-end TODS in zero-shot utilizing LLMs.We compare InstructTODS to several state-of-the-art fully fine-tuned end-to-end TODS and show that InstructTODS manages to guide the conversation towards goal completion similarly to the fine-tuned systems on MWOZ while generating answers that are more informative, helpful, and human-like than previous approaches.Furthermore, we investigate the capability of LLMs in performing various TOD subtasks in zero-shot settings, demonstrating better diversity and human preference on response generation, and state-of-the-art zero-shot results on dialogue state tracking and intent classification.

Limitation
Generalization to Other Datasets In our work, we only assess the effectiveness of InstructTODS on MultiWoZ 2.1 dataset, whose size is a magnitude higher than other TODS datasets (Eric et al., 2020).We conjecture that the generalization to other datasets will follow the same trend as described in §5, where it excels in the single-domain setting while still struggling in the multi-domain setting.We expect future work to extend the assessment on InstructTODS to other datasets and domains.
Generalization to Other Languages In recent years, various task-oriented dialogue systems in languages other than English have been introduced, such as CrossWoZ (Zhu et al., 2020), BiTOD (Lin et al., 2021b), GlobalWoZ (Ding et al., 2022), and COD (Majewska et al., 2023).As suggested in prior works evaluating LLMs in low-resource languages (Bang et al., 2023;Asai et al., 2023;Cahyawijaya et al., 2023b,a;Workshop et al., 2023;Kabra et al., 2023;Zhang et al., 2023), we conjecture that the performance in other languages follow the general trend in LLMs where the performance in low-resource languages will be lower compared to the high-resource languages.Future work might explore and further extend methods for improving the generalization of InstructTODS to other languages.
Generalization to Other LLMs In this work, we only explore two proprietary LLMs which display strong performance on various NLP tasks, i.e., GPT-3.5 and GPT-4.Despite the lack of transparency of these models, we expect that when other publicly available LLMs achieve the same performance as these proprietary LLMs, a similar capability of zero-shot end-to-end TODS will emerge.We expect future work to explore the generalization of InstructTODS and its improvement in other LLMs.

Ethics Statement
Our research endeavors to develop an off-the-shelf framework for zero-shot end-to-end Task-Oriented Dialogue Systems (TODS) using Large Language Models (LLMs).It is important to note that this study does not involve the use of any sensitive data and the experimental evaluation is conducted on publicly available datasets.To ensure the quality of our results, we have employed crowdsourcing for the human evaluation of the generated dialogue responses.While our study does not raise any ethical concerns regarding privacy, confidentiality, or bias, we recognize that the use of LLMs in dialogue systems may have ethical implications related to potential biases in the training data and the generated responses.Therefore, we emphasize the importance of ongoing research toward developing ethical guidelines and best practices for the use of LLMs in dialogue systems.In line with our commitment to transparency and reproducibility, we will be releasing our code publicly.We believe that this will encourage open and collaborative research towards the development of more ethical and effective dialogue systems.

A Comparison of LLMs over Various NLP Tasks
We show the performance comparison of various LLMs on both NLU and NLG tasks in Fig- ure A1.The data are collected from various prior works focusing on benchmarking the capabilities of LLMs (Bang et al., 2023;Cahyawijaya et al., 2023a;Anonymous, 2023;Asai et al., 2023;Ope-nAI, 2023;Wu et al., 2023).

B Human Evaluation
We give additional details concerning the human evaluation in this section.The instructions for each metric given to the evaluators are defined as follow: Informativeness Amount of information that the system provides while answering the user's utterance.
1.The response has no information at all 2. The response provides at least one piece of information, but clearly not enough.
3. The response provides several pieces of information but more could be provided

C Generation Samples
We show a few generation samples of Instruct-TODS as well as the modular evaluation of TODS subtasks.In table A1, we show different delexicalized responses of the fine-tuned baseline compared to the generation by InstructTODS.For the modular subtasks, we show a sample of correct prediction as well as difference failure cases.The samples for dialogue state tracking are in Table A2, for intent classification in Table A3 and for response generation in Table A4.

D Prompts
In table A5, we give the templates and samples of the prompts used in RG naive , and in the end-to-end setting for the proxy belief state, the initial knowledge base interaction and the response generation using the extracted information.Table A1: Samples of different system responses for both the proposed E2E framework in comparison with the gold response.InstrucTOD's response are delexicalized after generation, while the fine-tuned systems generate the delexicalized response directly.

Figure 2 :
Figure 2: Overview of InstructTODS, a framework to utilize LLM for zero-shot end-to-end task-oriented dialogue.

Figure 4 :
Figure 4: InstructTODS have higher human preference scores than the gold responses and baselines.

Figure 5 :
Figure 5: Human evaluation on correctness, incorrectness, and hallucination for RG naive and Instruct-TODS.

Figure 6 :
Figure 6: End-to-end TODS performance degrades as the number of active domains in the dialogue increases.

4.
The response gives all the information you would expect in that turn Helpfulness The system answers the user's utterance and pushes the conversation towards completion 1.The response is doing neither 2. The response is just pushing the conversation towards completion without answering the question 3. The response is just answering the question but not pushing the conversation towards completion 4. The response is doing both Humanness The system answers the user's utterance in a human-like manner 1.The response is completely machine-like 2. The response contains human-like acknowledgement, but still sound machine-like 3. The response has human acknowledgement and you could imagine it being said by a human 4. The response is completely indiscernible from a human answer

Table 1 :
Performance comparison on intent classification.LLMs outperform most baselines in our benchmark.Best performances in each section are in bold.*Out-of-scope intent of CLINC150.

Table 2 :
Performance comparison on zero-shot DST benchmark.LLMs outperform all baselines in our benchmark.

Table 3 :
Performance comparison on response generation.Although lower in BLEU, responses by the LLMpowered modular TODS are more human-preferred.The reported results for the baselines are taken from their respective work.The best performances in each group are in bold.
value_count] [value_price range] [value_type] available.Do you have a preference for the area?GALAXY [value_name] meets your criteria.Would you like me to book you a room ?PPTOD There are [value_choice] [value_price] [value_type] with free parking.Do you have a preference for the area ?Great, I found [value_name] located at [value_address] [value_postcode].It has free parking and falls within the [value_pricerange] price range.Would you like me to confirm a booking for you?Yes, please.I need a table for 8 on Friday at 17:30, please.SYSTEM: Ok, you are all set for cote on Friday, table for 8 at 17:30.Can I help with anything else?USER: Can I have the reference number for the reservation please?SYSTEM: Booking was unsuccessful.Can you try another time slot?USER: What about 16:30?Gold One moment while I try to make the reservation of table for 8, Friday at 16:30.SOLIST I have made that reservation for you.Your reference number is [value_reference].Is there anything else I can help you with ?UBAR I was able to book your table for [value_day] at [value_time].Your reference number is [value_reference].AUGPT Yes, the reference number is [value_reference].GALAXY I am sorry, but I was unable to book that for you.Would you like to find another restaurant?PPTOD Your reference number is [value_reference].Is there anything else I can help you with ?Sure your reference number for the reservation at [value_name] on [value_day] at [value_time] for a table of [value_people] is [value_reference].Is there anything else I can assist you with?