AnyTOD: A Programmable Task-Oriented Dialog System

We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A neural LM keeps track of events occurring during a conversation and a symbolic program implementing the dialog policy is executed to recommend next actions AnyTOD should take. This approach drastically reduces data annotation and model training requirements, addressing the enduring challenge of rapidly adapting a TOD system to unseen tasks and domains. We demonstrate state-of-the-art results on STAR, ABCD and SGD benchmarks. We also demonstrate strong zero-shot transfer ability in low-resource settings, such as zero-shot on MultiWOZ. In addition, we release STARv2, an updated version of the STAR dataset with richer annotations, for benchmarking zero-shot end-to-end TOD models.


Introduction
An enduring challenge in building and maintaining task-oriented dialog (TOD) systems is efficiently adapting to a new task or domain.For instance, if we were to add the ability to book flight tickets to an existing system that can only handle booking train tickets, this requires manual data collection and labeling for new conversations about flight booking, as well as retraining of natural language understanding (NLU) and policy models.These data efficiency and scaling problems compound for multi-task TOD systems, as each task may have its own bespoke ontology and policy.
To tackle this problem, we propose ANYTOD, an end-to-end TOD system that can be programmed to adapt to unseen tasks or domains without prior training, significantly reducing the data collection and training requirements for enabling new TOD 1 tasks.To the best of our knowledge, ANYTOD is the first end-to-end TOD system capable of zeroshot transfer.To this end, we view TOD as a program that a language model (LM) must execute throughout a conversation, and can rely on to provide guidance.ANYTOD can be controlled by any predefined task policy if implemented as a program, allowing arbitrary business logic to be executed for a specific task.To demonstrate the efficacy of this paradigm, we experiment with the STAR (Mehri and Eskenazi, 2021), ABCD (Chen et al., 2021), SGD (Rastogi et al., 2020) and MultiWOZ (Eric et al., 2019) benchmarks.We show that ANYTOD achieves state-of-the-art results in both full-shot and zero-shot transfer settings.
Overview of ANYTOD To adhere to a given program, ANYTOD adopts a neuro-symbolic approach (Figure 1).A neural LM is trained for Dialog State Tracking (DST) and Action State Tracking (AST), abstracting both states and actions into a sequence of symbols.To support zero-shot task adaptation, we follow the schema-guided paradigm advocated by Rastogi et al. (2020), which provides a schema to the LM as contextual information, describing all parameters and actions that should be tracked in natural language.By training on a large corpus of diverse schemas, the LM generalizes to arbitrary and unseen schemas (Lee et al., 2021;Zhao et al., 2022).A schema should also provide a symbolic program that declares the task logic, which is executed to recommend possible next actions the agent can take, conditioned on the current dialog states.These recommendations are then reincorporated into the LM, which selects a single Next Action Prediction (NAP), and generates a response.Note that the symbolic program forces ANYTOD to consider a dialog policy explicitly, driving zeroshot transfer onto unseen policies and allowing arbitrarily complex business logic to be employed.However, the program's recommendations are only considered as guidelines, and it is up to the LM to make a final decision on the NAP.

STARV2
We also introduce STARV2, an improved version of the STAR dataset (Mosig et al., 2020).The original STAR dataset is very valuable for benchmarking zero-shot dialog policy and NAP across a diverse set of tasks or domains, through following a provided policy graph that outlines the intended flow of a conversation.However, the original dataset made following these policy graphs difficult, due to its lack of training data for DST and AST.Moreover, we found that the schema entity descriptions provided by the original dataset were not intuitive enough to truly support zero-shot DST and AST.To resolve these limitations, the STARV2 dataset provides new belief state and action state annotations to the STAR dataset, as well as more intuitive natural language descriptions for many schema elements.In Section 4.2, we show that these changes facilitate stronger zero-shot DST and AST.However, the ground truth NAP on each system turn is left untouched, allowing direct comparison to results trained on the original STAR dataset.We hope that STARV2 can serve as a new benchmark for TOD systems and drive further research for zero-shot TOD.

Related Work
TOD and adaptation onto unseen tasks Fueled by the difficulty of adapting existing TOD systems to new tasks/domains, TOD systems capable of zero-shot task adaptation onto unseen tasks or domains have recently seen increasing interest.Much of this work has been on DST, with the primary approach being characterizing parameters through names (Wu et al., 2019) or descriptions (Lin et al., 2021;Lee et al., 2021;Zhao et al., 2022).Another approach has been through in-context finetuning (Shah et al., 2019;Gupta et al., 2022), in which a labeled exemplar conversation is given.Mi et al. (2021) demonstrated a more comprehensive approach, including task instructions, constraints, and prompts.In general, these results follow the schema-guided paradigm advocated by Rastogi et al. (2020); Mosig et al. (2020).
By contrast, there are fewer results on task adaptation onto unseen dialog policies (AST and NAP).To the best of our knowledge, the only result is SAM (Mehri and Eskenazi, 2021), which aligns an LM to an unseen dialog policy by following an explicit policy graph.While similar to the policy graph execution we demonstrate in ANYTOD, there are two differences.First, SAM lacks supervised training on DST and AST, and relies on ground truth NAP only, forcing user state and action tracking to be inextricably linked with the NAP, and hurting its ability to generalize to arbitrary policy graphs.Second, SAM is a classification model limited to NAP, and unlike ANYTOD, cannot support DST or natural language generation (NLG).Indeed, we show that ANYTOD is empirically more 2 16190 powerful than SAM in Section 4.2.
To our knowledge, no method has yet to combine zero-shot task adaptation fo DST, AST, and NAP into an end-to-end TOD system.All existing endto-end TOD systems (Hosseini-Asl et al., 2020;He et al., 2021;Yang et al., 2020;Peng et al., 2020) are trained and evaluated on the popular MultiWOZ dataset (Eric et al., 2019).As a result, these systems are only aware of the policy for MultiWOZ, and are not robust to arbitrary/unseen policies.In contrast, AnyTOD can generalize to arbitrary policies, and we demonstrate strong performance on MultiWOZ without prior training (Section 4.4).
TOD as Programming Historically, most TOD approaches use an explicit plan-based dialog policy module (Rich and Sidner, 1998;Ferguson and Allen, 1998;Bohus and Rudnicky, 2009).However, the NLU models powering these TOD systems are tightly coupled to a specific plan, and must be retrained for even slight changes to the plan.In contrast, ANYTOD enables zero-shot dialog policy by training NLU models to be robust to arbitrary programs as policies.Further, ANYTOD uses the program as contextual information to NLU, and refines its NAP with respect to the conversation, belief state, and action history instead of simply accepting the plan's dictated next action(s).
Recent work has also focused on discovering structure within conversations i.e. a latent schema, policy graph, or program (Shi et al., 2019;Yu et al., 2022;Xu et al., 2020).Notably, SMCalFlow (Machines et al., 2020) constructs "dataflow graphs" from a conversation, parsing semantic intents into executable programs.Cheng et al. (2020); Shin et al. (2021) further explore this setup.However, these aim to manipulate an external API/database instead of controlling the agent's behavior.
Beyond the scope of TOD, there has been some work in general neuro-symbolic programming with LMs, in which an LM is influenced by the results of a symbolic system.Nye et al. (2021) demonstrated a symbolic reasnoning module that accepts or rejects the logical consistency of generations from a neural LM.Lu et al. (2020) explored using predicated logic constraints to control lexical aspects from the generation of an LM.However, ANYTOD is the first application of such an approach to a practical TOD setting.

The ANYTOD System
An overview of the ANYTOD system is presented in Fig. 1.We decompose ANYTOD into three steps, and describe each step in detail below: 1. Schema and program construction: A chatbot designer constructs a ANYTOD schema describing the ontology of a specific task, and a policy graph that declares the task logic.
2. Schema-guided DST and AST: A LM performs DST and AST capable of transfer onto unseen tasks with reference to the schema, and without task-specific training.
3. Program execution and NAP: Given predicted DST and AST, we execute the schema program, which recommends possible NAP to the LM.The LM then predicts the final system action(s) conditioned on these recommendations, conversation history, and belief states.
Schema Construction A chatbot designer constructs a ANYTOD schema describing the ontology of a specific task, and a policy graph that declares the task logic.This is the only thing ANYTOD requires from the designer.For instance, suppose the designer is creating a flight booking chatbot.They should define parameters to be tracked (e.g."flight id", "name of the airline"), and enumerate possible actions the user and agent can take ("user saying they would like to search for flights", "agent should query flight booking api").Following the schemaguided paradigm (Rastogi et al., 2020), each element in this schema is characterized by a short natural language description, allowing the LM to understand its meaning and facilitate zero-shot transfer.
The schema should also include a program, which can be considered as a function from predicted belief states and actions, and dictate possible NAPs following explicit symbolic rules.Examples can be seen in Section A.2.At a high level, this program should describe agent actions in response to user behavior (e.g."if user wants to search for flights, query the flight search api").
Schema-guided DST and AST Adaptation to novel tasks without training data critically hinges on an LM performing zero-shot DST and AST .
For this purpose, we adopt and extend the D3ST approach (Zhao et al., 2022).We provide a description of D3ST here.Let p 0 , ...p n be parameters defined in the schema, and let desc(p i ) denote a parameter's natural language description.Then, construct a parameter context string [params] p0=desc(p 0 ) ... pn=desc(p n ) Note that the strings p0, ..., pn are used as indices.Similar context strings are generated for actions for AST.These context strings are concatenated with the entire conversation history, forming the input to the LM.This input is contextualized by the schema information, allowing the LM to refer to the schema, and enabling zero-shot transfer.The target string contains the conversation belief state and history of actions at each turn of the conversation, both in a parseable format.Let p i 0 , . . ., p im be the active parameters in the conversation, with corresponding values v i 0 , . . ., v im .The belief state is represented as Note that inactive slots do not appear in the belief state string.Note that D3ST's original formulation is limited to DST, but, in principle, D3ST supports tracking arbitrary events that occur during a conversation, as long as their descriptions are provided.
For ANYTOD, this approach can be extended to perform schema-guided AST, in which we can provide an action context string as contextual input, providing a list of user and system actions.We also build an target string consisting of a history of actions that were active at each turn of the conversation.Let uj and sk be the format of D3ST indices for user and system actions.Then, an action history string may look like [history] u0 u9; s2; u1; s3; ... This denotes that, on the first turn, the user was performing user actions u0 and u9.On the second turn, the system was performing system action s2, and so on.Note that the active actions for each turn are separated by a ; character.
Program Execution The LM's predicted DST and AST are then parsed and passed to the schema program.This program should execute the dialog policy, and control ANYTOD by recommending possible NAPs.Section A.2 shows some example programs for STARV2.In the example shown in Figure 1, the current conversation state ("user would like to search for flights to Dubai with Emirates") satisfies multiple dependency rules ("since the user would like to search for flights, query the flight search api" and "since the user has not provided their flight departure location, ask the user for it").These system actions are then passed back to the LM as a string of system action indices.
[recommend] s0 s2 Finally, given the policy graph's recommended actions as extra conditional information, the LM makes predictions about NAP with respect to the conversation, previously predicted belief states and action history.A response is also generated following the action prediction.
[selected] s2 [response] hello!Note that the selected action need not be one of the actions recommended by the program, as actual conversations may not rigorously follow the predefined business logic.Indeed, violations like this are common within the STAR dataset.This step allows ANYTOD to "softly" execute the policy graph, balancing between the model's belief before and after receiving recommendations.
Zero-shot adaptation onto unseen tasks ANY-TOD's zero-shot transfer ability is enabled by a combination of two design considerations.The first is the LM's description-driven DST and AST.Since this schema information is provided as context, if this LM is trained on a corpus of diverse schemas, it learns to make predictions by "reading" and understanding the schema descriptions.This leads to robustness on ANYTOD's state and event tracking for unseen schemas, as shown in D3ST (Zhao et al., 2022).Moreover, ANYTOD facilitates zero-shot policy transfer by executing the provided policy graphs as explicit programs, and by similarly training an LM with a large number of diverse policy graphs when considering recommended NAPs.

The STARV2 Dataset
To train ANYTOD, we construct STARV2, an updated version of STAR with new ground truth belief state and action annotations, supporting supervised training on DST and AST.These annotations were generated from few-shot training with D3ST (Zhao et al., 2022).We first train D3ST on the SGD dataset, then continue finetuning on a few handlabeled conversations from STAR.SGD (Rastogi et al., 2020): SGD is another schemaguided dataset in which schema elements are provided with natural language descriptions to facilitate task transfer.It contains 45 domains and was generated via simulation.Thus, the agent actions and responses follow pre-defined task logic.
MultiWOZ (Budzianowski et al., 2018b): Mul-tiWOZ is the standard dataset for benchmarking TOD models.It contains 7 domains and was generated through Wizard-of-Oz (Kelley, 1984) data collection, leading to natural conversations.
Training Our implementation is based upon the open-source T5X codebase (Roberts et al., 2022) initialized with the public T5 1.1 checkpoints3 as the LM backend.We augmented the LM to execute a schema program and reincorporate the results before making the final NAP, as described in Section 3.1.We experimented on two T5 sizes: base (250M parameters, trained on 16 TPUv3 chips (Jouppi et al., 2017)) and XXL (11B parameters, trained on 64 TPUv3 chips).We otherwise adopt the default T5X finetuning hyper-parameter settings throughout our experiments.

Results on STAR
Table 1 shows ANYTOD results on the STARV2 dataset on the full-shot and zero-shot task transfer settings, with both "happy" and "unhappy" conversations.In full-shot, models train on 80% of conversations across all tasks, and evaluate on the remaining 20%.The zero-shot domain setting is a leave-one-out cross-validation across the STARV2 dataset's 13 domains, evaluating quality on an unseen schema in a completely novel domain.The following metrics are used in our report: joint goal accuracy (JGA) to measure DST, user action F1 (UaF1) to measure AST, system action F1 (SaF1) to measure NAP, and response BLEU. 4ach STAR task schema defines the intended dialog policy by providing a policy graph, where nodes describe conversation actions, and edges connect subsequent actions titask training improves both DST (61.9 → 66.1 JGA), AST (72.1 → 74.3 UaF1), and by extension NAP (60.6 → 61.3 SaF1).A similar but smaller improvement is seen on XXL, suggesting that the larger LM may not need more diverse training owing to its better language understanding.
Complex Program Logic STARV2 is also a good testbed for complex zero-shot task adaptation, as it includes some tasks which are more complex than simple policy-graph following, specifically the bank, trivia, and trip domains.For instance, the trivia task requires the agent to ask the user a trivia question and extract their answer.Different system actions must be taken by the agent depend- ing on whether or not the user's answer is correct.This logic is not captured by the provided policy graph alone, requiring more complex logic.ANY-TOD is suitable for this problem, as we need only to construct a program implementing this logic.These programs are shown in Section A.2.
We report results with these programs in Table 1 under the -PROG name.There is a clear win on zero-shot domain SaF1 when averaged over all domains, with a very high 70.7 SaF1 on -PROG+SGD XXL, narrowing the gap with the full-shot 75.4 SaF1.When examining the complex tasks tasks individually (Table 1c), the win on NAP is even more apparent.The only exception is AT XXL on trivia, which has little difference with or without the program.In general however, the guidance provided by this specialized program is necessary for higher-level logic in the dialog policy, since the policy graph does not specify enough information to approach the task in zero-shot.

Results on ABCD and SGD
We conduct similar experiments for AST on ABCD (Chen et al., 2021) and DST and NAP on SGD (Rastogi et al., 2020) datasets.ABCD contains 10 flows, each describing the business logic for handling a customer request, which are relatively similar to each other.AST on ABCd is measured by joint action accuracy (JAA). 7We report full-shot results by training and evaluating on all flows, and zero-shot flow transfer results where the model is trained on one randomly sampled flow and evaluated on all other nine flows.The SGD test set consists of 21 services, 15 of these not seen during training.The dataset is generated via simulation with a generalized policy graph (shared across all services) encoding dialog act transitions.The per-service policy graphs are then constructed by inserting intents and slots and, as a result, end up similar.
Tables 2 and 3 and show ANYTOD results on SGD and ABCD respectively.For both datasets on both full-shot and zero-shot setups we gener-7 See Chen et al. ( 2021) for more details on the JAA metric.
ally see an improvement on action prediction using policy guidance, achieving state-of-the-art results for ABCD.However, the gain is not as large as STARV2, as the task policies are not as diverse.
Even without explicit policy guidance, features from different tasks in ABCD/SGD can transfer to each other.Notably, policy guidance helps more on the one-flow setup for ABCD and unseen services for SGD, further establishing the efficacy of policy guidance on unseen setups, even if related.

Zero-shot Transfer to MultiWOZ
To demonstrate ANYTOD's generalizability and robustness in zero-shot task adaptation, we demonstrate cross-dataset transfer results on the end-toend MultiWOZ 2.2 (Zang et al., 2020) benchmark, a popular dataset for TOD research.We train ANY-TOD-XXL on the SGD dataset, and evaluate it on MultiWOZ in zero-shot with a small policy program (Section A.7). Responses from ANYTOD were constructed using the template utterances from Kale and Rastogi (2020).We compare against SOLOIST (Peng et al., 2020) and Mars (Sun et al., 2022), two end-to-end TOD models directly trained on MultiWOZ with supervision.Results are shown in Table 5, with metrics reported by the MultiWOZ eval script (Nekvinda and Dusek, 2021).Although no training examples from MultiWOZ was used at all, ANYTOD demonstrates strong JGA, Inform, and Success comparable to results that do train on MultiWOZ.Note that since we applied templates for response generation, we do not consider BLEU to be important, as the responses are very different from ground truth labels.

Impact of Policy Guidance
To see the value of the schema program's recommended NAP, we reevaluate already finetuned ANYTOD models on the STARV2 zero-shot task transfer setting, but with changes to the program recommendations during eval.First, to see how dependent ANYTOD is on policy graph guidance, we modify the graph to output no recommendations (denoted as 0REC), forcing the model to do NAP only using the conversation, belief state, and action history.Secondly, we modify the graph to output deliberately bad recommendations (denoted   as BADREC), intended to trick the model into choosing an incorrect system action.This was done by randomly sampling 1-3 system actions other than the ground truth action.
The major drops in SaF1 for both setups shown in Table 4 confirm that the model, while able to predict actions without it, does consider the policy guidance heavily.Notably, 75% and 83% of correct predictions for 0REC and BADREC are actions common to all tasks e.g., hello or query.
We conduct a similar "policy corruption" experiment on ABCD (Table 6), in which policy graphs for evaluation tasks have a 0%, 40%, and 80% chance of being replaced by graphs from incorrect flows during evaluation.We see a consistent quality drop with increasing probability of corruption for both BASE and XXL.

Error Analysis
We also analyze ANYTOD errors on STARV2.We classify all incorrect NAPs into three possible error categories: (1) System action error: the program recommends the correct system action, but this was not chosen by the LM, (2) Policy graph error: the predicted belief state and action history are correct, but the program's execution of the policy graph does not recommend the expected system action, and (3) State tracking error: the predicted belief states and action history are incorrect, which leads to incorrect recommendations from the policy graph.Results are shown in Figure 2. In general, we see that the benefit to scaling the LM from BASE to XXL comes from improvements to state and action tracking, which aligns with better DST and AST results on XXL as in Table 1.

Conclusion
We proposed ANYTOD, an end-to-end TOD system that can be programmed to adapt to unseen tasks without domain-specific training.ANYTOD adopts a neuro-symbolic approach, in which a LM performs robust DST and AST with respect to a provided schema, and abstracts both into a sequence of symbols.These symbol sequences are then parsed and passed to an explicity symbolic program implementing a task's dialog policy, which is executed to make recommendations for the next agent action(s).Agent designers are free to implement arbitrarily complex business logic this program ANY-TOD to determine its policy on unseen tasks or domains.To demonstrate the value of this approach, we show state-of-the-art results on zero-shot transfer TOD benchmarks, such as STAR, ABCD, SGD and MultiWOZ.For further training and benchmarking zero-shot end-to-end TOD systems, we also release the STARV2 While we note that generating free-form natural language responses is possible due to supervised training on ground truth system responses, there is no guarantee that these generated responses are robust on unseen schema.We instead advocate that responses should be built with deterministic templates predefined by agent designers.

Ethics Statement
Models, codebases, and datasets used in this paper follow their respective licenses and terms of use.Moreover, the task-oriented dialogue datasets used in this paper do not contain any personallyidentifiable information or offensive content.The code for ANYTOD and the STARV2 dataset will be released upon this paper's publication.
One particular risk with language models is the possible generation of factually incorrect or biased content (Lin et al., 2022;Bender et al., 2021).However, we note that this risk does not apply to ANY-TOD, as (1) the language model is trained to make structured predictions that must be parseable by the policy program, and (2) we rely on response templates rather than using free form natural language generation.multilabel binary prediction.We then calculate a binary cross-entropy loss on this head.# Since this is zero-shot we don't train on this action at all, just provide 8 # a natural language description "user is saying they wants to book this hotel"

[Figure 1 :
Figure1: An overview of the ANYTOD system.A LM conducts zero-shot state and action tracking with respect to a provided schema, abstracting it into a sequence of symbols.A program that executes the dialog policy then recommends which actions to take based on the states sequence, the LM then chooses a single final action and generating a response.

Figure A. 1 #
Figure A.1 contains the ANYTOD policy program used when evaluating over MultiWOZ.This policy program was handcrafted, and provides a simplified conversation flow from our own personal understanding of the MultiWOZ dialogue policies. 7

Figure A. 1 :
Figure A.1: The ANYTOD program implementation for the zero-shot policy program.
2While not the focus of this paper, the labeling of STARV2 demonstrates the use of few-shot D3ST in labeling unlabeled conversations on new tasks/domains.
STAR and STARv2: As described in Section 3.2, we upgrade the original STAR (Mehri and Eskenazi, 2021) dataset to STARv2.The dataset has 24 tasks across 13 domains, many tasks requiring the model to adhere to a novel policy, providing an important zero-shot AST and NAP benchmark.ABCD (Chen et al., 2021):The design of the ABCD dataset follows a realistic setup, in which an agent's actions must be balanced between the customer's expressed desires and the constraints set by task policies.It is thus a natural fit for the AnyTOD framework for both training and evaluation.

Table 1 :
Results on STARV2.For compactness we show just UaF1 and SaF1 here -see Section A.3 for a complete table.For clarity, we bold SaF1 results for ANYTOD BASE/XXL, our key result.
ANYTOD-SGD, which jointly trains with SGD as a multitask training dataset.SGD includes a large number of tasks, each defined by a schema with highly diverse parameters and actions.The -SGD results in Table1show that at BASE, SGD mul-

Table 5 :
Results on MultiWOZ end-to-end benchmark. 8ANYTOD-XXL is trained on SGD, and evaluated in zero-shot transfer onto MultiWOZ.Note we applied templates for response generation, yielding low BLEU in comparison with other models.
# iterate through last turn's active user actions, result of AST 22 for last_useract in act_hist[-1]: 23 # some transitions are common to all star graphs, but not explicit 24 # if user is performing something out-of-scope, return out_of_scope required params are provided, we can query api 40 query_label = 'query' if 'query' in graph['replies'] else 'query_check' 41 if all(p.name in belief_state for p in api.params if p.required): 21