CESAR: Automatic Induction of Compositional Instructions for Multi-turn Dialogs

Instruction-based multitasking has played a critical role in the success of large language models (LLMs) in multi-turn dialog applications. While publicly available LLMs have shown promising performance, when exposed to complex instructions with multiple constraints, they lag against state-of-the-art models like ChatGPT. In this work, we hypothesize that the availability of large-scale complex demonstrations is crucial in bridging this gap. Focusing on dialog applications, we propose a novel framework, CESAR, that unifies a large number of dialog tasks in the same format and allows programmatic induction of complex instructions without any manual effort. We apply CESAR on InstructDial, a benchmark for instruction-based dialog tasks. We further enhance InstructDial with new datasets and tasks and utilize CESAR to induce complex tasks with compositional instructions. This results in a new benchmark called InstructDial++, which includes 63 datasets with 86 basic tasks and 68 composite tasks. Through rigorous experiments, we demonstrate the scalability of CESAR in providing rich instructions. Models trained on InstructDial++ can follow compositional prompts, such as prompts that ask for multiple stylistic constraints.


Introduction
Instruction tuning is a popular multi-tasking method for fine-tuning large language models (LLMs).In this setup, LLMs are trained over a range of tasks specified by instructions, which lets them generalize over new task descriptions at ease (Wei et al., 2021).Although instruction tuning enables individual task performance at scale, a language model's practical usage often requires high performance on compositions of these tasks.For example, in Fig. 1, the prompt requires the Action: -The response should contain keywords: catchy, tune, playful.
-The response should depict the dialog act: ask question (1-D Grounding)

ATOMIC TASKS Action:
The response should depict the dialog act: ask question (1-D Grounding) (2-D Grounding) Figure 1: An illustrative integration of compound tasks, namely keyword controlled generation and act grounded generation, into a more complex compositional one.These tasks are automatically merged-without human oversight-using CESAR.
model's response to meet two control dimensions, (i) incorporating three keywords in the response -'catchy', 'tune ', and 'playful' and (ii) following the dialog act -ask question.A large language model (LLM) may perform well at these dimensions individually.However, it may struggle to meet the requirements simultaneously as it has not seen such a composition of constraints during the training process.Prior work has addressed this problem through novel architectures or prompting tricks (Peng et al., 2023;Ramakrishnan et al., 2022;Hu et al., 2022).However, despite the proven effectiveness of scaling up the number of tasks (Chung et al., 2022a), the prior efforts, to the best of our knowledge, have yet to focus on scaling up compositional data during training.
One could handle complex instructions by introducing compositional tasks as demonstrations at training stages.However, getting compositional data at scale is a non-trivial task because the number of compositions grows exponentially with the number of atomic tasks.This introduces significant human labor in adding appropriate instructions for each new composition.
A naive solution to this challenge might be combining individual task prompts' instructions and control sequences, following the controlled generation literature (Yang et al., 2022;Liu et al., 2022).However, for cross-task compositions with multiple constraints, this could result in nonsensical tasks, in ways such as their composition is either infeasible, invalid or too idiosyncratic to be of any practical use (see Fig. 10 for an example).Thus, reliable and scalable mechanisms to compose tasks (and their instructions) without manual effort are highly desirable.
Contributions.To address the above-mentioned challenge, we make the following contributions: i) First, we propose an instruction-based framework for dialog tasks -named CESAR.CESAR modularizes dialog task prompts based on their input, constraints, and outputs.This enables an automatic combination of specific parts of different task prompts to generate compositional task prompts without any human intervention -e.g. in Fig. 1 CESAR combines two atomic tasks which only differ in the constraint component of the prompt.We describe the complete framework in §4.
ii) We introduce InstructDial++, an update over the original InstructDial benchmark (Gupta et al., 2022).It incorporates more atomic tasks and datasets and introduces composite tasks by utilizing the CESAR framework.Overall, InstructDial++ consists of 86 basic and 68 composite tasks defined on 63 datasets.We detail the InstructDial++ benchmark in §5.
iii) Finally, we perform comprehensive experiments that reveal multi-faceted benefits of having compositional tasks in fine-tuning stages (c.f.§6): (a) They improve compositional task performance for both seen and unseen task compositions; (b) They improve atomic or non-compositional task performance under similar data budgets.
The CESAR framework, along with the Instruct-Dial++ benchmark, enables the automated generation of complex tasks in dialog benchmarks, which we believe is one of the critical ingredients in bridging the gap between publicly-available dialog models compared to proprietary AI assistants.

Related Work
The term instruction-tuning was popularized by Wei et al. (2021); Mishra et al. (2022) and has gained popularity among various researchers-for example, Ouyang et al. (2022) optimized outputs based on user preferences by incorporating human feedback.Chung et al. (2022a), on the other hand, demonstrated that scaling tasks and model size improved performance across benchmarks.Finally, InstructDial (Gupta et al., 2022) focused on instruction tuning for downstream dialog tasks but lacked curated compositional tasks.To address this gap, CESAR enables large-scale training of compositional tasks.Please refer to Table 1 for a comparison of CESAR with the other recent benchmarks1 .
Unified Grounding.The research community has recently emphasized Unified Grounding as a solution for diverse datasets.UnifiedSKG (Xie et al., 2022) unifies 21 SKG tasks into a text-totext format, demonstrating improved performance through multitask learning.Convlab3 (Zhu et al., 2022) proposes a shared format for task-oriented In contrast, InstructDial lags behind ChatGPT since it is unable to satisfy >= 2 constraint scenarios (further evidence in Fig. 3).dialog datasets.BlenderBot 3 (Shuster et al., 2022) adopts a modular approach, assigning specific tasks to different parts of a prompt.However, none of these frameworks have explored task composition from their unification efforts.
Compositional Generalization via Structure.Incorporating structure into the prompts have consistently been shown to enhance the language model's ability to handle compositional tasks.Bursztyn et al. (2022) propose compositional finetuning, which involves breaking down the task into its constituent parts and combining these parts using decision templates.Keysers et al. (2019) use logical forms in training data to enable rulebased generation of compound samples for compositional generalization in semantic parsing tasks.Finally, Chen et al. (2022b) design task-specific prompts that provide detailed information about each task to help the model understand commonalities across different tasks.
Dialog Control.Various approaches have been used to control multiple attributes, such as adhesive architectures or prompting techniques.For instance, Chen et al. (2022a) define task-specific prompts and concatenate them at runtime to tackle complex tasks.Alternatively, SHAO et al. (2023) learn a compositional codebook where each task corresponds to a combination of codes, allowing controllability at inference time.Subramanian et al. (2018) embed each target attribute separately and average them as the start-of-sequence symbol during generation.Hu et al. (2022) propose a twostage decoder that imposes stylistic and word-level constraints separately within the seq2seq framework.However, none of these approaches directly compare to CESAR framework, which specifically addresses the scalability of compositional tasks in training data.

Motivation
In this section, we first explore the performance disparity between closed-access and open-access models2 on complex compositional tasks.We then investigate the impact of including compositions in the training data on task performance.
Closed-vs.Open-access Models in Complex Dialog Tasks.We begin by checking the disparity between open and closed-access models.In Fig. 2, we notice that ChatGPT (a closed-access model) can produce satisfactory results for simple composite tasks whereas publicly available DIAL-T0 (Gupta et al., 2022) struggles.This demonstrates that open-access models require additional resources to improve on complex dialog tasks.

Can Compositional Demonstrations Improve
Performance?Next, we want to verify whether the presence of compositional demonstrations can improve performance on complex dialog tasks.For this, we design a preliminary experiment where we select four dialog tasks: generating dialog responses where we control the i) beginning phrase, ii) the ending phrase, iii) the length (short, medium, and long), and iv) keyword to be incorporated in the response.We manually create instructions for all possible combinations of these tasks (e.g.combining begins with with ends with generation, amongst others.We fine-tune the public   Flan-T5-xl model (Chung et al., 2022b) on two different sets of training data, each with the same size, to create two models: i) Baseline -trained solely on the four atomic tasks, and ii) Compositionaltrained on a mixture of atomic and compositional tasks.For a fair comparison, we keep the training steps the same for both models.In Fig. 3, we observe that Compositional model outperforms Baseline model, regardless of the training size.This indicates that including compositional tasks in the training data is crucial for better performance.Interestingly, we also find that the presence of compositional tasks in the training data positively affects atomic task performance ( Fig. 8).This analysis reveals the need for a scalable generation of compositional dialog tasks and instructions -a gap that we fill with CESAR.

CESAR
For any given dialog interaction, we first define the notions of dialog items and dialog components: • Dialog items are units of information pertain-ing to a dialog interaction, such as utterances, speaker states, intents, personas, stylistic attributes of utterances, dialog summaries, utterance revisions, external knowledge snippets, amongst others.
• Dialog Components are logical categories of a dialog, which include: C → context (or dialog context); S → dialog state(s); E → evidence(s); A → action(s); R → the dialog response.Any dialog item λ, can be mapped to a dialog component using a mapping function g(), i.e., g(λ) → {C, S, E, A, R}.Table 2 provides an overview of the dialog components along with some sample dialog items that are mapped to it.

CESAR Framework
CESAR adopts a structured approach to dialog processing, often seen in task-oriented dialogs (Hosseini-Asl et al., 2020).It generalizes the rigid structure of task-oriented dialogs to accommodate natural dialogs.A task in the CESAR framework is essentially a text-to-text task where both the input and the output are a linearized3 representation of multiple dialog components formatted as per a specified structure.A high-level representation of a CESAR task is defined as follows: where, the symbol '−' delineates the input from the output, i.e. {input_prompt} − {output}.

Input:
The input prompt in a CESAR task contains three prime components.(a) For any dialog item xij, i refers to it's turn number in the dialog and j refers to it's identification within the same dialog component, i.e., S, E, A, or R. for that turn.Fig. 4 provides an example for this setup.Let us provide a concrete example of the CE-SAR framework.Imagine that a user is interacting with an AI assistant, which is shown in Table 3a.This dialog includes different dialog items.For example, evidence item e 41 that is useful to construct the utterance r 4 or state item s 32 that either infers information about utterance r 3 or the complete dialog history r 1 , r 2 , r 3 .Given these dialog items, we can construct several CESAR tasks, some of which are demonstrated in Table 3b.We also illustrate an example dialog interaction, including dialog items with their dialog component mapping in Fig. 4.

CESAR Tasks
We now define an n-D CESAR task: Table 3b illustrates multiple CESAR tasks that are 0-D, 1-D, and 2-D tasks framed from the dialog items in Table 3a.Note that a 0-D task does not mean the input is empty, since a CESAR task always assumes a task instruction I and a dialog context C (which can be potentially empty if there is no previous interaction).Atomic vs. Compositional Task.In CESAR, we categorize every task as either atomic and compostructions) where a user could ask to generate multiple outputs, such as generating a dialog summary along with an appropriate response.We defer such a setup for future work. (a 42 ) The length of the response should be short  3a.
sitional task: Atomic Tasks are either 0-D or 1-D tasks;Compositional Tasks are any n-D Task with n ≥ 2.
To create any compositional task in CESAR, we define the following composition operation.
Definition 2 (Task Composition): For two i-D Tasks, where, |Λ| = i − 1 and i ≥ 1, we combine the two tasks to form an (i + 1)-D Task: This composition operation allows the creation of arbitrarily complex compositional tasks, i.e., m-D Tasks with m ≥ 2, subjected to the availability of relevant atomic tasks.
Our proposed CESAR framework can also incorporate dialog items as reasoning elements in chain of thought (Wei et al., 2022).We present this extended formulation in Appendix E, but keep its experimentation as a future work.

New Tasks and Datasets
Knowing that scaling data in training benchmarks has a positive impact on performance (Longpre et al., 2023), we expand the InstructDial benchmark by incorporating 15 additional datasets and 42 new atomic (i.e.0-D & 1-D) tasks (c.f.Table 15).The majority of newly introduced atomic tasks are derived from the inherent structure of the CE-SAR framework, which allows for multiple tasks to be defined and performed on the same dataset.For example, we create 8 novel tasks (not comprehensive) listed in Table 3b using the dialog depicted in Fig. 4. Following the inclusion of the newly added tasks, the Instructdial++ benchmark consists of 86 tasks across 65 datasets -c.f.Fig. 5 and Table 15.The next step is to explain how we have mapped each Instructdial task to the CESAR format.

Mapping InstructDial++ to CESAR
To map each InstructDial task into the CESAR format, we begin by assigning them a CESAR task based on the input constraints and output type.None of the tasks in Instructdial++ incorporate reasoning or CoT, leading to the simplified CESAR format ICΛ − ψ (Eq.( 1)).
Prompt Design.Unlike InstructDial's approach of providing unique instructions for each task, we adopt a more general approach in crafting our instructions which cover two main aspects: i) identifying the dialog components (c.f.Table 2) to focus on during generation, and ii) specifying which component the generation is aimed at, such as the example instruction: Provide the correct value for action field given the dialog context, state, and evidence fields.Due to the structural form of the instruction, we can programmatically generate instructions for each compositional task.
Generative and Discriminative Tasks.Despite the generative nature of each CESAR task, it is important to note that our framework enables us to specify discriminative tasks as well.As an illustration, in the emotion_classification task, we provide the candidate emotions by incorporating them within the state component, e.g.'State: Candidate emotions are sad, happy, and mad.The emotion of the last utterance is:'.
CESAR framework enables 4 0-D tasks, and 12 1-D tasks.InstructDial++ incorporates downstream tasks for each 0-D grounding task and for 10 of all the 1-D grounding tasks as depicted in Fig. 6.
Please find examples of the input-output structure of these tasks in Table 3b 0-D Tasks.The top of Fig. 6 depicts all 4 0-D tasks.Each of these CESAR tasks clusters downstream tasks with a similar output objective.For instance, IC-S tasks involve categorizing, formatting, or organizing information within the dialog context in a more succinct or structured way, e.g.dialog-summarization. IC-E, on the other hand, involves generating external knowledge useful in the dialog's context; both persona generation and knowledge generation are downstream tasks under this category.We believe the persona information of a user is better fed as 'evidence' rather than an 'action' because it is external information and not a strict constraint to follow during generation, c.f Table 2. IC-A tasks are responsible for generating actions to be followed in the response, such as keyword or intent prediction.Finally, IC-R is the collection of tasks that generate/select a response for a given dialog context.
1-D Tasks.1-D tasks have the same categorization because their generation is also aimed at one of S, E, A, R components as in 0-D tasks, except they ground on an additional component other than the dialog context.For example, for ICA-R, the response generation task is additionally conditioned on a provided action, e.g.begins-with controlled generation or slot-value grounded generation.We also include edit generation under this category because, unlike an IC-R task, it grounds on both the context and the previous version of the response to be corrected.Another 1-D CESAR task example is ICS-A which involves generating action of the upcoming response conditioned on the state of the current dialog context.An illustrative example is length-grounded keyword prediction, where the generated keywords (for the response) are conditioned on the length of the final utterance in the dialog context.As depicted in Fig. 6, InstructDial++ incorporates 7 more 1-D tasks along with the 2 that we borrow from InstructDial.
2-D Tasks."After manually mapping all tasks of 0-D and 1-D nature, the CESAR framework selects and organizes viable 2-D tasks automatically according to a small subset of predefined rules.These rules ensure the combi- nation does not result in an infeasible taskc.f.Fig. 10.One sample rule allows compositions where the generation incorporates 2 actions (e.g.beginswith_controlled_generation and keyword_controlled_generation) creating ICAA-R task.For a comprehensive explanation of these rules please see Appendix C This results in 68 new downstream tasks defined on 7 2-D Cesar tasks -c.f.Table 16 and Fig. 11.Fig. 5 shows each of these compositions as edges between atomic tasks.
Order Invariance of Grounding Items As dialog items in a prompt must be linearized, we adopt a randomizing process to ensure that our models are invariant in ordering the items.We place each section randomly inside the prompt, with two specific rules: i) the instruction is always placed at the beginning, which is a common practice; ii) the target section, referred to as ψ in Eq. (1), is placed at the end.The rest of the sections (C and Λ from Eq. ( 1)) are randomly positioned within the prompt.
6 Experiments 6.1 Setup Models.Throughout experiments, we utilize ChatGPT model gpt-3.5-turbo-16k-0613and five public models, namely i) T0-3B (Sanh et al., 2021) which is trained on a mixture of downstream tasks, ii) DIAL-T0 and iii) DIAL-BART0 (Gupta et al., 2022), fine-tuned on InstructDial dataset and based on T0-3B and BART0 (Lin et al., 2022), respectively.We train another baseline model based on InstructDial dataset using FLAN-xxl (Chung et al., 2022a) and name it with the same convention as the authors as iv) DIAL-FLAN-xxl.Our main model, v) CESAR-FLAN-xxl, is also trained using the FLAN-xxl model but on the Instruct-dial++ dataset rich in compositional tasks -c.f.Appendix A for training details.

Tasks and Metrics
Atomic Tasks.Throughout experiments, we test models on eight atomic tasks -either individually or as part of some composition: Begins With Generation (BW): Generate a response that starts with a given phrase.Ends With Generation (EW): Generate a response that ends with a given phrase.Keyword Controlled Generation (KC): Generate a response which incorporates given set of keywords.Length Controlled Generation (LC): Generate a response with a certain length (short/medium/long).Persona Based Generation (PB): Generate a response based on a given speaker persona.Knowledge-Based Generation (KB): Generate a response based on some external knowledge.Edit Generation (EG): Edit a response to make it coherent with the dialog context.Emotion-Grounded Generation (EMG): Generate a response that depicts a certain emotion.To maintain standardized evaluation, we utilize the atomic task metrics implemented by Gupta et al. (2022) for all of our atomic tasks.
Compositional Task Metrics.We evaluate compositional task performance on nine tasks, which are binary compositions of the atomic tasks.In InstructDial++, these compositions are available in the test set.For InstructDial, we manually create instructions for each task composition following the same formatting as the InstructDial paper and generate new test sets accordingly.
For each compositional task performance, we report accuracy scores if possible.If the combined accuracy is unclear, we provide multiple metrics evaluating each dimension separately.For example, for persona-based + ends with generation, because it is difficult to quantify how well the persona was used in the final generation, we report both the "ends with accuracy" and Rouge-L metrics.

Results
In this section, we present the results of three main experiments.Each experiment's test set prompts for each model are formatted according to their training data.For T0-3B and DIAL-FLAN-xxl models, the prompts included natural phrases that explain the task.For DIAl-BART0 and DIAL-T0, we use the special tokens defined by Gupta et al. (2022), and for CESAR-FLAN-xxl, the prompts are automatically generated in CESAR format by the framework itself.Despite their format, each test set is composed of the same data instances from the test splits of the corresponding datasets.
Atomic Task Performance.Table 4 presents the atomic task performance of each model.Even though we do not claim any discrete advantage in atomic task performance, we observe that the atomic task performance of the CESAR model proves to be comparable and even better in many tasks compared to the baselines.This aligns with preliminary experiments insights, c.f. Fig. 8.
Compositional Task Performance.Table 5 presents compositional task experiments results.Our model outperforms the baselines on every task composition.This indicates that good performance on atomic tasks does not necessarily translate to good performance on compositions of those tasks, as evidenced by the widening gap between CE-SAR and baselines from atomic to compositional tasks.We add two qualitative examples depicting atomic and compositional generation by CESAR, interested readers can find them in Fig. 12.
Generalization Experiment.To evaluate the robustness capabilities of CESAR, we design a simple training setup where each model trains only on a limited set of tasks using the smaller FLAN-xl model.We then evaluate these models with various task compositions, both seen and unseen.
We employ three models for this experiment.i) Atomic Model, solely trained on four atomic tasks: BW, KC, LC, and EG, representing the lower bound without any compositional training, ii) Naive Composer, trained on the same four atomic tasks as well as three compositional tasks: BW+KC, KC+LC, and BW+LC.These compositional tasks are created by concatenating instructions and constraints from individual tasks within the same prompt using the conjunction 'and'.To avoid generating infeasible tasks due to the lack of structural information inherent in the CESAR framework, we manually select the tasks that the Naive Composer combines (as explained in Appendix G).Finally, we train another model using the iii) CESAR format, incorporating the same four atomic tasks and three compositional tasks as used for the Naive Composer.
The results, presented in Table 6, demonstrate that the CESAR structure outperforms the Naive Composer in all seen compositions and most unseen tasks and compositions.For the unseen task composition6 at the right-most column, we incor-    porate chatGPT to classify and evaluate the generated response's emotion and the response's quality -c.f.Appendix F for details on the evaluation prompts used.
It is important to note that the manual selection of tasks to be composed by the Naive Composer overlooks an essential contribution of our framework: the ability to detect viable compositions.Therefore, this experiment only demonstrates how the robustness is affected by the CESAR structure and is not a direct comparison between the Naive Composer and our approach.

Conclusion
We propose CESAR to fill the compositional capability gap between public models and irreproducible state-of-the-art, ChatGPT.CESAR modularizes downstream dialog tasks under one format amined the FLAN collection and saw that out of 23 datasets it incorporates we only use DailyDialog within the test set.Moreover we saw that there is minimal intersection between the version of DailyDialog used by FLAN (NIv2 corpus) and the original version we used.Thus we conclude there is only minimal chance of contamination, where in the worst case 100 test instance are seen by the model.allowing the programmatically-induced generation of complex composite instructions.Moreover, we create a new benchmark building on top of Instruct-Dial, adding new tasks and datasets and utilizing CESAR to populate composite tasks.The new, In-structDial++ includes 63 datasets with 86 atomic and 68 composite task definitions.Our framework's compositional and atomic task abilities are demonstrated in extensive experiments.These experiments reveal that our framework significantly improves the model's ability to handle complex prompts compared to previous approaches.Notably, we discover that including composite data in the training set enhances compositional performance and improves atomic performance.Additionally, we conduct a robustness experiment and find that the CESAR structure outperforms the baselines in majority of compositions as well as in unseen tasks and compositions.

Limitations
Given its large scope, our work has yet to be able to delve into many promising avenues.We dedicate this section to discussing these to benefit future research: 1) Datasets in InstructDial++ are not comprehensive.Even though we tried to increase the dataset and task scale to our best ability there are certainly more datasets that InstructDial++ would benefit from such as Contrack (Ruckert et al., 2022), PRESTO (Goel et al., 2023), etc.
2) Multi-tasking with non-dialog data is not done.Due to the large scope of our work, we limited the scope of datasets we focused on to dialogspecific ones.Previous work has shown that adding non-dialog data might help (Zeng and Nie, 2021).
3) We have not explored negative conditions.A true composition should incorporate all logical compositions.The challenging part of the negative condition is its evaluation.4) We have only experimented using the existing grounding features available in the datasets.This limits the kinds of controls that could be done.5) Automatic metrics are not necessarily robust.We try to mitigate this by choosing control signals that can be automatically measured.6) Our action fields are mostly lexical, and some are semantic.A comprehensive set of actions would be better.

A Training Details
In line with Gupta et al. (2022), we create the training data by sampling 5000 instances per atomic task from both the InstructDial and Instructdial++ datasets for their respective trainings.For Instruct-dial++, we additionally sample 1000 instances per compositional task for each dataset generated by the CESAR framework.Input sequences are set to a maximum length of 1024 tokens, and output sequences are set to a maximum length of 128 tokens.Longer sequences are truncated to these lengths, and shorter sequences are padded.We trained both the DIAL-FLAN-xxl and CESAR-FLAN-xxl models on eight A100 GPUs, with a batch size of 10 per device and gradient accumulation steps set to 4. Both models are trained for two epochs, with a learning rate of 5e-05, and using the AdamW optimizer (Loshchilov and Hutter, 2017).

B Compositional Capabilities of Closed Source Models
Fig. 7 shows 3 more examples depicting ChatGPT(gpt-3.5-turbo-0301),GPT-4 and Claude-v2's compositional task capabilities.It's evident that each of these models demonstrates a certain degree of compositional capabilities, showcasing their aptitude in complex language understanding and generation.However, there are certain scenarios where each exhibits inaccuracies or nuances that deviate from expected outputs.Amongst the three, GPT-4 consistently delivers the most reliable results whereas Claude v2 exhibits comparable performance, although it occasionally makes minor errors.GPT-3.5, on the other hand, tends to fall slightly behind its successors.

C Composition Rules
As explained in section §4, CESAR does not only unify dialog tasks in a certain prompt structure but it utilizes this structure to combine these dialog tasks as compositional tasks automatically.This automation is achieved by defining specific rules, 10 as of the current version, that constrain which CESAR tasks can be composed together.The comprehensive list of these rules can be found in Table 7.
The first rule delineated in the Table provides an illustrative example of how the CESAR compositional tasks are defined.This particular rule is formulated to create ICAA-R compositions, implying that it combines two distinct ICA-R tasks, generating an output that encapsulates dual actions.For instance, the combined tasks could involve beginswith_controlled_generation and key-word_controlled_generation.
In the rule structure, the key "common fields" represent modules that are recurrent across each task involved in the composition.For this rule, the common modules include the "dialog context" and "response".These shared elements are essential as they get assimilated into the final compositional task.Furthermore, the "target field" is pivotal as it indicates the expected output from both tasks involved in the composition.Hence, in this rule, the "response" field, which is anticipated from both ICA-R tasks, becomes the output of the culminating ICAA-R task.It's also important to note that due to the consistent structure prevalent across all ICA-R tasks, the framework ensures a seamless composition.This means that irrespective of the variant of ICA-R task combined, as long as the rule's constraints are satisfied, the resulting composition should be impeccable.

D Benhmarking ChatGPT on Atomic and Compositional Tasks
In this section we provide additional results by ChatGPT model complementing out main results tables.We have chosen not to include these results in the main paper for two reasons: (1) ChatGPT is a closed-access model, limiting scientific insights, and (2) Replicating ChatGPT results is not guaranteed due to inaccessible architecture, parameters, and the API's lack of consistency across different runs.
To query ChatGPT, we utilize the same naturally formatted prompts as those used for the DIAL-FLAN model.An example of a zero-shot prompt can be found in Table 11.The one-shot experiments follow a similar procedure but include an additional in-context example, maintaining the same prompt structure.
Table 8 and Table 9 present additional results for the ChatGPT model in the atomic and compositional task evaluations, respectively.Results depict that ChatGPT actually has relatively lower performance compared to CESAR and DIAL-FLAN-xxl and CESAR-FLAN-xxl models.However, upon analyses of the results we saw that these results maybe somewhat misleading.Because both DIAL-FLAN-xxl and CESAR-FLAN-xxl are trained on the training splits of the tested data they learned

ChatGPT GPT 4 Claude v2
Instruction: Generate a short answer to the given dialog context.The response should start with the phrase: "My" and it should not include the keyword: "favorite".Rule Task 1 Task 2 Composed Task Common Dialog Components Target Field    certain spurious traits in the datasets and in the way we preprocess our data.For example, for the beginswith_generation task, the evaluation checks if the given initial phrase and beginning of the response are exactly the same.During tokenization we split punctuation with an additional space e.g.('The response should start with: Yes, I love this song ,') but ChatGPT omits the additional space while generating the response i.e.To examine the impact of incorporating more examples in the context and employing enhanced proprietary models, we reassessed the outcomes for three tasks with a reduced test set, as seen in c.f. Table 10.Generally, GPT-4 performs notably better than GPT-3.5.Additionally, adding more contextual examples considerably boosts performance.

E Chain of Thought Potential in CESAR
We can extend CESAR to also incorporate reasoning supervision to the dialog tasks.This can be done by including Chain of thought elements into the dialog tasks.The CESAR task from Eq. ( 1) is modified as follows: where, Λ ′ = {g(λ ′ 1 ), . . ., g(λ ′ n )} is interpreted as the multiset of dialog components that can be used to reason about the primary output ψ.In the traditional CoT framework, the reasoning precedes the main output, thus we represent the output sequence as Λ ′ ψ. 7Definition 3 (CoT-Generation): Any i-D Task, ICΛ−ψ with |Λ| = i and i ≥ 1, can be converted to its CoT counterpart by shifting a subset of dialog items from input to output, i.e., IC Note that any i-D Task can be converted to any j-D Task where j < i using Definition 3 repeatedly.For example, a CESAR Task ICSSEA−R can be converted to ICA−SSER after three iterations of Definition 3 in shifting S, S, E dialog components.
These operations ensure full task coverage of any arbitrary CESAR task as per Eq. ( 1).

F ChatGPT Evaluation Prompts
Earlier studies have shows that ChatGPT does a good job in evaluating the overall quality of generated text by language models (Zheng et al., 2023).We used ChatGPT to evaluate the EW+EMG task in Table 6.Since emotion grounded generation is hard to evaluate we utilize ChatGPT and classify the emotions generated by each model and then calculate the accuracy of each model.Moreover because models may tend to generate the name of the emotion directly rather than infusing the emotion into the response (e.g. for response generation grounded on happiness the generated response can be 'I am very happy!") we further generate qualitative scores by ChatGPT for each of model's responses.
We use in-context examples to set each of these evaluation prompts.Table 12 and Table 13 depict template prompts for emotion classification and quality evaluation respectively.

Figure 2 :
Figure 2: Investigating the disparity between instruction following capability for dialog response generation between closed-access (ChatGPT) and open-access (InstructDial) dialog models.ChatGPT comfortably satisfies two constraints in dialog response generation whereas showing signs of struggle in >= 3 constraint scenarios (more examples in Appendix B).In contrast, InstructDial lags behind ChatGPT since it is unable to satisfy >= 2 constraint scenarios (further evidence in Fig.3).

Figure 3 :
Figure 3: Compositional accuracy of both the baseline and compositional model over a varying number of training data sizes.Each datapoint is run across three indenpendently sampled test sets to account for variability -c.f.Fig. 8 for atomic performance comparison.

Definition 1
(n-D Task): For any CESAR task of the form ICΛ − ψ, we call the task n-D Task if there are n dialog items in Λ, i.e. |Λ| = n.

(r 1 )
Hey, do you like music?(r 2 ) Yes, I am big fan of the Barbie song.The band plays rock.(r 3 ) Oh seems like you like rock.What other genres do you like? (s 31 ) The dialog act of the final utterance in is: question.(s 32 ) The summary of the conversation is: User is interested in knowing the bot's music interests.The bot likes rock music, such as Barbie song by Aqua.Dialog Context (a 41 ) The final turn should have the keywords: like, play, lot.
Figure 5: Chord Wheel showing all atomic and compositional tasks in CESAR.Tasks colored in red are newly added compared to the InstructDial dataset.Each edge between a pair of tasks indicates a new compositional task that combines them together.
Generate a response in the form of a question for the given dialog context.The response should have the keyword "Everest" and should be a rephrased version of the given original response.Original Response: Okay, lets go to Alps then!Dialog Context: S1: Hey would you like to go on a road trip together?S2: Sure, where do you have in mind?S1: Have not decided yet but I am thinking of somewhere North.

Figure 7 :
Figure 7: Qualitative examples showing compositional capabilities of some closed source models.

Figure 8 :
Figure 8: Atomic accuracy of both the baseline and compositional model over a varying number of training data sizes.Each datapoint is run across three indenpendently sampled test sets to account for variability

Table 1 :
CESAR compared against other recent instruction tuning benchmarks.Zero and Few stand for zero-and few-shot, respectively, whereas COT stand for chain of thought prompting.

Table 2 :
Dialog Components of CESAR with sample Dialog Items.

Table 4 :
Evaluation results on atomic tasks.R-L stands for Rouge-L metric, best results for each column are bold.

Table 5 :
Evaluation results on compositional tasks.R-L stands for Rouge-L metric, best results for each column are bold.

Table 6 :
Evaluation results on seen/unseen compositions, and unseen tasks.R-L stands for Rouge-L metric, E.C. stands for emotion classification, and R.Q. stands for response quality.Best results for each column are bold.

Table 7 :
List of compositional rules.

Table 5
R-L stands for Rouge-L metric, best results for each column are bolded.

Table 10 :
Evaluation results on three compositional tasks for a smaller test set using more incontext samples, and GPT-4 API.

Table 11 :
Sample Prompt used for ChatGPT-based benchmarking this song,'), and thus indirectly 'miss-generates' the correct response.Another simple mistake it does for endswith_generation is that it generates a sentence that actually ends with the given phrase, however, it does not stop the generation and adds another sentence to the response.e.g. for the task: 'The response should end with: song.',ChatGPT might generate: 'I like this song.Where can I listen to it?'.

Table 13 :
Prompt template used for ChatGPT to evaluate response quality.

Table 16 :
CESAR downstream compositional 2D tasks (The table starts in the previous pages).