CoF-CoT: Enhancing Large Language Models with Coarse-to-Fine Chain-of-Thought Prompting for Multi-domain NLU Tasks

While Chain-of-Thought prompting is popular in reasoning tasks, its application to Large Language Models (LLMs) in Natural Language Understanding (NLU) is under-explored. Motivated by multi-step reasoning of LLMs, we propose Coarse-to-Fine Chain-of-Thought (CoF-CoT) approach that breaks down NLU tasks into multiple reasoning steps where LLMs can learn to acquire and leverage essential concepts to solve tasks from different granularities. Moreover, we propose leveraging semantic-based Abstract Meaning Representation (AMR) structured knowledge as an intermediate step to capture the nuances and diverse structures of utterances, and to understand connections between their varying levels of granularity. Our proposed approach is demonstrated effective in assisting the LLMs adapt to the multi-grained NLU tasks under both zero-shot and few-shot multi-domain settings.


Introduction
Natural Language Understanding (NLU) of Dialogue systems encompasses tasks from different granularities.Specifically, while intent detection requires understanding of coarse-grained sentencelevel semantics, slot filling requires fine-grained token-level understanding.Moreover, Semantic Parsing entails the comprehension of connections between both token-level and sentence-level tasks.
Large Language Models (LLMs) possess logical reasoning capability and have yielded exceptional performance (Zoph et al., 2022;Zhao et al., 2023b).However, they remain mostly restricted to reasoning tasks.On the other hand, mutlistep reasoning can take place when solving multiple interconnected tasks in a sequential order.In practical NLU systems, as coarse-grained tasks are less challenging, they can be solved first before proceeding to fine-grained tasks.Therefore, the coarse-grained tasks' outcomes can provide valuable guidance towards subsequent fine-grained tasks, allowing for deeper semantic understanding of diverse utterances across different domains within NLU systems (Firdaus et al., 2019;Weld et al., 2022;Nguyen et al., 2023a).For instance, consider the utterance "Remind John the meeting time at 8am" under reminder domain, recognizing GET_REMINDER_DATE_TIME intent is crucial for correctly understanding the existence of PERSON_REMINDED slot type rather than CONTACT or ATTENDEE slot type.Chain-of-Thought (CoT) (Wei et al., 2022) provides an intuitive approach to elicit multi-step reasoning from LLMs automatically.However, there remain two major challenges with the current CoT approach: (1) LLMs entirely rely on their uncontrollable pre-trained knowledge to generate stepby-step reasoning and could result in unexpected hallucinations (Yao et al., 2022;Zhao et al., 2023a), (2) Additional beneficial structured knowledge cannot be injected into LLMs via the current CoT.
On the other hand, structured representation demonstrates the effectiveness in enhancing the capability of Pre-trained Language Models (PLMs) (Xu et al., 2021;Bai et al., 2021;Shou et al., 2022).In Dialogue systems, the dependencies among different dialogue elements together with the existent diversely structured utterances necessitate the inte- gration of additional structured representation.For instance, as observed in Figure 1, by leveraging Abstract Meaning Representation (AMR) (Banarescu et al., 2013), it is possible to map multiple semantically similar but structurally different utterances with similar coarse-grained and fine-grained labels into the same structured representation, allowing for effective extraction of intents, slots, and their interconnections within the Dialogue systems.
In our work, we explore the capability of LLMs in NLU tasks from various granularities, namely multi-grained NLU tasks.Motivated by CoT, we propose an adaptation of CoT in solving multigrained NLU tasks with an integration of structured knowledge from AMR Graph.Our contribution can be summarized as follows: • To the best of our knowledge, we conduct the first preliminary study of LLMs' capability in multi-grained NLU tasks of the Dialogue systems.
• We propose leveraging a CoT-based approach to solve multi-grained NLU tasks in a coarse-tofine-grained sequential reasoning order.
• We propose integrating structured knowledge represented via AMR Graph in the multi-step reasoning to capture the shared semantics across diverse utterances within the Dialogue systems.

Related Work
Chain-of-Thought (CoT) CoT (Wei et al., 2022) proposes leveraging intermediate steps to extracts logical reasoning of LLMs and succeeds in various reasoning tasks.Wang et al. (2022) enhances CoT by selecting the most consistent output answers via majority voting.Additionally, Fu et al. (2022) argues majority consistency voting works best among the most complex outputs.They propose complexity metrics and leverage them to select demonstration samples and decoding outputs.Unlike previous CoT approaches, we leverage CoT to solve multi-grained NLU tasks.
Structured Representation Structured Representation has been widely incorporated in language models to further enhance the capability across various NLP tasks (Bugliarello and Okazaki, 2020;Zhang et al., 2020).Structured representation can be either in the syntax-based structure (Bai et al., 2021;Xu et al., 2021) such as Dependency Parsing (DP) Graph, Constituency Parsing (CP) Graph or semantic-based structure (Shou et al., 2022) such as Abstract Meaning Representation (AMR) Graph (Banarescu et al., 2013).Unlike previous works, we aim at leveraging structured representation as an intermediate step in the multi-step reasoning approach to extract essential concepts from diverse utterances in the multi-domain Dialogue systems.

Proposed Framework
In this section, we introduce our proposed Coarseto-Fine Chain-of-thought (CoF-CoT) approach for NLU tasks as depicted in Figure 2. Specifically, we propose a breakdown of multi-grained NLU tasks into 5 sequential steps from coarse-grained to finegrained tasks.At each step, LLMs leverage the information from the previous steps as a guidance towards the current predictions.As domain name could provide guidance to NLU tasks (Xie et al., 2022;Zhou et al., 2023), at each step, we condition the domain name of the given utterance in the input prompt.The model's output is in the format of Logic Form (Kamath and Das) which encapsulates coarse-grained intent label, fine-grained slot labels and slot values.Further details of Logic Form's structure and its connections with multi-grained NLU tasks are provided in the Appendix A.
Our multi-step reasoning is designed in the following sequential order: 1. Generate AMR: Given the input utterance, LLMs generate the AMR structured representation (Banarescu et al., 2013).The representation is preserved in the Neo-Davidsonian format as demonstrated in Figure 1,2.Each node in AMR graph refers to a concept, including entity, noun phrase, pre-defined frameset or special keyword.Edges connecting two nodes represent the relation types.
2. Generate Intent: In this step, LLMs generate coarse-grained intent label prediction when conditioned on the given input and its corresponding AMR Graph.AMR concepts could provide additional contexts to ambiguous utterances, leading to improved ability to recognize the correct intents.
3. Generate Slot Values: In this stage, to generate the fine-grained slot values existent in the input utterance, besides the utterance itself, prompts for LLMs are conditioned on the generated AMR structure and predicted intent label.As AMR graph captures the essential concepts existent in the utterance while abstracting away syntactic idiosyncrasies of the utterance, it can help extract the important concepts mentioned in the utterances.In order to further couple the connections between slots and intents (Zhang et al., 2019;Wu et al., 2020), predicted intents from the Step 2 are also concatenated to construct input prompts for Step 3.
4. Generate Slot Value, Slot Type pairs: After obtaining slot values, LLMs label each identified slot value when given the slot vocabulary.Simi-lar to Step 3, we condition the generated output with the predictions from previous steps, including AMR and intent.Both AMR and intent provide additional contexts for slot type predictions of the given slot values besides the input utterance.
5. Generate Logic Form: The last step involves aggregating the predicted intents together with sequences of slot type and slot value pairs to construct the final Logic Form predictions.

Datasets & Preprocessing
We evaluate our proposed framework on two multidomain NLU datasets, namely MTOP (Li et al., 2021) and MASSIVE (Bastianelli et al., 2020;FitzGerald et al., 2022).As the innate capability of language understanding is best represented via the robustness across different domains, we evaluate the frameworks under low-resource multi-domain settings, including zero-shot and few-shot.Details of both datasets are provided in Appendix B.
To provide a comprehensive evaluation for coarse-grained, fine-grained NLU tasks, as well as the interactions between the two, we conduct an extensive study on both NLU and Semantic Parsing metrics, including: Slot F1-score, Intent Accuracy, Frame Accuracy, Exact Match.Intent Accuracy assesses the performance on coarse-grained sentence-level tasks, while Slot F1 metric evaluates the performance on more fine-grained token-level tasks.The computation of Frame Accuracy and Exact Match captures the ability to establish the accurate connections between sentence-level and token-level elements.For more details of individual metric computation from the Logic Form, we refer readers to (Li et al., 2021).
To conduct the evaluation with efficient API calls, following (Khattab et al., 2022)   In comparison with MASSIVE, performance of all methods is significantly lower on MTOP.It is mainly due to the more complex Logic Form structures existent in MTOP.It is noticeable that MAS-SIVE datasets contain samples of fewer average number of slots, leading to significantly better performance on Semantic Parsing tasks (i.e.Frame Accuracy and Exact Match).

Impact of Conditioning
The major advantage of our multi-step reasoning is the ability to explicitly condition the prior predictions in later steps.
As observed in Table 4, conditioning prior knowledge in multi-step reasoning improves the overall performance of CoF-CoT across different metrics with the most significant gain in Intent Accuracy (+4.50%).This observation implies the importance of conditioning the appropriate information on CoT for an improved performance of LLMs under challenging zero-shot multi-domain settings.

Conclusion
In this work, we conduct a preliminary study of LLMs' capability in multi-grained NLU tasks of Dialogue systems.Moreover, motivated by CoT, we propose a novel CoF-CoT approach aiming to break down NLU tasks into multiple reasoning steps where (1) LLMs can learn to acquire and leverage concepts from different granularities of NLU tasks, (2) additional AMR structured representation can be integrated and leveraged throughout the multi-step reasoning.We empirically demonstrate the effectiveness of CoF-CoT in improving LLMs capability in multi-grained NLU tasks under both zero-shot and few-shot multi-domain settings.

Limitations
Our empirical study is restricted to English NLU data.It is partially due to the existent Englishbias of Abstract Meaning Representation (AMR) structure (Banarescu et al., 2013).We leave the adaptation of the CoF-CoT to multilingual settings (Nguyen and Rohrbaugh, 2019;Qin et al., 2022;Nguyen et al., 2023b) as future directions for our work.
Our work is empirically studied on the Flat Logic Form representation.In other words, Logic Form only includes one intent followed by a set of slot sequences.There are two major rationales for our empirical scope.Firstly, as the early preliminary study on multi-grained NLU tasks which unify both Semantic Parsing and NLU perspectives, we design a small and controllable scope for the experiments.Secondly, as most NLU datasets including MASSIVE (FitzGerald et al., 2022) are restricted to single-intent utterances, Flat Logic Form is a viable candidate reconciliating between traditional NLU and Semantic Parsing evaluations.We leave explorations on the more challenging Nested Logic Form where utterances might contain multiple intents for future work.Logic Form not only captures the coarse-grained intent labels and fine-grained slot labels of the utterances but also encapsulates the implicit connections between slots and intents.
As observed in Table 7, Logic Form is constructed as the flattened representation of the dependency structure between intents and slot sequences.Semantic Frame constructed as intent type(s) followed by a sequence of slot types can be directly extracted from the Logic Form.In addition, via the Logic Form, the coarse-grained intent label CREATE_REMINDER, fine-grained slot TODO, DATE_TIME labels together the respective slot values (message Mike, at 7pm tonight) can all be extracted and converted to appropriate format (i.e.BIO format as the traditional sequence labeling ground truths (Zhang et al., 2019)).Therefore, Logic Form can be considered the unified label format to bridge the gap between Semantic Parsing (Li et al., 2021;Xie et al., 2022) and traditional Intent Detection and Slot Filling tasks in NLU systems (Xia et al., 2020;Nguyen et al., 2020;Casanueva et al., 2022).

B Dataset Details
We provide the details of MTOP and MASSIVE datasets in Table 5.As compared to MASSIVE, MTOP dataset not only contains more slot types and intent types but also tends to cover more slot types per sample in the Logic Form.This challenging characteristic explains the consistent lower performance across all methods on MTOP when compared to MASSIVE as observed in Section 5.

C Implementation Details
As the proposed step-by-step reasoning can be applied to any LLMs, our proposed method is LLMagnostic which is empirically studied in Appendix F. For simplicity and consistency, in our main empirical study, we leverage gpt-3.5-turbofrom Ope-nAI as the base LLM model.Following (Wang et al., 2022), we set the decoding temperature T=0.7 and number of outputs n=10.
As domain names provide essential clues for language models in multi-domain settings for multigrained NLU tasks (Zhou et al., 2023), to safeguard the fairness in baseline comparisons, we consistently include the domain name in the input prompts for all baselines unless stated otherwise.Specifically, the only exception is presented in Table 4 for CoF-CoT(CoF order)-Conditioning.
For few-shot (i.e.k-shot) learning settings, we randomly sample k examples and manually prepare the necessary labels for different baseline variants.
We experiment with k=5 in our empirical study.

Domains of Demonstration Samples
To replicate a more realistic scenario where the domains of k-shot demonstration samples are generally unknown, we assume that k-shot demonstration samples come from different domains from the test samples.The relaxation of constraints on the assumption regarding the domain similarity between demonstration samples and test samples allows for broader applications and encourages LLMs to accumulate and extract the true semantic knowledge from k-shot demonstrations and avoid overfitting any specific domains.For completeness, we also conduct additional empirical studies to compare the FSL performance of CoF-CoT under both scenarios: (1) k demonstration samples are from the same domain as test samples, (2) k demonstrations are drawn from different domains from the test samples.As observed in Table 6, additional constraint of similar domains between k-shot demonstration samples and test samples leads to improvements in the evaluation performance across NLU and Semantic Parsing tasks.This might be intuitive since LLMs can extract domain-relevant information from the given k domain-similar samples to assist with inference process on test samples.

D Prompt Design
Prompts for individual steps of our CoF-CoT are presented in Figure 3.Additional output samples are also provided in Figure 4.

E Qualitative Case Study
We present additional Qualitative Case Study comparing the outputs between different baseline methods and our proposed CoF-CoT in Figure 5.
As observed in Figure 5, our CoF-CoT provides the predictions closest to the ground truth while other baselines struggle to (1) generate the correct intent type (i.e.GET_DATE_TIME_EVENT intent type from Direct Prompt in comparison with GET_EVENT intent from ground truth) (2) identify the correct slot values (i.e.everything slot value generated from CoT), (3) generate the correct slot type for the corresponding slot values.(i.e.EVENT_TYPE slot type for music festivals slot values from Complex-CoT instead of CATE-GORY_EVENT slot type).

F LLM-Agnostic Capability
Our proposed CoF-CoT is LLM-agnostic since the focus of the work is on the prompt design, which can be applied to any LLMs.As most LLMs rely on the high quality of the designed prompts, our proposed CoF-CoT prompt design can be used as input to any LLMs for zero-shot and in-context learning settings.This is also similarly observed in CoT (Wei et al., 2022), SC-CoT (Wang et al., 2022) and other comparable CoT methods.For further clarification, we report additional empirical results of our proposed CoF-CoT applied to both of the backbone PaLM (Chowdhery et al., 2022) and GPT3.5 LLMs on the MTOP dataset under both ZSL and FSL settings in Table 8.As observed in Table 8, CoF-CoT prompting consistently outperforms the two backbone LLMs across all NLU and Semantic Parsing tasks, demonstrating both the effectiveness and LLM-agnostic capability of our proposed CoF-CoT.Step 1: Given the utterance and its domain, generate a single corresponding Abstract Meaning Representation (AMR) Graph in the Neo-Davidsonian format.The format involves :ARG and :op relations.
Utterance: {utterance} Domain: {domain} Step 2: Given the utterance and its domain, and its AMR Graph, select one of the following in the Intent Vocabulary as the intent type for the utterance.
Utterance: {utterance} Domain: {domain} AMR Graph: {AMR}] Intent Vocabulary: {intent_vocab} Step 3: Based on the utterance, domain, its AMR Graph and its intent, generate key phrases for the utterance.Key phrases can be made up from multiple AMR concepts.Each word in key phrases must exist in the given utterance.Each word in the utterance appears in only one key phrase.Key phrases need to contain consecutive words in the given utterance.Key phrases do not need to cover all words in the utterance.Return a list of key phrases separated by commas.
Return the list of key phrases and their corresponding slot types in the following format: (key_phrase, slot_type) separated by commas.
If none of the slot types in the vocabulary fits, return the slot type as O.

(Figure 1 :
Figure 1: Illustration of Abstract Meaning Representation (AMR) of two structurally different but semantically similar utterances with the same fine-grained and coarse-grained labels.Each colored node represents an AMR concept matching the colored word or phrase existent in the corresponding utterances.

Figure 2 :
Figure 2: Illustration of CoF-CoT and its counterpart Direct Prompt approach.The left side illustrates the proposed CoF-CoT.The right side illustrates the naive Direct Prompt approach.Red and Green represent sentence-level and token-level annotations captured in the Logic Form respectively.For CoF-CoT, the prompt at each step starting from Step 2 is conditioned on the relevant output predicted from the previous step(s).
[IN:___ [SL:____] [SL:___]] where IN: is followed by an intent type and SL: is followed by a slot type and slot value pair separated by white space.The number of [SL: ] is unlimited.The number of [IN: ] is limited to 1.

Table 2 :
Comparison between FT and LLM approaches on MTOP dataset.

Table 3 :
Ablation study on the effectiveness of different structured representations on MTOP dataset under zero-shot settings.CP, DP, AMR denote Constituency Parsing, Dependency Parsing and Abstract Meaning Representation respectively.CoT variants over the Direct Prompt.It implies that CoT prompting allows the model to reason over multiple steps and learn the connections between different NLU tasks more effectively.
• ZSL: We utilize samples from domains different from test domains for training.• FSL: We leverage samples from domains different from test domains in conjunction with a fixed number of k-shot test domain samples.

Table 4 :
Ablation study of step ordering on MASSIVE dataset.CoF and FoC denote Coarse-to-Fine-grained and Fine-to-Coarse-grained order respectively.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, et al. 2022.Least-to-most prompting enables complex reasoning in large language models.In The Eleventh International Conference on Learning Representations.

Table 5 :
Details of MTOP and MASSIVE datasets

Table 7 :
Sample utterance with its Logic Form under both Semantic Parsing and NLU tasks' metrics.// denotes the separation between tokens of the given utterance.