Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration

Conversational systems based on Large Language Models (LLMs), such as ChatGPT, show exceptional proficiency in context understanding and response generation. However, despite their impressive capabilities, they still possess limitations, such as providing randomly-guessed answers to ambiguous queries or failing to refuse users' requests, both of which are considered aspects of a conversational agent's proactivity. This raises the question of whether LLM-based conversational systems are equipped to handle proactive dialogue problems. In this work, we conduct a comprehensive analysis of LLM-based conversational systems, specifically focusing on three aspects of proactive dialogue systems: clarification, target-guided, and non-collaborative dialogues. To trigger the proactivity of LLMs, we propose the Proactive Chain-of-Thought prompting scheme, which augments LLMs with the goal planning capability over descriptive reasoning chains. Empirical findings are discussed to promote future studies on LLM-based proactive dialogue systems.


Introduction
Conversational systems are envisioned to provide social support or functional service to human users via natural language interactions.Most research typically centers around a system's response capabilities, such as understanding the dialogue context (Wu et al., 2020;Chen et al., 2022;Deng et al., 2022b) and generating appropriate responses (Zhang et al., 2020b;Roller et al., 2021).The popularity of conversational systems has grown unprecedentedly with the advent of Chat-GPT, which showcases exceptional capabilities of context understanding and response generation with large language models (LLMs).Recent studies observe that, compared with current fine-tuned state-of-the-art (SOTA) methods, ChatGPT can still achieve competitive performance under zero-shot setting on different dialogue problems, such as the knowledge-grounded dialogues (Bang et al., 2023), task-oriented dialogues (Zhang et al., 2023), and emotion-aware dialogues (Zhao et al., 2023).
Despite the strength of ChatGPT, there are still several limitations1 , such as failing to ask clarification questions to ambiguous user queries or refuse problematic user requests.These kinds of capabilities are typically regarded as the proactivity of the conversational system (Deng et al., 2023b), where the system can create or control the conversation to achieve the conversational goals by taking initiative and anticipating impacts on themselves or the human users.Thus, it raises the question: Are these LLM-based conversational systems equipped to manage proactive dialogue problems?
In this work, we conduct the first comprehensive analysis of LLM-based conversational systems on three common aspects of proactive dialogues, including 1) clarification in information-seeking dialogues (Guo et al., 2021;Deng et al., 2022a) where the system is required to proactively ask clarification questions when encountering ambiguity in user queries; 2) target-guided open-domain dialogues (Tang et al., 2019;Wu et al., 2019) where the system is required to proactively lead the conversation towards the designated target; and 3) non-collaborative task-oriented dialogues (Li et al., 2020;Zhou et al., 2020;Deng et al., 2023a) where the system and the user do not share the same conversational goal while the system aims to strategically reach a consensus with the user.
Motivated by the emergent capabilities of LLMs (Wei et al., 2022a,b) on reasoning over texts, some recent studies investigate in-context learning or chain-of-thought prompting schemes on planning (Huang et al., 2022) or taking actions (Yao et al., 2022) in interactive environments.Similarly, strategy learning and goal planning attach great importance in proactive dialogue systems.In order to enhance the proactivity of LLM-based conversa- Strategy: ["Build rapport", "Show dominance", "Negative sentiment", …] Act: ["Ask a question", "Insist on an offer", "Proposing a counter price", …] Prompt: Given the task background and the conversation history, first analyse the current negotiation progress and consider an appropriate negotiation goal, and then select appropriate strategies and actions to generate the response: Response: The seller firmly rejects the price.
It would be better to propose a higher price closed to 8700.The Strategy is Negative sentiment.The act is Proposing a counter price.I really don't need it but I can pay you 8200 if you let it go.
Strategy: ["Build rapport", "Show dominance", …] Act: ["Ask a question", "Insist on an offer", …] Prompt: Given the task background and the conversation history, please use appropriate strategies and actions to generate the response: Response: The strategy is Communicate politely.
The act is Proposing a counter price.I understand.How about we meet halfway at 10250?What do you think?
Prompt: Given the task background and the conversation history, please generate the response: Response: How about we meet in the middle at 10000?
Task Background: The item description is "Selling my 2006 Toyota 4 Runner with only 106k original miles.The truck is in great condition with no mechanical flaws whatsoever and a clean accident history.…, don't waste your time and my time.I'm a busy guy."The selling price is 14500.The target price is 8700.Conversation History: ["Buyer": "Man thats a nice trak.","Seller": "Yes its really nice.","Buyer": "How about you give it to me for 8000?", "Seller": "That is way too little."] (2) Non-collaborative Dialogues: CraigslistBargain Figure 1: Examples of three kinds of prompting schemes for proactive dialogues.In the example of non-collaborative dialogue, the system plays the role of "Buyer", and the sale-to-list (SL) ratio shows the effectiveness of negotiation, which is calculated by (listed price − bargain price)/(listed price − buyer target price).The higher ratio means the current bargain price is closer to the target.
tional systems, we design the proactive chain-ofthought prompting (ProCoT) scheme.As shown in Figure 1, with standard prompting, LLM-based systems directly provide a randomly-guessed answer to the ambiguous user question (1a), or generate a general bargain response without any negotiation strategy (2a).When providing the system with options to take different dialogue acts (proactive prompting), the generated responses are unaware of the conversational goal, such as generating underspecified clarification questions (1b) and conservative negotiation responses (2b).To this end, Pro-CoT first instructs the system to generate descriptive thoughts about intermediate steps of reasoning and planning for reaching the conversational goal, and then make the decision of the next action to take.Finally, the system generates an appropriate response based on the decided action (1c & 2c).We conduct extensive experiments with two LLM-based conversational systems, including ChatGPT and an open-sourced model, Vicuna (Chiang et al., 2023).With the aforementioned three types of prompting schemes, we compare these LLM-based conversational systems with fine-tuned SOTA dialogue models.The main contributions of this work can be summarized as follows: • This work presents the first comprehensive evaluation on the proactivity of LLM-based dialogue systems, including the handling of clarification, target-guided, and non-collaborative dialogues.
• We design the proactive chain-of-thought prompting scheme to endow LLM-based dialogue systems with the capability of planning and taking the initiative towards the conversational goal.
• Specifically, the main findings of the evaluation of LLM-based dialogue systems include: 1) They barely ask clarification questions when encountering ambiguous queries, and ProCoT largely overcomes this issue, though the performance is still unsatisfactory in domain-specific applications ( §4.1).2) They are proficient at performing topic shifting towards the designated target, but tend to make aggressive topic transition.ProCoT further improves this capability by planning a smoother transition ( §4.2). 3) They fail to make strategic decision and tend to compromise with the opponent.The key challenge is how to effectively optimize the strategy learning ( §4.3).

Related Works
Proactive Dialogues.Recent years have witnessed many advanced designs on developing proactive dialogue systems (Liao et al., 2023) for various applications.For example, target-guided dialogues aim to proactively lead the conversation to either a designated target topic (Tang et al., 2019) or a pre-defined knowledge entity (Wu et al., 2019).Existing studies typically adopt keyword transition (Qin et al., 2020;Zhong et al., 2021) or knowledge graph reasoning (Yang et al., 2022;Lei et al., 2022) techniques to proactively plan the topic thread towards the target.Besides, in information-seeking dialogues, proactive dialogue systems can ask clarification questions for clarifying the ambiguity of the query or question in conversational search (Aliannejadi et al., 2021) and question answering (Guo et al., 2021;Deng et al., 2022a).In addition, under the non-collaborative setting, the system and the user have competing goals towards the task completion but the system aims to proactively reach an agreement favorable to itself (Zhou et al., 2020), such as negotiating a product price (He et al., 2018) or persuading users to make a donation (Wang et al., 2019).
Large Language Models for Dialogues.Previous dialogue systems, such as DialoGPT (Zhang et al., 2020b), Meena (Adiwardana et al., 2020), BlenderBot (Roller et al., 2021), LaMDA (Thoppilan et al., 2022), typically fine-tune pre-trained language models on public dialogue data.Inspired by the success of ChatGPT, recent practices build dialogue systems through conducting supervised fine-tuning on open-source large language models, such as LLaMA (Touvron et al., 2023), with either constructed instruction-following examples (e.g., Alpaca (Taori et al., 2023)) or distilled conversation data (e.g., Vicuna (Chiang et al., 2023)) from Chat-GPT.As all these LLM-based dialogue systems are trained to follow the user's instruction, it remains a question on whether these systems can take the initiative for handling proactive dialogues.
Prompting in Dialogue Systems.To induce knowledge from LLMs, various prompting methods are designed for zero-shot or few-shot learning in dialogue applications, such as task-oriented dialogues (Lee et al., 2021;Mi et al., 2022), knowledge-grounded dialogues (Shuster et al., 2022;Liu et al., 2022;Wang et al., 2023c), and open-domain dialogues (Chen et al., 2023b;Lee et al., 2023;Wang et al., 2023a).For example, Chen et al. (2023c) propose to prompt LLMs for controllable response generation in emotional support and persuasion dialogues, conditioned on the ground-truth dialogue strategies.In this work, we aim at prompting LLMs to proactively interact with the users.

Prompting LLMs to be Proactive
As presented in Figure 1, we describe the prompting schemes, including the standard, proactive, and proactive chain-of-thought (ProCoT) prompting.
Standard Prompting.In order to instruct LLMs to perform specific dialogue tasks, the typical prompting scheme can be formulated as p(r|D, C). (1) Given the task background D and the conversation history C, instruct the LLM to generate the response r.In specific, the task background can be the grounded document in clarification dialogues or the target description in target-guided dialogues.
Proactive Prompting.Proactive prompting aims to provide alternative options for LLMs to decide what kinds of actions should be taken in the response, instead of simply responding to the instruction.It can be formulated as: p(a, r|D, C, A). (2) Given the task background D, the conversation history C, and a set of possible dialogue acts A, instruct the LLM to select the most appropriate dialogue act a ∈ A and then generate the response r.
For example, the dialogue act can be Ask a Clarification Question or Directly Answer the Question in clarification dialogues, different negotiation strategies in non-collaborative dialogues, or different conversation topics in target-guided dialogues.
Proactive Chain-of-Thought Prompting.In order to endow LLMs with the capability of planning and taking the initiative towards the ultimate goal, we develop the proactive chain-of-thought prompting scheme-ProCoT.It involves the analysis of the next action to take by performing dynamic reasoning and planning for reaching the conversational goal.ProCoT can be formulated as: where t is the thought description for the decisionmaking process of the next action.For example, in clarification dialogues, t can be the ambiguity analysis of the current user question as in Figure 1(1c).While in non-collaborative dialogues, t can be the goal completion analysis of the current negotiation progress as in Figure 1(2c).

Evaluation
We evaluate the proactivity of LLM-based conversational systems from three perspectives, including the capability of asking clarification questions ( § 4.1), guiding the conversation towards the designated target ( § 4.2), and strategically handling conflicting goals ( § 4.3).

Clarification Dialogues
Clarification in information-seeking dialogues (Zamani et al., 2022) refers to the process of seeking further information or details to better understand the topic or question at hand.In this context, clarification is an important part of the dialogue as it helps to ensure that the information being shared is accurate and complete.

Problem Definition
Following previous studies (Aliannejadi et al., 2021;Guo et al., 2021;Deng et al., 2022a), the problem of asking clarification questions can be decomposed into two subtasks: 1) Clarification Need Prediction (CNP) to identify the necessity of clarification in the current turn, and 2) Clarification Question Generation (CQG) to produce an appropriate clarifying question if needed.Given the grounded document D and the dialogue context C = {q 1 , a 1 , ..., q t−1 , a t−1 , q t }, the dialogue system aims to first predict the binary ambiguity label y on whether the current question q t needs to be clarified.If so, a corresponding clarification question should be generated as the response a t for clarifying the ambiguity.

Experimental Setups
Datasets.We evaluate the capability of asking clarification questions in LLM-based dialogue systems on two types of datasets: 1) Abg-CoQA (Guo et al., 2021) in general domain, and 2) PA-CIFIC (Deng et al., 2022a) in finance domain.Details on these datasets can be found in Appendix A.
Evaluation Metrics.Following previous studies (Guo et al., 2021;Deng et al., 2022a), we use the F1 score for the evaluation of CNP, and BLEU-1 and ROUGE-2 (F1) for the evaluation of CQG.In addition, since the automatic lexical matching metrics may fail to actually estimate the clarification capability of the generated clarifying questions (Guo et al., 2021), we also adopt human evaluation to score whether the generated question is helpful for clarifying the existing ambiguity (Help.).
Usage of LLMs.To facilitate reproducibility, we adopt a static version of ChatGPT, i.e., gpt-3.5-turbo-0301,and set the temperature to 0 for generating the deterministic outputs with the same inputs.In addition, we adopt an open-source LLM, i.e., Vicuna-13B-delta-v1.1 , for the evaluation.The maximum number of new tokens is set to 128 for the generation.
Prompting Schemes.We evaluate the three prompting schemes introduced in Section 3, including standard, proactive, and ProCoT prompting.In addition, we report their results under both zeroshot and few-shot settings.Due to the limitation of the maximum sequence length in Vicuna (2,048 tokens), we only apply one-shot in-context learning for comparisons.The complete prompts adopted for evaluation is presented in Appendix C.

Experimental Results
Table 1 summarizes the evaluation results on Abg-CoQA and PACIFIC datasets.There are several notable observations as follows: LLM-based conversational systems fail to ask clarification questions.Under standard prompting, both Vicuna and ChatGPT fail to ask clarification questions when encountering ambiguous queries, according to the human evaluation on the helpfulness (Help.) of the generated responses for clarifying ambiguity.Even with one-shot demonstration, in-context learning (ICL) still cannot provide them with such ability.Under proactive prompting, given the option of clarification, Vicuna's ability to accurately take this action is still quite limited, with the F1 scores close to 0. In contrast, ChatGPT becomes capable of asking clarification questions on Abg-CoQA, as evidenced by the improvement on both F1 and Help.scores.performance with SOTA fine-tuned methods on the CNP task (F1), but the generated clarification questions are still unsatisfactory (Help.).One-shot ICL further improves the performance of ChatGPT with ProCoT to a great extent.The case study in Appendix D.1 shows that ProCoT also improves the explanability of asking clarification questions.
As for domain-specific problem, there is still a noticeable gap from the fine-tuned methods.
Although ProCoT has already largely enhanced the capability of asking clarification questions, the performance of LLMs on the domain-specific task, i.e., PACIFIC (Finance), is still far behind the finetuned methods.In fact, with fine-tuning on domainspecific data, the SOTA method can achieve a remarkable performance on PACIFIC, i.e., 86.9 (F1) for CNP and 80.1 (Help.)for CQG, indicating the importance of domain knowledge.

Error Analysis
In order to find out the reason why LLM-based dialogue systems with ProCoT prompting fall short of handling domain-specific clarification dialogues, we randomly sample 100 error cases in clarification question generation from each dataset for analysis (all cases are generated by ChatGPT with oneshot ProCoT).We categorize these failure cases into four groups, including Wrong Aspect, Underspecified Clarification, Over-specified Clarification, and Generation Error.The details and examples can be found in the Appendix B. The statistics of error analysis is presented in Table 2.It can be observed that the proportion of failure cases attribute to the wrong aspect and under-specified clarification in PACIFIC (Finance) is higher than that in Abg-CoQA (General).This indicates that ChatGPT may lack of certain domain knowledge required for asking precise and specific clarification questions.

Target-guided Dialogues
Instead of making consistent responses to the useroriented topics, the dialogue system for targetguided dialogues is required to proactively lead the conversation topics towards a designated tar-get (Tang et al., 2019).According to different applications, the target can be topical keywords (Zhong et al., 2021), knowledge entities (Wu et al., 2019), or items to be recommended (Deng et al., 2023c).

Problem Definition
Given a target D that is only presented to the agent but unknown to the user, the dialogue starts from an arbitrary initial topic, and the system needs to produce multiple turns of responses {u n } to lead the conversation towards the target in the end.The produced responses should satisfy (i) transition smoothness, natural and appropriate content under the given dialogue context, and (ii) target achievement, driving the conversation towards the designated target.The problem is typically decomposed into two subtasks (Tang et al., 2019;Zhong et al., 2021;Yang et al., 2022): next topic selection and transition response generation.

Experimental Setups
Datasets.We first conduct turn-level evaluation of the target-guided capability on a next-turn targetoriented dataset OTTers (Sevegnani et al., 2021), which requires the dialogue system to proactively bridge the current conversation topic to approach the target.Furthermore, we adopt TGConv (Yang et al., 2022) to testify the ability to guide the multi-turn conversation to the target topic as the dialogue-level evaluation.Details can be found in Appendix A.
Automatic Evaluation Metrics.Following previous studies (Sevegnani et al., 2021;Yang et al., 2022), we adopt the hits@k (k ∈ [1, 3]) for evaluating next topic prediction.Three text generation metrics, including BLEU, ROUGE-L, and ME-TEOR scores, are used for the evaluation of response generation on the OTTers dataset.
As for the dialogue-level evaluation on the TG-Conv dataset, we follow existing studies (Yang et al., 2022;Wang et al., 2023b) to simulate multiturn conversations via self-play (Tang et al., 2019), where the simulated user is unaware of the target topic.Three aspects are evaluated: 1) Succ. is the success rate of generating the target word within 8 turns of conversations; 2) Turns is the average turns of all dialogues that successfully reach the target word; and 3) Coh. is the contextual semantic similarity between the last utterance and the generated response, which is measured by MiniLM (Wang et al., 2020).

Method
Shot Prompt BLEU METEOR R-L hits@1 hits@3 GPT2  Human Evaluation Metrics.We also conduct the same human evaluation as Yang et al. (2022), including two dialogue-level metrics with the following instructions provided for annotators: • Global-Coherence (G-Coh.):Whether the entire dialogue is logically and topically coherent.
• Effectiveness (Effect.):How efficiently the target is achieved.
A total of 100 dialogues are generated through simulation for each method.Three annotators assign ratings to the generated dialogues on a scale of [0, 1, 2], where higher scores indicate better quality.

Turn-level Evaluation
Table 3 shows the turn-level evaluation results on OTTers.There are several notable observations: LLM-based dialogue systems are proficient at performing topic shifting towards the designated target.According to the performance of LLMs with standard prompting, we observe that: 1) As for the next-topic prediction (hits@k), thanks to the extensive knowledge across various topics, zero-shot LLMs can achieve competitive (Vicuna) or even better (ChatGPT) performance than the fine-tuned methods.2) As for the transition response generation, automatic evaluation metrics (BLEU, METEOR, R-L)2 suggest that zero-shot models perform closely to fine-tuned methods in terms of lexical similarity with the reference response.3) One-shot ICL casts no positive impact on the performance and may even lead to worse results in next-topic prediction.This indicates that it is difficult for LLMs to enhance the topic shifting capability from limited demonstrations.
Only ProCoT prompting with one-shot demonstrations can improve the topic shifting capability.Without demonstrations, proactive and Pro-CoT prompts perform even worse than standard prompts, since LLMs may confuse about what kinds of topics are desired.For example, we observe a typical mistake that LLMs tend to analyse the next topics using questions, such as "What kind of food do you like?", leading to a narrow topic for the next turn.With one-shot demonstrations, ChatGPT with proactive prompts continues to underperform compared to standard prompts when it comes to accurately predicting suitable topics towards the target.However, it is worth noting that only ProCoT prompts consistently show an improvement in the performance of all LLMs for next topic prediction.

Dialogue-level Evaluation
Table 4 shows the dialogue-level evaluation results on TGConv.We draw the following conclusions: LLM-based dialogue systems tend to make aggressive topic transition.The results demonstrate the effectiveness of LLMs in steering the conversation towards the designated target, with ChatGPT exhibiting nearly perfect success rates (Succ.).Compared with baselines, LLMs also excel in generating more coherent responses that align with the dialogue context (Coh.), showcasing their impressive abilities in context understanding and response generation.Furthermore, the analysis reveals that ChatGPT basically achieves the target topics within just three turns, suggesting its tendency to generate responses that aggressively involve the desired topic.Similar observations can be made with Vicuna using standard prompting.
ProCoT prompting enables a smoother topic transition of target-guided dialogues.Under proactive prompting, the response coherency is improved by the topic planning.However, the success rate is negatively affected, which attributes to can be described in different ways rather than the reference response.We mainly discuss the topic shifting capability in terms of the performance on next topic prediction.

Non-collaborative Dialogues
Unlike collaborative task-oriented dialogue settings (Zhang et al., 2020c), where the user and the system work together to reach a common goal (e.g., booking hotels), in non-collaborative dialogues, the user and the system have a conflict of interest but aim to strategically communicate to reach an agreement (e.g., negotiation) (Zhan et al., 2022).The system is required to leverage a series of proactive strategies to reach an agreement favorable to itself, instead of passively following the user's intents.

Problem Definition
Given the dialogue history C = {u 1 , ..., u t−1 } and the dialogue background D, the goal is to generate a response u t with appropriate dialogue strategy a t that can lead to a consensus between the system and user.A set of dialogue strategies A is pre-defined for prediction.Based on different applications, the dialogue strategy can be coarse dialogue act labels or fine-grained strategy labels.The dialogue background includes the system's goal and the related grounded information, such as item descriptions in bargain negotiation (He et al., 2018) and user profile in persuasion dialogues (Wang et al., 2019).

Experimental Setups
Datasets.We use the CraigslistBargain dataset (He et al., 2018) for evaluating the capability of strategically handling non-collaboration in LLMbased dialogue systems.The dataset was created under the bargain negotiation setting where the buyer and the seller are negotiating the price of an item on sale.Details can be found in Appendix A. Automatic Evaluation Metrics.Following the previous study (Joshi et al., 2021), we conduct a comprehensive evaluation over three subtasks, including negotiation strategy prediction, dialogue act prediction, and response generation.We report the F1 and ROC AUC scores for strategy prediction and dialogue act prediction, where the former one is a multi-label prediction problem.For the response generation, we adopt BLEU score and BERTScore (Zhang et al., 2020a) for evaluation.
Human Evaluation Metrics.Following Joshi et al. (2021), we also conduct human evaluation on 100 randomly sampled dialogues with both subjective and objective human judgement.As for the subjective judgement, annotators are asked to score [0,1,2] on how persuasive, coherent, and natural the generated response is.
We further pair the generated responses from each prompting scheme, including Standard, Proactive, ProCoT, and Ground-truth (GT), with the corresponding responses from each of the other prompting scheme to compute the overall win rates between each pair.
As for the objective human judgement, we adopt the sale-to-list ratio (SL%) (Joshi et al., 2021;Dutt et al., 2021) as an indicator for explicitly measuring the negotiation inclination in the generated response: where the bargain price is the price that the seller would like to sell the item at the current turn.The lower the SL%, the more compromise the seller have made.
To sum up, the instructions provided for annotators are as follows: • Persuasive: Whether the seller is persuasive in bargaining the price.
• Coherent: Whether the seller's responses are on topic and in line with the conversation history.
• Natural: Whether the seller is human-like.
• Bargain Price: What is the current bargain price from the seller's side.
• Win: Assume you are the seller.Which dialogue system you would like to use for bargain the price with the buyer (Win/Tie/Lose).
Usage of LLMs & Prompting Schemes.The adopted LLMs are the same, but the maximum number of new tokens is set to be 256, as there are more information needed to be generated, including negotiation strategies and dialogue acts.

Experimental Results
Table 6 and Table 7 present the results with automatic and human evaluation metrics, respectively.There are several notable findings as follows: LLM-based dialogue systems fail to predict appropriate negotiation strategies and dialogue acts.Table 6 shows that failures on strategy learning further result in a poor performance of response generation.Specifically, ChatGPT generally performs better than Vicuna in strategy learning.Although both proactive and ProCoT prompting schemes can slightly improve the final performance of response generation, there is still a large gap from fine-tuned methods according to automatic evaluation metrics.
The key challenge of LLMs in handling noncollaborative dialogues is how to effectively optimize the strategy planning.Table 7 shows that the generated responses conditioned on reference strategies are more favorable (Win Rates).
In specific, ChatGPT guarantees a high score on the human-like response generation (Natural).With the ProCoT, the generated responses are more coherent to the conversation history (Coherent), which can also be observed from the case study in Appendix D.3.However, compared with prompting with reference strategies, all the other prompting schemes fall short of generating persuasive responses for negotiation (Persuasive), indicating their shortcomings on strategy learning.This is also validated by the objective judgement on Sale-to-List Ratio, which shows that ChatGPT can reach a better deal for itself when being conditioned on reference strategies.Similarly, Chen et al. (2023c) empirically show that, given the optimal planned strategy, ChatGPT achieves strong performance on controllable response generation in some other strategy-based dialogues.

Analysis of Strategy Learning
Figure 2 presents the analysis of the relationships between the target and predicted dialogue acts by ChatGPT.As for the standard prompting, we observe two typical mistakes: 1) The system tends to propose the initial bargain price (init-price), instead of greetings (intro) and waiting for the buyer to initialize the bargain.2) The system often directly accepts the buyer's offer (accept) when it is supposed to offer another price for negotiation (offer).This also explains why the Sale-to-List Ratio is relatively low when using standard prompting in Table 7.On the other hand, Proactive and ProCoT prompting share similar patterns of mistakes, where ChatGPT tends to propose a counter price (counter-price) to negotiate with the buyer.Appendix E presents the analysis of the distribution of selected strategies by ChatGPT.In the reference responses, the seller often shows positive/negative sentiment to negotiate with the buyer.However, ChatGPT inclines to adopt conservative or concessionary strategies, such as using hedge words, show gratitude, or propose a counter price.
Overall, we conclude that ChatGPT tends to make compromise with the buyer during the negotiation, rather than strategically taking actions to maximize its own benefit.

Conclusion
In this work, we conduct the first comprehensive evaluation on the capability of LLM-based dialogue systems in handling proactive dialogues, including clarification, target-guided, and noncollaborative dialogues.To enhance the proactivity of LLM-based dialogue systems, we propose a proactive chain-of-thought prompting scheme that triggers the reasoning and planning capability of LLMs.The empirical analysis sheds light on the potentials of LLMs for proactive dialogues: 1) Pro-CoT largely enhances the originally poor performance of LLMs in asking clarification questions, but still limits in handling domain-specific applications.2) LLM-based dialogue systems perform aggressive topic shifting towards the designated target, while ProCoT enables the topic planning to be smoother.3) Despite the strength on controllable response generation, the capability of strategy learning and planning is a key challenge for LLMs in handling non-collaborative dialogues.

Acknowledgement
This research is supported by NExT Research Center.

Limitation
In this section, we discuss the limitations of this work from the following perspectives: Sensitivity of Prompts Similar to other studies on prompting LLMs for dialogue applications (Lee et al., 2023;Chen et al., 2023c,a), the evaluation results are likely to be sensitive to the choice of prompts.Besides, it is also likely that the designed prompts are not the optimal ones for the concerned problem.In fact, prompt sensitivity and optimality themselves are valuable research problems in dialogue systems, which can be further investigated in the future studies.To facilitate the reproducibility of this work, we will release all the prompts used in the experiments and provide detailed descriptions about the designs of each prompting scheme in Appendix C. The code and data will be released via https://github.com/dengyang17/LLM-Proactive.
Financial and Computational Cost of LLMs It is financially expensive to call the API of commercial LLMs for experiments.In our experiments, it costs about $120 to call the OpenAI API for getting all the experimental results of ChatGPT.On the other hand, it is computationally expensive to conduct experiments with open-source LLMs in local machines.In our experiments, we choose Vicuna 13B as the open-source LLM for evaluation, which can be adapted to NVIDIA DGX-1 V100 32G for inference.If more budgets and better experimental environment are permitted, it would be great to evaluate how other larger LLMs performs in the concerned proactive dialogue problems, such as GPT-4, LLaMA/Vicuna 65B, etc.

Capability of Planning and Decision Making
The proposed ProCoT prompting scheme can be regarded as a preliminary attempt at triggering the capability of planning and decision making from LLM-based dialogue systems.Compared with finetuned methods, such ability of LLMs is still weak as we learn from the empirical analysis.Moreover, simply prompting LLMs to be proactive may fall short of handling decision making under dynamic environments in real-world applications.It is worth studying how LLM-based dialogue systems handle the proactive dialogue problems in an interactive setting with more diverse user simulation (Lei et al., 2022;Fu et al., 2023).

A Details of Datasets
In the experiment, we adopt the test sets from five datasets for evaluation, including Abg-CoQA (Guo et al., 2021), PACIFIC (Deng et al., 2022a), Otters (Sevegnani et al., 2021), TGConv (Yang et al., 2022), and CraigslistBargain (He et al., 2018).Detailed descriptions of each dataset are as follows: • Abg-CoQA3 is constructed based on the CoQA dataset (Reddy et al., 2019) by truncating a partial conversation from the full conversation and selecting ambiguous questions.
• PACIFIC4 is constructed based on the TAT-QA dataset (Zhu et al., 2021), an question answering dataset in the financial domain, whose contexts contain a hybrid of tables and texts.Deng et al. (2022a) rewrite the questions to be ambiguous for introducing clarification turns in the conversation.
• OTTers is a next-turn target-oriented dialogue dataset, which requires the agent proactively generate a transition utterance to approach the designated target.We adopt the processed version5 by Yang et al. (2022) for evaluation.The topic is represented as a set of topical keywords.
• TGConv is constructed based on ConvAI2 (Dinan et al., 2019) and is split to two settings, including "easy-to-reach" and "hard-to-reach".The topic is also represented as a set of topical keywords.
• CraigslistBargain was created in a negotiation setting where two crowdsourced workers play the roles of the buyer and the seller to bargain the price of an item.We adopt the processed version6 by Joshi et al. (2021) for evaluation, which assigns 10 dialogue acts and 21 negotiation strategies to the utterances.

B Error Analysis Details for Clarification Dialogues
As shown in Table 9, we categorize these failure cases into the following four groups:  G: Which period are you asking about?P: The document does not provide information on the average operating income.Therefore, the question cannot be answered.
• Wrong Aspect: The model generates a question for clarifying a wrong aspect of the user query.
• Under-specified Clarification: The model generates an under-specified clarification question, where the requested information is too general so that it would be difficult for the user to provide feedbacks.
• Over-specified Clarification: The model generates an over-specified clarification question, where the requested information is already clear in the user query.
• Generation Error: Although the model identifies the need for clarification, but it doesn't generate the output as the required format, such as no clarification question.

C Designs of Prompts
Table 10, 11, and 12 present the example of prompts for clarification, target-guided, noncollaborative dialogues, respectively.As for the zero-shot setting, the overall prompt is composed by the task instruction and the sample.As for the few-shot setting, the overall prompt is composed by the task instruction, a number of samples with demonstrations, and the test sample.
In particular, we clarify several questions regarding the prompt designs as follows: How to construct the task instructions?The task instructions first follow the problem definition for each proactive dialogue problem.Then, similar to other studies on applying LLMs for different tasks (Wei et al., 2023;Bang et al., 2023), we further instruct the LLMs to generate the response following the desired output format for evaluation.

Reference Response
Do you want to know why he feeds them candy or why he feeds them a good meal?A: Why he feeds them candy -to keep up their energy B: Why he feeds them a good meal -to make sure he doesn't have to share the cookies kids leave for him The document mentions several reasons why Santa makes sure the reindeer are well taken care of, including giving them coffee to stay awake, candy to keep up their energy, and a good meal before they go.Therefore, the question is ambiguous.The clarifying question is "Which reason are you referring to?" Task Background: The grounded document is "Santa Claus makes sure that every year at Christmas time that his reindeer are ready to deliver gifts to girls and boys all over the world.Without the reindeer, how would Santa fly from home to home?The most important thing that Santa has to do for the reindeer is to make sure they have their coffee on Christmas Eve.Without the coffee, they won't stay awake for very long.Sleeping reindeer are not very helpful.Santa also gives the reindeer candy to keep up their energy.In the North Pole, the main candy is candy canes in red and white.Reindeer love the candy canes that are different colors best and Santa says that helps them fly faster.Next, Santa has to make sure that the reindeer have a good meal before they go.Santa wants to make sure he doesn't have to share the cookies kids leave for him.The last thing Santa has to do is make sure the reindeer know where they are going.Santa has only been lost one time on Christmas Eve and he does not want that to happen again."Conversation History: ["User": "What candy does he feed them most often?","System": "Candy canes.","User": "How many times has he been lost?","System": "Once.","User": "Why does he make sure they are well fed?"] Clarification Dialogues: Abg-CoQA How to choose the one-shot sample?Due to the input length limitation of LLMs, we could only adopt one-shot in-context learning (ICL).In order to testify the sensitivity of the choice of the one-shot sample, we report the results with three different one-shot samples in Figure 3.Despite the variance among different one-shot examples as expected when using exemplar-based ICL (Scao and Rush, 2021;Zhao et al., 2021), the observations of adopting ProCoT for different proactive dialogues remain unchanged.Since the variance of one-shot ICL is inevitable, we simply adopt the first dialogue sample in the original training set of each dataset as the one-shot sample for facilitating reproducibility.

Generated
How to construct the demonstration of proactive chain-of-thoughts?The demonstration of

Reference Transition Response and Next Topics
A: My dogs fall asleep to music.["dog", "fall", "music"] B: I love dogs too and I also love music.["love", "dog", "music"] C: My dog calms down when listening to music.["dog", "music", "calm", "listen"] Generated Responses (a) Standard Prompting (ChatGPT): Dogs are amazing companions, aren't they?Have you ever tried listening to music while playing the piano?It can be a great way to relax and unwind.(b) Proactive Prompting (ChatGPT): The next topics are ["pets", "music", "favorite"].The response is "Pets and music are two of my favorite things.Do you have a favorite type of pet or music to listen to?" (c) ProCoT Prompting (ChatGPT): The current topic is ["dogs", "love"].To bridge the current topic with the target topics, the next topics are ["pets", "music", "hobbies", "passion"].Based on the predicted next topics, the response is "Pets can bring so much joy and love into our lives.Speaking of hobbies and passions, do you enjoy listening to music or playing any instruments like the piano?"Task Background: The target topic is ["piano", "listen"] Conversation History: ["User": "I really love dogs."]Target-guided Dialogues: OTTers proactive chain-of-thoughts is written by human annotators, which represents their own chain-ofthoughts of the planning or decision making for the proactive dialogues.Following Wei et al. (2022b), we also report the results with the demonstrations from three different annotations.As shown in Figure 3, it can be observed that the variance is much smaller than the one-shot sample.We adopt the best ProCoT annotation for each dataset in the evaluation.

D Case Study
In order to intuitively compare the three prompting schemes, we conduct case studies on the generated responses regarding three proactive dialogue problems.

D.1 Clarification Dialogues
Figure 4 presents a case study from the Abg-CoQA dataset.There are two possible answers for the ambiguous user question in this case.When using standard prompting (a), ChatGPT generates a response based on a random guess, potentially resulting in an answer that may not align with the user's desired outcome.When using proactive prompting (b), although ChatGPT decides to ask a clarification question, the concerned aspect of clarification is not the focus of this conversation.ProCoT prompt-  ing (c) first provides a thorough analysis to identify the underlying reasons for the ambiguity in the user's question.This analysis serves as the foundation for generating an effective clarifying question, which helps to clarify and disambiguate the user's intended meaning.

D.2 Target-guided Dialogues
Figure 5 presents a case study from the OTTers dataset, where the target topical keywords include "piano" and "listen", and the system is required to generate a transition response to approach the target topics.It can be observed that the current topics in the user utterance, i.e., "dog", is completely different from the target topics.When using standard  prompting, ChatGPT leverages its overwhelming ability of controllable generation to generate the response with aggressive topic transition.Despite its fluency, it is not a natural utterance with log- ical or topical coherency.When using proactive prompting, although the predicted next topics are less aggressive than the standard prompting, the generated transition response just blends the current topics and the next topics together without naturally connecting the topics.Finally, ChatGPT with the ProCoT prompting generates a relatively smoother transition response to bridge the current topic with the target topic through planning about the topic shifting process.
Figure 6 presents a case study from the TGConv dataset, where the hard-to-reach target of this case is "Gaullist", and the system is required to reach this target within 8 turns of conversations under the self-play simulation.As for Vicuna, it is struggled to lead the conversation towards this hard-to-reach target, when using standard and proactive prompting.ProCoT prompting enables Vicuna to effectively and smoothly drive the conversation towards the designated target.In addition, owing to the powerful capability of controllable text generation, ChatGPT directly responds with the target topic to the initial user utterance.However, the topic transition is relatively aggressive, which might downgrade the user engagement or experience during the conversation.

D.3 Non-collaborative Dialogues
Figure 7 presents a case study from the Craigslist-Bargain dataset, where the system plays the seller role to bargain with the buyer.At turn 3, even though the buyer just inquires about the item information without showing the bargain intention, ChatGPT with standard prompting tends to initiate the negotiation, which may put the seller in a disadvantageous position.Proactive and ProCoT prompting enable the dialogue act and strategy prediction of the next response.Especially for the analysis of the current negotiation status, ProCoT points out that the negotiation has not yet started.
At turn 9, we observe that the seller has already lowered down the bargain price to $40 in a previous turn.Without the reasoning and planning process, ChatGPT with standard and proactive prompting generates the response with contradictory statement, i.e., propose a higher counter price ($45) for bargain, which is unreasonable in negotiation dialogues.With proactive CoTs, ChatGPT effectively summarizes the current negotiation progress and makes a better decision on the next negotiation goal.

E Analysis of Strategy Learning (Cont.)
Figure 8 presents the analysis of the distribution of selected strategies by ChatGPT.As for the reference responses, we observe that the seller tends to express their positive/negative sentiment as well as negotiate in a positive/negative manner.Differently, ChatGPT with standard and proactive prompting prefers to use hedge words or polite expressions (e.g., please and gratitude), indicating that Chat-GPT essentially plays a nice role in negotiation.ChatGPT with ProCoT prompting makes more decisions to use assertive words or trade in, compared with other distributions.This shows that ProCoT can enable ChatGPT to involve certain negotiation strategies.

Figure 2 :
Figure2: Heatmaps on the relationships between target and predicted dialogue acts.As no dialogue act is predicted in standard prompting, a dialogue act classifier is trained to identify the dialogue act of the generated response.

Figure 3 :
Figure 3: Performance in terms of different prompts, including different one-shot examples and different ProCoT annotations.The reported results are based on Vicuna-13B.
Responses (a) Standard Prompting (ChatGPT): To ensure that they have enough energy for their long journey delivering gifts.(b) Proactive Prompting (ChatGPT): The clarifying question is "Who are you referring to when you say 'he'?" (c) ProCoT Prompting (ChatGPT):

Figure 5 :
Figure 5: Case study from the OTTers dataset.

Figure 6 :
Figure 6: Case study from the TGConv dataset.

Figure 7 :
Figure 7: Case study from the CraigslistBargain dataset.

Figure 8 :
Figure 8: Distribution of selected negotiation strategies.Similarly, a negotiation strategy classifier is trained to identify the negotiation strategies of the generated response in standard prompting.

Table 2 :
Statistics of error analysis.
ProCoT effectively endows LLM-based conversational systems with the capability of asking clarification questions.Zero-shot ProCoT is not working in Vicuna, but one-shot ICL can largely improve the performance.As for Abg-CoQA, Chat-GPT with zero-shot ProCoT achieves competitive

Table 3 :
Turn-level evaluation results on Next Topic Prediction and Transition Response Generation.

Table 4 :
Dialogue-level evaluation results on targetguided dialogues.

Table 5 :
Human evaluation on target-guided dialogues.All reported methods are under the one-shot setting.

Table 5
Thus, the proficiency of controllable generation in LLMs is a double-edged sword for target-guided dialogues.The key challenge of LLMs is how to guarantee the topical smoothness and coherence of the generated transition responses.

Table 6 :
Evaluation results on Negotiation Strategy Prediction, Dialogue Act Prediction, and Response Generation.

Table 7 :
Human evaluation on non-collaborative dialogues.All reported methods are based on ChatGPT under the one-shot setting.Gold denotes that we instruct the LLMs to generate responses conditioned on the reference dialogue acts and negotiation strategies.

Table 8 :
Statistics of datasets.
The negotiation has not yet started, as the buyer is asking for more information about the item.The appropriate goal is to provide the requested information and engage the buyer in the negotiation.To reach this goal, the most appropriate set of negotiation strategies is['informal',  'greet', 'inform']and the most appropriate dialogue act is ['inquiry'].Based on the selected negotiation strategies and dialogue act, the response is "Hi, I'm doing well, thank you for asking.The dimensions of the computer table are [insert dimensions here].Is there anything else you would like to know about it?"The most appropriate set of negotiation strategies is ['hedge', 'informal', 'gratitude', 'neg_sentiment'] and the most appropriate dialogue act is ['counter-price'].The response is "I appreciate your offer, but the lowest I can go is $45.It's already a great deal with the swivel chair included."ProCoT (ChatGPT): The buyer has proposed a price that is lower than the target selling price.The seller has already offered a counter-price with delivery included, but the buyer is still negotiating.The next step should be to politely reject the buyer's offer and propose a new counter-price closer to the target selling price.To reach this goal, the most appropriate set of negotiation strategies is ['please', 'certainty', 'propose', 'hedge', 'gratitude'] and the most appropriate dialogue act is ['counter-price'].Based on the selected negotiation strategies and dialogue act, the response is "I appreciate your offer, but I cannot go lower than $38.Would that work for you?" Task Background: The item description is "Furniture.It's a solid wood tan color computer table with swivel chair for sale.It's in good condition as shown in pics.Moving out sale computer table and chair".The target selling price is 50.