Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning

Planning for goal-oriented dialogue often requires simulating future dialogue interactions and estimating task progress. Many approaches thus consider training neural networks to perform look-ahead search algorithms such as A* search and Monte Carlo Tree Search (MCTS). However, this training often requires abundant annotated data, which creates challenges when faced with noisy annotations or low-resource settings. We introduce GDP-Zero, an approach using Open-Loop MCTS to perform goal-oriented dialogue policy planning without any model training. GDP-Zero prompts a large language model to act as a policy prior, value function, user simulator, and system model during the tree search. We evaluate GDP-Zero on the goal-oriented task PersuasionForGood, and find that its responses are preferred over ChatGPT up to 59.32% of the time, and are rated more persuasive than ChatGPT during interactive evaluations.


Introduction
In many goal-oriented conversation tasks, interacting parties must retake initiative (Allen et al., 1999) by executing conversational strategies to lead the conversation a desired outcome (e.g.successful negotiation (Lewis et al., 2017) or emotional support (Liu et al., 2021)).As such, it is imperative to have high-quality dialogue policy planners that can prescribe an "optimal" strategy at each turn of the dialogue (Levin et al., 1997;Zhang et al., 2020b;Liu and Lane, 2017;Liu et al., 2018).
Optimal policy planning is a difficult task.While in task-oriented settings (e.g.restaurant booking) there at least is objectivity with respect to successful planning, many goal-oriented tasks like persuasion are often subjective.For instance, in a persuasion task, individual persuaders might adopt different strategies, making it difficult to train or 1 Code available at: https://github.com/jasonyux/GDPZero: If I'm in the right place at the right time or am given an opportunity.
[emotion appeal]: It's wonderful to see people like you who care so much about helping others!It's important to support causes that are close to our hearts.Children are ...

Prediction
: Save the Children is a wonderful charity to donate to.Would you be interested in donating to the charity?
: Do you ever donate to charities?evaluate a policy planner2 .Moreover, "optimality" in these complex tasks may require expert domain knowledge (e.g., negotiation skills).This also makes collecting and annotating high-quality conversations difficult (Chen et al., 2023b).
In this work, we contribute a novel approach to Goal-oriented Dialogue Planning with Zero training (GDP-ZERO).GDP-ZERO prompts a large language model (LLM) to perform planning by simulating future dialogue interactions (Figure 1), making it particularly suitable for tasks which would otherwise require high-quality conversations and annotations.Unlike previous approaches, we treat policy planning as a stochastic game, and use prompting for every stage of an open-loop tree search.We evaluate GDP-ZERO on PersuasionFor-Good due to its difficult planning task (Wang et al., 2019), and find its responses are preferred over ChatGPT in both static and interactive evaluations.
Dialogue Policy Planning Research on dialogue policy planning can be categorized into neuralfocused and algorithmic-focused.Neural-focused approaches use annotated dialogues to train dedicated classifiers or value functions to predict the next dialogue acts without explicit look-ahead planning (Zhang et al., 2022a,b;Cao et al., 2020;Peng et al., 2018;Zhang et al., 2023).For many goal-oriented dialogues, however, both annotated strategies and dialogue responses can be suboptimal/noisy, as different people can respond differently even given the same context.
To reduce the reliance on a labeled dataset, much work has also attempted to combine neural networks with search algorithms, such as A* search (Cheng et al., 2022) and tree search (Wang et al., 2020;Yang et al., 2021;Jang et al., 2020;Väth et al., 2023).However, these methods still require model training for dialogue simulation or value function estimation, and are therefore highly dependent on training data quality (Shi et al., 2019).For example, Jang et al. (2020) use MCTS for training an RNN-based policy model and Wang et al. (2020) train multiple neural networks for user simulation and value function estimation.Consequently, these methods can face difficulties during dialogue simulation due to a) noisy data annotations causing sub-optimally trained generation models, and b) inaccurate responses generated at turn i compounding errors for simulations at turns > i.

Method
In this work, we introduce GDP-ZERO, an algorithm-focused dialogue policy planner for goaloriented dialogue tasks like persuasion.GDP-ZERO uses zero model training and instead performs Open-Loop MCTS at decision time by prompting an LLM to simulate user and system response, evaluate current task progress, and predict a prior next dialogue act.Building on findings from Chen et al. (2023b), our approach has two main differences from existing policy planning work: we use few-shot prompting to bypass the need for model training on noisy data, and we use Open-Loop MCTS to reduce compounding simulation errors by continuously re-generating system and user responses during the tree search.

Problem Definition
To introduce tree search methods for dialogue policy planning, we first formulate planning as a Markov Decision Process (MDP).A t turn dialogue between a user and a system can be represented as: where a sys i is the system's dialogue act at turn i, u sys i is the system's response, and u usr i is the user's utterance at turn i.Similar to Yang et al. (2021) and Wang et al. (2020), we define the task of planning the next a sys as an MDP problem ⟨S, A, R, P, γ⟩.The dialogue act of the system a sys i represents an action a i ∈ A at a turn i, and the corresponding dialogue history up to the i-th turn s i = (a 0 , u sys 1 , u usr 1 , ..., a i−1 , u sys i , u usr i ) represents a state s i ∈ S. A reward function R(s, a) represents the likelihood of a desired conversational outcome, such as persuading a user to donate to a charity.The transition function P : S × A → S represents the probability of transitioning from a dialogue state s i to state s i+1 after executing a i at a turn.Finally, γ ∈ [0, 1) is the discount factor.

Dialogue Planning as a Stochastic MDP
In a typical MDP game like Go, much success has been achieved by employing closed-loop MCTS to plan for the next move/action a (Silver et al., 2016(Silver et al., , 2017)).In general, MCTS algorithms improve the actions proposed by an original policy model π θ by building a search tree that contain simulated outcomes of many potentially high-rewarding actions, and returning the best action according to the simulation/evaluation results.Specifically, MCTS repeatedly 1) explores a new action or exploits a successful action a proposed by π θ in a state s; 2) simulates the action outcome s ′ ← P(s, a); 3) evaluates the action's quality; and 4) updates its estimate Q(s, a) for that action.At the end of the search, MCTS produces probabilities π ∝ N (s, a) 1/τ for playing each action using the exponentiated visit count, which usually suggests much stronger moves than using the raw policy π θ (Howard, 1960;Sutton and Barto, 2018;Silver et al., 2017).
However, in simulating dialogue interactions during tree search, generating a slightly improbable system or user response for state s ′ and storing it in a search tree could lead to a large compounding error for the rest of the subtree from s ′ (Wang et al. ( 2020)).This is because the state space representing all possible responses is large, and dialogue responses are diverse.This makes it difficult to accurately simulate the most probable next dialogue state s ′ given an (s, a) pair from the previous turn.We thus treat dialogue policy planning as a stochastic MDP, where the simulated next state s ′ ← P(s, a) is drawn from a large unknown distribution and might not be representative of the most probable s ′ (Perez Liebana et al., 2015).Unlike previous usages of (closed-loop) MCTS for dialogue which consider a deterministic transition (Wang et al., 2020;Yang et al., 2021), this formulation re-quires potentially different s ′ to be returned given the same dialogue context s and system action a.

GDP-ZERO
To solve this stochastic problem, we base our algorithm on Open-Loop MCTS (Weber, 2010;Perez Liebana et al., 2015), a variant of MCTS where each tree node s tr i = (a 0 , ..., a i ) represents the sequence of dialogue actions to reach a dialogue turn i.Instead of using system/user utterances to represent a tree node, this design forces an algorithm to (re)generate the corresponding system and user utterances when traversing the tree (see Figure 2).Over time, a tree node s tr stores statistics derived from executing the sequence of dialogue actions (DA) without relying on some specific instance of user/system utterances, which could cause errors to propagate into future simulations.Given a dialogue state s 0 , GDP-ZERO searches for the next best action by iteratively performing action selection, search tree expansion, action evaluation, and backpropagation to update tree statistics.After n iterations, GDP-ZERO outputs a prediction for the next best action for s 0 .GDP-ZERO is detailed in Figure 2 and Appendix A. Below we describe each stage of the algorithm.
Selection Given a tree state s tr , the action a * with the highest Predictor Upper Confidence Tree Bound (PUCT) (Silver et al., 2017;Rosin, 2011) is selected to traverse the tree: , where N records the number of times a (s tr , a) pair has been visited, and c p is a hyperparameter controlling exploration.Since future simulations require a specific dialogue history, we either sample from the node's simulation cache if its size reached k, or generate a new simulation based on selected dialogue history h tr by prompting (Appendix B).
We repeat this process until s tr becomes leaf node.
Expansion Once a leaf node is reached, we treat a LLM M θ as a prior policy by prompting it to generate a distribution of next dialogue acts.This is done by sampling M θ at temperature τ = 1.0 for m times, and converting the sampled DAs into a distribution (see Appendix A).Finally, each DA is also initialized with Q(s tr , •) = Q 0 , a hyperparameter controlling exploration.
Evaluation We model the value of a state v(s tr ) by the probability that its dialogue context h tr can lead to task success.In a persuasion task to convince a user to donate to a charity, this can be achieved by appending the utterance "Would you like to make a donation?" to the context, and prompting an LLM l times to simulate the user's current inclination to donation (Appendix A).
Backpropagation At the end of each search, we first store any newly simulated histories h tr in a cache for each s tr .Then, we update the statistics of all nodes along the search path: where ∆Q(s tr , a) = v(s tr )−Q(s tr ,a) N (s tr ,a) . We also store a value v h estimated for each simulated history h tr that resulted in v(s tr ) by a running average: with N h (•) stores the number of times h tr is visited.
Prediction After all simulations are completed, we select the optimal action a * = arg max a N (s tr 0 , a) based on the number of times an action has been visited, where s tr 0 is the root node of the tree.To avoid generating a response using M θ on a * again, we also extract from cache the best system utterance associated with executing action a * at state s tr 0 .We use v h computed during the simulations, and return the utterance with the highest estimated value: where h tr a * denote any history h tr that played action a * after the root state s tr 0 .We call this process "response selection".

Experiments
We evaluate GDP-ZERO in the context of Persua-sionForGood (P4G; Wang et al. ( 2019)), a dataset with 300 annotated dialogues where a "persuader" attempts to persuade a "persuadee" to donate to a charity called Save the Children (Appendix E).Due to the subjective nature of persuasion, it is not clear whether the annotated conversations reflect optimal persuasion strategies (Chen et al., 2022).This makes training and evaluating a policy planner challenging, since different persuaders likely adopt different strategies (Section 1).Motivated by this challenge, we employ GDP-ZERO to perform decision-time planning based on dialogue simulations, and focus on evaluating our approach's end-to-end performance in achieving the desired task outcome: successfully persuading a user to donate to Save the Children.

Static Evaluation
We use ChatGPT3 (OpenAI, 2022) as the generation backbone of GDP-ZERO, which has become accepted as one of the most coherent generalpurpose LLM (e.g.Liu et al. (2023b); Guo et al. (2023)).We take the first 20 dialogues from P4G, and produce 154 turns for evaluation.For each turn, we compare the response generated with and without GDP-ZERO for planning.Then, we prompted ChatGPT to choose4 which generated response is more persuasive (Appendix F).
In Table 1, we found that generative approaches using ChatGPT were preferred over human ground truth responses about 90% of the time, with the highest score achieved by GDP-ZERO.In Table 2, we show that responses generated after GDP-ZERO planning were preferred in up to 59.32% of comparisons.We also observe increasing preference for GDP-ZERO when the number of simulations n increases.Finally, we find changing k, Q 0 (controls simulation diversity and exploration, respectively) can slightly improve performance (Appendix A).However, we consider the possibility that a) ChatGPT is biased towards its own generated dialogues (Liu et al., 2023a), and b) it might not have a robust criteria of what constitutes persuasiveness.
As such, we also conducted interactive evaluation.

Interactive Human Evaluation
We conducted interactive human evaluation using the LegoEval platform (Li et al., 2021) with crowdworkers on Amazon Mechanical Turk.We primarily sought to evaluate GDP-ZERO in an end-to-end chatbot against two competitive baselines.The first is prompting ChatGPT for generation without GDP-ZERO planning.The second follows Chen et al. (2023b) by using ChatGPT with RAP (Chen et al., 2022).To our knowledge, RAP is the current state-of-the-art system on P4G, using trained modules with fixed dialogue policies derived from expert knowledge.Recently, Chen et al. (2023b) showed that using an LLM (e.g.ChatGPT) as the response generation module for RAP further improved performance (see Appendix G for details).
After the conversation, we asked the crowdworkers to evaluate our system based on the criteria in Table 3.We collected 40 survey results for GDP-ZERO, 35 for ChatGPT, and 36 for RAP (Appendix H).Our study revealed that GDP-ZERO achieves the best performance across all metrics related to persuasiveness.We also found that RAP is highly rated for strategy diversity and relevance, indicating the benefit of using expert knowledge in planning.In Appendix D we provide some analysis of the generated dialogues and the resulting dialogue act distributions using different planners.
In Appendix I we include some example dialogues.

Ablation Studies
To study how various components of GDP-ZERO helped improve task success, we conducted various ablation studies including 1) using Closed-Loop MCTS instead of the open-loop variant; 2) removing the "response selection" process; and 3) testing with both ChatGPT and Codex (Chen et al., 2021) as a backbone.Our experiments (see Appendix C) show that using Open-Loop MCTS contributed the most to making a response more persuasive, and that using the "response selection" precedure further improves performance.We also find that GDP-ZERO can improve upon both backbone models.

Conclusion
We propose GDP-ZERO, an algorithm to perform look-ahead policy planning with a large language model for goal-oriented dialogues.End-to-end evaluation shows that using the policy from GDP-ZERO surpasses previous state-of-the-art system (RAP) and direct prompting with state-of-the-art LLMs on the task of persuasion without any model training.Strong performance in the zero-data regime opens the possibility of future work building dialogue systems in more conversational tasks under data-scarce settings.

Limitations
When is using GDP-ZERO appropriate?In this paper, we present GDP-ZERO, a general approach for close-domain dialogue policy planning at turn-level.However, in this work we only evaluated GDP-ZERO on P4G.This is because we believe simulation-based plannings would be most beneficial when the task 1) often requires longhorizon planning to be successful, 2) does not have "optimal" action annotations readily available for supervised learning, and 3) evaluation does not require study participants to follow fake hypothetical scenarios to conduct a conversation, as otherwise the outcome may be disingenuous.We thus believe tasks like persuasion are most suitable, where planning ahead is crucial to success and policy optimality from human demonstrations is extremely subjective.Additionally, such a user may or may not want to donate to the charity, and the task of the dialogue system is to try to persuade them.This is in contrast to other goal-oriented contexts like task-oriented dialogue (TOD), where strong policies can be directly learned due to TOD's mostly passive and objective nature (e.g.He et al. ( 2022)), and also to dialogues where a hypothetical scenario is required.For example, to design a fair and controlled study, a user may need to make up fake preferences/objectives about a certain product in CraiglistsBargains (He et al., 2018), or pretend that they need emotional support for some fixed set of issues in ESConv (Liu et al., 2021).Additionally, while GDP-ZERO can be adapted to task-oriented contexts like Multi-Woz (Budzianowski et al., 2018), it may not necessarily be appropriate.Such task-oriented contexts often have hierarchical policies (e.g." [hotel] [recommend] name price" and "[restaurant] [inform] food price area"), and adaptation to GDP-ZERO would require converting the hierarchy into a multi-label classification, resulting in a massive action space.We believe this could be very inefficient, and approaches such as building multiple search trees to perform high/low-level planning would be useful (Zhang et al., 2020a).
Runtime One important limitation for GDP-ZERO is runtime.The more exhaustive the tree search (e.g.increasing n or k), the more likely the algorithm is able to find the optimal dialogue policy (Table 2).However, this comes at the cost of longer simulation time, which may affect the overall user experience, and accordingly, user perceptions of persuasiveness.
With OpenAI API's rate limit and LLM's inference speed, we restricted GDP-ZERO to plan on 7 dialogue acts in P4G, with n = 10, k = 3 for a simulation time of around 35 seconds during interactive evaluation.We believe methods to parallelize tree search (Chaslot et al., 2008) or to re-use part of the simulation subtrees could be helpful to speed up GDP-ZERO.We expect that as research with LLMs progresses, inference speed will continue to improve.In the short-term, one may bypass latency limitations by utilizing multiple accounts to parallelize API calls during simulation.
Simulation Quality GDP-ZERO prompts a LLM (e.g.ChatGPT) to perform dialogue simulation and value estimation.Despite LLM's strong few-shot performance on many tasks, issues with controllable generation can still create errors during simulation (e.g.generated system utterances might not match planned dialogue action).GDP-ZERO accounts for such errors by using an Open-Loop search with k > 1, but this increases simulation runtime.We believe this trade-off between simulation quality and runtime is also an important aspect for future work to consider.2023a)), recent work has been finding that LLMs like ChatGPT may favor responses with a higher number of unique tokens (Wang et al., 2023).As such, we analyzed responses between GDP-ZERO (n = 20, k = 3, Q 0 = 0) and ChatGPT, and found that there is indeed some positive correlation (r = 0.29, p < 0.001) between the number of words in a response and ChatGPT's preference for persuasiveness.However, it is not clear whether this is a limitation in our setting.Our setting of evaluating persuasiveness is not quite the same as the typical context of considering direct preference; there are many situations where longer responses which correlate with persuasiveness (for instance, an aptly used logical appeal).As such, we conducted interactive evaluation with crowdworkers in Section 4.2 and find that GDP-ZERO generated responses are indeed rated as more persuasive.

Ethical Considerations
Our work describes an algorithm to perform dialogue policy planning for goal-oriented tasks without any model training.It is aimed at making future dialogue systems to build, and also better at helping users/systems achieve their tasks/goals.Potential Abuses Generally, while most algorithms are not designed for unethical usage, there is often potential for abuse in their applications.In our experiments with PersuasionForGood (Wang et al., 2019), we apply GDP-ZERO on the goal of increasing users' intention to donate to a charity.However, because GDP-ZERO is fundamentally goal-agnostic, it is possible to use them for unethical tasks, such as scamming.We do not condone the use of GDP-ZERO for any unlawful or morally unjust purposes.
Interactive Human Evaluation In this study, we conducted interactive human evaluation using crowdworkers on the Amazon Mechanical Turk platform.All crowdworkers were informed that they were speaking with a chatbot.All study participants were paid at a rate of $15 per hour.Our study has received IRB approval.

References
James E Allen, Curry I Guinn, and Eric Horvitz.1999.

A Additional details on GDP-ZERO
We describe the details of GDP-ZERO in Algorithm 1. Similar to other MCTS algorithms, GDP-ZERO performs simulation based on four stages, selection, expansion, evaluation, and backpropagation, and finally predicts an action based on the simulations.Different from existing implementations, GDP-ZERO performs Open-Loop search using only a generative LLM M θ , by prompting it to do dialogue simulation, value function estimation, and prior policy estimation (see Appendix B for prompting details and examples).GDP-ZERO requires a generative LLM M θ as a backbone model, and takes in a dialogue history h i at turn i as input.Given some fixed dialogue action space A (see appendix E for P4G), GDP-ZERO builds a search tree after n simulations.For each state, GDP-ZERO keeps a cache of size k storing newly generated user and system utterances.We use c p = 1.0, and Q 0 = {0.0,0.25, 0.5} to promote exploration (see Table 2).

B Prompting Details on P4G
For P4G, we used the same one-shot example for all cases, while dynamically changing the representation for each operation.
System response generation.Following Chen et al. (2023b), we include the natural language form of a planned dialogue action (Table A4) in the prompt to perform conditional generation.We present an example in Table A7.
User response generation.We swap the user and the system role for this task, and prompt the LLM to act as a user simulator.We present an example in Table A8.
Value function estimation.To evaluate the user's inclination to donate at a given state, we first append the turn "Persuader: Would you be interested in donating to Save the Children?" to the dialogue history, and then prompt the LLM at temperature τ = 1.1 to sample the user's response for l = 10 times.We define "no donation"=-1.0,"negative reaction"=-0.5,"neutral"=0.0,"positive reaction"=0.5, and "donation"=1.0,and then convert the sampled responses to a score between -1.0 and 1.0.We present an example in Table A9.
Prior policy estimation.We treat the backbone LLM as a prior policy, and prompt it to generate the next dialogue action at temperature τ = 1.0 for 15 times to simulate its policy distribution.Finally, to promote the diversity of the generated dialogue actions during GDP-ZERO tree search, we use add-1 smoothing to convert the generated dialogue actions to a probability distribution.We present an example in Table A10.

C Ablation Studies
In Table A1 and Table A2, we conduct ablations to study how each component in GDP-ZERO affect its performance.We use the same 20 dialogues from our static evaluation (Section 4.1), and test GDP-Zero with ChatGPT (Table A2) and Codex5 (Table A1) as the generation backbone.Similar to Section 4.1, we use ChatGPT to evaluate the persuasiveness of the generated responses.
In all runs, we use n = 20, c p = 1, Q 0 = 0, k = 3 for GDP-ZERO, when applicable.Table A1 and Table A2 shows that GDP-ZERO improves the policy proposed by the raw backbone model, and also that of a Closed-Loop MCTS implementation (denoted as "w/o OpenLoop") and without the "response selection" process (denoted as "w/o Resp.Select").This indicates that using the openloop variant and the proposed "response selection" process are beneficial for GDP-ZERO to improve dialogue planning.

D Analysis of GDP-ZERO Dialogues
In our experiments, we utilized GDP-ZERO to generate persuasive responses to convince a persuadee to donate to a charity.In this section, we provide an analysis of how and when GDP-ZERO impacts persuasiveness.
actions.We thus utilize those annotations and investigate when GDP-ZERO is producing more persuasive responses.For this analysis GDP-ZERO used n = 10, k = 3, Q 0 = 0.25, which is the same configuration as in our interactive evaluation (Section 4.2, Appendix G).We found that GDP-ZERO is rated as more persuasive 70.59% of the time when the user at the previous turn showed negative emotions (i.e."negative-reactionto-donation" or "negative-to-inquiry"), and 59.26% of the time when the user showed positive emotions (i.e."positive-reaction-to-donation" and "positiveto-inquiry").We believe this implies that GDP-ZERO's simulations are most beneficial in persuading users who is having little intent to donation.This is because look-ahead planning may, for instance, help ensure a response has covered all of the user's concerns regarding donation.We present an example in Table A6.
Interactive evaluation.Since each interactive dialogue is unlabeled, here we focus on analyzing the known dialogue actions generated by the different planners during those conversations.In Figure A1 we present the distribution of planned dialogue actions produced by RAP, ChatGPT, and GDP-ZERO.In general, the planned dialogue actions using ChatGPT and GDP-ZERO are unevenly distributed across different stages of the dialogue.Across different times of the conversation, Chat-GPT and GDP-ZERO shared the most frequent DA at each stage is: "greeting" during turns 1-2, "logical appeal" during turns 3-5, and "emotion appeal" during turn 6-10.However, during turns 3-5 GDP-ZERO had a relatively even preference between "credibility appeal", "emotion appeal", and "logical appeal", while ChatGPT strongly preferred "logical appeal."Additionally, we find that throughout all turns ChatGPT prefers to use "emotion appeal" over other dialogue actions, while GDP-ZERO balances between "emotion appeal" and "logical appeal," and RAP prefers "task related inquiry."It is also worth noting that GDP-ZERO performs "proposition of donation" less frequently at turn 1-5 than ChatGPT, and more frequently at turns 6-10.This indicates GDP-ZERO potentially adopting a more "conservative" strategy than Chat-GPT overall.GDP-ZERO focuses on building up persuasive arguments and is less likely to select "proposition of donation" at early stages of the conversation in order to avoid over eargly propositions, which could lead to persuasion failure (also see Table A12).

E GDP-ZERO Setup on P4G
PersuasionForGood (P4G) is annotated with 10 persuasion strategies and 7 important non-persuasive strategies (see Table A3).However, since P4G is collected from human-human interaction, with both the "persuader" and the "persuadee" possibly donating to the charity, some of the dialogue actions are unsuitable when the "persuader" is a chatbot (e.g.self-modeling and personal story).We therefore choose a subset of dialogue actions to plan, by picking 4 frequent persuasive strategies suitable for chatbots, and 3 non-persuasive strategies including "other" to enable the chatbot to deal with unaccounted situations.We present the chosen dialogue actions and their prompts for LLM in Table A4.

F Additional details on static evaluation
In our static evaluation, we prompt ChatGPT to choose which generated response is better (e.g. with and without GDP-ZERO planning).Given two responses u a and u b , we ask ChatGPT "Which of the following responses can better help the Persuader convince the Persuadee to donate to Save the Children?Why? A: u a , B: u b , C: Can't tell." after providing the relevant task context and dialogue history (see Table A5).For every evaluation, we sample the result 5 times and perform a majority vote.Interestingly, we find that ChatGPT skewed towards choosing option A, preferred choosing A The Persuader responds to the Persuadee without using any persuasive strategy.
Table A4: Mapping of persuader dialogue actions to natural language prompts used for prompting LLMs.
for 95.45% when u a = u b .We therefore randomly swap option A and B during all of our evaluations.

G Additional details on interactive study
In our interactive evaluation, we compare the rulebased planner from RAP, ChatGPT, and GDP-ZERO in an end-to-end chatbot for the persuasion task.
RAP we use the rule-based planner derived from expert knowledge from RAP (Chen et al., 2022), which produces a dialogue action given a dialogue context.We then use the same prompting template in GDP-ZERO (Appendix B, Table A7), and prompt ChatGPT to produce a system response conditioned on the planned dialogue action.
ChatGPT we first use the same prompting template in GDP-ZERO (Appendix B, Table A10) to obtain ChatGPT's own prior distribution of the next dialogue actions.We then take the most probable action as the planned dialogue action, and use the same template in GDP-ZERO (Appendix B, Table A7) to prompt ChatGPT again to produce a system response.
GDP-ZERO we use GDP-ZERO with ChatGPT backbone as policy planner, and use the "Response Selection" step to produce both the next dialogue action and the associated system response.We used n = 10, k = 3, Q 0 = 0.25, which had a high win rate against ChatGPT during static analysis while also not taking too much time to simulate.

H Additional details on survey results
We require our crowdworkers to be located in the United States and have a HIT acceptance rate of at least 99%.After interacting with each chatbot, each crowdworker was asked to rate their conversational experience.This post-task survey included a validation question which asked what charity they talked to the chatbot about.We had a total of 216 respondents.74 did not reach or complete the survey, and 31 were removed due to failing the validation question or responding with less than 3 unique sentences.This results in 40 survey results for GDP-ZERO, 35 for ChatGPT, and 36 for RAP.

I Example Interactive Conversations
We present multiple dialogues from the interactive evaluation (Section 4.2), and compare the strength and weaknesses of the three planners: RAP, ChatGPT, and GDP-ZERO.In Tables A11, A13 and A15 we present conversations where the persuadee "strongly agreed" that the chatbot has increased his/her intention to donation for GDP-ZERO, ChatGPT, and RAP, respectively.In Tables A12, A14 and A16 we present conversations where the persuadee rated "neutral" for the question that the chatbot has increased his/her intention to donation for GDP-ZERO, ChatGPT, and RAP, respectively.We observe that in Table A12, while GDP-ZERO did not succeed in convincing the user to donate, GDP-ZERO has been attempting various strategies and does not perform "proposition of donation" prematurely.On the other hand, in Tables A14 and A16 both ChatGPT and RAP performs "proposition of donation" at an arguably unsuitable timing.Yes, I did know that.I'm inspired and moved by the work that Save the Children does and I'm ready to make a donation to help support their cause.ChatGPT proposition of donation That's wonderful to hear!Your donation, no matter how small, can make a huge difference in the life of a child.Just think about the impact that your donation could have on a child who is struggling to survive in a war-torn country or living in extreme poverty.Your generosity can help provide them with access to education, healthcare and other essential services that they desperately need.Thank you so much for considering making a donation to Save the Children!User You are very welcome!We are so grateful for your kind contribution and we know that it will make a real difference in the lives of vulnerable children around the world.Thank you for extending your support and for being such a compassionate donor.Your generosity is truly appreciated!ChatGPT proposition of donation Thank you so much for your kind words.Would you like to make a donation now?Even a small amount can go a long way in helping children in need.

User
Yes, I would love to make a donation.Please tell me how I can do that.ChatGPT other That's great to hear!You can make a donation directly on the Save the Children website.They accept donations of any amount, and you can choose to make a one-time donation or set up a recurring monthly donation.Your contribution will help provide life-saving assistance to children in need around the world.Thank you so much for your generosity and compassion!Table A13: Example conversation with ChatGPT as the policy planner.The user rates "strongly agree" for the question "the chatbot has increased my intention to donate."

Figure 1 :
Figure 1: Using GDP-ZERO for persuasion with zero model training.

Figure 2 :
Figure 2: GDP-ZERO with ChatGPT backbone.During Selection, simulations are either sampled from cache or newly generated.During Expansion and Evaluation, we prompt ChatGPT for prior policy π and value estimation.

Figure A1 :
Figure A1: Distribution of planned dialogue actions for each planner during interactive evaluations.

Table 1 :
Algorithm n k Q 0 Win Rate v. Ground Truth Static evaluation with ChatGPT as backbone and judge.Results are µ ± σ repeated over three runs.

Table 2 :
Static evaluation ChatGPT as backbone and judge.Runner-up is colored in gray.Results are µ ± σ repeated over three runs.

Table A3 :
Annotated dialogue actions in the P4G dataset.
Very well.I'm just up organizing info for my charity called Save the Children.Have you heard of this charity before?user Persuadee: No, I have not.Can you tell me more?The Persuader establishes credibility of Save the Children by citing its impact.assistant Persuader: Save the Children is an organization that helps children in developing countries, by promoting children's rights and providing relief.It is an amazing charity that helps kids who are in desperate need.They can help with safety, education and more.user Persuadee: That sounds great.I believe in this charity, but still wonder how much of the money I donate actually helps.I am always worried if I donate it will just go to some higher up that is living the high life.The Persuader uses an emotion appeal to convince the Persuadee.assistant Persuader: Every little bit makes a difference.When you have people who are so poor, it's amazing what a tiny amount can do.I usually donate in hopes I can at least feel like I did my part.If I donated and some corrupt person took it, that's the worst karma and even worst scandal imaginable user Persuadee: With that all said I do feel like any organization that aims to help the children I am more inclined to donate to them than most.I think helping children is an important thing as they are our future!The Persuader asks if the Persuadee would like to make a small donation.assistant Persuader: I think donating to this cause would def be a step in the right direction to hopefully helping across the world the children that are in despair.I don't want you to donate any more than you want, so if you want to donate how much do you to do? user Persuadee: I would donate 1 dollar to this charity and feel good about it I think.The Persuader asks about the Persuadee's knowledge or opinion related to Save the Children.assistant Persuader: I'm doing well, thanks for asking.I wanted to talk to you about an important cause that I am passionate about -Save the Children.Have you heard of them before?user Persuadee: No I have not.The Persuader establishes credibility of Save the Children by citing its impact.Generation assistant Save the Children is an international non-governmental organization that is committed to improving the lives of children through education, healthcare, and other support programs.They work in over 100 countries around the world to help children who are living in poverty or affected by conflict.

Table A7 :
Prompting LLM to generate a system response conditioned on a dialogue act."Role" is used to be consistent with ChatGPT's API.Blue texts are prompts characteristic of this task.The one-shot example changes its formatting across different tasks.doing great, thank you.How about you?ChatGPT task related inquiry I'm doing well, thank you.I wanted to talk to you about a charity called Save the Children.Have you heard of it before?User Yes, I have heard of Save the Children before.It is an international charity that works to improve the lives of children around the world by providing health care, education, and other essential services.ChatGPT logical appeal That's right!Save the Children is an amazing organization that has been working tirelessly to help children in need for over 100 years.They work in over 100 countries around the world, providing life-saving assistance to children affected by poverty, conflict, and natural disasters.User Yes, Save the Children is an amazing organization.ChatGPT proposition of donation Absolutely.Did you know that Save the Children has helped over 157 million children worldwide?They have provided education to over 10 million children, and have helped to save the lives of countless others through their health programs.Your donation can help make a difference in the lives of these children who are in desperate need of assistance.User