Zero-Shot Prompting for Implicit Intent Prediction and Recommendation with Commonsense Reasoning

Intelligent virtual assistants are currently designed to perform tasks or services explicitly mentioned by users, so multiple related domains or tasks need to be performed one by one through a long conversation with many explicit intents. Instead, human assistants are capable of reasoning (multiple) implicit intents based on user utterances via commonsense knowledge, reducing complex interactions and improving practicality. Therefore, this paper proposes a framework of multi-domain dialogue systems, which can automatically infer implicit intents based on user utterances and then perform zero-shot prompting using a large pre-trained language model to trigger suitable single task-oriented bots. The proposed framework is demonstrated effective to realize implicit intents and recommend associated bots in a zero-shot manner.


Introduction
Intelligent assistants have become increasingly popular in recent years, but they require users to explicitly describe their tasks within a single domain.Yet, the exploration of gradually guiding users through individual task-oriented dialogues has been relatively limited (Chiu et al., 2022).This limitation is amplified when tasks extend across multiple domains, compelling users to interact with numerous bots to accomplish their goals (Sun et al., 2016).For instance, planning a trip might involve interacting with one bot for flight booking and another for hotel reservation, each requiring distinct, taskspecific intentions like "Book a flight ticket" to activate the corresponding bot, such as an airline bot.In contrast, human assistants can manage highlevel intentions spanning multiple domains, utiliz-ing commonsense knowledge.This approach renders conversations more pragmatic and efficient, reducing the user's need to deliberate over each task separately.To overcome this limitation of current intelligent assistants, we present a flexible framework capable of recommending task-oriented bots within a multi-domain dialogue system, leveraging commonsense-inferred implicit intents as depicted in Figure 1.Sun et al. (2016) pinpointed the challenges associated with a multidomain dialogue system, such as 1) comprehending single-app and multi-app language descriptions, and 2) conveying task-level functionality to users.They also gathered multi-app data to encourage research in these directions.The HELPR framework (Sun et al., 2017) was the pioneering attempt to grasp users' multi-app intentions and consequently suggest appropriate individual apps.Nevertheless, previous work focused on understanding individual apps based on high-level descriptions exclusively through user behaviors, necessitating a massive accumulation of personalized data.Due to the lack of paired data for training, our work leverages external commonsense knowledge to bridge the gap between high-level utterances and their task-specific bots.This approach enables us to consider a broad range of intents for improved generalizability and scalability.

Multi-Domain Realization
Commonsense Reasoning Commonsense reasoning involves making assumptions about the nature and essence of typical situations humans encounter daily.These assumptions encompass judgments about the attributes of physical objects, taxonomic properties, and individuals' intentions.Existing commonsense knowledge graphs such as ConceptNet (Bosselut et al., 2019), ATOMIC (Sap et al., 2019), andTransOMCS (Zhang et al., 2021) facilitate models to reason over human-annotated commonsense knowledge.This paper utilizes a generative model trained on ATOMIC 20 20 (Hwang et al., 2021) to predict potential intents linking given user high-level utterances with corresponding task-oriented bots.The inferred intents can activate the relevant task-oriented bots and also serve as justification for recommendations, thereby enhancing explainability.This work is the first attempt to integrate external commonsense relations with task-oriented dialogue systems.
Zero-Shot Prompting Recent research has revealed that large language models (Radford et al., 2019;Brown et al., 2020) have acquired an astounding ability to perform few-shot tasks by using a natural-language prompt and a handful of task demonstrations as input context (Brown et al., 2020).Guiding the model with interventions via an input can render many downstream tasks remarkably easier if those tasks can be naturally framed as a cloze test problem through language models.As a result, the technique of prompting, which transposes tasks into a language model format, is increasingly being adopted for different tasks (Zhao et al., 2021;Schick and Schütze, 2021).Without available data for prompt engineering (Shin et al., 2020), we exploit the potential of prompting for bot recommendation in a zero-shot manner.This strategy further extends the applicability of our proposed framework and enables it to accommodate a wider variety of user intents and tasks, thus contributing to a more versatile and efficient multidomain dialogue system.

Framework
Figure 2 illustrates our proposed two-stage framework, which consists of: 1) a commonsenseinferred intent generator, and 2) a zero-shot bot recommender.Given a user's high-level intention utterance, the first component focuses on generating implicit task-oriented intents.The second component then utilizes these task-specific intents to recommend appropriate task-oriented bots, considering the bots' functionality through a large pretrained language model.

Commonsense-Inferred Intent Generation
The commonsense-inferred implicit intents function not only as prompts for bot recommendation but also as rationales for the suggested bots, thereby establishing a solid connection between the highlevel intention and task-oriented bots throughout the conversation.For instance, the multi-domain system shown in Figure 1 recommends not only the AirlineBot but also describes its functionality-"can book a flight ticket"-to better convince the user about the recommendation.

Relation Trigger Selection
ATOMIC 20 20 is a commonsense knowledge graph featuring commonsense relations across three categories: social-interaction, event-centered, and physical-entity relations, all of which concern situations surrounding a specified event of interest.Following Hwang et al. (2021), we employ a BART model (Lewis et al., 2020) pre-trained on ATOMIC 20 20 to generate related entities and events based on the input sentence.However, despite having a total of 23 commonsense relations, not all are suitable for inferring implicit intents in assistant scenarios.We utilize AppDialogue data (Sun et al., 2016) to determine which commonsense relations can better trigger the task-specific intents.Given a high-level intention description u i and its task-specific sentences s ij , we calculate the trigger score of each relation r as an indicator of its

Relation Definition
Social xIntent the likely intent or desire of an agent (X) behind the execution of an event "X gives Y gifts" → X wanted "to be thoughtful" xNeed a precondition for X achieving the event "X gives Y gifts" → X must first "buy the presents" xWant post-condition desires on the part of X "X gives Y gifts" → X may also desire "to hug [Y]" Event isAfter events that can precede an event "X is in a hurry to get to work" → "X wakes up late" isBefore events that can follow an event "X is in a hurry to get to work" → "X drives too fast" suitability as a trigger relation.
where P BART ([u i , r, s ij ]) represents the probability of the sentence beginning with the high-level user description u i , followed by a relation trigger r, and the corresponding task-specific sentences s ij .By summing up multiple task-specific sentences over j and all samples over i, a higher T (r) implies that the relation r can better trigger implicit task-oriented intents in assistant scenarios.We identify a total of five relations with the highest T (r) and present their definitions (Sap et al., 2019) in Table 1.These relations are also reasonable from a human perspective to trigger implicit user intents.

Commonsense Knowledge Generation
Given the selected relations R = {r 1 , r 2 , ..., r 5 }, where r i represents the i-th relation from {xIntent, xNeed, xWant, isAfter, isBefore}, we concatenate each relation with a user utterance u to serve as the context input for our pre-trained BART model: where <s> and </s> are special tokens in BART, and [GEN] is a unique token employed during the pre-training of BART to initiate the commonsenserelated events.BART accepts this input and decodes the commonsense events into implicit taskoriented intents Y = y 1 1:k , y2 1:k , ..., y 5 1:k , where y i k denotes the k-th generated commonsense event of the relation r i .

Zero-Shot Bot Recommendation
With the inferred intents, the second component aims to recommend appropriate bots capable of executing the anticipated tasks.To pinpoint the task-specific bots based on the required functionality, we leverage the remarkable capacity of a large pre-trained language model, assuming that app descriptions form a part of the pre-trained data.

Pre-trained Language Model
The language model used in this study is GPT-J 6B 2 , an GPT-3-like causal language model trained on the Pile dataset3 (Radford et al., 2019), a diverse, open-source language modeling dataset that comprises 22 smaller, high-quality datasets combined together.Making the assumption that app descriptions in mobile app stores are incorporated in the pre-training data, we exploit the learned language capability to suggest task-oriented bots based on the given intents.

Prompting for Bot Recommendation
To leverage the pre-trained language capability of GPT-J, we manually design prompts for each relation type.For social-interaction relations, the prompt is formulated as "The user r i y i 1:k by using a popular app called".For instance, Figure 2 generates a prompt "The user needs to go to the restaurant and make the reservation by using a popular app called".For event-centered relations, we simply concatenate the generated events and appprompt to trigger the recommended task-oriented apps/bots.

Experiments
To evaluate the zero-shot performance of our proposed framework, we collected a test set specific to our multi-domain scenarios.We recruited six volunteers who were knowledgeable about the target scenarios to gather their high-level intention utterances along with the associated task-oriented bots.
Upon filtering out inadequate data, our test set incorporated a total of 220 task-oriented bots and 92 high-level utterances, each linked with an average of 2.4 bots.The number of bot candidates considered in our experiments is 6,264, highlighting the higher complexity of our tasks.Our primary aim is to connect a high-level intention with its corresponding task-oriented bot recommendation by leveraging external commonsense knowledge.Therefore, we assess the effectiveness of the proposed methodology and compare it with a 1-stage prompting baseline using GPT-J to maintain fairness in comparison.For this baseline, we perform simple prompting on the user's high-level utterance concatenating with a uniform app-based prompt: "so I can use some popular apps called."In response to these context prompts, GPT-J generates the associated (multiple) app names, serving as our baseline results.
To further investigate whether our proposed commonsense-inferred implicit intent generator is suitable for our recommendation scenarios, we introduce another 2-stage prompting baseline for comparison.Taking into account that contemporary large language models exhibit astonishing proficiency in commonsense reasoning, we substitute our first component with the state-of-the-art GPT-3 (Brown et al., 2020) to infer implicit intents, serving as another comparative baseline.

Automatic Evaluation Results
Considering that multiple bots can fulfill the same task (functionality), we represent each app by its category as defined on Google Play, then compute precision, recall, and F1 score at the category level.This evaluation better aligns with our task objective; for instance, both "WhatsApp" and "Line" belong to the same category-"communication" as demonstrated in Table 3.
Table 2 presents that the 2-stage methods significantly outperform the 1-stage baseline, suggesting that commonsense knowledge is useful to bridge high-level user utterances with task-oriented bots.Further, our proposed approach, which leverages external commonsense knowledge, achieves superior precision over GPT-3, a quality that is more important in recommendation scenarios.The reason is that GPT-3 may generate hallucinations for inferring more diverse but may not suitable intents.

Human Evaluation Results
Given that our goal can be interpreted as a recommendation task, the suggested bots different from user labels can be still reasonable and useful to users.Therefore, we recruited crowd workers from Amazon Mechanical Turk (AMT) to evaluate the relevance of each recommended result given its high-level user utterance.Each predicted bot or app is assessed by three workers on a three-point scale: irrelevant (1), acceptable (2), and useful (3).The human-judged scores are reported in the right part of Table 2, and our proposed framework achieves the average score of 2.18, implying that most recommended tasks are above acceptable.Compared with the 1-stage baseline with a score below 2, it demonstrates that commonsense inferred implicit intents can more effectively connect the reasonable task-oriented bots.Considering that the score of 2-stage prompting is also good, we report the pairwise comparison in Table 4, where we can see that humans prefer ours to 2-stage prompting baseline for 57% of the data.
In additon to simply suggesting task-oriented bots, providing the rationale behind their recommendation could help users better judge their utility.Within our proposed framework, the commonsenseinferred implicit intents, which are automatically generated by the first component, can act as the explanations for the recommended task-oriented bots, as illustrated in Table 3.Consequently, we provide these rationales alongside the recommended results using the predicted intents and undergo the same human evaluation process.Table 4 validates that providing these justifications results in improved performance from a human perspective, further suggesting that commonsense-inferred intents are useful not only for prompting task-oriented bots but also for generating human-interpretable recommendation.

Discussion
Table 5 showcases the implicit intents generated by our proposed COMeT generator and GPT-3.It is noteworthy that GPT-3 occasionally produces hallucinations, which can render the recommended bots unsuitable.For instance, given the text prompt "My best friend likes pop music.",GPT-3 infers an intent to "buy a ticket to see Justin Bieber", which may not align accurately with the user's need.
Hence, our experiments reveal that while the  2-stage prompting achieves higher recall, its precision is lower.As our objective is to recommend reasonable task-specific bots, a higher precision is more advantageous in our scenarios.

Conclusion
This paper introduces a pioneering task centered around recommending task-oriented dialogue systems solely based on high-level user intention utterances.The proposed framework leverages the power of commonsense knowledge to facilitate zero-shot bot recommendation.Experimental results corroborate the reasonability of the recommended bots through both automatic and human evaluations.Experiments show that the recommended bots are reasonable for both automatic and human evaluation, and the inferred intents can provide informative and interpretable rationales to better convince users of the recommendation for practical usage.This innovative approach bridges the gap between user high-level intention and actionable bot recommendations, paving the way for a more intuitive and user-centric conversational AI landscape.

Limitations
This paper acknowledges three main limitations: 1) the constraints of a zero-shot setting, 2) an uncertain generalization capacity due to limited data in the target task, and 3) the longer inference time required by a large language model.Given the absence of data for our task and the complexity of the target scenarios, collecting a large dataset for supervised or semi-supervised learning presents a significant challenge.As the first approach tackling this task, our framework performs the task in a zero-shot manner, but is applicable to fine-tuning if a substantial dataset becomes available.Consequently, we expect that future research could further train the proposed framework using supervised learning or fine-tuning, thereby enhancing the alignment of inferred implicit intents and recommended bots with training data.This would expand our method to various learning settings and validate its generalization capacity.
Conversely, the GPT-J model used for recommending task-oriented bots is considerably large given academic resources, thereby slowing down inference speed.To mitigate this, our future work intends to develop a lightweight student model that accelerates the prompt inference process.Such a smaller language model could not only expedite the inference process to recommend task-oriented bots but also be conveniently fine-tuned using collected data.
Despite these limitations, this work can be considered as the pioneering attempt to leverage commonsense knowledge to link task-oriented intents.The significant potential of this research direction is evidenced within this paper.

Ethics Statement
This work primarily targets the recommendation of task-oriented bots, necessitating a degree of personalization.To enhance recommendation effectiveness, personalized behavior data may be collected for further refinement.Balancing the dynamics between personalized recommendation and privacy is a critical consideration.The data collected may contain subjective annotations, and the present paper does not dive into these issues in depth.Future work should address these ethical considerations, ensuring an balance between personalized recommendations and privacy preservation.

A Implementation Details
In our zero-shot bot recommendation experiments, which are evaluated using Android apps based on RICO data (Deka et al., 2017), we append the phrase "in Android phone" to all prompts.This helps guide the resulting recommendations.Taskoriented prompts are fed into GPT-J to generate token recommendations for bots/apps, such as "OpenTable", an Android app, which aligns better with our evaluation criteria.
In the 2-stage prompting baseline, our prompts for GPT-3, designed to generate commonsenserelated intents, are coupled with our selected relations to ensure a fair comparison.These prompts are outlined in Table 6.

B Reproducibility
To enhance reproducibility, we release our data and code.Detailed parameter settings employed in our experiments are as follows.
In commonsense knowledge generation, we apply beam search during generation, setting beam_size=10.In prompting for bot recommendation, a sampling strategy is implemented during recommendation generation, with max_length=50, temperature=0.01, and top_p=0.9.

C Crowdsourcing Interface
Figure 3 and 4 display annotation screenshots for both types of outputs.Workers are presented with a recommendation result from 1) user-labeled ground truth, 2) the baseline, and 3) our proposed method.Note that results accompanied by reasons originate only from our proposed method.

D Qualitative Analysis
Table 7 features additional examples from our test set, highlighting our method's ability to use commonsense knowledge to recommend more appropriate apps than the baseline, and broaden user choices.
In the first example, our method discerns the user's financial needs and suggests relevant financial apps such as Paypal.Conversely, the baseline method could only associate the user's needs with communication apps like WeChat, possibly influenced by the term friend in the high-level description.
In the second example, our method infers potential user intents about checking their bank account and purchasing a new notebook, thus recommending Paypal for bank account management and Amazon for shopping.
In the third example, the user mentions having a tight schedule.Hence, our method suggests Uber to expedite the user's commute to the movie theater or Netflix for instant access to movies.

Figure 2 :
Figure 2: Our zero-shot framework for triggering task-oriented bots via the commonsense-inferred prompts.

Figure 3 :
Figure 3: An annotation screenshot of annotating the recommended apps/bots on the Amazon Mechanical Turk, where the results may come from the ground truth, the baseline, or the proposed method.

Figure 4 :
Figure 4: An annotation screenshot of annotating the recommended apps/bots together with the predicted intents as reasons on the Amazon Mechanical Turk.
We are planning to celebrate friend's birthday at a restaurant.

Table 2 :
Evaluation scores (%).OpenTable can help book a table at the restaurant and go to the restaurant.

Table 3 :
Generated results for given user high-level descriptions.

Table 6
ReasonsGoogle Wallet can help check if the money was sent to the right place and check if the money was sent to the correct place WhatsApp can help find out where the money came from and find out who sent the money Paypal can help to give the money to my friend and to give the money to the person who sent it to me User Input My notebook was broken.I need to get a new one.Check how much money is left in my account.User-labeled Shopee (Shopping) Baseline Google Play (Google Play) Proposed Google Play (Google Play), Amazon (Shopping), Mint (Tools), Paypal (Finance) Reasons Google Play can help to buy a new one and to buy a new notebook.Amazon can help to buy a new one and find out how much money is left.Mint can help to buy a new one and to buy a new notebook.PayPal can help my credit card is maxed out and my credit card is maxed out and I can't afford a new one.Reasons WhatsApp can help to be entertained and to have fun.Netflix can help find a movie to watch and find a movie to watch.Youtube can help go to the movies and to find a movie to watch.Uber can help when you have a lot of work to do and have to go to work.

Table 7 :
Generated results for given user high-level descriptions.