Saying No is An Art: Contextualized Fallback Responses for Unanswerable Dialogue Queries

Despite end-to-end neural systems making significant progress in the last decade for task-oriented as well as chit-chat based dialogue systems, most dialogue systems rely on hybrid approaches which use a combination of rule-based, retrieval and generative approaches for generating a set of ranked responses. Such dialogue systems need to rely on a fallback mechanism to respond to out-of-domain or novel user queries which are not answerable within the scope of the dialogue system. While, dialogue systems today rely on static and unnatural responses like “I don’t know the answer to that question” or “I’m not sure about that”, we design a neural approach which generates responses which are contextually aware with the user query as well as say no to the user. Such customized responses provide paraphrasing ability and contextualization as well as improve the interaction with the user and reduce dialogue monotonicity. Our simple approach makes use of rules over dependency parses and a text-to-text transformer fine-tuned on synthetic data of question-response pairs generating highly relevant, grammatical as well as diverse questions. We perform automatic and manual evaluations to demonstrate the efficacy of the system.


Introduction
In order to cater to the diversity of questions spanning across various domains, dialogue systems generally follow a hybrid architecture wherein an ensemble of individual response subsystems (Kuratov et al.;Harrison et al., 2020) are employed from which an appropriate response is presented to the user (Serban et al., 2017;Finch et al., 2020;Paranjape et al., 2020). However, it is common for dialogue systems to encounter queries which are not within their scope of knowledge. While increasing the number of such subsystems would be a good strategy to increase coverage, it can be a never ending process and a default fallback strategy would al- Figure 1: Comparison of responses of three flight booking dialogue systems: The first one does not handle unknown responses. The second one has a default fallback response. The third one has a fall-back response which is contextualized with the user query. ways be needed. Besides, domain specific dialogue systems, especially those deployed in professional settings generally prefer restricting themselves to a fixed set of domains, and purposely refrain from responding to out-of-domain and random or toxic user queries.
One approach to acknowledge such queries is to have a fallback mechanism with responses like "I don't know the answer to this question" or "I'm not sure how to answer that." However, such responses are static and unengaging and give an impression that the user's query has gone unacknowledged or is not understood by the system as shown in Figure 1 above. Yu et al. (2016) have shown that static and predefined responses lead to lower levels of user engagement and decrease users' interest in interacting with the system. Yu et al. (2016) shows that a system which reacts to system breakdowns and to low user engagement leads to a better user engagement.
Our fallback approach attempts to address these limitations by generating "don't-know" responses which are engaging and contextually closer with the user query. 1) Since there are no publicly available datasets to generate such contextualised responses, we synthetically generate (query, fallback response) pairs using a set of highly accurate handcrafted dependency patterns. 2) We then train a sequence-to-sequence model over synthetic and natural paraphrases of these queries. 3) Finally, we measure the grammaticality and relevance of our models using a crowd-sourced setting to assess the generation capability. We have released the code and training dataset used in our experiments publicly. 1

Related Work
Improving the coverage to address out-of-domain queries is not a new problem in designing dialogue systems. The most popular approach has been via presenting the user with chit-chat responses. Other systems such as Blender (Roller et al., 2020) and Meena (Adiwardana et al., 2020) (Rashkin et al., 2019) to generate social talk responses. While this might seem fitting for chit-chat and social talk dialogue systems, domainspecific scenarios often dealing with professional settings would refrain from performing friendly or social talk especially avoiding the possibility of the randomness of generative models. Also, multiple subsystem architectures always have the possibility of cascading errors and profane or toxic queries. Hence systems should always have a foolproof mechanism in the form of static templates to reply from. Liang et al. (2020) uses an interesting approach for error handling by mapping dialogue acts and intents to templates. Besides, like Finch et al. (2020) it is always safer to generate fallback responses on encountering queries which might be toxic, biased or profane. 2 Another line of work attempts to handle user queries which are ambiguous by asking back clarification questions (Dhole, 2020;Zamani et al., 2020;Yu et al., 2020). While this increases user interaction and coverage to an appreciable extent, it does not eliminate the requirement of a failsafe fallback responder. This paper's contribution is to address this requirement with an enhanced version of a fallback response generator.

Methods
We describe two approaches to generate such contextual don't-know responses.

The Dependency Based Approach (DBA)
Inspired by previous approaches which use parse structures to generate questions (Heilman and Smith, 2009;Mazidi and Tarau, 2016;Dhole and Manning, 2020), we create a rule-based generator by handcrafting dependency templates to cater to a wide variety of question patterns as shown in Table 1. We perform extensive manual testing to improve the generations from these rules and increase overall coverage. The purpose of these rules is two-fold: i) To create a high-precision fall-back response generator as a baseline and ii) to help create (query, don't-know-response) pairs which could be paired with natural paraphrases to serve as seed training data for other deep learning architectures.
To build this baseline generator, we utilize few dependency templates in the style of SynQG (Dhole and Manning, 2020). We utilize the dependency parser from Andor et al. (2016) to get the Universal Dependencies (Nivre et al., 2016(Nivre et al., , 2017(Nivre et al., , 2020 of the user query. We then convert it to a don't-know-response by re-arranging nodes to a matched template. We further change pronouns, incorporate named entity information, and add rules to handle modals and auxiliaries. Finally, we also add rules for flipping pronouns to convert an agent targeted question to a user targeted response by interchanging pronouns and their supporting verbs. E.g. You to I and vice-versa.
We incorporate a bit of paraphrasing by randomizing various prefixes like "I'm not sure whether", "I don't know if", etc. and randomly using named entities. We describe the high-level algorithm below and in Algorithm 1.

Sequence-to-Sequence Approach
Owing to the expected low coverage and scalability of the rule-based approach, we resort to take advantage of pre-trained neural architectures to attempt  to create a sequence-to-sequence fallback responder. To incorporate noise and avoid the model to over-fit on the handcrafted transformations, we do not train the model directly on (query, don't-knowresponse) pairs generated from the previous section. From all possible questions of the Quora Questions Pairs dataset (QQP) 3 , we first filter all the questions which generate a reply from the dependency based rules. Then we pair these dont-knowresponses with the paraphrases of the input questions rather than the input questions themselves. 4 Primarily attempting to avoid over-fitting on the dependency patterns, this also helps generate dontknow-responses which are paraphrastic in nature.
After incorporating paraphrases from QQP, we are able to build a dataset of 100k pairs, which we call the "I Dont Know Dataset" (IDKD). After witnessing the success of text-to-text transformers, we use the pre-trained T5 transformer (Raffel et al., 2020a,b) as our sequence-to-sequence model. We 3 Quora Question Pairs Dataset 4 Those question pairs which have the label "1" or are similar are used as paraphrases.  divide IDKD into a train and validation split of 80:20. We use the Transformers code from Hug-gingFace (Wolf et al., 2020) to fine-tune a T5-base model over IDKD for 2 epochs. 5

Results
Most prior generated systems are evaluated on a range of automatic metrics like BLEU and ROGUE (Papineni et al., 2002) used in the machine translation literature. However, owing to the drawbacks of these metrics, we perform human evaluation of the generated responses using two metrics -namely "relevance" and "grammaticality" as defined in Dhole and Manning (2020). We evaluate the performance of both the approaches in a crowd-sourced setting by requesting Englishschooled individuals to rate. 6 Raters were asked to evaluate grammaticality in a binary setting (grammatical/ungrammatical) and relevance on a Likert scale (1 to 5).
Our human evaluations are shown in Table-2. T5 responses tend to be more grammatical than their dependency counterparts by a large margin of 6%. Relevance scores drop slightly from 3.97 to 3.66.

Metrics
Question Dependency Based Approach Seq2Seq Approach   This can be largely attributed to the model's paraphrastic ability of describing words and connected events outside the knowledge of the user's query. Eg. in the second query in Table 4, if the string "MIT" were something other than an institution, the dependency based approach would seem safer than the seq2seq approach.
In addition, T5 responses on an average generate at least double the number of novel words than their dependency counterparts as shown in Table 3. Sentence length mostly remains unaffected across the two models. Undoubtedly, the rule-based model despite being highly relevant is only able to reply to 54.5% of random QQP queries.
The T5 model helped to not only add paraphrastic variations but also scale to user queries outside of the scope of the dependency templates. More importantly, without losing the original ability of saying no, the model was able to generate more natural sounding dont-know-reponses by utilizing it's inherent world-knowledge acquired during pretraining. Table 4 shows some interesting examples. The highlighted phrases in blue show the benefits of the model's pre-training ability.

Conclusion and Future work
We describe two simple approaches which enhance user interaction to cater to the necessities of reallife dialogue systems which are generally a tapestry of multiple solitary subsystems. In order to avoid cascading errors from such systems, as well as refrain from answering out-of-domain and toxic queries it is but natural to have a fallback approach to say no. We argue that such a fallback approach could be contextualised to generate engaging responses by having multiple ways of saying no rather than a one common string for all approach. The appeal of our approach is the ease with which it can rightly fit within any larger dialogue design framework.
Of course, this is not to deny that as we give more paraphrasing power to the fallback system, it would tend to retract from succinctly replying with a no -as is evident from the drop in the relevance scores. Nevertheless, we still believe that both our fallback approaches could serve as effective baselines for future work.