DialGuide: Aligning Dialogue Model Behavior with Developer Guidelines

Dialogue models are able to generate coherent and fluent responses, but they can still be challenging to control and may produce non-engaging, unsafe results. This unpredictability diminishes user trust and can hinder the use of the models in the real world. To address this, we introduce DialGuide, a novel framework for controlling dialogue model behavior using natural language rules, or guidelines. These guidelines provide information about the context they are applicable to and what should be included in the response, allowing the models to generate responses that are more closely aligned with the developer's expectations and intent. We evaluate DialGuide on three tasks in open-domain dialogue response generation: guideline selection, response generation, and response entailment verification. Our dataset contains 10,737 positive and 15,467 negative dialogue context-response-guideline triplets across two domains - chit-chat and safety. We provide baseline models for the tasks and benchmark their performance. We also demonstrate that DialGuide is effective in the dialogue safety domain, producing safe and engaging responses that follow developer guidelines.


Introduction
Current open-domain dialogue models such as Di-aloGPT (Zhang et al., 2020), Blenderbot (Roller et al., 2021), and PLATO (Bao et al., 2021) have shown the ability to generate fluent and interesting responses.However, they are generally difficult to control and require large datasets to re-purpose them for a new task or domain.On the other hand, deployed conversational systems generally rely on handcrafted rules and templates (Tesauro et al., 2013;Ralston et al., 2019;Juraska et al., 2021;Konrád et al., 2021;Chi et al., 2022).Such systems allow for more control over responses and produce interesting, high quality responses, yet they are rigid and have poor coverage due to the difficulty of writing responses for every situation.We propose a new framework, DIALGUIDE, to control dialogue response generation using natural language rules, which we call guidelines.A guideline consists of an "if x" condition part specifying the context it is relevant to, and a "then y" action part that specifies what the response should contain.Figure 1 presents an overview of our framework.We use a retrieve-then-infer process to retrieve guidelines relevant to the context and then use one of them to either generate or verify a response candidate.
Using guidelines in our proposed framework offers several benefits.Guidelines allow developers to drive system actions towards predefined agendas, enhance response engagement, and address common issues in system outputs such as the generation of toxic responses.Guidelines can be added, removed, or edited during deployment, and updating them does not require retraining the model.Guidelines are more flexible, as they are not as rigid as regex-based rules, but are more abstract.Our framework merges language models' instruction understanding with developers' intuitive guidelines expressed in natural language.It is important to note that the model's ability to generate responses is not limited solely to the guidelines present.In the absense of relevant guidelines, our models are trained to generate responses directly without conditioning on guidelines.
In the DIALGUIDE framework * , we benchmark three tasks: 1) Guideline selection, where a model needs to retrieve context-relevant guidelines, 2) Response generation, where a model generates a response that follows a selected guideline, and 3) Response entailment verification, where the model determines whether a response follows or violates a guideline.We augment conversations from existing dialogue datasets -Blended Skills talk (Smith et al., 2020) and ProsocialDialog (Kim et al., 2022a) by collecting annotations of 1) relevant/irrelevant guidelines to the conversation context and 2) responses following/violating the guideline.To test the models' semantic understanding, we also create adversarial train and test sets.We establish benchmark performance on these tasks and show that models tuned on our data can generate better controlled and coherent responses.Although the dataset is medium-sized, few-shot based models enable generalization to new guidelines and contexts.We also demonstrate our framework's effectiveness in the dialogue safety domain, generating safe and engaging responses.

Related Work
Controlling Dialogue Systems has been a focus of research to generate engaging responses (Ghazarian et al., 2021), prevent toxic content and biases (Dinan et al., 2020;Xu et al., 2021a), steer the conversation towards specific keywords or topics (Tang et al., 2019;Gupta et al., 2022a), and ground responses in background knowledge such as persona (Song et al., 2019), emotions (Zhong et al., 2019), or documents (Zhao et al., 2020;Li et al., 2022).Many approaches train models on discrete labels or control codes, but this can be inflexible and requires retraining to incorporate new * Code and data will be available labels.While neural dialogue models are the mainstream in research, chatbots in deployment often still rely on handcrafted rules (Suendermann et al., 2009;Liu and Mei, 2020) and templates (Reiter et al., 2005;McRoy et al., 2003) due to the ease of update and ability to generate high quality, controllable responses.There has also been progress in using natural language prompts and instructions to control models (Gupta et al., 2022b;Mi et al., 2022;Chung et al., 2022), but our work extends this by providing fine-grained semantic control through guidelines over open domain response generation.
Fixing Models through Intervention Some recent work has explored editing models by computing targeted changes in the model's parameters (Sinitsin et al., 2020;Hase et al., 2021;Mitchell et al., 2021;Meng et al., 2022), while others have explored natural language feedback (Madaan et al., 2022;Scheurer et al., 2022;Zeidler et al., 2022).Our approach differs by showing that guidelines can be used to "patch" models by controlling their behavior over problematic contexts and guiding the model toward the desired behavior, rather than modifying the model's parameters.
Dialogue Safety is an important concern for conversational models, as they can generate harmful content, exhibit social biases, and align themselves with offensive statements (Xu et al., 2021b;Baheti et al., 2021;Barikeri et al., 2021;Dinan et al., 2022).Several approaches have been proposed to address these issues, such as filtering unsafe text from training data (Xu et al., 2021b;Ngo et al., 2021), using specialized decoding procedures for safer generation (Liu et al., 2021), and controlling language generation (Keskar et al., 2019;Dathathri et al., 2020).Other approaches include strategies for responding to problematic contexts, such as steering away from toxicity (Baheti et al., 2021;Arora et al., 2022), using apologies (Ung et al., 2022), and non-sequiturs (Xu et al., 2021b).Our work is closely related to a study that proposed ProSocial dialog, a dataset where speakers disagree with unethical and toxic contexts using safety labels and social norms (Kim et al., 2022a).Using guidelines allows for more fine-grained control by specifying the contexts they are relevant to, and can provide more informative responses.

Response Entailment and Selection
Response selection involves selecting a response from a candidate set based on the context of a conversation or background knowledge (Lowe et al., 2017;Yuan et al., 2019;Gu et al., 2020).Response entailment (Welleck et al., 2019;Nie et al., 2021;Gupta et al., 2022c) predicts whether a response entails a given premise.Our task design is similar, as we determine the entailment of a candidate response based on a guideline.This can be applied to response selection when multiple candidates are available and we need to select those that align with a given guideline.
3 Proposed Task and Data collection DIALGUIDE consists of the following tasks: • Guideline retrieval: Retrieve the most appropriate guidelines relevant to the context.• Response generation: Generate a response that follows the specified guideline.• Response entailment verification: Infer whether a response follows or entails the guideline or not.At test time, a model first retrieves a guideline most relevant to the context.Then, a model either generates a response based on the guideline(s) or checks whether a response follows the guideline(s).
We collected two datasets for DIALGUIDE.For DIALGUIDE-BST, we augment conversations from the BlendedSkillTalk (Smith et al., 2020) (BST) dataset.We use the Amazon Mechanical Turk platform to collect annotations for the three tasks mentioned above.We use Blenderbot (Roller et al., 2021) to generate 3 additional responses for each context, creating a set of four responses including the original response from the dataset, denoted as R b , which is used in tasks A) and C) below.DI-ALGUIDE-SAFETY consists of data for the safety domain, where we augment conversations from the ProsocialDialog (Kim et al., 2022a) dataset.A) Guideline writing task.We collect annotations in the form of triplets C, g, r cg , where C is the dialogue context, g is a guideline that describes the context and the content of the responses, and r cg is a response that is coherent with the context and follows the guideline.The annotations are collected using two mechanisms: In mechanism 1, annotators are shown a dialogue context and a response and are asked to write a guideline such that the provided response can be generated based on the guideline.The response shown is selected from R b (either the original dataset or the set of automatically generated responses) with equal probability.In Figure 2 of Appendix B, we show the annotation interface for mechanism 1.In mechanism 2, annotators are shown a dialogue context and are asked to write a guideline and then a response that follows the guideline.To aid the annotators, we provide hints in the form of a small set of possible guideline phrases such as "ask a question about x" and "give a reason for doing x." Workers are provided with multiple good and bad examples and are encouraged to use abstract concepts in the guidelines to generalize to novel contexts.For example, using "learning a musical instrument" instead of "learning piano" in the condition generalizes the guideline to any musical instrument.
While in Mechanism 1 annotators do not need to write responses, we notice that the written guidelines can be specific to the context and response.Mechanism 2, on the other hand, yields more abstract guidelines due to the use of guideline phrase hints.The set of context-guideline-response instances collected from this task is denoted as G ann .
B) Guideline relevance annotation task.For a given context C, workers are presented with a set of guidelines G c = (g 1 , g 2 , ..g k ) (only the condition part), and are asked to annotate which guidelines are relevant to the context.The annotation interface is displayed in Figure 3.We collect three annotations per context-guideline pair (inter-annotator agreement of Krippendoff's alpha 0.67), and the majority label is chosen.To generate the guideline candidates, we first train a guideline generation model M g using the InstructDial model (Gupta et al., 2022b), which is instruction tuned for the guideline generation task.The model is trained on a pair of contexts and responses using annotations from the guideline writing task.Using M g , a large set of synthetic guidelines G BST is generated, conditioned on the contexts and responses from the BST train dataset.For each context C, the set of guidelines G c is created by retrieving the top 5 highest scored guidelines from BM25 as well as from DPR (Karpukhin et al., 2020) using contextguideline similarity.The DPR model is trained using the context-guideline pairs from G BST .The guideline set G c for context C is thus composed of 10 guidelines, where we replace a randomly selected retrieved guideline with the gold guideline from G ann written by the human annotators.
C) Response entailment verification task.Given the context C, the guideline (created from A -the guideline writing task), and the response set R b , annotators are asked to mark whether each response candidate follows the guideline.Because of the  design of the guideline writing task, at least one of the responses in R b would be entailed since either the guideline is written based on a response (mechanism 1) or the response is written based on the guideline (mechanism 2).The annotation interface is shown in Figure 4. Three annotations are collected per instance (a tuple of dialogue context, guideline, and a response), and the majority label is chosen with an inter-annotator agreement of Krippendoff's alpha 0.68.D) Adversarial negative response writing.Annotators were provided with a guideline g and a response r that follows the guideline and then asked to minimally edit r so that the new response r ′ violates g.These adversarial responses are designed to test the model's robustness and ability to handle responses that are semantically and lexically similar to the guideline, but still do not follow it.The annotation interface is shown in Figure 5.
Data Statistics and Quality.DIALGUIDE-BST is annotated using tasks A), B), C), and D) and DIAL-GUIDE-SAFETY is annotated using only tasks A) and B).Tables 1 and 2 show the dataset statistics."Response generation" is from task A, "Guideline retrieval" is from task B, and "Response entailment verification" is from tasks C and D. Both datasets are augmented using random instances from the original datasets' train, validation, and test sets.We conducted human evaluations to measure dataset quality.For 200 randomly selected contextguideline-response triplets, annotators rated 96% of the guidelines as sensible, 96% of responses as sensible, 97% of guidelines as relevant to the context, and 95% of responses as entailing the guideline.

Experiments and Results
In this section we discuss the experimental setup and results for the three tasks in DIALGUIDE setup.

Setup and Baselines
The task is to retrieve the most relevant guidelines for a given context, C. G c , the set of guidelines for a context, has 10 guidelines and binary annotations indicating their relevance to the context.G c includes the gold human-written guideline and at least one relevant guideline.Only the condition part of the guidelines is used.The train, dev, and test sets of DIALGUIDE-BST contain 2798, 1004 and 1011 contexts respectively.We report performance using standard retrieval metrics.
For training data, we use a) Human-annotated data: it consists of positive pairs of relevant context and guidelines, easy negative pairs of irrelevant context and randomly selected guideline, and hard negative pairs of guideline annotated as irrelevant to the context.b) Silver data: synthetic data, G BST (discussed in Section 3 B) with no human annotations, consists of 33k pairs of context and generated guidelines.Negative pairs are created from randomly selected contexts and guidelines.
We experiment with the following methods.• BM25: Measures overlap between the guideline and the context.• DPR (Karpukhin et al., 2020) (silver): The base DPR model is a Bert-base (Devlin et al., 2019) bi-encoder model trained on Natural Questions dataset.We fine-tune it on silver data.• DPR (silver+ann): DPR model fine-tuned on both silver and human annotated pairs.• Rerank-deberta (silver): Deberta-base (He et al., 2020) based classification model trained using the silver guideline-context pairs.• Rerank-deberta (ann): Deberta model trained only on human annotated guidelines.• Rerank-deberta (silver+ann): Deberta model trained on both silver and human annotated pairs.For training the DPR models, we use a batch size of 16 for 50 epochs.For Deberta models, we finetune the Deberta-base model with a batch size of 60 across 8 GPUs.For our models in all experiments, we report the average scores across 3 runs.

Results
Table 3 shows that BM25 performs poorly, indicating that simple word-based prediction does not Model MAP@1 MAP@3 MRR MDCG@3 Recall@3 Recall@5 BM25 12.9 work on this task, while DPR and Deberta models trained with human-annotated data perform the best.Models trained on silver data also show reasonable performance.Deberta performs better than DPR and BM25, and the model trained with a combination of human-annotated and silver data performs better than the one trained with only human guidelines, indicating data augmentation benefits the task.Our best model has a Recall@3 of 78%, making it suitable for practical use.

Setup and Baselines
This is a binary classification task to predict whether a response follows the provided guideline.We experiment on the train, dev and test sets of DIALGUIDE-BST with 14689, 4406 and 4962 context-response pairs, as shown in Table 1.Two settings are used: 1) Normal, where we only use the Positive and Negative instances, and 2) Adversarial, which additionally consists of adversarial negative responses (described in Section 3).We report the F1 scores per class, macro F1 and accuracy.We explore the following models and baselines: • Token-overlap: Measures token level overlap between the guideline and the response after stopword removal.A threshold (tuned using the dev set) is used for classification.• DNLI (Welleck et al., 2019): A Bert model trained on the Dialogue NLI task.• Roberta-Large: A Roberta (Liu et al., 2019) based classification model.For all Dial* baselines, the guideline is concatenated to the dialogue context and the response candidate, with an instruction to perform entailment.

Results
The results are shown in Table 4. Token-overlap and DNLI models perform poorly on the task, indicating the need for models with capabilities beyond token-level overlap for semantic similarity measures.DialT0 multi-task pretrained models also struggle in the zero-shot setting.Our model BSTGuide shows the best performance, with 90.2 macro F1 score on the Normal test set.Performance drops on the Adversarial test set (87.5 macro F1), confirming the difficulty of the Adversarial test set.However, the performance drop is lower than on BSTGuide-NoAdv, which was fine-tuned without adversarial examples, indicating that training on a few adversarial examples improves robustness.Additionally, BSTGuide (base DialT0 model fine-tuned on DIALGUIDE) performs better than BSTGuide-T5XL (base T5 model fine-tuned on DIALGUIDE), indicating that the DialT0 model pretrained on multiple dialogue tasks serves as a better model for this task.

Setup and Baselines
This task involves generating a response r that follows the provided guideline g and is coherent to the dialogue context C. We experiment on the test set of DIALGUIDE-BST with 1507 contextguideline-response triples.For training we experiment with both DIALGUIDE-BST and DIAL-GUIDE-SAFETY train sets.Most of our baseline models are instruction-tuned, and we feed the following sequence as input to the models: an instruction to generate a response conditioned on the guideline and the context, followed with the guideline and the dialogue context.We consider the following methods and compare to Ref-responses (the reference or gold responses from the data set).with noisy (randomly selected) guidelines for 20% of the data (more details in the next section).

Training and Evaluation Details
The Ret-generate model is trained the same as the DIALGUIDE-tuned model, but at test time we retrieve the guidelines in two steps: first, a large set of guidelines is retrieved using BM25 + DPR (100 from each) for faster inference, followed by reranking using the Rerank-Deberta (silver+ann) model.The final guideline is selected randomly from the set of guidelines with a score above 98% from the Deberta model.The Ret-robust model is a variation of Ret-generate, where during training, the gold guideline is randomly replaced with a random guideline for 20% of the training data, enhancing its robustness to incorrectly selected guidelines during inference.
For evaluation, we report Bleu-2,4 and RougeL scores using references.Diversity is measured by Dist-1,2.We measure word overlap between the response and guideline using Bleu-2 and report it as Gd-Bleu-2.RS-entail measures responseguideline enatilment using the BSTGuide model.An ideal model would have high RS-entail and low Gd-Bleu-2 scores to avoid excessive guideline copying.Coherence is measured using a Bert-large model trained on a mix of conversations from the DEB dataset (Sai et al., 2020), BST, and Prosocial dialogue.It takes the context and response as input and predicts the coherence of the response.In addition, we conducted human evaluation on Mturk platform (more details in Appendix B) on 100 randomly selected test instances.They annotate if the response is coherent and sensible (Resp.quality), the guideline's quality (Gd-quality), and if the response follows the guideline (Entailment).

Results
Tables 5 and 7 show automatic and human evaluation results for the DIALGUIDE-BST test set.The DialBart0-zeroshot model does not follow the guideline and copies tokens (high Gd-Bleu-2), while the OPT30B-fewshot model underperforms fine-tuned models.The DIALGUIDE-tuned model, trained on multiple dialogue tasks, performs slightly better than its Bart-guideline-tuned version on most metrics and is better at response quality and coherence among all models.It also performs better than BST-only, indicating that models can improve with more and diverse data and guidelines.While the No-guideline model has good coherence, our model offers more control over the generation space.The Multistep model that generates guidelines first followed by responses, suffers on quality but offers an interpretable generation approach.
The retrieval based models, Ret-generate and Ret-robust models, condition on the guideline if the score of the top retrieved guideline is greater than 0.5, otherwise they generate a response directly without a guideline (since their base model DialBart0 is capable of generating responses simply based on the context).While the Ret-generate We perform ablation experiments for the Retrobust model and test its robustness to noise in guideline retrieval.We do this by varying percentage of noisy guidelines added during training, and varying thresholds for guideline retrieval score during testing.Results are presented in Table 6.0% noise corresponds to the Ret-generate model since it does not use noisy data augmentation.The experiment is carried out for response generation results on DIALGUIDE-BST data.As we increase the noise percentage, the response quality and coherence improve, but at the cost of guideline entailment.For both retrieval thresholds, 20% noisy data augmentation leads to best coherence, with a small trade-off in guideline entailment.After 20%, we see a decrease in both coherence and entailment, and hence select 20% noise for Ret-robust model in our main experiments.

Setup and Baselines
This task involves generating a safe response r based on a guideline g that is coherent to the dialogue context C. We experiment on the test set of DIALGUIDE-SAFETY with 379 context-guidelinesresponse triples and use its dev set for model selection.The guidelines considered for testing belong exclusively to the DIALGUIDE-SAFETY data.We consider the following models: • DialBart0-noguideline: Bart-large model trained on Instructdial (Gupta et al., 2022b) and tested on zero-shot generation without guidelines.• DialBart0-withguideline: Bart-large model trained on Instructdial (Gupta et al., 2022b) and tested on zero-shot generation with guidelines.• DialBart-rot: DialBart0 tuned on RoTs (Kim et al., 2022b)  For the safety domain, we also include a Safety metric that scores whether a response is safe.The safety classifier is a Deberta-large classifier trained on the BAD (Bot Adversarial Dialogue) dataset (Xu et al., 2021a), consisting of dialogue safety data collected through an adversarial framework.We conducted human evaluation on the Mturk platform on 100 randomly selected test instances (more details in Appendix B).Workers annotated whether the response is coherent and sensible (Resp-quality), whether the response follows the guideline (Entailment), and whether the response is safe (Safety).

Results
Tables 8 and 9 show automatic and human evaluation results.DialBart0-noguideline, which performs zero-shot generation without a guideline, performs poorly on safety.DialBart0-withguideline, which conditions on guidelines in a zero-shot setting, improves safety by 5% in automatic and 6% in human evaluation.The OPT30B-fewshot model generates guideline-conditioned responses, but performs poorly in terms of safety and coherence compared to other baselines.The Dialbart-rot baseline, which uses RoTs or rules of thumbs (such as "it is bad to be racist"), performs similarly to DIAL-GUIDE-tuned on safety.However, ROTs do not contain the "if condition", thus making selection of relevant ROTs harder at test time.In addition, RoTs are often very generic which leads to poor control, as evident by the lower entailment scores.Human evaluation shows that DIALGUIDE-tuned outperforms all other baselines on all three criteria.
We perform ablation experiments with our model.The No-guidelines baseline, which is trained on safety data without guidelines or RoTs, can generate safe responses but it lacks control, whereas DIALGUIDE-tuned can generate safe responses based on the developers' agenda.Although the Safety-only baseline trained exclusively on DI-ALGUIDE-SAFETY performs better than BST-only, the performance of BST-only is close, which implies that a model that uses guidelines can perform well on cross-domain settings.elaborate on the guideline, while OPT30B-fewshot produces a less interesting response.The multistep baseline's generated guideline and response focus on the topic of news channels and the retrieval baselines' responses follow the retrieved guideline and are coherent.In the bottom example, the gold guideline provides a response related to the speaker's previous friendships.DialGuidetuned's output follows the gold guideline similar to the gold response, but the OPT30B-fewshot model output is unrelated and instead expresses a desire to have friends.The multistep baseline generates a guideline and response that focuses on parenting, while the Ret-generate response focuses too much on the provided guideline and is somewhat incoherent; Ret-robust is able to incorporate both the context and guideline.
In Overall, the model outputs show a range of quality, with some following the gold guideline more closely than others.Although DialGuide-tuned has the best performance in both results and qualitative analysis and forms a performance upper-bound using the gold guidelines, the retrieval baselines also show good performance and are more practical, as systems need to retrieve relevant guidelines at test time.The Multistep baseline is also useful in scenarios where no good guideline is available, as the model can first generate a guideline on how it is going to respond and then generate the response.

Conclusion
DialGuide framework and dataset provide a solution for controlling dialogue model behavior using natural language rules, or guidelines.Through the three tasks of guideline selection, response generation, and response entailment verification, Di-alGuide aims to enable better control of dialogue models and improve their trustworthiness and realworld use.We evaluate DialGuide on two domains, chit-chat and safety, and provide baseline models and benchmark performance for these tasks.Mod-els trained on DialGuide data generate coherent, diverse, and safe responses that generalize well to new guidelines and contexts.

Limitations
Our work explores aligning and controlling dialogue models toward developer-defined natural language guidelines.There is room for improvement in the following aspects: DialGuide may not be able to handle very complex or nuanced guidelines.For example, it may struggle to interpret guidelines that contain multiple conditions or that require a high level of common sense or domain knowledge.The performance of DialGuide may depend on the quality and clarity of the guidelines it is provided with.If the guidelines are poorly written or ambiguous, the system may struggle to interpret them correctly and generate appropriate responses.Dial-Guide may be less effective in domains where the appropriate response is more subjective or open to interpretation.For example, in a customer service context, it may be difficult to define clear guidelines for handling every possible customer request or complaint.DialGuide may not be suitable for use in all types of dialogue systems.For example, it may be less effective in systems that require more flexibility or creativity in generating responses.Di-alGuide may be more resource-intensive than other approaches to dialogue modeling, as it requires the additional step of matching a generated response with a set of guidelines or generating a guideline.Our work is an initial step in controlling dialogue models through guidelines and aligning them with a developer agenda.Future work can explore Di-alGuide for new applications and domains, such as task-oriented settings.Since response selection and generation can suffer from semantic overlap biases with the guidelines, better pretraining and incorporating commonsense knowledge should be able to help.Future work may also incorporate more complex and logical "if" condition matching.that are misleading, incorrect, manipulative, or harmful to users.For example, the system could be used to generate responses that exploit users' vulnerabilities or manipulate their emotions for commercial or political gain.The system may be used to collect sensitive or personal information about users, which could raise privacy concerns if this information is not handled appropriately.Careful regulation and oversight are needed to mitigate ill use of the system.

A Qualitative Results
In Table 10, we present sample inputs, guidelines, and outputs from models for the Response generation experiment for DIALGUIDE-BST.In Table 11, we show sample input, guidelines, and outputs from models for the Safe response generation experiment for DIALGUIDE-SAFETY.Discussion can be found in the Qualitative analysis section of the main paper.

B Annotation Details and Interfaces
In Figure 2, we show the interface for the guideline writing task, in Figure 3 we show the annotation interface for the guideline retrieval annotation task, in Figure 4 we show the annotation interface for the guideline based response selection task, and in Figure 5, we show the annotation interface for the adversarial response writing task.In all annotations, we employed Amazon Mechanical Turk.In each interface, we provided detailed instructions and explanations for the task along with 3 or more example instances and their annotations.The requirements for workers/annotators who worked on these tasks were -number of tasks completed more than 1000, first language English, HIT approval rate higher than 98 percent, and we used Master workers.They were paid higher than an average of $15 per hour.We collected the data across multiple batches and regularly removed the workers who either had a poor agreement with other workers or who performed poorly based on our manual checks.We removed the annotations of such workers and recollected annotations for those instances.
Annotations for dataset quality-We conducted human evaluations to test the dataset quality (discussed in last paragraph of Section 3).For 200 randomly selected context-guideline-response triplets, we asked the annotators to provide binary ratings for the following questions -a) Sensible response (yes-no): Is the response sensible?Does it make sense as a follow-up to the conversation?b) Sensible guideline (yes-no): Is the guideline sensible in itself?, c) Relevant guideline (yes-no): Is the guideline relevant to the conversation?, and d) Response follows guideline (yes-no): Does the response follow the guideline?We collected 3 annotations per instance and report the average scores.
Annotations for human evaluation -For human evaluation of response generation and Dialogue safety response generation, we hire annotators from the Amazon Mechanical Turk platform.The selection criteria are the same as described above for data collection.For the DIALGUIDE-BST Response generation human evaluation (Section 4.3.3),we collect annotations for 100 randomly selected instances of the test set, and perform an evaluation of responses from 7 models.We ask the annotators to score model responses and guidelines on the following criteria -a) Response quality (yesno): Is the response sensible and coherent?Does it make sense as a follow-up to the conversation?b) Relevant guideline (yes-no): Is the guideline relevant to the conversation?, and c) Entailment (yes-no): Does the response follow or entail the guideline?For the DIALGUIDE-Safety response evaluation (Section 4.4.2),we collect annotations

Figure 1 :
Figure 1: Task setup -First, for a conversational context, the model selects context relevant guidelines (Guideline A and C in the example) in Task 1. Then the model either generates a response using one of the selected guidelines (Guideline A) in Task 2 or checks whether response candidates follow the guideline in Task 3.

•
DialT0-Zeroshot: An instruction based model pre-trained on multiple dialogue tasks from Instructdial(Gupta et al., 2022b) tested in a zeroshot setting.It uses T5(Raffel et al., 2020) architecture and contains 3 billion parameters.• BSTGuide-T5XL: A T5-XL model fine-tuned on positive, negative, as well as adversarial negative examples from the train set.• BSTGuide-NoAdv: DialT0 fine-tuned on the positive and negative examples from the train set.• BSTGuide: DialT0 model fine-tuned on the positive, negative, as well as adversarial negative examples from the train set.
• DialBart0-withguidelines: A Bart-large model pre-trained on Instructdial(Gupta et al., 2022b) tested on zero-shot generation with guidelines.• OPT30B-fewshot: OPT(Zhang et al., 2022) 30B model prompted using 3 in-context examples.• Bart-guideline-tuned: A Bart-large (Lewis et al., 2020) model fine-tuned on our train set.• DIALGUIDE-tuned: DialBart0 fine-tuned on context-guideline-responses from our train set.• BST-only: DialBart0 fine-tuned only on DIAL-GUIDE-BST and not on DIALGUIDE-SAFETY.• No-guideline: A DialBart0 tuned on conversations without conditioning on guidelines.• Multistep: DialBart0 tuned model -first generates a guideline conditioned on the context, then generates the response.• Ret-generate: Conditions on retrieved guidelines instead of gold guidelines during inference.• Ret-robust: Above model additionally trained with same count of instances.• OPT30B-fewshot: OPT 30B model prompted using 3 in-context examples.• DIALGUIDE-tuned (Ours): Dialbart0 fine-tuned on a mixture of BST and safety guidelines data.• No-guideline: Dialbart0 model fine-tuned on safety data without guidelines.• BST-Only: Dialbart0 fine-tuned on DIALGUIDE-BST dataset, without using safety data.• Safety-only: Dialbart0 fine-tuned on only safety guideline data.

Figure 3 :
Figure 3: Annotation interface for the guideline retrieval annotation task.Workers are shown a context and a set of guidelines (only the condition part), and asked to select if each guideline is relevant to the context or not.

Figure 4 :
Figure 4: Annotation interface for the guideline based response selection task.Annotators are shown a conversation, candidate responses, and a guideline.They are then asked to select one or more responses that follow the guideline.

Figure 5 :
Figure 5: Annotation interface for the adversarial response writing task.Annotators are shown a conversation, a response, and a guideline.They are then asked to edit the response so that it does not entail the guideline.They are provided sample strategies along with examples (not shown here) o help them with the task.

Table 3 :
Guideline retrieval results.Re-ranking models perform better than DPR.The model trained on the combined set of silver and human annotated guidelines performs the best.

Table 4 :
Guideline-based response entailment verification results.The model trained on the annotated dataset performs well.Training on the adversarial set improves performance on the adversarial test set without reducing performance on the normal test set.

Table 5 :
Response generation results on DIALGUIDE-BST data.We compare our model DIALGUIDE-tuned with various zero-shot, few-shot and fine-tuned baselines.

Table 6 :
Ablation experiments for the Ret-robust model with the varying percentage of noisy guidelines added during training, and varying threshold for guideline retrieval during testing.The experiment is carried out for response generation results on DIALGUIDE-BST data.0% noise corresponds to the Ret-generate model since it does not use noisy data augmentation.For both retrieval thresholds, 20% noisy data augmentation leads to best coherence, with a small trade-off in guideline entailment.

Table 7 :
Response generation human evaluation results on DIALGUIDE-BST data.Gold and None denote that gold and no guideline were used by the model.

Table 8 :
Safe response generation results on DIALGUIDE-SAFETY data.We compare our model DIALGUIDE-tuned with various zero-shot, few-shot and fine-tuned baselines.

Table 9 :
Response generation human evaluation results on DIALGUIDE-SAFETY data.

Table 11 (
Appendix A), we show examples for DIALGUIDE-SAFETY.DialGuide-tuned follows the guideline and generates safe responses, while DialBart0-noguideline generates generic responses.The No-guideline model, which is trained on safety response data without guidelines, generates safe responses but inferior to the DialGuidetuned responses.The RoT based responses are more generic and less specific than DialGuidetuned responses.