AARGH! End-to-end Retrieval-Generation for Task-Oriented Dialog

We introduce AARGH, an end-to-end task-oriented dialog system combining retrieval and generative approaches in a single model, aiming at improving dialog management and lexical diversity of outputs. The model features a new response selection method based on an action-aware training objective and a simplified single-encoder retrieval architecture which allow us to build an end-to-end retrieval-enhanced generation model where retrieval and generation share most of the parameters. On the MultiWOZ dataset, we show that our approach produces more diverse outputs while maintaining or improving state tracking and context-to-response generation performance, compared to state-of-the-art baselines.


Introduction
Most research task-oriented dialog models nowadays focus on end-to-end modeling, i.e., the whole dialog system is integrated into a single neural network (Wen et al., 2017;Ham et al., 2020). Although recent end-to-end generative approaches based on pre-trained language models produce fluent and natural responses, they suffer from two major problems: (1) hallucinations and lack of grounding (Dziri et al., 2021), which result in faulty dialog management or responses inconsistent with the dialog state or database results, and (2) blandness and low lexical diversity of outputs (Zhang et al., 2020b). On the other hand, retrieval-based dialog systems (Chaudhuri et al., 2018) select the most appropriate response candidate from a humangenerated training set, thus producing varied outputs. However, their responses might not fit the context and can lead to disfluent conversations, especially when the set of candidates is sparse. This limits their usage to very large datasets which do not support dialog state tracking or database access (Lowe et al., 2015;Al-Rfou et al., 2016).
Several recent works focus on combining the retrieval and generative dialog systems via response selection and subsequent refinement, i.e., retrievalaugmented generation (Pandey et al., 2018;Weston et al., 2018;Cai et al., 2019b;Thulke et al., 2021). These models are used for open-domain conversations or to incorporate external knowledge into task-oriented systems and do not consider an explicit dialog state.
Our work follows the retrieve-and-refine approach, but we adapt it for database-aware taskoriented dialog. We aim at improving diversity of produced responses while preserving their appropriateness. In other words, we do not retrieve any new information from an external knowledge base, instead, we retrieve relevant training data responses to support the decoder in producing varied outputs. To the best of our knowledge, we are the first to use retrieval-augmented models in this context. Unlike previous works, we merge the retrieval and generative components into a single neural network and train both tasks jointly, instead of using two separately trained models. Our contributions are summarized as follows: 1 • We propose a single-encoder retrieval model utilizing dialog action annotation during training, and we show its superior retrieval capabilities in the task-oriented setting compared to twoencoder baseline models (Humeau et al., 2020).
• We propose an end-to-end task-oriented generative system with an integrated minimalistic retrieval module. We compare it to strong baselines that model response selection and generation separately.
• On the MultiWOZ benchmark (Budzianowski et al., 2018), our approaches outperform previous methods in terms of lexical diversity and achieve competitive or better results in automatic metrics and human evaluation.  Figure 1: Our retrieval-based generative task-oriented system (AARGH, see Section 3.5). Numbers in module boxes mark the order of processing during inference: (1) inputs are pushed through the shared context encoder and (2) state encoder; (3) the state decoder produces the update to the current dialog state. The new state is used to query the database whose outputs are discretized, embedded, and (4) used in the retrieval encoder whose output is reduced to a single vector via average pooling. The context embedding is used to get the best response candidate (hint). Finally, (5) the response decoder, which can attend to the state encoder outputs via cross-attention and is conditioned on the database results and the hint, generates the final system response to be shown to the user.

Related Work
Task-Oriented Response Generation Most current works focus on building multi-domain database-grounded systems. The breeding ground for this research is the large-scale conversational dataset MultiWOZ (Budzianowski et al., 2018;Eric et al., 2020;. Recent models often benefit from action annotation. Zhang et al. (2020a) use action-based data augmentation and a three-stage architecture, decoding the dialog state, action, and response. Chen et al. (2019) generate responses without state tracking, exploiting a hierarchical structure of the action annotation. On the other hand, reinforcement learning models (Wang et al., 2021) learn latent actions from data without using annotation.
Response Selection can be viewed as scoring response candidates given a dialog context. A popular approach is the dual encoder architecture (Lowe et al., 2015;Henderson et al., 2019b) where the response and context encoders model a joint embedding space. The encoders can take various forms: Henderson et al. (2019a) compare encoders based on BERT (Devlin et al., 2019b) and custom encoders pre-trained on Reddit; Wu et al. (2020) pretrain encoders specifically for task-oriented conversations. Humeau et al. (2020) introduce polyencoders, which produce multiple context encodings and add an attention layer to allow rich interaction with the candidate encoding (cf. Section 3.3).
Retrieval-Augmented Generation To benefit from both retrieval and generative models, Weston et al. (2018) proposed an open-domain dialog system utilizing a retrieval network and a decoder to refine retrieved responses. Roller et al. (2021) further developed this approach, using polyencoders with a large pre-trained decoder. They found that their decoder tends to ignore the retrieved response hints. To combat this, they propose the α-blending method (replacing retrieval output with ground truth, see Section 3.2). Similarly, Gupta et al. (2021) and Cai et al. (2019a,b) focus on retrieval-augmented open-domain dialog, but to prevent the inflow of erroneous information into the generative part of their models, they use semantic frames or reduced forms of retrieved responses instead of raw response texts. Thulke et al. (2021) aim at knowledge retrieval from external documents for resolution of out-ofdomain questions on MultiWOZ (Kim et al., 2020). Shalyminov et al. (2020) present the only work using generation and retrieval in a single model. They finetune GPT-2 (Radford et al., 2019) for response generation in a low-resource task-oriented setup, retrieve alternative responses based on the model's embedding similarity, and choose between gener-ated and retrieved responses on-the-fly. However, their model is not trained for retrieval, cannot alter retrieved responses, and does not take a dialog state or database into account.

Method
We aim at end-to-end modeling of database-aware task-oriented systems, i.e., systems supporting both dialog state tracking and response generation tasks (Young et al., 2013). We combine retrieval and generative models to reduce hallucinations and boost output diversity. We first describe our purely generative baseline (Section 3.1), then explain baseline generation based on retrieved hints (Section 3.2). We then introduce baseline retrieval models (Section 3.3) and our action-aware retrieval (Section 3.4). Finally, we describe AARGH, our singlemodel retrieval generation hybrid, in Section 3.5. AARGH is shown in Figure 1; other setups are depicted in Appendix A.

Generative Baseline
Our purely-generative baseline model (Gen) follows MinTL (Lin et al., 2020). It is based on an encoder-decoder backbone with a context encoder, shared among two decoders: one for modeling the dialog state updates, the other for producing the final system response. Both decoders attend to the encoded input tokens via an attention mechanism.
The encoder input sequence consists of a concatenation of two parts: (1) past dialog utterances prepended with <|system|> or <|user|> tokens, and (2) the initial dialog state converted to a string, e.g., hotel [area: center] restaurant [food: African, pricerange: expensive]. The first decoder is conditioned only on the start-of-sequence token and predicts the dialog state update as a difference between the current state and the initial state. The second decoder is conditioned on the number of database results for each queried domain, e.g. train: 6 if there are six matching results for a train search, and generates the final response.
During inference, the input is passed through the encoder, then the state update is predicted, merged with the initial dialog state, and this new state is used to query the database (see Section 4 for details). The final system response is predicted based on the context, state, and database results.

Retrieval-Augmented Response Generation
To combine the retrieval and generative approaches, we follow Weston et al. (2018) and incorporate response hints, i.e., the outputs of a retrieval module (Sections 3.3, 3.4), into the generative module in their original form as raw sub-word tokens. Specifically, we add the retrieved response prepended with <|hint|> to the input of Gen's response decoder (Section 3.1), alongside the database results. Gupta et al. (2021) state that this straightforward token-based retrieve & refine setup might lead to generating incoherent responses due to overcopying of contextually irrelevant tokens. However, using more abstract outputs of the retrieval module, e.g. semantic frames or salient words would go against our goal of reducing blandness and increasing responses lexical diversity. To smoothly control the amount of token copying, we follow Roller et al. (2021) and use the so-called α-blending. During training, we replace the retrieved utterance with the ground-truth final response with probability α. This method also ensures that the decoder learns to attend to the retrieval part of its input successfully.

Baseline Response Selection
We consider two baseline retrieval model variants: Dual-encoder (DE) follows the very popular retrieval architecture (Lowe et al., 2015;Humeau et al., 2020) which makes use of context and response encoders. Both produce a single vector in a joint embedding space. During training, the context embedding and the corresponding response embedding are pushed towards each other, while other responses in the training batch are used as negative examples, i.e., cross-entropy loss is used: where S ∈ R N ×N is the similarity matrix between normalized encoded responses e r and contexts e c in the batch, specifically S i,j = w · (e c i · e r j ), where w > 0 is a trainable scaling factor.
Inference-time retrieval is as simple as finding the nearest candidate embedding given a context embedding. The context input is similar to Gen's (see Section 3.1): a concatenation of the current updated dialog state, the number of matching database results and past user and system utterances. En-coders are followed by average pooling and a fullyconnected layer for dimensionality reduction.
Poly-encoder (PE) an extension of DE, aiming at richer interaction between the candidate and the context. The candidate encoder is unchanged. In the context encoder, the average pooling is replaced with two levels of dot-product attention (Vaswani et al., 2017;Humeau et al., 2020). The first level summarizes the encoded context tokens into m vectors. The context tokens act as attention keys and values; queries to this attention are m learned embeddings (query codes). The second attention level provides the candidate-context interaction: it takes the m context summary vectors as keys and values, and the candidate encoder output acts as the query. The parameter m provides trade-off between inference complexity and richness of the context encoding. The loss term remains the same.

Action-aware Response Selection
We argue that the dual-or poly-encoder models are not practical for the task-oriented settings as their performance depends on the way negative examples are sampled during training (Nugmanova et al., 2019). Choosing appropriate negative examples is difficult in task-oriented datasets as system responses are often very similar to each other (with the conversations being in a narrow domain and following similar patterns). Therefore, we propose a method for candidate selection based on system action annotation, which is usually available in task-oriented datasets. We designed the method to be usable with a single encoder only, but we also include a dual-encoder version for comparison.
Action-aware-encoder (AAE) Using two separate encoders to encode the response and the context might be impractical due to large model size. Some recent works (e.g., Wu et al. (2020); Roller et al. (2021)) use a single shared encoder instead, and Henderson et al. (2020) discuss parameter sharing between the two encoders. In view of that, we propose a single-encoder action-aware retrieval model. We train it to produce embeddings of dialog contexts which are close to each other if the corresponding responses in the training data have similar action annotation. More precisely, we adapt Wan et al. (2018)'s generalized end-to-end loss, originally developed for batch-wise training of speaker classification from audio: To form training minibatches, we first sample M random dialog actions, and for each of those actions, we sample N examples that include the particular action in their system action annotation. We then encode dialog contexts corresponding to the sampled examples into normalized embeddings e m,n , and compute the similarity matrix as follows: is a set of indices. Same as for DE, w > 0 is a trainable scaling factor of the similarity matrix. In other words, the similarity matrix describes the similarity between embeddings of each example and centroids, i.e., the means of N embeddings that correspond to the same particular action. For stability reasons and to avoid trivial solutions, we follow Wan et al. (2018) and exclude e j,i from the centroid calculation when computing S ji,j .
We then maximize the similarity between the examples and their corresponding centroids while using other centroids as negative examples: During inference, we rank the responses from the training set according to the cosine similarity of their corresponding contexts and the query context. Again, the contexts consist of the current updated dialog state, the number of matching database results and past utterances.
Action-aware-dual-encoder (AADE) This setup follows the DE architecture (see Section 3.3), but it is trained in a similar way as AAE, i.e., we form training mini-batches identically and for each of M distinct actions in the batch, we treat all N examples as positive examples.

Hybrid End-to-end Model
To further simplify the retrieval-augmented setup, reduce the number of trainable parameters and gain back computational efficiency, we introduce an endto-end Action-Aware Retrieval-Generative Hybrid model (AARGH), which jointly models both response selection and context-to-response generation (see Figure 1). It is a natural extension of the Gen generative model (Section 3.1), enabled by our new single-encoder action-aware response retrieval (AAE, Section 3.4).
A new retrieval encoder, which produces normalized context embeddings, shares most parameters with the original encoder, which is followed by the two decoders and is partially responsible for state tracking and response generation. To build the retrieval encoder, we fork the last L layers of the original encoder and condition them on the outputs of the shared preceding layers, concatenated with an embedding of the number of current database results. To obtain this embedding, we convert the number of database results into a small set of bins, which are then embedded via a learnt embedding layer of size E. 2 The new retrieval encoder is followed by average pooling and trained using the same objective as AAE (see Section 3.4).
During inference, we pass the input through the partially shared context encoder and decode and update the dialog state. The new state is used to query the database. Database results are embedded and added to the output of the last encoder shared layer to form the input to the retrieval encoder, which produces the context embedding and a retrieved response. Based on state, database results, and retrieved response, the response decoder produces the final (delexicalized) response.

Experimental Setup
Models Our models are based on pre-trained models from HuggingFace (Wolf et al., 2020): We implement Gen and the generative parts in our retrieval-based models using T5-base (Kale and . Retrieval encoders in DE, AADE, PE and AAE are implemented as fine-tuned BERTbase (Devlin et al., 2019a). AARGH is built upon T5-base, same as Gen; we fork the last L = 2 out of K = 12 encoder layers. The choice of L is a trade-off between model performance and size. 3 The database embedding has size E = 4. For simplicity, we do not use specialized backbones pretrained on dialogs such as ToD-BERT (Wu et al., 2020). PE uses m = 16 query codes (see Section 3.3) and single-headed attention mechanisms.  Figure 2: Part of a short conversation from MultiWOZ. It has user and system turns, and annotated slot spans. Both, user and system affect the dialog state. Actions are shown below system texts.
Data and database We experiment on the Multi-WOZ 2.2 dataset (Budzianowski et al., 2018; which is a popular dataset with around 10k task-oriented conversations in 7 different domains such as trains, restaurants, or hotels (see Figure 2). A single conversation can touch multiple domains. The dataset has an associated database, dialog state annotation, dialog action annotation of system turns, and slot value span annotation for easy delexicalization (Wen et al., 2015), thus enabling development of realistic end-to-end dialog systems. 4 To query the database using the belief state, we use the fuzzy matching implementation by Nekvinda and Dušek (2021). To filter out inactive domains from database results during inference, we follow previous work and estimate the currently active domain from dialog state updates.
Input and output format We use the same formats for all models. Target responses are delexicalized using MultiWOZ 2.2 span annotation, and we limit the context to 5 utterances. MultiWOZ action labels include domain, action, and slot name, e.g., train-inform-price. We remove domains from the labels to limit data sparsity.
Training procedure DE, AADE, PE and AAE are trained in two stages. The retrieval part is trained first and provides response hints to the generative model during the second phase. Modules in AARGH are trained jointly, but we alternate param- α-blending We experiment with two α-blending values: a conservative one (α = 0.05, marked "↓ ") and a greedy one (α = 0.4, marked "↑ "), targeting a mostly generation-focused and a mostly retrievalfocused setting. 5 Decoding We use greedy decoding for dialog state update generation. For response generation, we report results with greedy decoding in Section 5 and with beam search in Appendix B.

Evaluation and Results
We focus on end-to-end modeling, which includes dialog state tracking and response generation. All reported results are on MultiWOZ test set with 1000 dialogs, averaged over 8 different random seeds. We generated responses given ground truth contexts. We follow MinTL and predict the dialog state cumulatively for each conversation turn, which means that state tracking errors may compound. See Appendix C for an example end-to-end conversation without any ground-truth information.  the search criterion and would always score 100%. Instead, we use the action annotation and measure the intersection over union (IoU), full-match and no-match rates on sets of actions associated with top-1 retrieved and ground-truth responses. We add BLEU (Papineni et al., 2002;Liu et al., 2016a) between ground-truth and retrieved responses and the proportion of distinct retrieval outputs to assess their lexical similarity to references and diversity. Table 1 shows that AAE and AARGH significantly outperform other setups on all measures except for the no-match rate, 6 where PE has comparable results. This is expected as they use the additional action annotation during training, unlike DE and PE. AADE performs surprisingly bad. According to the unique hints rate, AAE and AARGH retrieve a much wider range of outputs, which could improve lexical diversity of final responses. The higher BLEU, Action IoU and full match rates suggest that the models retrieve responses more similar to the ground truth.

Response selection
To further compare the approaches to response selection, we computed the Silhouette coefficient (Rousseeuw, 1987) based on the active domain and action annotation (see Table 3). 7 We omit PE 6 According to a paired t-test with 95% confidence level. 7 In the case of action-based clustering, we treat each action as a separate cluster; each example can belong to multiple clusters. The clustering measure is calculated for each cluster and averaged over all actions which are weighted by the size   because its context embeddings depend on queries, i.e., the candidate embeddings (other models output the same context regardless of candidates). DE has the worst results; other systems perform similarly, but AARGH is the best on action separation while AADE has the best scores for domains. We see that AADE's context encoder is successful in clustering, but it lags behind in terms of correct action selection. Unlike AARGH and AAE, AADE retrieves candidates based on response embeddings. We hypothesize that lower response variability (compared to context variability) leads the model to prefer responses seen more frequently during training. AARGH and AAE are not affected by this as they use purely context-based retrieval. Figure 3 provides a visualisation of the domain clusters projected using t-SNE (van der Maaten and Hinton, 2008). It supports the findings of our evaluation based on the Silhouette coefficient: We see that visualisations of AARGH and AADE embedding spaces look similarly whereas DE's clusters appear more noisy.
of the corresponding clusters.

Response generation
We evaluate the response generation abilities of our models using automatic metrics and human assessment of delexicalized texts (see Table 4 for examples).
Evaluation with automatic metrics We use the corpus-based evaluator by Nekvinda and Dušek (2021) to measure commonly used metrics on Mul-tiWOZ (Inform & Success rates, BLEU) as well as lexical diversity measures, namely the number of distinct trigrams in the outputs and bigram conditional entropy (Li et al., 2016;Novikova et al., 2019). State tracking joint accuracy is calculated with scripts adapted from TRADE . To better understand the effect of using retrieved hints and to quantify the amount of copying, we calculate BLEU between retrieved hints and final generated responses (Hint-BLEU) and the proportion of generated responses exactly matching the corresponding retrieved hints (Hint-copy).
We include comparisons with recent strong endto-end models on MultiWOZ: SOLOIST (Peng et al., 2021a), MTTOD (Lee, 2021), PPTOD (Su et al., 2022), andMinTL (Lin et al., 2020), which has the same architecture as Gen. To show the importance of the generative parts of our models, we also include AAE without the refining decoder. Table 2 shows scores obtained with greedy decoding (see Appendix B for beam search results). All models have similar state tracking performance. AARGH has slightly lower numbers, which is not surprising as it shares a substantial part of the encoder with its retrieval component. As expected, we notice a huge difference in Hint-BLEU and Hint-copy of versions with different α-blending probabilities (↓ vs. ↑). 8 The performance boost over Gen and retrieval-only AAE is, for ↓ variants, mainly in terms of Success. In ↑, more frequent hint copying reduces BLEU and improves lexical diversity; we also see higher Inform. AAE +Gen and AARGH (both ↓ and ↑) perform better than corresponding DE +Gen or PE +Gen on Inform and Success rates. 9 Differences between AAE +Gen and AARGH are not statistically significant and their Success scores are better than MinTL, competitive with PPTOD and SOLOIST but lower than MTTOD. In terms of lexical diversity, all models are better than most generative baselines. 10 Human evaluation We arranged an in-house human evaluation on the delexicalized outputs of Gen (i.e., MinTL's architecture), DE +Gen ↓, AARGH ↓ and AARGH ↑. We used side-by-side relative ranking evaluation, which has been repeatedly found to increase consistency compared to rating isolated examples (Callison-Burch et al., 2007;Belz and 8 Hint-copy of 15% roughly means one turn per dialog. 9 According to a paired t-test with 95% confidence level. 10 The ↓ variants are similar to SOLOIST, which, however, reaches diversity by employng sampling (Holtzman et al., 2020) instead of greedy decoding.  Kow, 2010; Kiritchenko and Mohammad, 2017). Participants were given full dialog context and current database results, and we asked them to rank responses of the compared models from the bestfitting to the worst, where multiple responses could be ranked the same (see Appendix D for details). We collected rankings for 346 turns of 50 conversations from 5 linguists with experience in natural language generation. All of them were given a different set of dialogs and they were instructed to focus on consistency with the context and database results, naturalness, and attractiveness of the responses. See Table 5 for results. Although AARGH ↑ scored the best on automatic metrics, it has worse mean ranks than other models, which all have similar mean ranks. 11 This confirms previous findings of low correla-tion between automatic metrics and human assessments (Liu et al., 2016b;Novikova et al., 2017). Upon detailed manual error analysis, we found that AARGH ↑ often copies whole hints including words that do not fit the context, i.e., contradictions to earlier statements or noisy non-delexicalized values from the training set. AARGH ↓ performs slightly better than the baselines and is more often ranked best and least often ranked worst.

Conclusion
We present AARGH, an end-to-end task-oriented dialog system, combining retrieval and generative approaches. It uses an embedded single-encoder retrieval component which extends a purely generative model without the need for a large number of new parameters. AARGH features an action-aware response selection training objective. Our experiments on the MultiWOZ dataset show that AARGH outperforms baselines in terms of automatic metrics and human evaluation and it is competitive with state-of-the-art models such as SOLOIST or MT-TOD. We showed that our proposed action-aware retrieval training objective supports retrieval of a larger variety of unique and relevant responses in the task-oriented setting and makes efficient use of the available system action annotation. Further, using the retrieval module improves dialog management in terms of the Success rate. A limitation of our approach is the need for careful hyperparameter setting, coupled with the risk of overuse of retrieved responses that match the dialogue state but are not appropriate for the context.
In future work, we would like to confirm our results on more datasets and explore more complex ways of usage of the retrieved responses to encourage the model to copy interesting language structures while ignoring inappropriate tokens or relics of faulty delexicalization. A Model Architectures Figure 4 shows architectures of the baseline (Gen), dual-encoder-based model (DE), and singleencoder action-aware model (AAE). See Figure 1 for details about AARGH and Section 3 for description of the models.

B Beam Search Results
See Table 6 for the results of beam search-based response generation evaluation, and compare the results with greedy decoding evaluation (see Section 5.1 and Table 2). For all models, we used beams of size 8 during the decoding In the case of conservative α-blending, beam search decoding results in higher lexical diversity for all retrieval-augmented systems. However, the gains with respect to Inform and Success rates are mostly very small or not present at all in the case of AADE and AARGH. All BLEU scores are slightly lower which corresponds with the higher output diversity. We notice that the numbers for the baseline without a retrieval component have an opposite trend. Beam search decoding causes lower lexical diversity and higher BLEU. We attribute this to the fact that beam search decoding prefers safer responses with a higher overall probability.
When using higher α-blending, the differences become small even in the case of lexical diversity. We hypothesize that all the retrieval-based models are not substantially influenced by the particular response decoding strategy because they strongly rely on the retrieved hints and their copying.
C End-to-end Conversation Figure 5 shows a multi-domain (restaurant and taxi) end-to-end conversation between our end-to-end retrieval-based model AARGH (See Section 3.5).

D Human Evaluation Interface
We used the graphical user interface depicted in Figure 6 for human evaluation. A full dialog context, i.e., all past utterances corresponding to the particular turn, and the number of database results were shown to participants. We asked participants to rank provided responses from the best to the worst. They evaluated only two conversations in a single run and we sampled the conversations from the test set so that all participants receive roughly the same number of turns to assess. Evaluated responses were shown side-by-side; each of them had a dedicated discrete scale from 1 to 4 where 1 was labeled as the best and 4 as the worst. More responses could receive the same ranking. Participants could move forward and backward in the conversations and they could switch to another conversation anytime.