Cascaded Span Extraction and Response Generation for Document-Grounded Dialog

This paper summarizes our entries to both subtasks of the first DialDoc shared task which focuses on the agent response prediction task in goal-oriented document-grounded dialogs. The task is split into two subtasks: predicting a span in a document that grounds an agent turn and generating an agent response based on a dialog and grounding document. In the first subtask, we restrict the set of valid spans to the ones defined in the dataset, use a biaffine classifier to model spans, and finally use an ensemble of different models. For the second sub-task, we use a cascaded model which grounds the response prediction on the predicted span instead of the full document. With these approaches, we obtain significant improvements in both subtasks compared to the baseline.


Introduction
Unstructured documents contain a vast amount of knowledge that can be useful information for responding to users in goal-oriented dialog systems. The shared task at the first DialDoc Workshop focuses on grounding and generating agent responses in such systems. Therefore, two subtasks are proposed: given a dialog extract the relevant information for the next agent turn from a document and generate a natural language agent response based on dialog context and grounding document. In this paper, we present our submissions to both subtasks.
In the first subtask, we focus on modeling spans directly using a biaffine classifier and restricting the model's output to valid spans. We notice that replacing BERT with alternative language models results in significant improvements. For the second subtask, we notice that providing a generation model with an entire, possibly long, grounding document often leads to models struggling to generate factually correct output. Hence, we split the task into two subsequent stages, where first a ground-ing span is selected according to our method for the first subtask which is then provided for generation. With these approaches, we report strong improvements over the baseline in both subtasks. Additionally, we experimented with marginalizing over all spans in order to be able to account for the uncertainty of the span selection model during generation.

Related Work
Recently, multiple datasets and challenges concerning conversational question answering have been proposed. For example, Saeidi et al. (2018) introduced ShARC, a dataset containing ca. 32k utterances which include follow-up questions on user requests which can not be answered directly based on the given dialog and grounding. Similarly, the CoQA dataset (Reddy et al., 2019) provides 127k questions with answers and grounding obtained from human conversations. Closer related to the DialDoc shared task, the task in the first track of DSTC 9  was to generate agent responses based on relevant knowledge in task-oriented dialog. However, the considered knowledge has the form of FAQ documents, where snippets are much shorter than those considered in this work.
Pre-trained trained language models such as BART (Lewis et al., 2020a) or RoBERTa (Liu et al., 2019) have recently become a successful tool for different kinds of natural language understanding tasks, such as question answering (QA), where they obtain state-of-the-art results (Liu et al., 2019;Clark et al., 2020). Naturally, they have recently also found their way into task-oriented dialog systems (Lewis et al., 2020a), where they are either used as end-to-end systems (Budzianowski and Vulić, 2019;Ham et al., 2020) or as components for a specific subtask (He et al., 2021).

Task Description
The task of dialog systems is to generate an appropriate systems response u T +1 to a user turn u T and preceding dialog context u T −1 1 := u 1 , ..., u T −1 . In a document-grounded setting, u T +1 is based on knowledge from a set of relevant documents D ⊆ D, where D denotes all knowledge documents. Feng et al. (2020) identify three tasks relevant to such systems, namely 1) user utterance understanding; 2) agent response prediction; 3) relevant document identification. The shared task deals with the second task and assumes the result of the third task to be known. They further split this task into agent response grounding prediction and agent response generation. More specifically, one subtask focuses on identifying the grounding of u T +1 and the second subtask on generating u T +1 . In both subtasks exactly one document d ∈ D is given. Each document consists of multiple sections, whereby each section consists of a title and the content. In the doc2dial dataset, the latter is split into multiple subspans. In the following, we refer to these given subspans as phrases in order to avoid confusing them with arbitrary spans in the document.

Agent Response Grounding Prediction
The first subtask is to identify a span in a given document that grounds the agent response u T +1 . It is formulated as a span selection task where the aim is to return a tuple (a s , a e ) of start and end position of the relevant span within the grounding document d based on the dialog history u T 1 . In the context of the challenge, these spans always correspond to one of the given phrases in the documents.

Agent Response Generation
The goal of response generation is to provide the user with a system response u T +1 that is based on the dialog context u T 1 and document d and fits naturally into the preceding dialog.

Baselines
Agent Response Grounding Prediction For the first subtask, Feng et al. (2020) fine-tune BERT for question answering as proposed by Devlin et al. (2019). Therefore, a start and end score for each token is calculated by a linear projection from the last hidden states of the model. These scores are normalized using a softmax over all tokens to obtain probabilities for the start and end positions. In order to obtain the probability of a specific span, the probabilities of the start and end positions are multiplied. If the length of the documents exceeds the maximum length supported by the model, a sliding window with stride over the document is used and each window is passed to the model. In training, if the correct span is not included in the window, the span only consisting of the begin of sequence token is used as target. In decoding the scores of all windows are combined to find the best span.

Agent Response Generation
The baseline provided for the shared task uses a pre-trained BART model (Lewis et al., 2020a) to generate agent responses. The model is fine-tuned on the tasks training data by minimizing the cross-entropy of the reference tokens. As input, it is provided with the dialog context, title of the document, and the grounding document separated by special tokens. Inputs longer than the maximum sequence length supported by the model (1,024 tokens for BART) are truncated. Effectively, this means that parts of the document are removed that may include the information relevant to the response. An alternative to truncating the document would be to truncate the dialog context (i.e. removing the oldest turns which may be less relevant than the document). We did not do experiments with this approach in this work and always included the full dialog context in the input. For decoding beam search with a beam size of 4 is used.

Agent Response Grounding Prediction
Phrase restriction In contrast to standard QA tasks, in this task, possible start and end positions of spans are restricted to phrases in the document. This motivated us to also restrict the possible outputs of the model to these positions. That is, instead of applying the softmax over all tokens, it is only applied over tokens corresponding to the start or end positions of a phrase and thus only consider these positions in training and decoding.
Span-based objective The training objective for QA assumes that the probability of the start and end position are conditionally independent. Previous work (Fajcik et al., 2020) indicates that directly modeling the joint probability of start and end position can improve performance. Hence, to model this joint probability, we use a biaffine classifier as proposed by Dozat and Manning (2017) for dependency parsing.
Ensembling In our submission, we use an ensemble of multiple models for the prediction of spans to capture their uncertainty. More precisely, we use Bayesian Model Averaging (Hoeting et al., 1999), where the probability of a span a = (a s , a e ) is obtained by marginalizing the joint probability of span and model over all models H as: The model prior p (h) is obtained by applying a softmax function over the logarithm of the F1 scores obtained on a validation set. Furthermore, we approximate the span posterior distribution p h a | u T 1 , d by an n-best list of size 20.

Agent Response Generation
Cascaded Response Generation One main issue with the baseline approach is that the model appears to be unable to identify the relevant knowledge when provided with long documents. Additionally, due to the truncation, the input of the model may not even contain the relevant parts of the document. To solve this issue, we propose to model the problem by cascading span selection and response generation. This way, we only have to provide the comparatively short grounding span to the model instead of the full document. This allows the model to focus on generating an appropriate utterance and less on identifying relevant grounding information. Similar to the baseline, we fine-tune BART (Lewis et al., 2020a). In training, we provide the model with the dialog context u T 1 concatenated with the document title and reference span, each separated by a special token. In decoding, the reference span is not available and we use the span predicted by our span selection model as input.
Marginalization over Spans Conditioning on only the ground truth span creates a mismatch between training and inference time since the ground truth span is not available at test time but has to be predicted. This leads to errors occurring in span selection being propagated in response generation. Further, the generation model is unable to take the uncertainty of the span selection model into account. Similar to Lewis et al. (2020b) and Thulke et al. (2021) we propose to marginalize over all spans S. We model the response generation as: where the joint probability may be factorized into a span selection model p s | u T 1 ; d and a generation model p u T +1 | u T 1 , s; d corresponding to our models for each subtask. For efficiency, we approximate S by the top 5 spans which we renormalize to maintain a probability distribution. The generation model is then trained with cross-entropy using an n-best list obtained from the separately trained selection model. A potential extension which we did not yet try is to train both models jointly.

Data
The shared task uses the doc2dial dataset (Feng et al., 2020) which contains 4,793 annotated dialogs based on a total of 487 documents. All documents were obtained from public government service websites and stem from the four domains Social Security Administration (ssa), Department of Motor Vehicles (dmv), United States Department of Veterans Affairs (va), and Federal Student Aid (studentaid). In the shared task, each document is associated with exactly one domain and is annotated with sections and phrases. The latter is described by a start and end index within the document and associated with a specific section that has a title and text. Each dialog is based on one document and contains a set of turns. Turns are taken either by a user or an agent and described by a dialog act and a list of grounding reference phrases in the document.
The training set of the shared task contains 3,474 dialogs with in total 44,149 turns. In addition to the training set, the shared task organizers provide a validation set with 661 dialogs and a testdev set with 198 dialogs which include around 30% of the dialogs from the final test set. The final test set includes an additional domain of unseen documents and comprises a total of 787 dialogs. Documents are rather long, have a median length of 817.5, and an average length of 991 tokens (using the BART subword vocabulary). Thus, in many cases, truncation of the input is required.

Experiments
We base our implementation 1 on the provided baseline code of the shared task 2 . Furthermore, we use the workflow manager Sisyphus (Peter et al., 2018) to organize our experiments.
For the first subtask, we use the base and large variants of RoBERTa (Liu et al., 2019) and ELEC-TRA (Clark et al., 2020) instead of BERT large uncased. In the second subtask, we use BART base instead of the large variant, which was used in the baseline code, since even after reducing the batch size to one, we were not able to run the baseline with a maximum sequence length of 1024 on our Nvidia GTX 1080 Ti and RTX 2080 Ti GPUs due to memory constraints. All models are fine-tuned with an initial learning rate of 3e-5. Base variants are trained for 10 epochs and large variants for 5 epochs.
We include agent follow-up turns in our training data, i.e. such turns u t made by agents, where the preceding turn u t−1 was already taken by the agent. Similar to other agent turns, i.e. where the preceding turn was taken by the user, these turns are annotated with their grounding span and can be used as additional samples in both tasks. In the baseline implementation, these are excluded from training and evaluation. To maintain comparability, we do not include them in the validation or test data.
For evaluation, we use the same evaluation metrics as proposed in the baseline. In the first subtask, exact match (EM), i.e. the percentage of exact matches between the predicted and reference span (after lowercasing and removing punctuation, articles, and whitespace) and the token-level F1 score is used. The second subtask is evaluated using SacreBLEU (Post, 2018).

Results
Table 1 summarizes our main results and submission to the shared task. The first line shows the results obtained by reproducing the baseline provided by the organizers (using BART base for Subtask 2). We note that these results differ from the ones reported in Feng et al. (2020) due to slightly different data conditions in the shared task and their paper. The second line shows the results of our best single model. In Subtask 1, we obtained our best results by using RoBERTa large, trained additionally on agent follow-up turns, and by restricting the model to phrases occurring in the document. Using an ensemble of this model, an ELECTRA large model trained with the same approach, and a RoBERTa base model trained with the span-based objective, we achieve our best result. In the second subtask, our cascaded approach using this model and BART base significantly outperforms the baseline by over 10% absolute in BLEU. Using the results of the ensemble in Subtask 2 also translates to a significant improvement in BLEU, which indicates a strong influence of the agent response grounding prediction task.   Table 3: Ablation analysis of our systems for subtask 2 on the validation set.

Ablation Analysis
Agent Response Grounding Prediction Table 2 gives an overview of our ablation analysis for the first subtask. In addition to F1 and EM, we report the EM@5 which we define as the percentage of turns where an exact match is part of the 5-best list predicted by the model. This metric gives an indication of the quality of the n-best list produced by the model. Both RoBERTa and ELECTRA large outperform BERT large concerning F1 and EM with RoBERTa large performing best. Removing agent follow-up turns in training consistently degrades the results for both models.
Restricting the predictions of the model to valid phrases during training and evaluation gives consistent improvements in the EM and EM@5 scores. Training RoBERTa base using the span-based objective, we observe degradations in F1 and EM but observe an improvement in EM@5 which indicates that it better models the distribution across phrases. Due to instabilities during training, we were not able to train a large model with the span-based objective. Additionally, we only did experiments with the biaffine classifier discussed in Section 3. It would be interesting to compare the results with other span-based objectives as the ones proposed by Fajcik et al. (2020). Table 3 shows an ablation study of our results in response generation. The results show that our cascaded approach outperforms the baseline by a large margin. Further experiments with additional context, such as the title of a section or a window of 10 tokens to each side of the span, do not give improvements. This indicates that the selected spans seem to be suffi-cient to generate suitable responses. Furthermore, marginalizing over multiple spans leads to degradations, which might be because training is based on an n-best list from an uncertain model. We observe our best results when using only the predicted span and a beam size of 6. Furthermore, we add a repetition penalty of 1.2 (Keskar et al., 2019) to discourage repetitions in generated responses.

Agent Response Generation
Finally, the last line of the table reports the results of the cascaded method when using ground truth spans instead of the spans predicted by a model. That is, a perfect model for the first subtask would additionally improve the results by 4.7 points absolute in BLEU.

Conclusion
In this paper, we have described our submissions to both subtasks of the first DialDoc shared task. In the first subtask, we have experimented with restricting the set of spans that can be predicted to valid phrases, which yields constant improvements in terms of EM. Furthermore, we have employed a model to directly hypothesize entire spans and shown the benefits of combining multiple models using Bayesian Model Averaging. In the second subtask, we have shown how cascading span selection and response generation improves results when compared to providing an entire document in generation. We have compared marginalizing over spans to just using a single span for generation, with which we obtain our best results in the shared task.