CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System

Information-seeking dialogue systems, including knowledge identification and response generation, aim to respond to users with fluent, coherent, and informative responses based on users' needs, which. To tackle this challenge, we utilize data augmentation methods and several training techniques with the pre-trained language models to learn a general pattern of the task and thus achieve promising performance. In DialDoc21 competition, our system achieved 74.95 F1 score and 60.74 Exact Match score in subtask 1, and 37.72 SacreBLEU score in subtask 2. Empirical analysis is provided to explain the effectiveness of our approaches.


Introduction
Recent progress in research has opened up real-life applications of dialogue systems , of which informationseeking dialogue systems are one of the major types. The goal of such dialogue systems is to provide fluent and coherent responses with sufficient information to users based on their needs, retrieving information using the dialogue history. The performance of an information-seeking dialogue system can be evaluated from three aspects: (1) user utterance understanding, (2) relevant knowledge retrieval, and (3) agent response generation (Feng et al., 2020). This paper presents work on the DialDoc-21 Shared Task, which is to teach a dialogue system to identify the most relevant knowledge in the associated document for generating agent responses in natural language. It is composed of two subtasks: Knowledge Identification (KI) to retrieve the knowledge from the document, and Response Generation (RG) to generate an agent utterance utilizing the retrieved knowledge. * * These two authors contributed equally.
To tackle this problem, we leverage the pretrained language models from Liu et al. (2019a) and Lewis et al. (2020) and explore data augmentation methods with several training techniques so as to avoid over-fitting to the DialDoc datasets and to teach the model the general pattern of the task. Ensemble and post-processing are conducted to further improve the model performance. Experimental results show that data augmentation is a simple but effective approach for knowledge identification in information-seeking dialogue systems (Madotto et al., 2020a), while bringing improvement to response generation at the same time. In the DialDoc-21 competition, our system achieved 74.95 of F1 score and 60.74 of Exact Match in subtask 1, and 37.72 SacreBLEU score (Post, 2018) in subtask 2 1 .

Datasets
Doc2Dial dataset In this shared task, we mainly focus on the Doc2Dial dataset (Feng et al., 2020). Doc2Dial addresses the challenge of modeling different dialogue scenes with documents and providing free-form responses while allowing follow-up questions from the agent. The shared task evaluation is divided into a testdev phase and a test phase. The main difference between these is that in the test phase, out-of-domain (OOD) data samples are included by selecting documents from the domain which is unseen in the training process. The testdev phase only covers 30% of the data samples in the final test phase.
Besides Doc2Dial, several other datasets are leveraged for augmentation, as follows: MRQA 2019 Shared Task dataset is a collection of multiple reading comprehension datasets for evaluating the generalization ability of QA  models. Six datasets are assigned to the training split, which is not included in the evaluation. Among them, SearchQA (Dunn et al., 2017) and TriviaQA (Joshi et al., 2017) differ from the others by the data resource and have the least generalization ability compared to the other four datasets as reported in (Su et al., 2019). In this shared task, we consider two settings when leveraging the MRQA dataset: MRQA and MRQA small which excludes SearchQA and TriviaQA.

Wizard-of-Wikipedia
(WoW) is a commonly-used knowledge-grounded dialogue dataset (Dinan et al., 2018). It aims at providing content-full responses to user utterances based on Wikipedia documents.

Methodology
We utilize a series of data-augmentation approaches to enable the model to obtain better representations on both dialogue context and document context and learn a general pattern of the task with less domain bias. Namely, we have a two-stage training paradigm, the first step is pretraining (PT) to have a better model initialization, and the second step is fine-tuning (FT) to adapt to DialDoc task. For each step, we can apply the multi-task learning (MTL) strategy if we have multiple datasets by making the datasets format uniform and treat samples equally. As reported in Fisch et al. (2019), a model trained on multiple dataset under similar tasks, is supposed to provide a better initialization for further fine-tuning and is capable of generalizing to the data samples in other domains. Thus, we expect a model trained with MTL in the first step to offer a better initialization and in the second step to reduce the domain bias and avoid overfitting.

Knowledge Identification
In the KI task, we conduct experiments on a large pre-trained model, RoBERTa-large (Liu et al., 2019a), which has shown its effectiveness on many QA datasets (Ju et al., 2019). The MRQA dataset and three CQA above datasets are leveraged for data augmentation. The combinations of the experimental settings are considered as follows: We consider using CQA datasets to enrich the data source.
RoBERTa cqa is fine-tuned on Doc2Dial and three CQA datasets using MTL method. RoBERTa f(cqa) leverages the pretrained RoBERTa cqa model and is fine-tuned on Doc2Dial dataset for better performance.
We train the RoBERTa model on MRQA daztaset and MRQA small dataset described in § 2 using MTL respectively (denoted as RoBERTa mrqa and RoBERTa mrqas ). These models could be further fine-tuned while providing a better initialization (Fisch et al., 2019).
RoBERTa f(mrqa) is to further fine-tune RoBERTa mrqa on Doc2Dial dataset.
The corresponding settings are also applied to RoBERTa f(mrqas) model.
While RoBERTa cqa(mrqa) is initialized with RoBERTa mrqa and fine-tuned on Doc2Dial and three CQA datasets using MTL. RoBERTa cqa(mrqas) follows the same setting as the former model, but use RoBERTa mrqas model for initialization instead. RoBERTa f(cqa(mrqas)) is to further fine-tune RoBERTa cqa(mrqas) on Doc2Dial dataset.
RoBERTa all is trained on Doc2Dial, MRQA dataset and CQA datasets using MTL method.
For better readability, we summarize the model settings in Table 1. We also explore more combinations of the experimental settings, such as other combinations of the datasets and other pre-trained language models. However, those fail to bring the improvements as much as those we mentioned above.
Post-processing We further conduct postprocessing on the model predictions based on our observation that the ground truths of the data samples are annotated by document splits which are provided together with the dataset. We consider including the whole split of the document once the prediction covers λ percent of it, where λ is set as 0.1. In addition, for better performance in the shared task, we also slightly extend the predictions when there is a "Yes" or "No" shown right in front of the predicted spans.
Ensemble To further boost the model performance, we build an ensemble of our existing models. We consider one prediction containing the start position and the end position of the document as a unit and conduct voting over all the predictions of each data sample. The most frequent one will be selected as the final prediction. We denote the ensemble result as RoBERTa ensemble .

Response Generation
To obtain natural and relevant responses, we take advantage of the evidence to the query identified from § 3.1 and focusing on paraphrasing the corresponding knowledge sentences based on the dialogue context. We leverage the large pre-trained model BART large (Lewis et al., 2020). The process of training and inference can be summarized as three steps: Pre-training on WoW dataset. We first pretrain the BART model on the WoW dataset for better initialization because of its similarity with the RG task. In the training process, the gold grounded knowledge sentences are concatenated with the dialogue context and fed into the model as the inputs.
Fine-tuning on Doc2Dial dataset. In the Doc2Dial dataset, the labels of the gold document splits are also provided in the training and validation set. The model is further fine-tuned on the Doc2Dial dataset using the same components for the input sequences in the first step. The model could be evaluated under two scenarios: (1) Gold mode (BART gold ), leveraging the gold  labels of the knowledge evidence in the dataset as the knowledge inputs; (2) Prediction mode (BART pred ), leveraging the prediction of the KI process as the inputs.
Inference with Knowledge Evidence. During the testdev and test phase, we leverage the predictions from the KI process as the knowledge evidence components for the dialogue queries. The model generates responses based on a concatenation of the knowledge evidence and the dialogue context.
Post-processing To avoid serious information loss in the generations compared to the knowledge evidence for the OOD data samples, we compare the lengths of the knowledge evidence and the responses (denoted as L kn and L resp ). The generated response will be replaced by the raw knowledge evidence as the final output if L resp ≤ αL kn , where α is set as 0.4.

Training Details
Hyper-parameter Settings We apply different settings to utilize the dialogue history for the two subtasks. For subtask 1, we leverage all previous turns and build the input sequence in a reverse order to them. For subtask 2, we leverage one  extra last turn in the time order and differentiate the speakers with special tokens. In Table 2, we list the selected hyper-parameters utilized in the shared task.
Ensemble Settings In subtask 1, we make an ensemble of all the checkpoints of the models listed in Table 1 except RoBERTa mrqa and RoBERTa mrqas . The details of the checkpoints can be found in Tabel 3.

Metrics and Model Selection
In subtask 1, the Exact Match (EM) and uni-gram F1 score are utilized as the criteria, while in subtask 2, we evaluate the generation by SacreBLEU. We select the models with the best EM and SacreBLEU scores on the validation set respectively, for the two subtasks. Specifically for subtask 2, the model is selected under the gold mode.

Results
The results are shown in Table 3 and Table 4. For both subtasks, we observe gaps between the testdev phase and the test phase. For some of the models in subtask 1, multiple random seeds are applied in the training process. The performance gap may result from the domain difference of the partial data samples in the test phase, where the corresponding documents are unseen in the training set. In Table 3, without post-processing on the predictions, the model performance consistently drops to a certain extent, which indicates that postprocessing is suitable for the Doc2Dial scenario. Ensemble, which is a common strategy to improve performance, shows its effectiveness in this task. For subtask 2, the pre-training on WoW dataset brings huge improvement to the model. Interestingly, by just using the knowledge evidence predicted from the subtask 1 RoBERTa ensemble model or the gold knowledge evidence labels, the performance can even exceed that of the generative model on SacreBLEU scores, while the responses from BART pred are more fluent and natural. This may be caused by the information loss when paraphrasing the knowledge evidence to dialogue responses.

Discussion
In this task, we explore data augmentation methods and conduct two-stage training as auxiliary training strategy for improvement. Although resource-and time-consuming, this approach is easy to implement and effective at enabling the model to learn more general ability on the task.

Post-Challenge Improvements
From our findings, the hyper-parameter, the maximum answer length, is left untuned, which hurts the QA model performance to some degree. With a maximum answer length of 100, the EM and F1 score on the testdev set improve by 2.53 and 1.08, respectively, while a 64.42 EM and 77.27 F1 score are achieved on the test set. With the improved prediction from subtask 1, we achieve a 39.88 Sacre-BLEU score in subtask 2.

Related Work
Conversational QA is a type of reading comprehension task that requires understanding not only the question but also the previous conversation turns. Various datasets have been introduced in recent years, and many of them restrict answers to be extraction of a span from the reference document, while the others allow free-form responses (Choi et al., 2018;Reddy et al., 2019;Campos et al., 2020). In addition to the works to enrich the contents of open-domain conversations by controllable generation (Lin et al., 2020;Madotto et al., 2020b), the knowledge grounded dialogue task aims to offer more informative conversation by leveraging an external knowledge source (Dinan et al., 2018;. Relevant knowledge selection is the key to improving the whole system, and very recently, latent variable models have been attracting more attention for this purpose (Lian et al., 2019;Liu et al., 2019b;

Conclusion
In this paper, we utilize data augmentation methods and several training techniques with pretrained language models to tackle the challenge of the information-seeking dialogue task. The results have indicated the effectiveness of our approaches. Moreover, data augmentation methods are easy to implement, which is promising for practical use.