Team JARS: DialDoc Subtask 1 - Improved Knowledge Identification with Supervised Out-of-Domain Pretraining

In this paper, we discuss our submission for DialDoc subtask 1. The subtask requires systems to extract knowledge from FAQ-type documents vital to reply to a user’s query in a conversational setting. We experiment with pretraining a BERT-based question-answering model on different QA datasets from MRQA, as well as conversational QA datasets like CoQA and QuAC. Our results show that models pretrained on CoQA and QuAC perform better than their counterparts that are pretrained on MRQA datasets. Our results also indicate that adding more pretraining data does not necessarily result in improved performance. Our final model, which is an ensemble of AlBERT-XL pretrained on CoQA and QuAC independently, with the chosen answer having the highest average probability score, achieves an F1-Score of 70.9% on the official test-set.


Introduction
Question Answering (QA) involves constructing an answer for a given question in either an extractive or an abstractive manner. QA systems are central to other Natural Language Processing (NLP) applications like search engines, and dialogue. Recently, QA based solutions have also been proposed to evaluate factuality (Wang et al., 2020) and faithfulness (Durmus et al., 2020) of abstractive summarization systems.
In addition to popular QA benchmarks like SQuAD (Rajpurkar et al., 2016), and MRQA-2019 (Fisch et al., 2019), we have seen QA challenges that require reasoning over human dialogue. Some notable examples being QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2019). These datasets require the model to attend to the entire dialogue context in the process of retrieving an answer. In this work, we are interesting in building a QA system to help with human dialogue. Feng et al. (2020) introduced a new dataset of goal-oriented dialogues (Doc2Dial) that are grounded in the associated documents. Each sample in the dataset consists of an information-seeking conversation between a user and an agent where agent's responses are grounded in FAQ-like webpages. DialDoc shared task derives its training data from the Doc2Dial dataset and proposes two subtasks which require the participants to (1) identify the grounding knowledge in form of document span for the next agent turn; and (2) generate the next agent response in natural language.
In this paper, we describe our solution to the subtask 1. This subtask is formulated as a span selection problem. Therefore, we leverage a transformerbased extractive question-answering model Lan et al., 2019) to extract the relevant spans from the document. We pretrain our model on different QA datasets like SQuAD, different subsets of MRQA-2019 training set, and conversational QA datasets like CoQA and QuAC. We find that models pretrained on out-of-domain QA datasets substantially outperform the baseline. Our experiments suggest that conversational QA datasets are more useful than MRQA-2019 data or its subsets. In the following sections, we first present an overview of the DialDoc shared task ( §2), followed by our system description ( §3) and a detailed account of our experimental results, and ablation studies ( §4, §5).

DialDoc Shared Task Dataset
Dataset used in the DialDoc shared-task is derived from Doc2Dial dataset (Feng et al., 2020), a new dataset with goal-oriented document-grounded dialogue. It includes a set of documents and conversations between a user and an agent grounded in the associated document. The authors provide annotations for dialogue acts for each utterance in the dialogue flow, along with the span in the document that acts as the reference of it.
The dataset shared during the shared task was divided into train/validation/testdev/test splits. Train and validation splits were provided to the participants to facilitate model development. During phase 1, the models were evaluated on testdev whereas, the final ranking was done on the performance on the test set.
Pre-processing Using the pre-processing scripts provided by the task organizers, we converted the Doc2Dial dataset into SQuAD v2.0 format with questions containing the latest user utterance as well as all previous turns in the conversation. This is in line with previous work from (Feng et al., 2020) which showed that including the entire conversational history performs better than just considering the current user utterance. Dialogue context is concatenated with the latest user utterance in the reverse time order.
The output of this pre-processing step consisted of 20431 training, 3972 validation, 727 testdev, and 2824 test instances.

System Description
As discussed earlier, subtask 1 of DialDoc shared task is formulated as a span selection problem. Therefore, in order to learn to predict the correct span, we use an extractive question-answering setup.

Question-Answering Model
We pass the pre-processed training data through a QA model that leverages a transformer encoder to contextually represent the question (dialogue history) along with the context (document). Since the grounding document is often longer than the maximum input sequence length for transformers, we follow (Feng et al., 2020) and truncate the documents in sliding windows with a stride. The document trunk and the dialogue history are passed through the transformer encoder to create contextual representations for each token in the input. To extract the beginning and the ending positions of the answer span within the document, the encoded embeddings are sent to a linear layer to output two logits that correspond to the probability of the position being the start and end position of the answer span. The training loss is computed using the Cross-Entropy loss function. We use the huggingface transformers toolkit in all of our experiments.

Pretraining
Recent work (Gururangan et al., 2020) has shown that multi-phase domain adaptive pretraining of transformer-based encoders on related datasets (and tasks) benefits the overall performance of the model on the downstream task. Motivated by this, we experimented with further pretraining the QA model on different out-of-domain QA datasets to gauge its benefits on Doc2Dial (Table 1)

Experimental Setup
In this section, we discuss our experimental setup in detail.

Pretraining Datasets
Firstly, we briefly describe the different datasets used for the continual pretraining of our transformer-based QA models.  CoQA (Reddy et al., 2019). 2 For both datasets, we filter out samples which do not adhere to SQuADlike extractive QA setup (e.g. yes/no questions) or have a context length of more than 5000 characters. Table 1 presents the size of the different pretraining datasets after the removal of non-extractive QA samples.

Evaluation Metrics
The shared-task relies on Exact Match (EM) and F1 metrics to evaluate the systems on subtask 1. To compute these scores, we use the metrics for SQuAD from huggingface. 3

Hyperparameters
We use default parameters set by the subtask baseline provided by the authors. 4 However, we reduce the training per-device batch-size to 2 to accommodate the large models on an Nvidia Geforce GTX 1080 Ti 12GB GPU. We stop the continual out-ofdomain supervised pretraining after 2 epochs.

Results
We now present the results for different experimental setups we tried for DialDoc subtask 1.

Pretraining on Different QA Datasets
Our first set of results portray the differential benefits of different out-of-domain QA datasets when used to pretrain the transformer encoder.  Experiments with bert-base-uncased on the validation set (Table 2) portray that pretraining on different QA datasets is indeed beneficial. Datasets like SQuAD, NewsQA, and NaturalQuestions are more useful than SearchQA, and Trivi-aQA. However, pretraining on complete MRQA-2019 training set does not outperform the individual datasets suggesting that merely introducing more pretraining data might not result in improved performance. Furthermore, conversational QA datasets like CoQA and QuAC, which are more similar in their setup to DialDoc, perform substantially better than any of the other MRQA-2019 training datasets.
We observe similar trends with larger transformers (Table 3). Models pretrained on QuAC or CoQA outperform those pretrained on SQuAD. However, combining CoQA and QuAC during pretraining does not seem to help with the performance on validation or testdev split.
Analyzing Different Transformer Variants Table 3 also contains the results for experiments where albert-xl is used to encode the questioncontext pair. We find that albert-xl-based models outperform their bert counterparts on validation set. However, they do not generalize well to the Testdev set, which contains about 30% of the test instances but is much smaller than validation set in size (727 samples in testdev vs 3972 in validation set).

Results on test set
We only submitted our best performing models on the official test set due to a constraint on the number of submissions. Contrary to the trends for testdev phase, albert-xl models trained on conversational QA datasets perform the best. albert-xl + QuAC is the best-performing single model according to the EM metric (EM = 52.60), whereas albert-xl + CoQA performs the best on F1 metric (F 1 = 69.48) on the test set.

Ensembling
We perform ensembling over the outputs of the model variants to obtain a single unified ranked list. For a given question Q, we produce 20 candidate spans, along with a corresponding probability score ps. We compute rank-scores rs for the answer-spans at rank r as rs = 1 log 2 (r+1) . We then aggregate the information of the answer spans for the model variants using the following techniques. Frequent: We chose the answer span which was the most frequent across the model variants. Rank Score : We chose the answer span which was the highest average rank score. Probability Score: We chose the answer span which was the highest average probability score.
We observe empirically that ensembling using the probability score performs the best and hence we report the results of ensembling using the probability score (E) in Table 3.
We observe the highest gains after ensembling the outputs of all the 5 model variants on the validation test and test-dev set. However, the best performance on the test set was achieved by ensembling over the albert-xl models pre-trained independently on CoQA and QuAC (EM = 53.5, F 1 = 70.9). This was the final submission for our team.

Informed Data Selection
We investigate the disparate impact of pretraining on different MRQA-19 datasets on the Doc2Dial shared task. Specifically, we explored factors such as answer length, relative position of the answer in the context, question length, and context length in Table 4. We observe that the SQuAD, NewsQA, and NaturalQuestions (NQ) has compartaively longer answers than the other datasets. However, we do not observe a noticeable difference in terms of question length, context length or relative position of the answer in the context, with respect to the other datasets.  We also use the dataset of Li and Roth (2002) to train a BERT classifier to predict answer type of a question with 97% accuracy. The coarse-answer types are DESC (Description), NUM (Numerical), ENT (Entity), HUM (Person), LOC (Location) and ABBR (Abbreviation). We use the classifier to gauge the distribution of answer types on MRQA datasets and Doc2Dial. We observe from Figure  2 that a majority of questions in Doc2Dial require a descriptive answer. These DESC type questions are more prevelant in SQuAD, NewsQA, and NQ, which might explain their efficacy.
To ascertain the benefit of intelligent sampling, we pretrain on a much smaller subset of the SQuAD, NewsQA, and NaturalQuestions dataset, which we obtain via intelligent sampling. We select questions which satisfy one of the following criteria, (i) the answer length of the question is ≥ 50, (ii) the question includes 'how' or 'why' question word or (iii) the answer type of the question is 'DESC'. Overall, the size of the selected sample is only 20% of the original dataset, yet achieves a higher EM score than the combined dataset as seen in Table 2. Yet, surprisingly, the performance is lower than each of the individual dataset.

Conclusion
Our submission to the DialDoc subtask 1 performs continual pretraining of a transformer-based encoder on out-of-domain QA datasets. Experiments with different QA datasets suggest that conversational QA datasets like CoQA and QuAC are highly beneficial as their setup is substantially similar to Doc2Dial, the downstream dataset of interest. Our final submission ensembles two AlBERT-XL models independently pretrained on CoQA and QuAC and achieves an F1-Score of 70.9% and EM-Score of 53.5% on the competition test-set.

Impact Statement
In this work, we tackle the task of question answering (QA) for English language text. While we believe that the proposed methods can be effective in other languages, we leave this exploration for future work. We also acknowledge that QA systems suffer from bias (Li et al., 2020), which often lead to unintended real-world consequences. For the purpose of the shared task, we focused solely on the modeling techniques, but a study of model bias in our systems is necessary.