Technical Report on Shared Task in DialDoc21

We participate in the DialDoc Shared Task sub-task 1 (Knowledge Identification). The task requires identifying the grounding knowledge in form of a document span for the next dialogue turn. We employ two well-known pre-trained language models (RoBERTa and ELECTRA) to identify candidate document spans and propose a metric-based ensemble method for span selection. Our methods include data augmentation, model pre-training/fine-tuning, post-processing, and ensemble. On the submission page, we rank 2nd based on the average of normalized F1 and EM scores used for the final evaluation. Specifically, we rank 2nd on EM and 3rd on F1.


Introduction
Our team SCIR-DT participates in the DialDoc shared task in the Document-grounded Dialogue and Conversational QA Workshop at the ACL-IJCNLP 2021. There are two sub-tasks based on the Doc2Dial dataset (Feng et al., 2020). The dataset contains goal-oriented conversations between a user and an assistive agent. Each dialogue turn is annotated with a dialogue scene, which includes role, dialogue act, and grounding in a document (or irrelevant to domain documents). The documents are from different domains, such as Social Security and Veterans Affairs. Sub-task1 is Knowledge Identification which requires identifying the grounding knowledge in form of document span for the next agent turn. The input is dialogue history, current user utterance, and associated document. The output should be a text span. The evaluation metrics are Exact Match (EM) and F1 (Rajpurkar et al., 2016). Sub-task2 is text generation which requires generating the next agent response in natural language. The input is dialogue history and The DGD maintains a dialogue pattern where external knowledge used in dialogues can be obtained from the given document. Recently, some DGD datasets (Moghe et al., 2018;Dinan et al., 2019) have been released to exploiting unstructured document information in open-domain dialogues. The Doc2Dial dataset is also document-grounded dialogue. However, the dialogue in Doc2Dial is goaloriented which guides users to access various forms of information according to their needs.
The CQA (such as CoQA (Reddy et al., 2019), QuAC (Choi et al., 2018) and DoQA (Campos et al., 2020)) task is also based on background document, which aims to understand a text passage and answering a series of interconnected questions that appear in a conversation. The difference between DGD and CQA is the dialogue of DGD is more diversified (including chit-chat or recommendation) and not limited to QA. The Doc2Dial task is closely related to the CQA tasks. It shares the challenges and additionally introduces the dialogue scenes where the agent asks questions when the user query is identified as under-specified or additional verification required for a resolute solution.

Pre-trained Language Model (PLM)
The traditional word embeddings (Pennington et al., 2014) are fixed and context-independent, they could not resolve the out-of-vocabulary (OOV) problem and the ambiguity of words in different contexts. To address these problems, Pre-trained Language Models (PLMs) such as BERT (Devlin et al., 2019) were introduced. BERT employed a Masked language modeling (MLM) method that first masked out some tokens from the input sentences and then trained the model to predict the masked tokens by the rest of the tokens. Concurrently, there was research proposing different enhanced versions of MLM to further improve on BERT. Instead of static masking, RoBERTa (Liu et al., 2019) improved BERT by dynamic masking and abandoned the Next Sentence Prediction (NSP) loss. Instead of masking the input, ELEC-TRA (Clark et al., 2020) replaced some input tokens with plausible alternatives sampled from a small generator network and trained a discriminative model that predicted whether each token in the corrupted input was replaced by the generator or not. When used for downstream tasks, these PLMs were first trained on a large corpus, then fine-tuned on specific tasks. The contextualized embedding has been proven to be better for the downstream NLP tasks (Qiu et al., 2020) than traditional word embedding. We adopt the BERT, RoBERTa, and ELECTRA in this competition.

Our Method
We first use two data augmentation methods to obtain a 5-times larger augmented dataset. We use the augmented data to re-train BERT and RoBERTa with the whole word masking technique and finetune BERT, RoBERTa, and ELECTRA models. We test several span post-processing methods and then propose an ensemble method with trainable parameters for final text span selection. The pipeline we used in this competition is illustrated in Figure 1.

Problem Statement
In sub-task 1, we focus on selecting the correct text span as knowledge from a document. For each example, the model is given a conversational context   model learns to select a document span K i for the response with probability P (K i |K, C; Θ), Θ is the model's parameters. Specifically, our model adopts the BERT-QA (Chadha and Sood, 2019) method and predicts the start and end positions of a span, if the predicted positions are not the boundaries of an existing span, we use some post-processing methods to modify them to the nearest K i . The selected span K i is used for sub-task 2 to generate a response. The model structure is shown in Figure 2. The input of the model is the sum of positional/segment/word embedding of dialogue and document. The output is a document span.

Data augmentation
The statistics of the Doc2Dial dataset are shown in Table 1. The final test set has an unseen domain that is not included in the training set. Besides the final test page, the organizers provide a dev-test page that uses a small set for additional testing. We use back-translation and Synonym substitution as data augmentation methods. We adopt the google translation service 1 to translate English into other languages (such as Spanish/German/Japanese/French), then back-translated them into English 2 . Finally, we obtain 5-times document+dialogue data to pretrain the PLMs. Then we pair the 5-times dialogue data with documents translated from different languages, which gives 25 times data for fine-tuning.

Pre-training and Fine-tuning
We use the augmented data to pre-train two models: BERT and RoBERTa. We follow the Masked Language Model method with the whole word masking technique. We do not pre-train the ELECTRA model because we hope our ensemble method could leverage the prediction results from RoBERTa and ELECTRA to achieve a good performance on both seen and unseen domains. We pre-train RoBERTa on the augmented data to get a good performance on the seen domains. Meanwhile, we hope that ELECTRA can get a good prediction on the unseen domain. The unseen domain in the final-test set requires the knowledge packed in the parameters of the pre-trained model. Pre-training ELECTRA will lose this knowledge. When fine-tuning these models (BERT, RoBERTa, and ELECTRA), the model structure and training objective is the same as the common method used in the span-extraction Reading Comprehension task. The training objective is defined as the sum of negative log probabilities of the true start and end positions by the predicted distributions, averaged over all N examples: where S start n and S end n are the ground-truth span start and end positions of the n-th example .

Post Processing
Since the document is divided into consecutive spans and the task requires identifying a single span, we propose two different post-processing methods to fix the wrong predictions. The goal of these methods is to process the predicted incomplete span into a complete one. The first method is to expand the predicted start/end to the boundary of one standard span when the predicted positions are within it. The second is to move the predicted start/end to the boundary of the nearest span when the predicted positions are across two spans.
2 When the back-translation sentence is the same as the original sentence, we employ synonym substitution with Wordnet (https://wordnet.princeton.edu/) to increase diversity.

Ensemble Method
We propose a simple but efficient ensemble method (Algorithm 1 shows the details) to utilize the advantages of different models. For each example, we calculate top N span candidates from each model and sort them in descending order with respect to model confidence. Each span is given a weight which is the reciprocal of its ranking number plus one. For example, candidates from RoBERTa are S R j , (j = 1, 2, ..., N ), and the corresponding weight is W R j = 1 j+1 . Similarly, S E j and W E j for ELECTRA. Then we use these candidates to form a final candidate dictionary S i , (i = 1, 2, ..., T ), N ≤ T ≤ 2N , and the ensemble weight here andW E i follows the same definition. Then we use a specific metric, such as F1 or EM, to learn the optimal p* with all examples in the validation set. When testing, we select one candidate as our final prediction using the learned weight 3 . Table 2: Experimental results. "DA/FT/PT/PP" means "data augmentation/fine-tuned/pre-trained/post-processing", respectively.

Models
On Our implementations of BERT, RoBERTa, and ELECTRA are based on the public Pytorch implementation from Transformers 4 . All models are in large size. During pre-training, we follow the hyper-parameters setting of the original implementation. During fine-tuning, we truncated the length of the dialogue context to 60 tokens and maximum input length to 512 tokens. The maximum predicted span length is set to 90 words. Candidate span size N is set to 20. We use EM as the Metric in the ensemble method. We use a single Tesla v100s GPU with 32gb memory, the pre-training time is around 48 hours and fine-tuning time is around 24 hours for each model.

Experimental Results and Analysis
In this competition, each team has five submission opportunities on the final test page 5 . respectively. It proves that pre-training on task data can further improve performance. Then we find Post-processing helps ELECTRA on both F1 and EM. We employ the PT/FT/PP on RoBERTa and get 72.37 F1 and 60.61 EM. At last, we employ our ensemble method on the best performance RoBERTa and ELECTRA models and achieve 74.09 F1 and 63.13 EM on the dev-test set. The last method also achieves our best F1 and EM on the final-test set, the ensemble results outperform the best single model (RoBERTa) more than 4% on both F1 and EM. For EM, the contribution ranks from big to small are Ensemble>Pre-training>Data Augmentation>Post Processing.
The ensemble method uses both PLM (RoBERTa) that is pre-trained with augmented data and PLM (ELECTRA) that is not pre-trained with augmented data. In this way, we can leverage the knowledge packed in the parameters of ELECTRA for the unseen domain of the final-test data. The ELECTRA(FT/PP) got an EM of 55.65 on the final-test set and the RoBERTa(PT/FT/PP) got an EM of 59.09. The ensemble method increased the EM to 63.91, indicating that the two models have a great difference of choice in spans and our ensemble method leverages the difference between the two models to achieve a better result.

Conclusion
We introduced our submission for Doc2Dial Shared Task. In sub-task 1, our model is based on RoBERTa and ELECTRA. We propose a simple but efficient ensemble method for knowledge selection in multi-turn dialogue. Our team SCIR-DT ranks 2nd on the final submission page. Apart from the methods we introduced, there are other methods that could further improve the performance of our model. For example, Feng et al. (2020) proved the dialogue act information was useful for subtask 1; there are some noisy data such as empty responses in the dialogue data could be filtered out during training; employing machine reading comprehension dataset such as SQuAD (Rajpurkar et al., 2016) or CQA dataset such as CoQA (Reddy et al., 2019) for pre-training and fine-tuning may also be helpful. However, due to the time limitation, we did not try all these methods during the competition. We hope these methods and experiences would be helpful for future contestants.