Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering

Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-of-the-art results on three SQA benchmarks.


Introduction
Building an intelligent spoken question answering (SQA) system has attracted considerable attention from both academia and industry. In recent years, many significant improvements have achieved in speech processing and natural language processing (NLP) communities, such as multi-modal speech emotion recognition (Beard et al., 2018;Sahu et al., 2019;Priyasad et al., 2020;Siriwardhana et al., 2020), spoken language understanding (Mesnil et al., 2014;Chen et al., 2016Chen et al., , 2018Haghani et al., 2018), and spoken question answering You et al., , 2021a. Among these * Equal contribution. topics, SQA is an especially challenging task, as it requires the machines to fully understand the semantic meaning in both speech and text data, and then provide the correct answer given a question and corresponding speech documents.
Automatic speech recognition (ASR) and text question answering (TQA) are two key components to build such a SQA system. The former module is used for transforming the speech sequences into text form, and the latter module trained on noisy ASR transcriptions utilizes NLP techniques to give a concrete answer. However, utilizing existing stateof-the-art SQA systems to retrieval answers still remain formidable challenges, such as ASR recognition errors. This is mainly because ASR systems usually fail to recognize the speech, leading to word errors (e.g., "Barcelona" to "bars alone").
To address these issues, most existing SQA methods are either text-based Chuang et al., 2020) or fusion-based (You et al., 2021a(You et al., , 2021b. One line of research examines internal vector representations both in speech and text domains , often using sub-word units for language modeling. Another line of work (You et al., 2021a,b) investigates the transfer learning problem about how to leverage a large amount of speech and text data to improve the performance of SQA. However, some critical challenges remain, such as robustness, generalization, and data efficiency.
Different from previous methods (Su and Fung, 2020;You et al., 2021b), we move beyond leveraging dual nature of TQA and ASR to mitigate recognition errors. In this paper, we focus not only on extracting the cross-modality information for joint spoken and textual understanding, but also on the training procedure that may take the most advantage of the given dataset. Inspired by the recent advance in contrastive learning (Chen et al., 2020b;Khosla et al., 2020) and recent breakthrough (Devlin et al., ASR System

Audio-Text Input
Temporal-Alignment Attention [MASK] Figure 1: Overall architecture of our model: (a) For a spoken QA part, we use VQ-Wav2Vec and Tokenizer to transfer speech signals and text to discrete tokens. A Temporal-Alignment Attention mechanism is introduced to match each text embedding with the corresponding speech features. Then, we use BERT to learn sequential information of utterances with the proposed self-supervised tasks. We generate the final answer distribution on both domains. At inference time, we use the BERT only. (b) We incorporate contrastive learning strategies to train our SQA model in an auxiliary manner to improve the model performance.
2018; Liu et al., 2019;Rahman et al., 2020;Chen et al., 2020a) in the context of NLP, we propose a novel training framework for Spoken QA that integrates these two perspectives to improve spoken question answering performance. Our training framework contains two steps: (a) self-supervised training stage, and (b) contrastive training stage. During the self-supervised training stage, instead of building the complex spoken question answering model, we propose to learn a spoken question answering system based on pre-trained language models (PLMs) with several auxiliary selfsupervised tasks. In particular, we introduce three self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model with these auxiliary tasks in a multi-task setting. On the one hand, these auxiliary tasks enable the model to capture sequential order within the given passage. On the other hand, they effectively learn cross-modality knowledge without any additional dataset or annotations to generate better representations for answer prediction.
During the fine-tuning stage, along with the main QA loss, we incorporate the contrastive learning strategy to our framework in an auxiliary manner for the SQA tasks. Specifically, we use multiple augmentation strategies, including span deletion and span substitution, to develop the capability of learning noise-invariant utterance representations. In addition, we propose a novel attention mechanism, termed Temporal-Alignment Attention, to effectively learn cross-modal alignment between speech and text embedding spaces. By this mean, our proposed attention mechanism can encourage the training process to pay more attention to seman-tic relevance, consistency and coherency between speech and text in their contexts to provide better cross-modality representations for answer prediction. The overview of our framework is shown in Figure 1. We evaluate the proposed approach on the widely-used spoken question answering benchmark datasets -Spoken-SQuAD , Spoken-CoQA , and 2018 Formosa Grand Challenge (FGC). Experimental results show our proposed approach outperforms other state-of-the-art models when self-supervised training is preceded. Moreover, evaluation results indicate our learning schema can also consistently bring further improvements to the performance of existing methods with contrastive learning.

Related Work
Spoken Question Answering. Spoken question answering Su and Fung, 2020;Huang et al., 2021;You et al., 2021aYou et al., ,b, 2020aYou et al., , 2021cChen et al., 2021) is a task of generating meaningful and concrete answers in response to a series of questions from spoken documents. Typical spoken QA systems focus on integrating ASR and TQA in one pipeline. ASRs are designed to transcribe audio recordings into written transcripts. However, current ASRs are not capable of processing every spoken document. Generated ASR transcripts may contain highly noisy data, which severely influences the performance of QA systems on speech documents. A number of works have explored the shortcomings of this issue.  and  introduced sub-word unit strategy to alleviate the effects of speech recognition errors in SQA. SpeechBERT (Chuang et al., 2020) utilized the pre-trained BERT-  based language model to effectively learn audiotext features. The model improved the performance of ASR by SpeechBERT. However, these works mainly focus on improving performance by exploiting internal information without considering learning the explicit mapping between human-made transcripts and corresponding ASR transcriptions, which is crucial to building Spoken QA systems.  adopted an adversarial learning strategy to alleviate this gap to achieve remarkable performance improvements. In contrast to previous works in SQA, which only consider speech representations or confine to certain subtasks (e.g., spoken multi-choice question answering and spoken conversational question answering), we not only model the interactions between speech and text data, but also focus on capturing semantic similarity. In parallel, our proposed method is a unified framework, which can be easily applied to a variety of downstream speech processing tasks.
Self-supervised Learning. Self-supervised learning (SSL) has become a promising solution for performance improvements by leveraging large amounts of unlabeled audio data. Substantial efforts have recently been dedicated to developing powerful SSL-based approaches in the machine learning community. (Oord et al., 2018;You et al., 2018;You et al., 2019b,a;Chung et al., 2019;Pascual et al., 2019;Chung et al., 2021;You et al., 2020bYou et al., , 2021d. Oord et al. (2018) designed a Contrastive Predictive Coding (CPC) framework to learn compact latent representations to provide future predictions over future observations by combining autoregressive modeling and noise-contrastive estimation in an unsupervised manner. Later on,  further applied the learned generic speech representations to improve supervised ASR systems. Chung et al. (2019) and  have taken advantage of state-of-the-art self-supervised pre-trained language models in the NLP community. These methods mainly focus on learning from audio data only, yet hardly exploit meaningful and relevant representations across both speech and text domains. Most recently, Khurana et al. (2020) investigated how to leverage speech-translation retrieval tasks into self-supervised learning. In this study, we explore an effective way to utilize cross-modality information via the self-supervised training scheme for SQA tasks without additional large-scale unlabeled datasets. In contrast, our proposed method yields such remarkable accuracy without using any extra data or annotations.
Contrastive Representation Learning. In parallel to self-supervised learning, an emerging subfield has explored the prospect of contrastive representation learning in the machine learning community (Kharitonov et al., 2021;Manocha et al., 2021;Oord et al., 2018;He et al., 2020;Chen et al., 2020b;Hjelm et al., 2018;Tian et al., 2019;Henaff, 2020;Khurana et al., 2020). This is often best understood as follows: pull together the positive and an anchor in embedding space, and push apart the anchor from many negatives. Thus, the choice of negatives can significantly determine the quality of the learned latent representations. Since contrastive learning is a framework to learn representations by comparing the similarity between different views of the data. In computer vision, Chen et al. (2020c) has demonstrated that the enlarged negative pool significantly enhances unsupervised representation learning. However, there are few attempts on contrastive learning to address downstream language processing tasks. Recently, few prior work (Kharitonov et al., 2021) incorporated CPC with time-domain data augmentation strategies into contrastive learning framework for speech recognition tasks. In contrast, we focus on learning interactions between speech and text modalities for spoken question answering tasks, and also introduce a set of auxiliary tasks on top of the former self-supervised training scheme to improve representation learning.

Methods
In this section, we first formalize the spoken question answering tasks. Furthermore, we introduce the key components of our method with self-supervised contrastive representation learning. Next, we describe the design of our proposed Temporal-Alignment Attention mechanism. Lastly, we discuss how to incorporate contrastive loss into our self-supervised training schema.

Task Formulation
where Q i denotes a question, P i denotes a passage with a answer A i . In this study, similar to the SQA setting in Kuo et al., 2020), we focus on extraction-based SQA, which can be applied to other types of language tasks. We use Spoken-SQuAD, Spoken-CoQA, and FGC to validate the robustness and generalization of our proposed approach. In Spoken-SQuAD, Q i and A i are both single sentences in text form, and P i consists of multiple sentences in spoken form. In FGC, Q i , A i , and P i are all in spoken form. Different from Spoken-SQuAD and FGC, Spoken-CoQA is in a multiturn conversational SQA setting, which is more challenging than a single-turn setting. Moreover, it adopts Q i in spoken form. The task is to learn a SQA model G(·, ·) from D so that G(Q i , P i ) can provide a most proper answer A i to the given question Q i .

Spoken question answering with PLMs.
Recent PLMs, such as BERT (Devlin et al., 2018) and ALBERT (Lan et al., 2020), learn meaningful language representations from large amounts of unstructured corpora, and have achieved superior performances on a wide range of downstream tasks in the domain of NLP. Following previous work , we consider building the SQA system with PLMs. We adopt BERT as the base model for a fair comparison. Similar to , we concat ASR token sequences of a passage and a question as input to our SQA system. Specifically, given a passage P ={p 1 , p 2 , ..., p n } and a question Q = {q 1 , q 2 , ..., q m }, we first concatenate all utterance sequences, which can be formulated as X = {[CLS], q 1 , q 2 , ..., q m , [SEP], p 1 , p 2 , ..., p n , [SEP]}. "[CLS]" and "[SEP]" denote begin token and separator token of each concatenated token sequence, respectively. We then utilize the pre-trained BERT to extract the hidden state features from the processed token sequences. Finally, we feed these representations to the following module, including a feed-forward network followed by a softmax layer, to obtain the probability distribution for each answer candidate given a textual passage-question pair. We use the cross-entropy loss as the question answering loss.

Self-supervised Training
Heading for a SQA model that can effectively make use of cross-modality knowledge with a limited number of training data and produce better contextual representations for answer prediction. To this end, we design three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination. The objective of these auxiliary tasks is to capture the semantic relevance, coherence, and consistency between speech and text domains. Figure 2 illustrates three auxiliary self-supervised tasks. These tasks are jointly trained with the SQA model in a multi-task manner.
More training examples of self-supervised training can be found in Table 1 and Appendix Table 4.
Utterance Insertion. PLMs often suffer from the limitations in capturing latent semantic and logical relationships in discourse-level, which refers to the problem that Next Sentence Prediction (NSP), the standard training objective of PLM-based approaches, negatively impact semantic topic shift without modeling coherence. One key reason is that NSP fails to capture sufficient semantic coherence with a incomprehensible passage (Lan et al., 2020), which leads performance degradation. Thus, learning the natural sequential relationship between consecutive utterances within a passage can significantly help the model understand the meaning of the passage.
In order to solve the above-mentioned problem, we design a more general self-supervised task with the spoken question answering context termed utterance insertion. In this way, it can enable the model to fully leverage the sequential relationship within a passage to improve the performance in calculating the semantic relevance between consecutive utterances. specifically, we first extract k consecutive utterances from one passage. Then we insert an utterance, which is randomly selected from another topic unrelated passage. Hence, suppose k + 1 utterances consist of k utterances from the original passage and one from different corpus, the goal is to predict the position of inserted utterance given the k + 1 utterances. A special token [UI] is introduced to be positioned before each utterance. The input can be formulated as follows: where u INS is the inserted utterance.
Utterance Restoration. One of the major tasks to train PLMs is mask token prediction (MTP), which requires the model to estimate the position of the masked utterance during the training stage. Although, recent work (Liu et al., 2019;Lan et al., 2020;Devlin et al., 2018;Joshi et al., 2020) found that utilizing this auxiliary task can improve model performance, it only focuses on learning syntactic and semantic representations of the word in tokenlevel. However, spoken question answering is a more challenging task, which requires the deeper understanding of each utterance within a passage.
To explicitly model the utterance-level interaction between utterances within a passage, we propose an utterance-level masking strategy termed utterance restoration to predict the utterance, which causes inconsistency. Specifically, suppose that a context is c = {u 1 , u 2 , ..., u k } including k consecutive utterances, we first randomly pick an utterance u t , t ∈ [0, k], and then replace all tokens in the u t by using a special token [MASK]. Similarly, a special token [UR] is introduced to be positioned before each utterance. To adapt the task in BERT, we formulate input of BERT encoder as follows: where u MASK consists of only [MASK] tokens, which has the same length with u t .
Audio-Text Input. Inspired by recent success in video question answering (Kim et al., 2020), we leverage the cross-modality sequence modeling to generate audio-text sequence as input for question discrimination task. In this process, we utilize the BPE tokenizer to convert the ASR documents to a sequence of Text-Question and Text-Passage tokens, similar to PLMs (See Section 3.2). We utilize pre-trained VQ-Wav2Vec  trained on Librispeech-960 (Panayotov et al., 2015) to encode speech signals to a sequence of input tokens for Speech-Question, since it outperforms the conventional RNN/CNN on sequence modeling.  Question Discrimination. Recent work (Kuo et al., 2020) has shown that learning cross-modality representation is essential for SQA tasks. Hence we design question discrimination to consider building semantic alignments between speech and text by incorporating cross-modality knowledge into our model. Unlike the original goal of SQA (i.e., finding the answer using a question and contextualized contexts in Section 3.2), we instead train the model to predict the proper text question using audio-text contexts. Specifically, we first randomly select k − 1 questions in textual form from other passages, and then incorporate them into the corresponding question Q t . We can reformulate the question asQ = {Q 1 t , .., Q k−1 t , Q t }. The goal of this task is to find the correct Text-Question given a Speech-Question and Text-Passage contexts.
where Q s denotes the appropriate question in spoken form.

Temporal-Alignment Attention
Our proposed Temporal-Alignment Attention strategy is in the spirit of selectively leveraging crossmodality knowledge for SQA. Given an ASR token U i and its corresponding acoustic-level MFCC features F i , the goal is to enhance the SQA model by learning semantically meaningful alignment between speech and text domain 1 . To align speech and text embeddings, we use a simple fullyconnected feed-forward layer. The speech embedding featuresr i is processed by self-attention to obtain speech-aligned features. Formally, the proposed attention module is defined as follows: Original text Input

[CLS] Speech-Question [SEP] Text-Question option [SEP] Passage [SEP]
Question-Discrimination Input where W i is parameters. * denotes element-wise multiplication.
[·] j is j-th column of a matrix.r i and Attention are acoustic-level embedding and self-attention, respectively. Note that we set u i of each special token (e.g., [CLS]) to 0.

Contrastive Learning
Recent work  suggests two main arguments: (1) some deletion of unnecessary words in an utterance may not affect the original semantic meaning; (2) suppose that some necessary words (e.g., not) are mistakenly deleted at times, it will provide extremely different semantic meaning. However, injecting some noises (e.g., properly deleting some words) can improve the robustness of the model. Thus, in order to learn effective noiseinvariant representation in sentence-level, we train our SQA model with a contrastive objective for performance improvement, in which we augment the training data with two sentence-level augmenta-tion strategies, span deletion and span substitution 2 . The augmented input examples are shown in Figure  3. More training examples of contrastive learning can be found in Table 1.
• Span Deletion: we add one special token [DEL] to replace the deleted consecutive words of the utterance (e.g., we randomly delete 5 spans, where each is of 5% length of the textual input sequences).
• Span Substitution: We randomly sample some words, and then replace them with synonyms to produce the augmented version (e.g., we randomly select 30% spans of the utterances, and replace them with tokens which share similar semantic meanings).
In this stage, we first extract the [CLS] token representation H ∈ R k×d from the last layer of the PLM, where d = 768 is the dimension of each word vector 3 . We create augmentations of original utterances with two sentence-level auxiliary tasks on top of the Question Discrimination, and then encode the augmented data using the same PLM, used in SQA section (See Figure 1 (a)), to construct the encoded representation H anchor ∈ R 1×d . Our contrastive learning scheme consists of the following components: (1) we consider the representation corresponding to the correct Q t as a positive, and others as many negative; (2) we use dot-production operation to compute the similarity scores between the joint speech-text representations and the anchor representation; (3) we apply a softmax function to the measured similarity scores. We leverage speech and text data for contrastive training, where the contrastive loss is as follows: Multi-Task Learning Setup. We optimize our model with two main stages: (1) self-supervised training; (2) contrasitive learning. In the selfsupervised training stage, we train our SQA model with three auxiliary tasks to obtain a better local optimum. We use binary cross-entropy loss in all proposed auxiliary tasks. The loss is computed by summing SQA answer prediction loss and all three auxiliary SSL task losses with same ratio. In contrastive learning trainig stage, the loss is defined as a linear combination of SQA answer prediction loss and contrastive loss with the same ratio.

Experiments
In this section, we conduct experiments to compare our proposed method with various baselines and state-of-the-art approaches.

Datasets
We evaluate our approach on three benchmark datasets: Spoken-SQuAD , Spoken-CoQA , and FGC 4 .
Spoken-SQuAD. Spoken-SQuAD  5 is a large listening comprehension corpus, where the training set and testing set consist of 37k and 5.4k question-answer pairs, respectively. The word error rate (WER) is around 22.77% in the training set, and around 22.73% on the testing set. The documents are in the form of speech, and the questions and answers are in the form of text, respectively. The manual transcripts of Spoken-SQuAD are collected from SQuAD benchmark dataset (Rajpurkar et al., 2016).
Spoken-CoQA. Spoken-CoQA ) is a large spoken conversational question answering (SCQA) corpus, where the training set and testing set consist of 40k and 3.8k questionanswer pairs from 7 multiple domains, respectively. The WER is around 18.7%. The questions and passages are both in the form of text and speech, and answers are in the form of text, respectively. The goal is to generate a time span in the spoken multiturn dialogues, and then answer questions based on the given passage and conversations.
FGC. FGC is a Chinese spoken multi-choice question answering (MCQA) corpus across a variety of domains. The number of question-answer pairs in the training set and testing set is 40k and 3.8k, respectively. Each PQC pair is composed of 1 passage, 1 question, and 4 corresponding answers, where only one answer is correct. All passages, questions, and multiple choices are in spoken form.
Following the widely used setting in (Kuo et al., 2020), we apply the Kaldi toolkit to construct the ASR module. The WER is around 20.4%.

Implementation and Evaluation Setup
We utilize Pytorch to implement our model. We adopt BERT-base as our backbone encoder, which consists of 12 transformer layers. We set the maximum sequence length of input and the hidden vector dimension to 512 and 768, respectively. k in Section 3 is set to 9. We train our model on 2x 2080Ti for 2-3 days with a batch size of 4 per GPU using the Adam optimizer with an initial learning rate of 3 × 10 −5 . For Spoken-CoQA, in order to utilize conversation history, we add the current question with previous 2 rounds of questions and ground-truth answers. When trained on FGC, we follow the standard multi-choice setting (Kuo et al., 2020), which takes questions, each candidate answers, and passages as inputs. We evaluate our model using the Exact Match (EM) and F1 to measure the performance of SQA models on Spoken-CoQA and Spoken-SQuAD, following previous work Kuo et al., 2020;Su and Fung, 2020). For FGC, we choose accuracy to evaluate the model performance on response quality.

Results
We report quantitative results on Spoken-SQuAD, Spoken-CoQA, and FGC datasets in Table 2. In our experiments, we set three aspects to study the effectiveness of key components of our method: (1) only using self-supervised learning strategies; (2) only using contrastive learning strategies; (3) we train the model with Temporal-Alignment Attention. Based on these initial aspects, we explore how effective each key component is for SQA. We first evaluate if the model with three auxiliary tasks can generate a proper answer and how much improvement it can achieve over all evaluated models. For all datasets, our model significantly outperforms all evaluated methods on most of the metrics. Specifically, we observe that sequentially incorporating three proposed strategies brings superior performance improvements in terms of F1 and EM scores. Table 2 compares the importance of different auxiliary SSL tasks, which shows that QD > UI > UR in terms of response quality. This suggests that the auxiliary tasks can effectively aid the learning of the SQA model to learn more sequential information and cross-modality representations for the answer prediction.   We then compare our method with other methods in terms of contrastive loss on three datasets. In Table 2, we utilize the proposed contrastive learning with the speech-text input as the auxiliary task, which consistently brings additional performance improvements on all datasets. When further explore the effectiveness of two augmentation strategies, we see that the model achieves comparable performances using SD or SS, and combining both of them enhances the capacity of the model to tackle many unseen sentence pairs. This indicates the importance of noise-invariant representations in boosting performance.
To validate the effectiveness of the proposed T-A Attention, we compare the models with T-A Attention and without it. The model with T-A Attention consistently shows remarkable performance improvements by 60.3%/73.2% (vs. 58.6%/71.1%) and 42.0%/55.6% (vs. 40.6%/54.1%) in terms of EM/F1 scores on Spoken-SQuAD and Spoken-CoQA, and 78.7% (vs. 77.0%) in terms of standard accuracy on FGC. Table 2 shows that our model achieves best results by 62.5%/75.5% (vs. 58.6%/71.1%), 45.4%/58.3% (vs. 40.6%/54.1%), and 81.3% (vs. 77.0%) across three datasets. This suggests that, by taking advantage of the proposed training scheme and T-A Attention, our model provides a more fine-grained understanding of spoken content to benefit the SQA answer prediction.

Ablation Study
Effects of Word Error Rates. To study how word error rates (WERs) will influence the model performance, we experiment with BERT, which is our baseline model, under different WERs. We randomly split three datasets into small-scale subsets of roughly equal training data size under different WERs for the ablation study. Then we compute Frame-level F1 score (Chuang et al., 2020) to evaluate the robustness of our proposed method with different WERs in Figure 4. We find that our model consistently achieves better results compared to the evaluated baseline. In addition, we find that higher WER leads to a consistent drop in all three spoken  (Lu et al., 2019) 72.8 55.0 77.9 w/ ICCN  71.7 54.7 77.7 w/ S-Fusion (Siriwardhana et al., 2020) 68  question answering tasks. This suggests low WER brings these gains in all SQA settings.
Effects of Hyperparameter Selection. Selfsupervised training enables the SQA model to capture sequential dependency between utterances along with semantic matching and maintain dialog coherence within a context. We explore the effects of different k, which determines the length of utterances in these auxiliary tasks. Figure 5 compares the performance of model with different k. We find that increasing the value of k clearly improves model performance, but it will not further increase after k = 9. We hypothesize that it gives rise to two potential reasons: (1) if the utterance length is too small within the context, the model cannot capture enough contextual information; (2) if the utterance length is too large, which introduces additional noise, it will not benefit the model performance. In our final models, we use k = 9 for self-supervised training.

Effects of T-A Attention.
We further evaluate the effectiveness of various attention mechanisms in Table 3. We define BERT as the base model. We observe that the model with the proposed T-A attention strategy achieves state-of-the-art performance on three datasets. It clearly demonstrates T-A attention can effectively reduce the discrepancy between text and speech domains.

Conclusions
Spoken question answering requires fine-grained understanding of both speech and text data. To this end, we propose a novel training scheme for spoken question answering. By carefully designing several auxiliary tasks, we incorporate the self-supervised contrastive learning framework to capture consistency and coherence within speech documents and text corpus without any additional data. We further propose a novel Temporal-Alignment strategy to align audio features and textual concepts by performing mutual attention over two modalities. Our model achieves state-of-the-art performance on three SQA benchmark datasets. For future work, we will develop more effective auxiliary tasks to enhance the quality of answer prediction.

ASR Question
How does scholars divide the library?

Original ASR Content
The Vatican at the stella clyde prairie, more commonly called the Vatican Library or simply the fact, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical tax. It has 75,000 courtesies from throughout history, as well as 1.1 million printed books, which include some 8,500 king abdullah. The Vatican Library is a research library for history, lot, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications in research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. In March 2014, team the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. The Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. Scholars have traditionally divided the history of the library into five periods, pre ladder and ladder and having yon prevent a cannon vatican. The pre latter in period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, the summer very significant.

Utterance Insertion
The Vatican at the stella clyde prairie, more commonly called the Vatican Library or simply the fact, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical tax. It has 75,000 courtesies from throughout history, as well as 1.1 million printed books, which include some 8,500 king abdullah. The Vatican Library is a research library for history, lot, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications in research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. The highly prized memorabilia which included item spanning the many stages of jackson's courier came for more than thirty fans associates and family members who contacted julian factions to sell their gifts and mementos of the singer. In March 2014, team the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. The Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. Scholars have traditionally divided the history of the library into five periods, pre ladder and ladder and having yon prevent a cannon vatican. The pre latter in period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, the summer very significant.

Utterance Restoration
The Vatican at the stella clyde prairie, more commonly called the Vatican Library or simply the fact, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical tax. It has 75,000 courtesies from throughout history, as well as 1.1 million printed books, which include some 8,500 king abdullah.
[MASK], [MASK], [MASK], . . . , [MASK]. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. In March 2014, team the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. The Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. Scholars have traditionally divided the history of the library into five periods, pre ladder and ladder and having yon prevent a cannon vatican. The pre latter in period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, the summer very significant.