End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. The performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of cross-modal information integration. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering tasks.


Introduction
Conversational question answering (CQA) has been studied extensively over the past few years * Equal contribution. within the natural language processing (NLP) communities (Zhu et al., 2018;Yang et al., 2019). Different from traditional question answering (QA) tasks, CQA aims to enable models to learn the representation of the context paragraph and multi-turn dialogues. Existing CQA methods (Huang et al., 2018a;Devlin et al., 2018;Xu et al., 2019;Gong et al., 2020) have achieved superior performances on several benchmark datasets, such as QuAC (Choi et al., 2018) and CoQA (Elgohary et al., 2018).
Current CQA research mainly focuses on leveraging written text sources in which the answer can be extracted from a large document collection. However, humans communicate with each other via spontaneous speech (e.g., meetings, lectures, online conversations), which convey rich information. Consider our multimodal experience, fine-grained representations of both audio recordings and text documents are considered to be of paramount importance. Thus, we learn to draw useful relations between modalities (speech and language), which enables us to form fine-grained multimodal representations for end-to-end speech-and-language learning problems in many real-world applications, such as voice assistant and chat robot.
In this paper, we propose a novel and challenging spoken conversational question answering task -SCQA. An overview pipeline of this task is shown in Figure 1. Collecting such a SCQA dataset is a non-trivial task, as in contrast to current CQA tasks, we build our SCQA with two main goals as follows: (1) SCQA is a multi-turn conversational spoken question answering task, which is more challenging than only text-based task; (2) existing CQA methods rely on a single modality (text) as the context source. However, plainly leveraging uni-modality information is naturally undesirable for end-to-end speech-and-language learning problems since the useful connections between speech and text are elusive. Thus, employing data from Q2: Who did she live with? A2: with her mommy and 5 sisters R2: with her mommy and 5 other sisters ASR-Q2: Who did she live with? A2: with her mommy and 5 sisters R2: with her mommy and 5 other sisters Q3: What color were her sisters? A3: orange and white R3: her sisters were all orange with beautiful white tiger stripes ASR-Q3: What color were her sisters? A3: orange and white R3: her sisters were all orange with beautiful white tiger stripes Table 1: An example from Spoken-CoQA. We can observe large misalignment between the manual transcripts and the corresponding ASR transcripts. Note that the misalignment is in bold font and the example is the extreme case. For more dataset information, please see Section 5 and Appendix Section "More Information about Spoken-CoQA".
the context of another modality (speech) can allow us to form fine-grained multimodal representations for the downstream speech-and-language tasks; and (3) considering the speech features are based on regions and are not corresponding to the actual words, this indicates that the semantic inconsistencies between the two domains can be considered as the semantic gap, which requires to be resolved by the downstream systems themselves.
In order to provide a strong baseline for this challenging multi-modal spoken conversational question answering task, we first present a novel knowledge distillation (KD) method for the proposed SCQA task. Our intuition is that speech utterances and text contents share the dual nature property, and we can take advantage of this property to learn the correspondences between these two forms. Specifically, we enroll multi-modal knowledge into the teacher model, and then guide the student (only trained on noisy speech documents) to boost network performance. Moreover, considering that the semantics of the speech features and the textual representations are usually inconsistent, we introduce a novel mechanism, termed Dual Attention, to encourage fine-grained alignments between audio and text to close the cross-modal semantic gap between speech and language. One example of cross-modal gap is shown in Table 1. The experimental results show that our proposed DDNET achieves remarkable performance gains in the SCQA task. To the best of our knowledge, we are the first work in spoken conversational question answering task.
Our main contributions are as follows: • We propose Spoken Conversational Question Answering task (SCQA), and comprise Spoken-CoQA dataset for machine comprehension of spoken question-answering style conversations. To the best of our knowledge, our Spoken-CoQA is the first spoken conversational question answering dataset.
• We develop a novel end-to-end method based Text Encoder (Text BERT) ASR Module

Passage:
Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters... Once upon a time in a bar near farm house, there lived a little like captain named cotton. How to live tied up in a nice warm place above the bar and we're all of the farmers horses slapped. But happened was not alone in her little home above the bar in now. She shared her hey bed with her mommy and five other sisters...
" : in a barn Current Question: − #: Did she live alone? Figure 2: An illustration of the architecture of DDNET. In training stage, we adopt the teacher-student paradigm to enable the student model (only trained on speech documents) to achieve good performance. As for test, we only use student model for inference.
on data distillation to learn both from speech and language domain. Specifically, we use the model trained on clear texts as well as recordings to guide the model trained on noisy speech transcriptions. Moreover, we propose a novel Dual Attention mechanism to align the speech features and textual representations in each domain. • We demonstrate that, by applying our proposed DDNET on several previous baselines, we can obtain considerable performance gains on our proposed Spoken-CoQA dataset.

Related Work
Text Question Answering. In recent years, the natural language processing research community has devoted substantial efforts to text question answering tasks (Huang et al., 2018a;Zhu et al., 2018;Xu et al., 2019;Gong et al., 2020;Chen et al., , 2021a. Within the growing body of work on machine reading comprehension, an important sub-task of text question answering, two signature attributes have emerged: the availability of large benchmark datasets (Choi et al., 2018;Elgohary et al., 2018;Reddy et al., 2019) and pre-trained language models (Devlin et al., 2018;Lan et al., 2020). However, these existing works typically focus on modeling the complicated context dependency in text form. In contrast, we focus on enabling the machine to build the capability of language recognition and dialogue modeling in both speech and text domains.
Spoken Question Answering. In parallel to the recent works in natural language processing (Huang et al., 2018a;Zhu et al., 2018), these trends have also been pronounced in the speech field Haghani et al., 2018;Lugosch et al., 2019;Palogiannidi et al., 2020;You et al., ,b,c,d, 2020aChen et al., 2021b;Xu et al., 2021;Su et al., , 2021, where spoken question answering (SQA), an extended form of QA, has explored the prospect of machine comprehension in spoken form. Previous work on SQA typically includes two separate modules: automatic speech recognition (ASR) and text question answering. It involves transferring spoken content to ASR transcriptions, and then employs NLP techniques to handle speech tasks.
Existing methods (Tseng et al., 2016;Serdyuk et al., 2018;Su and Fung, 2020) focus on optimizing each module in a two-stage manner, where errors in the ASR module would result in severe performance loss.  proved that utilizing clean texts can help model trained on the ASR transcriptions to boost the performance via domain adaptation. Chuang et al. (2019) cascaded the BERT-based models as a unified model, and then trained it in a joint manner of audio and text. However, the existing SQA methods aimed at solving a single question given the related passage, without building and maintaining the connections of different questions in the human conversations. In addition, we compare our Spoken-CoQA with existing SQA datasets (See Table 2). Unlike existing SQA datasets, Spoken-CoQA is a multi-turn con-Dataset Conversational Spoken Answer Type TOEFL (Tseng et al., 2016) × √ Multi-choice S-SQuAD  × √ Spans ODSQA  × √ Spans versational SQA dataset, which is more challenging than single-turn benchmarks. Knowledge Distillation. Hinton et al. (2015) introduced the idea of Knowledge Distillation (KD) in a teacher-student scenario. In other words, we can distill the knowledge from one model (massive or teacher model) to another (small or student model). Previous work has shown that KD can significantly boost prediction accuracy in natural language processing and speech processing (Kim and Rush, 2016;Hu et al., 2018;Huang et al., 2018b;Hahn and Choi, 2019;Liu et al., 2021b,a;Cheng et al., 2016b;Cheng and You, 2016;Cheng et al., 2016a;You et al., 2020bYou et al., , 2021e, 2022bYou et al., , 2019aLyu et al., 2018Lyu et al., , 2019Guha et al., 2020;Yang et al., 2020;Ma et al., 2021a,b), while adopting KD-based methods for SQA tasks has been less explored. In this work, our goal is to handle the SCQA tasks. More importantly, we focus the core nature property in speech and text: Can spoken conversational dialogues further assist the model to boost the performance? Finally, we incorporate the knowledge distillation framework to distill reliable dialogue flow from the spoken contexts, and utilize the learned predictions to guide the student model to train well on the noisy input data.

Task Definition
In this section, we propose the novel SCQA task and collect a Spoken-CoQA (S-CoQA) dataset, which uses the spoken form of multi-turn dialogues and spoken documents to answer questions in multiturn conversations. Given a spoken document D s , we use D t to denote the clean original text and D a to denote the ASR transcribed document. We also have Q a 1:L ={q a 1 , q a 2 , ..., q a L }, which is a collection of L-turn ASR transcribed spoken questions Q s 1:L , as well as A t 1:L = {a t 1 , a t 2 , ..., a t L } which are the corresponding answers to the questions in clean texts. The objective of SCQA task is then to generate the answer a t L for question q a L , given document D a , multi-turn history questions Q a 1:L−1 ={q a 1 , q a 2 , ..., q a L−1 }, and reference answers A t 1:L−1 = {a t 1 , a t 2 , ..., a t L−1 }. In other words, our task in the testing phase can be formulated as (1) Please note that in order to improve the performance, in the training phase, we make use of auxiliary information which are the clean texts of document D t and dialogue questions Q t ={q t 1 , q t 2 , . . . , q t L }, to guide the training of student model. As a result, the training process could be formulated as below: However, in the inference stage, these additional information of D t and Q t 1:L are not needed.

DDNet
In this section, we propose DDNET to deal with the SCQA task, which is illustrated in Figure 2. We first describe the embedding generation process for both audio and text data. Next, we propose Dual Attention to fuse the speech and textual modalities. After that, we present the major components of the DDNET module. Finally we describe a simple yet effective distillation strategy in the proposed DDNET to learn enriched representations in the speech-text domain comprehensively.

Embedding
Given spoken words S = {s 1 , s 2 , ..., s m } and corresponding clean text words T = {t 1 , t 2 , ..., t n }, we utilize Speech-BERT and Text-BERT to generate speech feature embed- Concretely, for speech input, we first use vq-wav2vec (Baevski et al., 2019) to transfer speech signals into a series of tokens, which is the standard tokenization procedure in speech related tasks. Next, use Speech-BERT (Chuang et al., 2019), a variant of BERT-based models retrained on our Spoken-CoQA dataset, to process the speech sequences for training. The text contents are embbed into a sequence of vectors via our text encoder -Text-BERT, with the same architecture of BERT-base (Devlin et al., 2018).

Dual Attention
Dual Attention (DA) is proposed to optimize the alignment between speech and language domains by capturing useful information from the two domains. In particular, we first use cross attention to align speech and text representations in the initial stage. After that, we utilize contextualized attention to further align the cross-modal representations in the contextualized word-level. Finally, we employ the self-attention mechanism to form fine-grained audio-text representations. Cross Attention. Inspired by ViLBERT (Lu et al., 2019), we apply the co-attention transformer layer, a variant of Self-Attention (Vaswani et al., 2017), as the Cross Attention module for the fusing of speech and text embeddings. The Cross Attention is implemented by the standard Attention module involving Multi-Head Attention (MHA) and Feed-Forward Network (FFN) (Vaswani et al., 2017) as below: where Q, K, V denote query, key, and value matrices, and F 1 , F 2 denote features from difference modalities, respectively. The co-attention module then use the Cross Attention function to compute the cross attention-pooled features, by querying one modality using the query vector of another modality.Ê whereÊ cross s ∈ R n×d ,Ê cross t ∈ R n×d and d is the dimension of feature vectors. Contextualized Attention (CA). After obtaining speech-aware representationÊ cross s and text-aware representationÊ cross t , our next goal is to construct more robust contextualized cross-modal representations by integrating features from both modalities. The features with fused modalities are computed as follows: where W 1 , W 2 ∈ R n×d are trainable weights.

Self-Attention.
To build a robust SCQA system, special attention needs to be paid on the sequential order of the dialogue, since the changes in utterances order may cause severely low-quality and in-coherent corpora. As a result, to capture the longrange dependencies such as co-references for the downstream speech-and-language tasks, similar to (Li et al., 2016;Zhu et al., 2018), we introduce a self-attention layer to obtain the final Dual Attention (DA) representations.

Key Components
The framework of our SCQA module is similar to recent works (Zhu et al., 2018;Huang et al., 2017), which is divided into three key components: Encoding Layer, Attention Layer and Output Layer. Encoding Layer. Then documents and conversations (questions and answers) are first converted into the corresponding feature embeddings (i.e., character embeddings, word embeddings, and contextual embedding). The output contextual embeddings are then concatenated by the aligned crossmodal embedding E DA to form the encoded input features: Attention Layer. We compute the attention on the context representations of the documents and questions, and extensively exploit correlations between them. Note that we adopt the default attention layers in four baseline models. Output Layer. After obtaining attention-pooled representations, the Output Layer computes the probability distributions of the start and end index within the entire documents and predicts an answer to current question: where X denotes the input document D and Q L , and "st", "ed" denote the start and end positions.

Knowledge Distillation
In previous speech-language models, the only guidance is the standard training objective to measure the difference between the prediction and the ref-  from our teacher model, and use them to guide the student model to learn contextual features in our SCQA task. Concretely, we set the model trained on the speech document and the clean text corpus as the teacher model and trained on the ASR transcripts as the student model, respectively. Thus, the student trained on low-quality data learns to absorb the knowledge that the teacher has discovered. Given the z S and z T as the prediction vectors by the student and teacher models, the objective is defined as: where KL(·) denotes the Kullback-Leibler divergence. p τ (·) is the softmax function with temperature τ , and α is a balancing factor.

Experiments and Results
In this section, we first describe the collection and filtering process of our proposed Spoken-CoQA dataset in detail. Next, we introduce several stateof-the-art language models as our baselines, and then evaluate the robustness of these models on our proposed Spoken-CoQA dataset. Finally, we provide a thorough analysis of different components of our method. Note that we use the default settings in all evaluated methods. Data Collection. We detail the procedures to build Spoken-CoQA as follows. First, we select the conversational question-answering dataset CoQA (Reddy et al., 2019) 2 as our basis data since it is one of the largest public CQA datasets. CoQA contains around 8k stories (documents) and over 120k questions with answers. The average dialogue length of CoQA is about 15 turns, and the answers areis in free-form texts. In CoQA, the training set and the development set contain 7,199 and 500 conversations over the given stories, respectively. Therefore, we use the CoQA training set as our reference text 2 Considering that the test set of CoQA (Reddy et al., 2019) idoes not publicly availablesh the test set, we follow the widely used setting in the spoken question answering task , where we divide Spoken-CoQA dataset into train and test set. of the training set and the CoQA development set as the test set in Spoken-CoQA.
Next, we employ the Google text-to-speech system to transform the questions and documents in CoQA into the spoken form, and adopt CMU Sphinx to transcribe the processed spoken contents into ASR transcriptions. In doing so, we collect more than 40G audio data, and the data duration is around 300 hours. The ASR transcription has a kappa score of 0.738 and Word Error Rates (WER) of 15.9%, which can be considered sufficiently good since it is below the accuracy threshold of 30% WER (Gaur et al., 2016). For the test set, we invite 5 human native English speakers to read the sentences of the documents and questions. The sentences of one single document are assigned to a single speaker to keep consistency, while the questions in one example may have different speakers. All speech files are sampled at 16kHz, following the common approach in the speech community. We provide an example of our Spoken-CoQA dataset in Table 1 and Fig. 5.
Data Filtering In our SCQA task, the model predicts the start and end positions of answers in the ASR transcriptions. As a result, during data construction, it is necessary for us to perform data filtering by eliminating question-answer pairs if the answer spans to questions do not exist in the noisy ASR transcriptions. We follow the conventional settings in  3 . In our approach, an ASR question will be removed if the groundtruth answers do not exist in ASR passages. However, when coreference resolution and inference occurs, the contextual questions related to the previous ones are required to be discarded too. For the case of coreference resolution, we change the corresponding coreference. For the case of coreference inference, if the question has strong dependence on the previous one that has already been discarded, it will also be removed. After data filtering, we get a total number of our Spoken-CoQA dataset, we collect 4k conversations in the training set, and 380 conversations in the test set in our Spoken-CoQA dataset, respectively. Our dataset includes 5 domains, and we show the domain distributions in Table 3.
Baselines. For SCQA tasks, our DDNET is able to utilize a variety of backbone networks for SCQA  tasks. We choose several state-of-the-art language models (FlowQA (Huang et al., 2018a), SDNet (Zhu et al., 2018), BERT-base (Devlin et al., 2018), ALBERT (Lan et al., 2020)) as our backbone network baselines. We also compare our proposed DDNET with several state-of-the-art SQA methods Serdyuk et al., 2018;Kuo et al., 2020). To use the teacherstudent architecture in our models, we first train baselines on the CoQA training set as teacher and then evaluate the performances of testing baselines on CoQA dev set and Spoken-CoQA dev set. Finally, we train the baselines on the Spoken-CoQA training set as student and evaluate the baselines on the CoQA dev set and Spoken-CoQA test set. We provide quantitative results in Table 4.
Experiment Settings. We use the official BERT (Devlin et al., 2018) and ALBERT (Lan et al., 2020) as our textual embedding modules. We use BERTbase (Devlin et al., 2018) and ALBERT-base (Lan et al., 2020), which both include 12 transformer encoders, and the hidden size of each word vector is 768. BERT and ALBERT both utilize BPE as the tokenizer, but FlowQA and SDNet use SpaCy (Honnibal and Montani, 2017) for tokenization. Under the circumstance when tokens in spaCy (Honnibal and Montani, 2017) correspond to more than one BPE sub-tokens, we average the BERT embeddings of these BPE sub-tokens as the final embeddings for each token. For fair comparisons, we use standard implementations and hyper-parameters of four baselines for training. The balancing factor α is set to 0.9, and the temperature τ is set to 2. We train all models on 4 24GB RTX GPUs, with a batch size of 8 on each GPU. For evaluation, we use three metrics: Exact Match (EM), F 1 score and Audio Overlapping Score (AOS)  to compare the model performance comprehensively. Please note that the metric numbers of baseline may be different from that in the CoQA leader board as we use our own implementations, Note that, we only utilize the student network for inference.
Results. We compare several teacher-student pairs on CoQA and Spoken-CoQA dataset and the quantitative results are shown in Table 4. We can observe that the average F1 scores is 77.6% when training on CoQA (text document) and testing on CoQA dev set. However, when training the models on Spoken-CoQA (ASR transcriptions) and testing on Spoken-CoQA test set, the average F1 scores drops significantly to 49.3%. For FlowQA, the performance even drops by 40.4 pts in terms of F1 score. This corroborates the importance of mitigating ASR errors. Table 5 compares our approach DDNET to all the previous results. As shown in the table, our distillation models achieve strong performance, and incorporating DA mechanism further improves the results considerably. Our DDNET using BERTbase models as backbone achieves similar or better results compared to all the state-of-the-art methods, and we observe that using a larger encoder ALBERT-base will give further bring large gains on performance.

Ablation Study
We conduct ablation studies to show the effectiveness of several components in DDNet in this section and appendix.
Multi-Modality Fusion Mechanism. To study the effect of different modality fusion mechanisms, we introduce a novel fusion mechanism Con Fusion: first, we directly concatenate two output embedding from speech-BERT and text-BERT models, and then pass it to the encoding layer in the following SCQA module. In Table 8, we observe that Dual Attention mechanism outperform four baselines with Con Fusion in terms of EM and F1 scores. We further investigate the effect of unimodel input. Table 8 shows that text-only performs better than speech-only. One possible reason for this performance is that only using speech features can bring additional noise. Note that speech-only (text-only) means that we only feed the speech (text) embedding for speech-BERT (text-BERT) to the encoding layer in the SCQA module.

Conclusions
In this paper, we have presented SCQA, a new spoken conversational question answering task, for enabling human-machine communication. We make our effort to collect a challenging dataset -Spoken-CoQA, including multi-turn conversations and pas-sages in both text and speech form. We show that the performance of existing state-of-the-art models significantly degrade on our collected dataset, hence demonstrating the necessity of exploiting cross-modal information in achieving strong results. We provide some initial solutions via knowledge distillation and the proposed dual attention mechanism, and have achieved some good results on Spoken-CoQA. Experimental results show that DDNET achieves substantial performance improvements in accuracy. In future, we will further investigate the different mechanisms of integrating speech and text content, and our method also opens up the possibility for downstream spoken language tasks.

B Effects of Different Word Error Rates
We study how the network performances change when trained with different word error rates (WER) in Figure 3. Specifically, we first split Spoken-SQuAD and Spoken-CoQA into smaller groups with different WERs. Then we utilize Frame-level F1 score (Chuang et al., 2019) to validate the effectiveness of our proposed method on Spoken-CoQA.
In Figure 3, we find that all evaluated networks for two tasks are remarkably similar: all evaluated models suffer larger degradation in performance at higher WER, and adopting knowledge distillation strategy is capable of alleviating such issues. Such phenomenon further demonstrates the importance of knowledge distillation in the case of high WER.

C Results on Human Recorded Speech
The results using BERT-base as the baseline are shown in Table 7. We train the model in the Spoken-CoQA training dataset and evaluate the model in both machine synthesized and human recorded speech. As shown in Table 7, the average EM/F1/AOS scores using BERT fell from 40.6/54.1/48.0 to 39.4/53.0/46.8, respectively. In addition, the similar trends can be observed on our proposed method. We hypothesise that the human recorded speech introduces additional noise during training, which leads to the performance degradation.

D More Information about Spoken-CoQA
To perform qualitative analysis of speech features, we visualize the log-mel spectrogram features and the mel-frequency cepstral coefficients (MFCC) feature embedding learned by DDNet in Figure 5. We can observe how the spectrogram features respond to different sentence examples. In this example, we observe that given the text document (ASRdocument), the conversation starts with the question Q 1 (ASR-Q 1 ), and then the system requires to answer Q 1 (ASR-Q 1 ) with A 1 based on a contiguous text span R 1 . Compared to the existing benchmark datasets, ASR transcripts (both the document and questions) are much more difficult for the machine to comprehend questions, reason among the passages, and even predict the correct answer.

E More Comparisons on Spoken-SQuAD
To verify that our proposed DDNET is not biased towards specific settings, we conduct a series of experiments on Spoken-SQuAD  by training the teacher model on textual documents, and the student model on the ASR transcripts. From the Table 5, compared with the performances on Spoken-CoQA, all baselines performances improve by a large margin, indicating our proposed dataset is a more challenging task for current models. We verify that, in the setting (KD+DA), the model consistently achieves significant performance boosts on all

F Broader Impact
In this section, we acknowledge that our work will not bring potential risks to society considering the data is from open source with no private or sensitive information. We also discuss some limitations of our work. First, we admit that using Google TTS for TTS and CMU Sphinx for ASR may affect the distribution of errors compared with the human recorded speech. Second, we currently cover only English language but it would be interesting to see that contributions for other languages would follow. Finally, as our collection comes with reliable data, it should trigger future analysis works on analyzing spoken conversational question answering biases.