A Framework for Automatic Generation of Spoken Question-Answering Data

,


Introduction
Spoken question answering (SQA) is the task of finding the answer of a question from a given spoken document.A typical approach in SQA is to use a cascade of ASR and QA systems.ASR outputs transcriptions of spoken documents and QA searches these potentially erroneous transcriptions for the answers of given questions.Additionally, end-to-end SQA systems that jointly train audio and text have been proposed (Chuang et al., 2019;Lin et al., 2022).Compared to QA on text docu-ments, SQA has been less explored, partly due to the limited amount of spoken datasets.
Spoken SQuAD (Li et al., 2018), which was generated from SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) ) using the Google TTS and CMU Sphinx (Walker et al., 2004) ASR systems, is one of the largest SQA datasets.Another example of a TTS-based spoken dataset is Spoken-CoQA (You et al., 2022) which was generated from CoQA (Reddy et al., 2019).Open-Domain Spoken Question Answering (ODSQA) (Lee et al., 2018) is a large SQA dataset that contains the recordings of a machine reading comprehension dataset by native Chinese speakers.
In this paper, we propose a framework to automatically generate SQA data.Our framework contains (i) QG to automatically obtain questionanswer pairs from given text documents; (ii) TTS to convert text into spoken documents; (iii) ASR to transcribe spoken documents.For each module in our framework, we use state-of-the-art systems -mT5 for QG (Xue et al., 2021), Google Text-to-Speech1 for TTS and XLSR (Conneau et al., 2021) for ASR.Since both mT5 and XLSR are multilingual pre-trained models and Google TTS supports various languages, the proposed framework can be easily utilized for different languages to generate spoken QA datasets.Only the pre-trained models need to be fine-tuned with data from the language of interest.Fine-tuning the QG and ASR models requires limited amount of QA data and TTS-based speech data, respectively.Even though our framework follows a similar strategy with spoken SQuAD in generating SQA data, the textual QA data in our framework is also generated automatically.To the best of our knowledge, our work is the first study on automatic generation of SQA data from scratch.
As a proof of concept, we explored the application of the proposed framework to Turkish, where there are limited textual (Soygazi et al., 2021) and spoken (Ünlü and Arisoy, 2021;Ünlü et al., 2019) QA datasets.A Turkish Question Answering (TurQuAse) dataset was automatically generated using Wikipedia articles and QA performance on this dataset was evaluated with state-of-the-art models.Our main contributions can be summarized as (i) an easily extensible framework for automatic generation of an SQA dataset in a language of interest and (ii) the first publicly available Turkish SQA dataset, TurQuAse.We publicly share our code, model, and datasets as open source2 .
This paper is organized as follows.Recent work is summarized in Section 2. Section 3 presents the proposed framework.Section 4 describes the experimental setups and reports the results on Turkish datasets.Section 5 concludes the paper.

Question Generation
Research in question generation has shifted from RNN or LSTM based models (Du et al., 2017;Song et al., 2018;Duan et al., 2017;Du and Cardie, 2018) to transformer encoder-decoders.These encoderdecoders take advantage of large pre-trained language models as starting point and then fine-tune the models with the dataset of interest (Lopez et al., 2020;Dong et al., 2019).With the idea of combining NLP tasks in a single framework, a text-totext transfer transformer (T5) (Raffel et al., 2020) model was proposed.T5 allows the same architecture to be used for multiple NLP tasks.Its multilingual version, mT5 (Xue et al., 2021) has extended this idea to various languages.In our research, we utilize the pre-trained mT5 model to automatically generate questions from given text documents.

Automatic Speech Recognition
Recently proposed ASR models exploit the idea of large pre-trained models (Schneider et al., 2019;Baevski et al., 2020;Conneau et al., 2021).To be able to generalize the speech representations across different languages, XLSR model (Conneau et al., 2021) which is based on Wav2Vec 2.0 was proposed.In our research, we use XLSR for ASR.

Spoken Question Answering
A typical SQA system relies on a cascade of ASR and textual QA models to find answers to ques- tions in spoken documents (Tseng et al., 2016;Lee et al., 2019;Ünlü and Arisoy, 2021;Li et al., 2018).To improve SQA performance, incorporating additional information from sub-words (Li et al., 2018;Lee et al., 2018), contextualized word representations (Su and Fung, 2020), ASR confusion networks (Ünlü and Arisoy, 2021) and knowledge distillation using text and speech domains (You et al., 2021a) have been investigated.Recent research on SQA has focused on using large pre-trained models in which acoustic and text data can be trained jointly (Chuang et al., 2019) or a self-supervised learning followed by contrastive multi-task manner can be used to learn the multi-modality representations (You et al., 2021b).To utilize the unlabeled data, an ASR transcript-free model pretrained with unpaired text and acoustic data was proposed (Lin et al., 2022).
In our research, we evaluate the performance of the generated textual data using BERT (Devlin et al., 2019), mT5 (Xue et al., 2021) and Electra (Clark et al., 2020) QA models.We also evaluate the performance of the generated spoken data using BERT QA model on ASR transcriptions.

Framework
In this section, we describe the proposed framework for generating a spoken QA dataset from scratch. Figure 1 shows the framework where the input is text documents and the output is the dataset containing automatically generated question-answer pairs, TTS-based audio files obtained from the input texts and corresponding ASR transcriptions.

Question Generation
For question generation, we utilized mT5 (Xue et al., 2021), a multilingual encoder-decoder trans-former model.The encoder takes the input text and generates vectors as inputs to the decoder.The outputs of the decoder are generated in an autoregressive manner and passed to a softmax layer.
The mT5 model was fine-tuned in a multi-task manner on the answer extraction, question generation, and question answering tasks.We modified the QA dataset used for fine-tuning the mT5 model to generate training data for all tasks.The answer extraction task takes the context and predicts an answer span.The QG task uses the predicted answer span as input to generate a question.The QA task takes the question and the context as input to predict an answer span from the context.
In our framework, a single paragraph is given as input to the QG model as the context.The model first extracts possible answer spans and then uses the extracted answer spans with the given context to generate questions.For fine-tuning the QG model, we used a limited amount of manually generated QA data from the language of interest.

Text-to-Speech
We used the Google Text-to-Speech (TTS) framework to generate audio data.The input paragraphs were divided into smaller segments (10-word windows) to allow XLSR to be trained with a large batch size.Although Google TTS has an internal text normalizer, we normalized the text before using it as input to TTS to fairly evaluate ASR performance.Normalization involves converting numbers to letters and removing punctuation.Quality of the normalized text affects the quality of the synthesized audio and this may improve ASR performance.

Automatic Speech Recognition
The TTS-based audio files were fed into ASR to generate transcriptions.For ASR, we used the pretrained multilingual XLSR model (Conneau et al., 2021).The ability to learn speech representations across different languages allows this model to be utilized for ASR in various languages.

Experiments
This section explains how we used the proposed framework to generate the Turkish Question Answering (TurQuAse) dataset, and presents our Turkish QA and SQA experiments and results.

Turkish Text Data
For generating the TurQuAse data, we collected 460K Wikipedia pages using an XLM parser (Vardar et al., 2019).Each page contains a title, a subject, a table, and several paragraphs.The title indicates who/what the page is about.The table contains structured information about the page.For our framework, we used the first paragraph of each page in our Wikipedia dataset, since the first paragraph is usually a summary of the article with more general information.Then, we filtered out the paragraphs containing non-Turkish characters for better TTS performance, the paragraphs with missing subject field to diversify the data based on subjects and the paragraphs containing less than 40 words to provide longer context to the QG module.Finally, we ended up with 20.4K paragraphs.

Question Generation
The QG module was implemented in Python using the HuggingFace library (Wolf et al., 2020).We used the small pre-trained mT5 model with a batch size of 8 and 32 gradient accumulation steps to achieve an effective batch size of 256.The model was fine-tuned for 30 epochs with a learning rate of 1e -4 using two Turkish QA datasets, ThQuAD (Soygazi et al., 2021) and an English to Turkish machine translated version of SQuAD.These datasets contain 15.4K and 64.8K question-answer pairs, respectively.After fine-tuning, the QG model resulted in 83.6K question-answer pairs on the Turkish Wikipedia data explained in Section 4.1.Each paragraph has on average 4 questions, and the average question and answer lengths are around 7 and 3 words, repectively.
The performance of the QG model was evaluated on the development set of ThQuAD and XQuAD Turkish (Artetxe et al., 2020) with the BLEU and ROUGE metrics.For the evaluation, we compared the original and generated questions using both lemmatized and surface form representations of words.the model gives better results in ThQuAD than XQuAD may be that the training set of ThQuAD was also used to pre-train the QG model together with the machine translated SQuAD3 .Among the 83.6K question-answer pairs generated from the Turkish Wikipedia articles, we manually evaluated 2.8K paragraphs with 12.3K question-answer pairs.This subset represents about 14% of the total data.A manual evaluation revealed that 55% of the questions were annotated as grammatically correct and sensible, and among these questions 11% had incorrect answer spans.In order to better understand the generated questions, we analyzed 168 randomly selected incorrect questions generated by the QG module and found the following distribution of errors: 37% factually inaccurate, 18% semantically incomplete, 39% grammatically incomplete, and 6% incorrectly formed entities.

TTS and ASR
The TTS and ASR models were implemented in Python using the TTS library4 and the Hugging-Face library (Wolf et al., 2020).Using TTS, we generated the audio files for all 20.4K paragraphs used as input to our framework and ended up with 223 hours of speech data.Then this data was decoded using the XLSR model to obtain ASR transcriptions.To fine-tune the XLSR model, we used a small amount of set apart text data from the collected Turkish Wikipedia articles.After generating the audio files with TTS for this subset, we ended up with 8 hours of speech data for fine-tuning the XLSR model and 2 hours of dev set for tuning the hyperparameters.The model was fine-tuned with an initial learning rate of 5e -4 for 30 epochs with a batch size of 2. Note that the articles used in QG and in fine-tuning the ASR model were disjoint.The ASR model yielded 14.8% WER on the paragraphs used as input in QG.By using the QG, TTS and ASR systems, we generated the TurQuAse dataset.To sum up, TurQuAse contains 83.6K automatically generated question-answer pairs from 20.4K paragraphs, as well as TTS-based audio files and ASR transcriptions corresponding to these paragraphs.

Question Answering
For QA experiments, we trained three models: BERT, mT5 and Electra.BERT and Electra models were trained with a batch size of 16 and a learning rate of 2e -5 for 20 epochs.The mT5 model was trained with a batch size of 8, 32 gradient accumulation steps and a learning rate of 1e -3 for 20 epochs.All models were fine-tuned on ThQuAD, TurQuAse, and a combination of these two datasets.Models were evaluated on the ThQuAD test set and the XQuAD Turkish set.The Exact Match (EM) and F1 scores for the QA experiments are given in Table 2.The first column represents the models used in evaluation (BERTurk (Schweter, 2020), mT5 and Turkish Electra), the second column represents the QA data used in fine-tuning the models (ThQuAD, TurQuAse and combination of these two) and the remaining columns show the QA results on ThQuAD and XQuAD test sets.
In all experiments fine-tuning the models with ThQuAD alone leads to better results than finetuning the models with TurQuAse alone.This might be due to TurQuAse being a noisy QA dataset.Note that TurQuAse was generated automatically whereas ThQuAD was generated manually by human annotators.However, the combination of the ThQuAD and TurQuAse (Combined in Table 2) improves the results especially for XQuAD which is a QA test set from Wikipedia articles.For XQuAD, the EM improvements are 6.4% (from 0.47 to 0.50) with the BERTurk model, 18% (from 0.33 to 0.39) with the mT5 model and 13% (from 0.46 to 0.52) with the Electra model.Even though fine-tuning with the combined data did not improve F1 for the BERTurk model, we obtained 7.8% (from 0.51 to 0.55) and 3.1% (from 0.64 to 0.66) F1 improvements with the mT5 and Electra models.Fine-tuning with the combined data did not really improve the performance on ThQuAD.This might be due to the domain mismatch with the ThQuAD and TurQuAse datasets.
For further analysis, we evaluated the BERTurk model without any fine-tuning for the zero-shot experiments (Second to the last row in Table 2).However, the results have revealed that the Turkish BERT model without any fine-tuning can not be utilized for the given QA task.Additionally, we tested a multilingual BERT model fine-tuned on English SQuAD on Turkish datasets (The last row in Table 2).The results of these experiments show that using cross-lingual capabilities in QA models can be a viable research direction.
For SQA experiments, we only used the BERTurk model after fine-tuning with the ASR transcriptions.In order to investigate the SQA performance with the ThQuAD and the combined datasets on the ThQuAD and XQuAD test sets, we applied the TTS and ASR frameworks also to ThQuAD and XQuAD and obtained the ASR transcriptions for the paragraphs.Note that for a fair evaluation we removed the question-answer pairs from the training and test sets if the ASR system did not correctly transcribe the answer.
The SQA results are given in Table 3.The first column of the table represents the QA data used in fine-tuning the BERTurk model and the remaining columns represent the SQA results on the ThQuAD and XQuAD test sets.Similar to the QA experiments reported in Table 2, we did not observe improvements on ThQuAD even with the combined dataset.This might be again due to the noise introduced by the automatically generated TurQuAse data.However, we obtained improvements on top of the model fine-tuned with ThQuAD by using TurQuAse alone and in combination with ThQuAD.For XQuAD, the EM improvements are 14.3% (from 0.35 to 0.40) and 34.3% (from 0.35 to 0.47) and the F1 improvements are 5.5% (from 0.55 to 0.58) and 14.5% (from 0.55 to 0.63) for the models fine-tuned with TurQuAse alone and with the combined data, respectively.The best SQA performance on XQuAD was obtained when ThQuAD was combined with the TurQuAse dataset which shows the effectiveness of the proposed framework for the SQA task.
Additionally, we performed an experiment to investigate the effect of ASR errors on SQA.The BERTurk model fine-tuned on the reference transcriptions of ThQuAD was evaluated on the reference and ASR transcriptions of the ThQuAD and XQuAD test sets.The results are reported in Table 4.The first row shows the QA results on reference transcriptions and the second row shows the QA results on the ASR transcriptions.The EM and F1 scores have decreased when ASR transcriptions were used in the test set.This is an expected performance drop due to the ASR errors in the transcribed data.Comparing the first row of shows that the performance drop can be alleviated to some extent by using ASR transcriptions both in training and test data.

Conclusion
In this paper we proposed a framework for generating SQA data from scratch.The framework outputs automatically generated question-answer pairs, audio data, and ASR transcriptions for a given input text.We demonstrated the effectiveness of the proposed framework by creating TurQuAse, the first publicly available SQA dataset for Turkish.Experimental results showed that the TurQuAse dataset improves SQA performance.The framework presented in this paper can be easily extended to other languages.As future work, we plan to improve the quality of the automatically generated questionanswer pairs by including additional information to QG.We are also planning to collect real speech data for a subset of our Turkish dataset to compare TTS and real speech performances in SQA.

Ethics
The input text data used in this paper comes from publicly available Wikipedia pages.The input data, automatically generated questions and audio files, do not contain any personal information.The annotators participated in manual annotations voluntarily.The Wikipedia pages used to generate the dataset were compiled to cover as homogeneous topics as possible to avoid any bias towards a particular topic.We will make the generated Turkish dataset publicly available along with the implementation to ensure reproducibility.

Limitations
The empirical results reported herein should be considered in light of some limitations.The first limitation is in the collection of the speech data.Google TTS system is free and easy to use, but there is a daily limit on the requests submitted.This limit caused the audio data collection process to drag on.As a result, we could only collect audio data for a subset of large amounts of textual data.The second limitation is in computational resources.Multilingual state-of-the-art pretrained models require GPU support and large memory sizes during fine-tuning, even with small data.We utilized the models with small number of parameters because of our limited computational resources.The third limitation is in working with a limited resource language for QA.Due to the lack of Turkish QA datasets, we used the same dataset (ThQuAD) to fine-tune both the QG and QA models, which might impose a bias toward this dataset.However, this bias was alleviated to some extent when we used the spoken versions of the datasets (noisy datasets due to ASR errors).

Figure 1 :
Figure 1: The proposed framework.Collected paragraphs are taken as input and automatically generated question-answer pairs, TTS-based audio files of the paragraphs and corresponding ASR transcriptions are produced as output.

Table 1 :
Table1shows that the results on ThQuAD are better compared to XQuAD.The reason why QG performance evaluation.

Table 2 :
Scores of different QA setups on ThQuAD test set and XQuAD.

Table 3 :
SQA Performance of the BERTurk model.

Table 4 :
QA Performance of the BERTurk model finetuned on the reference transcriptions of ThQuAD.

Table 3 (
both training and test data are ASR transcriptions) with the last row of Table 4 (training data are reference transcriptions and test data are ASR transcriptions)