VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

We introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 15 languages and their aligned oral interpretations into 15 target languages totaling 17.3K hours. We provide speech recognition (ASR) baselines and validate the versatility of VoxPopuli unlabeled data in semi-supervised ASR and speech-to-text translation under challenging out-of-domain settings. The corpus is available at https://github.com/facebookresearch/voxpopuli.


Introduction
Recent progress in speech-to-text tasks such as automatic speech recognition (ASR) and speech translation (ST) has been achieved by the development and application of unsupervised speech pre-training methods (Oord et al., 2018;Schneider et al., 2019;Baevski et al., 2020;Conneau et al., 2020;Wu et al., 2020;Nguyen et al., 2020), with semi-supervised learning (self-training) (Kahn et al., 2020a;Zhang et al., 2020b; or a combination of both methods . This line of research leverages large amounts of unlabeled English speech data (Kahn et al., 2020b) that enable improvements in English ASR or out-of-English ST. Large amounts of multilingual audio data are needed in order to achieve similar progress for multilingual ASR and ST. Similarly, most ASR and ST research is currently conducted on the LibriSpeech (Panayotov et al., 2015) and MuST-C benchmarks (Cattoni et al., 2020;Di Gangi et al., 2019). As a result, the Equal contribution. research community has been mostly focused on speech-to-text tasks with English as input. While multilingual ASR (Pratap et al., 2020;Ardila et al., 2020) and ST datasets (Wang et al., 2020b;Iranzo-Sánchez et al., 2020) have recently been made available, the amount of data available quickly drops beyond the top few high-resource languages.
Simultaneous speech translation (interpretation) has witnessed a resurgence with the applications of end-to-end encoder-decoder models. Most of the recent studies focus on text output and leverage ST corpora that are translated offline in the written form. There are differences, however, between translationese and interpretese (Sridhar et al., 2013;He et al., 2016), where interpreters develop a variety of strategies to improve simultaneity. Models trained on translation corpora are unlikely to learn from these interpretation skills to achieve better quality-latency trade-offs. Finally, there has been little research (Jia et al., 2019;Tjandra et al., 2019;Zhang et al., 2020a) into speech output due to the lack of open data. Existing corpora (Tohyama et al., 2004;Bendazzoli et al., 2005) are either of limited size or no longer publicly available.
We describe our corpus creation methodology in Section 2 and analyze the created corpus in Section 3. We provide ASR baselines and demonstrate the value of our multilingual unlabeled data as well as weakly labeled data on several non-English languages in Section 4.

Data Acquisition
VoxPopuli sources data from 2009-2020 European Parliament (EP) event recordings, which include plenary sessions, committee meetings and other events. In each event, speakers give speeches in turn in different European Union (EU) languages. These speeches are partially transcribed (for plenary sessions only) and interpreted into 24 EU languages. The interpretations are only oral without any transcription. In the following part, we refer to the original speech as "source speech" and to the interpreted one as "target speech". We download audio clips for both source and target speeches from the official website 1 . We also crawl the transcript, speaker information and starting/ending timestamps for each speech (for plenary sessions only) from that source, with which we later align the speech to its transcript and interpretation utterance by utterance. The acquired raw data suffers from missing audios, incomplete transcripts and inaccurate timestamps. We build data processing pipelines to segment speech paragraphs into utterances and filter out the ones with erroneous transcriptions.

Unlabeled Speech
We construct VoxPopuli unlabeled set from all source and target speeches in 23 EU languages (excluding Irish because of very limited data availability). We segment full-event audios into short clips of 15-30 seconds using an energy-based voice activity detection (VAD) algorithm 1 . Each audio clip has a maximum of 2 seconds of continuous silence, and silent clips are discarded. Around 16% of the data is dropped after silence removal, which leads to a final overall duration of around 400K hours.  Table 1: Statistics for unlabeled ("Unlab.") and transcribed speech data in VoxPopuli: duration in hours ("Hrs"), number of speakers ("Spkrs"), percentage of female speakers ("F%") and number of tokens ("Tkns"). Durations are calculated on segmented audios where leading and trailing silence is trimmed. The LM data is a combination of VoxPopuli transcription and sentences from EuroParl (Koehn, 2005).

Transcribed Speech
The VoxPopuli transcribed set comes from aligning the full-event source speech audio with the transcripts for plenary sessions. Official timestamps are available for locating speeches by speaker in the full session, but they are frequently inaccurate, resulting in truncation of the speech or mixture of fragments from the preceding or the succeeding speeches. To calibrate the original timestamps, we perform speaker diarization (SD) on the fullsession audio using pyannote.audio (Bredin et al., 2020) and adopt the nearest SD timestamps (by L1 distance to the original ones) instead for segmentation. Full-session audios are segmented into speech paragraphs by speaker, each of which has a transcript available. The speech paragraphs have an average duration of 197 seconds, which leads to significant memory usage and prevents efficient parallelism (batching) during model training. We hence further segment these paragraphs into utterances with a maximum duration of 20 seconds. We leverage  speech recognition (ASR) systems to force-align speech paragraphs to the given transcripts and cut the utterances by ending punctuation or the longest silence inside the sentence if it exceeds 20 seconds. The ASR systems are TDS models (Hannun et al., 2019) trained with ASG criterion (Collobert et al., 2016) on audio tracks from in-house de-identified video data. The resulting utterance segments may have incorrect transcriptions due to incomplete raw transcripts or inaccurate ASR force-alignment. We use the predictions from the same ASR systems as references and filter the candidate segments by a maximum threshold of 20% character error rate (CER). We split the filtered utterances into train, development and test sets with disjoint speakers and target duration ratio (18:1:1). To determine the assignments, we group utterances by speaker and sort them by overall duration in ascending order. We assign the sorted groups to the test set in order until it reaches 20 speakers or the target duration (whichever comes later). The same process is repeated on the remaining utterance groups to construct the development set (with minimum 10 speakers instead). Finally, the rest of utterances make up the train set. This approach ensures higher speaker diversity in the test and development sets.

Speech-To-Speech Alignment
Even though every source speech is associated with corresponding simultaneous interpretations in target languages, considerable preprocessing and filtering is necessary to make this dataset usable. Our strategy is to align source and target at the sentence level using ASR.
We first compare the spectrogram of the source and the target speech to remove the identical parts and segment the target speech into paragraphs. These identical speech are due to either the short delay between the time the source speaker and the interpreter started, or the fact that the source language is the same as the target one, and thus no interpretation is needed. For long target paragraphs, we further segment them by silence into audio clips of at most 15 minutes long. We use the same ASR model described in Section 2.2.2 and a language model (Section 2.2.4) to decode the segmented target audio. The decoded text is also forced aligned with the target audio, so that we have the timestamps of every decoded word.
For each source segment produced in Section 2.2.2, we locate all decoded words that are within a window of five seconds to its start and end. A set of candidate target segments can be generated from all possible combinations of the starting and ending decoded words. We compute the cosine similarity between the LASER representation (Artetxe and Schwenk, 2019) of the source text and each decoded text in the candidate set to find the best target segment, i.e. the one with the highest score. We first carry out this process for all source segments, respectively, and then finetune the boundaries of overlapping target segments for consecutive source segments. Finally, a threshold of 0.75 is applied on the similarity score to filter out low-quality alignments, which can be due to Original (French) Vous le savez tous, la forêt recule. Toutes les deux secondes dans le monde, c'est l'équivalent d'un terrain de football qui est détruit, c'est en un an l'équivalent du territoire de la Grèce qui est déforesté et c'est evidemment dramatique.

Translation
As you all know, the forest is receding. Every two seconds, across the world, the equivalent of a football pitch is destroyed; within a year, an area the size of Greece is deforested. Clearly, this is a tragic situation.

Interpretation
You all know that we are losing forests every second, the surface the size area of a football field is lost in the forest. This is really tragic. ASR errors.
In addition to ASR output, we also collect human transcription on 400 hours of English target speech. The human annotators were asked to provide timestamps for each word while transcribing, and thus we can apply the same alignment process described above on human transcription and generate a set of ground truth speech-to-speech alignment data.
As a by-product from this alignment process, source text and target speech is aligned, which provides speech-to-text "translation" data in the reversed direction. This data is weakly labeled-the label (text) may contain more information than the speech data (interpretation is likely to drop unimportant details) and hence is not exact. However, it is still useful for ST model training as an addition to labeled data.

Language Modeling Data
To train language models (LM) for ASR decoding, we combine VoxPopuli transcription in the training set with the EuroParl corpus (Koehn, 2005), which is from the proceedings of the European Parliament from 1996 to 2011. To process the EuroParl data, we first apply the sentence segmentation tool provided with the corpus. We remove all texts in the parentheses, replace hyphens and slashes with space, and remove all other punctuation except apostrophes. All digits are converted into words, and all texts are normalized into lowercase. Table 1 shows the statistics of the LM data.

Data Analysis
Unlabeled speech As we can see from Table 1, VoxPopuli has a total of 400K hours of unlabeled data well-distributed across 23 EU languages, resulting in 8K-24K hours of data for each language. This ensures adequate data on languages with lower ASR resource, which are likely to benefit more from semi-supervised learning. It also facilitates multilingual model training since there is not much data imbalance and little need for tuning data sampling strategy.
Transcribed speech The VoxPopuli transcribed data contains 16 languages totaling 1.8K hours and 4.3K speakers, whose detailed statistics can be found in Table 1, including duration (hours) by language, number of speakers, percentage of female speakers and number of tokens. The data distribution is imbalanced and reflects the natural distribution of the number of native speakers. The remaining 7 languages (Pt, Bg, El, Lv, Mt, Sv and Da) are not covered due to either limited data volume or the availability of processing pipelines.
Speech-to-speech alignment The statistics of the speech-to-speech alignment between all source languages and 15 target languages are shown in Table 2. Compared with the total amount of data available for each source language ("Transcribed hours" in Table 1), we obtain target alignments for more than 70% of the source sentences in En, De, Fr, Es and It, more than 50% for Pl, Ro, Cs, Nl and Hr, and the rest has at least 40% of source segments aligned. To examine the quality of our ASR system, we align the ASR output with the human transcription we collect on English target speech and see a word error rate (WER) of 31.7. With the human transcription, we can produce ground truth speech-to-speech alignment data that is 1.1 times larger than the size of the alignment data created from using ASR output, indicating that around 12% of the low-quality alignments are filtered due to ASR errors. If we compare the ASR-based and the ground truth alignment data, there is on average a 0.75-second shift in the target segment boundaries.
tends to be more general and summarizing with unimportant details dropped. Human interpreters regularly apply these tactics to make better qualitylatency trade-offs. Speech-to-speech translation models may benefit from these tactics if they are trained on interpretation data that VoxPopuli provides.

Experiments & Results
We provide VoxPopuli ASR baselines and validate the versatility of VoxPopuli unlabeled data in unsupervised representation learning and semisupervised learning for ASR as well as ST. We also evaluate the quality of speech-to-speech alignment indirectly via the weakly labeled ST data it produces.

Experimental Setup
For representation learning, we perform speaker diarization before VAD-based segmentation so that each utterance contains exactly one speaker. We augment the data with time dropout, pitch modification and reverberation (Kharitonov et al., 2020) during model training.
For non-wav2vec models, we extract 80dimensional log-mel filterbank speech features with 25ms windows size and 10ms shift. We apply per-utterance CMVN (cepstral mean and variance normalization) to the extracted features. For GPU memory efficiency, we remove training samples that have more than 60 seconds of speech or have more than 1024 characters.
We train wav2vec 2.0 (Baevski et al., 2020) models with original hyper-parameter settings using fairseq (Ott et al., 2019), except for Table 7 where we use wav2letter (Pratap et al., 2018) and follow Talnikar et al. (2020) to do finetuning using both supervised CTC (Graves et al., 2006) loss and unsupervised wav2vec 2.0 loss. The largest model ("VP-100K") takes 10 days on 128 V100 GPUs for 1M updates. For non-wav2vec models, we train Transformer (Vaswani et al., 2017) with cross-entropy criterion using fairseq S2T (Wang et al., 2020a). For Section 4.2 and Section 4.4.1, we use phoneme vocabularies for models that we evaluate with PER (phone error rate) and character vocabularies for the other. For Section 4.4.2, we use Unigram (Kudo and Richardson, 2018) vocabularies with 2K subwords for all models. To improve ST model training, we pre-train the encoder on the LibriSpeech (Panayotov et al., 2015) ASR task.

Speech Recognition (ASR) Baselines
We provide monolingual Transformer baselines for the 14 languages that have more than 10 hours of transcribed data (see Table 1). Both development and test WER are reported in Table 4. We see that several low-resource languages (Fi, It, Hr, Sk and Sl) suffer from high recognition errors (>40% WER) due to the lack of training data. Even the highest resource one (En) has a high WER of around 30%.

Unsupervised Representation Learning
We follow the setting in Rivière et al. (2020) to evaluate unsupervised speech representations by phoneme discriminability on 3 languages (English, French and Mandarin), and report ABX discriminability score (Schatz et al., 2013) on the 10s test set from ZeroSpeech 2017 (Dunbar et al., 2017). Standard deviation ("Std.") of the scores across the 3 languages is also reported as a measure for the generality of the representations. As previous studies focus on monolingual representations, we explore multilingual representations and examine their generality across languages. We train CPCbased models (Riviere and Dupoux, 2020) on 500hour English and 500-hour French unlabeled data from VoxPopuli, respectively. And we combine English and French data with 50% sampling (so that the total duration remains the same) for the multilingual setting. We observe from Table 5 that the multilingual model ("En+Fr-500") performs comparably to the monolingual ones ("En-500" and "Fr-500") on their seen languages and performs better on unseen language ("Zh"). Its scores vary less across languages (lower "Std.") compared to "En-500". The variance of the scores is comparable to "Fr-500" while the average is lower. We conclude that multilingual representations generalize better across languages and are more robust on unseen  Table 8: ST and ASR using VoxPopuli data for self-training or weak supervision. Left: test BLEU for ST models. Right: test WER for ASR models. We evaluate in-VoxPopuli-domain performance with EuroParl-ST (EP) and the out-of-domain performance with CoVoST 2 (CV). We combine both corpora to train our baseline and pseudo-label 3K-hour monolingual VoxPopuli unlabeled data for self-training. For ST training with weak supervision, we combine EP, CV and 300h weakly labeled data from VoxPopuli. Both approaches for leveraging VoxPopuli data improve in-domain (EP) and out-of-domain (CV) performance simultaneously. † EP baselines from Iranzo-Sánchez et al. (2020) and CV baselines from Wang et al. (2020b). languages. For quick exploration, we leverage only part of the VoxPopuli unlabeled data and leave the validation on more data to future work.

Semi-Supervised Learning
We explore two semi-supervised learning settings for the application of VoxPopuli unlabeled data: unsupervised pre-training followed by supervised fine-tuning for ASR and self-training for ASR as well as ST.

ASR with Unsupervised Pre-Training
Self-supervised (unsupervised) pre-training such as wav2vec 2.0 (Baevski et al., 2020) substantially reduces the need of labeled data in ASR. Furthermore, multilingual pre-training (Conneau et al., 2020) allows cross-lingual transfer, which brings extra gains especially to low-resource languages. Pre-training wav2vec 2.0 models is, however, resource-intensive and hence re-training models for each task with different domains is impractical. With the large-scale multilingual data in VoxPopuli, we explore if scaling multilingual pretraining can take us towards the one-model-fits-all paradigm by alleviating the impacts of domain or language mismatch between pre-training and finetuning. We train wav2vec 2.0 models 1 on 10Khour, 50K-hour and 100K-hour VoxPopuli data in 23 languages (denoted as "VP-10K", "VP-50K" and "VP-100K", respectively). We also train models with 4.5K-hour monolingual data (denoted as "VP-Mono-5K") for comparison. For quick verification, we use only part of the VoxPopuli unlabeled data for pre-training. We leave training the models 1 wav2vec 2.0 Base (95M) unless specified otherwise. on the full 400K-hour data to future work, which is supposed to achieve even better performance.
In-domain pre-training We examine the conventional in-domain pre-training setting on the Vox-Populi ASR benchmark. We evaluate the VP-10K model, where the pre-training data is filtered so that it has no overlaps with the transcribed development and test set. From table 4, we see that pre-training using unlabeled data brings significant gains to all the languages (average 59% test WER reduction). The gains are most significant on the low-resource languages, where improvements are qualitative (for example, from nearly 100% test WER on Sl down to around 30%).
Out-of-domain pre-training We examine the out-of-domain pre-training setting using the Common Voice (CV) ASR corpus (Ardila et al., 2020). In contrast with the political domain oral speech in VoxPopuli, they are more fluent read speech of no copyright sentences (for example, Wikipedia articles). We adopt the few-shot phoneme recognition setup on CV v3 from Rivière et al. (2020), with which domain adaptation is limited during fine-tuning due to the small data volume -it has 1-hour train set, 20-minute development set and 1-hour test set for 10 languages including 5 VoxPopuli ones. We present the performance of VP-Mono-5K, VP-10K and VP-100K with the m-CPC (Rivière et al., 2020) and XLSR (Conneau et al., 2020) baselines in Table 6, where phone error rate (PER) is reported. The XLSR baselines share the same wav2vec 2.0 architecture as our models but are trained with in-domain CV data. VP-Mono-5K outperforms XLSR-Mono and XLSR-10 on all 5 VoxPopuli languages (except for a tie on Es with XLSR-Mono). VP-100K outperforms XLSR-10 on 8 (9) out of the 10 languages. VP-100K (Large) overall performs competitively to XLSR-53, which leverages 52K-hour out-of-domain data in addition to the in-domain CV data. Notably, it outperforms XLSR-53 on Zh, which is covered by XLSR-53 but remote from the EU languages in VP-100K. This suggests the high generality of the speech representations VP-100K learned.
We also evaluate our multilingual model (VP-50K) under the normal setup (CV v5.1) and report test WER in Table 7. They are compared with supervised baselines from DeepSpeech-Polyglot 1 , which leverage extended CV train sets and several other corpora for training as well as LM for decoding. Our model outperforms the baseline with fine-tuning on the standard CV train set (a subset of the baseline's one), even when not using LM in decoding.
Out-of-language pre-training In the few-shot phoneme recognition setup (Table 6), VP-100K does not cover 5 of the 10 CV languages (Ky, Ru, Tr, Tt and Zh) in pre-training, but leverages data from 18 additional EU languages. It outperforms the in-domain in-language XLSR baselines on most of the uncovered languages (except Ky which is a remote central Asian language). Moreover, it performs more stably across all the 10 languages with a smaller variance (standard deviation) on PER.

Self-Training for ASR and ST
Self-training (Scudder, 1965) is a classical semisupervised learning approach, where unlabeled data is equipped with pseudo-labels from a supervised model and then combined with labeled data for model training. We use the combination of EuroParl-ST (Iranzo-Sánchez et al., 2020) and CoVoST 2 (Wang et al., 2020b) for both ASR and ST labeled data in 3 languages (directions). The former is created from 2009-2012 EP plenary sessions and hence has the same domain as VoxPopuli. The latter is based on Common Voice v4, which has different domain than VoxPopuli and dominates the combined train set. We train Transformer Base (Vaswani et al., 2017) supervised baselines and use 0.8K/3K-hour monolingual VoxPopuli unlabeled data (from 2013-2020 sessions only to avoid overlaps with EuroParl-ST) to self-train Transformer Large models. We upsample labeled 1 https://gitlab.com/Jaco-Assistant/deepspeech-polyglot data in self-training so that it has the same duration as the unlabeled one. We observe from Table 8 that self-training on VoxPopuli improves both indomain ("EP") and out-of-domain ("CV") performance with similar magnitude most of the time. For ST, self-training helps to narrow the gap between end-to-end models and the cascaded ones (more labeled data available) without the addition of expensive labeled data.

Weakly Supervised ST
We evaluate the quality of the weakly labeled ST data from our speech-to-speech alignment on the same benchmark as the self-training experiments. This also provides an indirect evaluation for our alignment pipeline since imprecise alignments hurt the ST label quality. We examine the performance of weakly supervised training as well as joint training using both labeled and weakly labeled data. We see from Table 8 that the former is on par with (or better than) the supervised baseline in the VoxPopuli domain ("EP") with 0.3x-1.8x more training data than the baseline. Joint training brings substantial gains to both in-domain ("EP") and out-ofdomain ("CV") performance, and it outperforms self-training. This suggests that our weakly labeled data (0.4K hours) is much more informative and efficient than the pseudo-labeled data (3K hours) when combined with labeled data.

Related Work
Multilingual speech corpora LibriLight (Kahn et al., 2020b) currently represents the largest scale unlabeled speech corpus but it is limited to English. MLS (Pratap et al., 2020) is a recently released large-scale multilingual corpus of read speech in 8 languages, derived from LibriVox. MAILABS 1 is also derived from Librivox and has about 1000 hours available in 9 languages. While MLS and MAILABS are derived from audiobooks, Vox-Forge 1 and Common Voice (Ardila et al., 2020) gather data via crowd-sourcing. VoxForge collected data in about 15 different languages with about 300 hours of speech in total; Common Voice currently supports 60 languages for a total of 7327 validated hours available. The CMU Wilderness dataset (Black, 2019) collects readings from the New Testament, with 700 different languages avail-able. IARPA Babel program 1 collected data for 24 languages, mostly from conversational telephone speech. The dataset is however not released and under an open license, and focused on low-resource languages, with labeled data ranging between 25 to 65 hours per language.
Speech-to-Text and Speech-to-Speech Translation Apart from machine translation (Koehn, 2005), the European Parliament open data has fostered the development of corpora for speech-totext translation and for simultaneous interpretation. EuroParl-ST (Iranzo-Sánchez et al., 2020) is a multilingual speech-to-text translation corpus with translations between 6 European languages (En, Fr, De, Es, It and Pt). Similarly, EPIC (Bendazzoli et al., 2005) is derived from the European Parliament with simultaneous interpretation speeches in Italian, English and Spanish. CIAIR (Tohyama et al., 2004) and STC (Shimizu et al., 2014) are simultaneous interpretation corpora between English and Japanese with a total of about 180 hours for the former, while the latter is currently unavailable for download. The MaSS dataset (Zanon Boito et al., 2020) also provides speech to speech alignments for about 8k utterances across 8 languages, for a total of about 23h of speech.

Conclusion
In this paper, we introduce a large-scale multilingual speech corpus, VoxPopuli, for representation learning, semi-supervised learning and interpretation. VoxPopuli provides the largest open unlabeled speech data to date, which has broad applications including unsupervised pre-training and selftraining. VoxPopuli is also the first corpus for large amounts of open speech-to-speech interpretation data. We provide VoxPopuli ASR baselines and validate the versatility of VoxPopuli unlabeled data in semi-supervised learning under challenging out-ofdomain settings. The corpus is available at https: //github.com/facebookresearch/voxpopuli.