A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.


Introduction
We present an open-source Kazakh speech corpus (KSC) constructed to advance the development of speech and language processing applications for the Kazakh language. Kazakh is an agglutinative language with vowel harmony and belongs to the family of Turkic languages. During the Soviet period, the Kazakh language was overwhelmed by the Russian language, which caused a decline in Kazakh language usage (Dave, 2007). In the 1990s, it was declared an official language of Kazakhstan, and many initiatives were launched to increase the number of Kazakh speakers. Today, it is spoken by over 10 million people in Kazakhstan and by over 3 million people in other countries 1 . By introducing the KSC, we aim to accelerate the penetration of the Kazakh language into the Internet of things (IoT) technologies and to promote research in Kazakh speech processing applications.
Although several Kazakh speech corpora have been presented (Makhambetov et al., 2013;Shi et al., 2017;Mamyrbayev et al., 2019), there is no generally accepted common corpus. Most of them are either publicly unavailable or contain an insufficient amount of data to train reliable models. Especially, these databases are too small for building recent end-to-end models, which are extremely data hungry (Hannun et al., 2014). Consequently, different research groups usually conduct their experiments on internally collected data, which prevents the reproducibility and comparison of different approaches.
To address the aforementioned limitations, we created the KSC, containing around 332 hours of transcribed audio. It was crowdsourced through the Internet, where volunteers were asked to read sentences presented through a web browser. In total, we accepted over 153,000 utterances submitted from over 1,600 unique device IDs. The recordings were first checked manually, and when a sufficient amount of data was collected, were partially checked automatically using a speech recognition system. To the best of our knowledge, the KSC is the largest open-source speech corpus in Kazakh and is available for academic and commercial use upon request at this link 2 under the Creative Commons Attribution 4.0 International License 3 .
We expect that this database will be a valuable resource for research communities in both academia and industry. The primary application domains of the corpus are speech recognition, speech synthesis, and speaker recognition. To demonstrate the reliability of the database, we performed preliminary automatic speech recognition (ASR) experiments, where promising and sufficient for practical usage results were achieved. We also provide a practical guide on the development of ASR systems for the Kazakh language by sharing the reproducible recipe and pretrained models 4 . The utilization of the database for speech synthesis and speaker recognition tasks is left for future work.
The rest of the paper is organized as follows. Section 2 provides a review of related works. Section 3 presents the KSC database and explains the database construction procedures in detail. Section 4 offer a presentation of the speech recognition experiment setup and obtained results. In Section 5, the obtained results and challenges are discussed. Lastly, Section 6 concludes this paper and mentions potential future work.

Related Works
In the past few years, the interest in ASR has surged by its new applications in smart devices, such as voice command, voice search, message dictation, and virtual assistants (Yu and Deng, 2014). In response to this technological shift, many speech corpora have been introduced for various languages. For example, Du et al. (2018) Mamyrbayev et al. (2020). Khomitsevich et al. (2015) utilized 147 hours of bilingual Kazakh-Russian speech data to build code-switching ASR systems. Shi et al. (2017) released 78 hours of transcribed Kazakh speech data recorded by 96 students from China. The IARPA Babel project has released a Kazakh language pack 5 consisting of around 50 hours of conversational and 14 hours of scripted telephone speech. Unfortunately, the aforementioned databases are either publicly unavailable or of an insufficient size to build robust Kazakh ASR systems. Additionally, some of them are nonrepresentative-that is, they cover speakers from a narrow set of categories, such as the same region or age group. Furthermore, since most of these databases have been collected in optimal lab settings, they might be ineffective for real-world applications.
The emergence of crowdsourcing platforms and the growth in Internet connectivity has motivated researchers to employ crowdsourcing for annotated corpora construction. Different from the expert-based approaches, crowdsourcing tends to be cheaper and faster, though, additional measures should be taken to ensure the data quality (Snow et al., 2008;Novotney and Callison-Burch, 2010;Eskenazi et al., 2013). Furthermore, crowdsourcing allows to gather a variety of dialects and accents from remote geographical locations and enables the participation of people with disabilities and of an advanced age, which otherwise would be impossible or too costly (Takamichi and Saruwatari, 2018). Inspired by this, we followed the best crowdsourcing practices to construct a large-scale and opensource speech corpus for the Kazakh language, as described in the following sections.

The KSC Construction
The KSC project was conducted with the approval of the Institutional Research Ethics Committee of Nazarbayev University. Each reader participated voluntarily and was informed of the data collection and use protocols through an online consent form.

Text Collection and Cleaning
We first extracted Kazakh textual data from various sources such as electronic books, laws, and websites, including Wikipedia, news portals, and blogs. For each website, we designed a specialized web crawler to improve the quality of the extracted text. The extracted texts were manually filtered to eliminate inappropriate content involving sensitive political issues, user privacy, violence, and so on. Additionally, we filtered out texts entirely consisting of Russian words. Texts consisting of mixed Kazakh-Russian utterances were kept, because there are many borrowed Russian words in Kazakh, and it is common practice among Kazakh speakers to code-switch between Kazakh and Russian (Khomitsevich et al., 2015). Next, we split the texts into sentences and removed sentences consisting of more than 25 words. Lastly, duplicate sentences were removed. The total number of extracted sentences was around 2.3 million.

Text Narration and Checking
To narrate the extracted sentences, we developed a web-based speech recording platform capable of running on personal computers and smartphones. The platform randomly samples a sentence from the pool of extracted texts, and presents it to a reader (see Figure 1). It also displays the recording status and statistics, such as elapsed time and the total number of read sentences. Additionally, it has the "pause" and "next" buttons to control the recording process. The readers were allowed to quit at any time. We recruited readers by advertising the project in social media, news, and open messaging communities on WhatsApp and Telegram. Readers who were at least 18 years old were included so that they could legally agree to participate in data collection. The audios were recorded in 48 kHz and 16 bits, but downsampled to 16 kHz and 16 bits for online publication. Following our experimental protocol, we did not store readers' personal   information except for the geolocation coordinates, IP address, and device type. Several native Kazakh transcribers were hired to check the quality of recordings. The transcribers logged on to a special transcription checking platform and were provided with an audio segment and the corresponding sentence that a reader had read. The task was to check if the reader had read the sentence according to the prompt, and to transcribe any deviations or other acoustic events based on a set of transcription instructions. As an additional quality measure, we hired a linguist who was assigned to supervise the transcribers and to randomly check tasks they completed. To harmonize the transcriptions, the linguist also held the "go through errors" sessions with the transcribers.
The transcribers were instructed to reject utterances containing obvious mispronunciations or severe noises, to convert numbers into words 6 , and to trim long silences at the beginning and end of the audio segments. Additionally, they were instructed to enclose partial repetitions and hesitations in parentheses, for example, '(he) hello' and '(ah)', and to indicate other non-verbal sounds produced by readers, such as sneezing and coughing, using a special '[noise]' token. Background noises were not labeled.
When the size of accepted utterances reached 100 hours, we built an ASR system to automatically check the recordings. The system accepted only recordings perfectly matching corresponding text prompts-that is, 0% character error rate (CER), whereas the remaining utterances were left to human transcribers.

Database Specifications
The KSC database specifications are provided in Table 1. We split the data into three sets of nonoverlapping speakers: training, validation, and test. While the training set recordings were collected  from anonymous speakers, the validation and test sets were collected from identifiable speakers to ensure that they did not overlap with the training set, represented different age groups and regions, and were gender balanced (see Table 2). In total, around 153,000 utterances were accepted, yielding 332 hours of transcribed speech data. Note that the device IDs could not be used to represent the number of speakers in the training set as several speakers might have used the same device or the same speaker might have used different devices. Therefore, the total number of speakers in the training set is not shown in Table 1. The whole database creation process took around four months, and the database size is around 38GB. The Kazakh writing system differs depending on the regions where the language is spoken. For example, the Cyrillic alphabet is used in Kazakhstan and Mongolia, while an Arabic-derived alphabet is used in China. In the KSC, we presented all texts using the Cyrillic alphabet consisting of 42 letters. The distribution of these letters in the KSC is given in Figure 2.
One of the important features of the proposed KSC database is that it was collected in various environment conditions (e.g. home, office, café, transport, and street), with diverse background noises through mobile devices (e.g. phones and tablets) and personal computers, with and without headphone sets, which is similar to realistic use-case scenarios. Consequently, our database enables the development and evaluation of ASR systems designed to operate in real-world voice-enabled applications, such as voice commands, voice search, message dictation, and so on. The KSC database consists of audio recordings, transcripts, and metadata stored in separate folders. The audio and corresponding transcription filenames are the same, except that the audio recordings are stored as WAV files, whereas the transcriptions are stored as TXT files using the UTF-8 encoding. The metadata contain the data splitting information (training, validation, and test) and the speaker's details (gender, age, region, device, and headphones) of the validation and test sets.

Speech Recognition Experiment
To demonstrate the utility and reliability of the KSC database, we conducted speech recognition experiments using both the traditional deep neural network-hidden Markov model (DNN-HMM) and recently proposed end-to-end (E2E) architectures. We did not compare or perform thorough architecture searches for either DNN-HMM and E2E, since this falls outside the scope of this paper.

Experimental Setup
All ASR models were trained with a single V100 GPU running on the NVIDIA DGX-2 server using the training set. All hyper-parameters were tuned using the validation set. The best-performing model was evaluated using the test set. All results are reported without lattice or n-best hypotheses rescoring, and no external data have been used.

The DNN-HMM ASR
The DNN-HMM ASR system was built using the Kaldi framework (Povey et al., 2011). We followed the Wall Street Journal (WSJ) recipe with the "nnet3+chain" setup and other latest Kaldi developments. The acoustic model was constructed using the factorized time-delay neural networks (TDNN-F) (Povey et al., 2018) trained with the lattice-free maximum mutual information (LF-MMI) (Povey et al., 2016) training criterion. The inputs were Mel-frequency cepstral coefficients (MFCC) features with cepstral mean and variance normalization extracted every 10 ms over a 25 ms window. In addition, we applied data augmentation using the speed perturbation (SpeedPerturb) (Ko et al., 2015) at rates of 0.9, 1.0, and 1.1 and the spectral augmentation (SpecAugment) (Park et al., 2019) techniques.
We employed a graphemic lexicon because of the strong correspondence between word spelling and pronunciation in Kazakh (e.g. "hello → h e l l o"). The graphemic lexicon was constructed by extracting all words in the training set, which resulted in 157,191 unique words. During the decoding stage, we employed a 3-gram language model 7 (LM) with the Kneser-Ney smoothing built using the SRILM toolkit (Stolcke, 2002). The 3-gram LM was trained using the transcripts of the training set and the vocabulary covering all the words in the graphemic lexicon.

The E2E ASR
The E2E ASR systems were built using the ESPnet framework (Watanabe et al., 2018). We followed the WSJ recipe to train two different encoderdecoder architectures based on the recurrent neural networks (RNN) (Bahdanau et al., 2015) and the Transformer (Vaswani et al., 2017). Both architectures were jointly trained with the connectionist temporal classification (CTC) (Graves et al., 2006) objective function under the multi-task learning framework (Kim et al., 2017). The input speech was represented as an 80-dimensional filterbank features with pitch computed every 10 ms over a 25 ms window. For both E2E architectures, the acoustic features were first processed by few initial blocks of VGG network (Simonyan and Zisserman, 2015). Since Kazakh is a morphologically rich language, it is susceptible to severe data sparseness.
To overcome this issue, we employed characterlevel output units in both architectures. In total, we used 45 distinct output units consisting of 42 letters from the Kazakh alphabet and 3 special tokens-that is, <unk>, <space>, and <blank> used in CTC.
The E2E ASR systems do not require a lexicon when modeling with grapheme-based output units (Sainath et al., 2018). The character-level LM was built using the transcripts of the training set as a two-layer RNN with 650 long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) units each. We utilized the LSTM LM during the decoding stage using the shallow fusion (Gülçehre et al., 2015) for both E2E architectures. Besides, we augment the training data using the SpeedPerturb and the SpecAugment techniques. For decoding, we set the beam size to 30 and the LSTM LM interpolation weight to 1.
E2E-RNN. The encoder module of the RNNbased E2E ASR system consists of three bidirectional LSTM layers with 1,024 units per direction per layer. The decoder module is a single uni-directional LSTM with 1,024 units. We train the model for 20 epochs using the Adadelta optimizer with the initial learning rate set to 1 and the batch size set to 30. The interpolation weight for the CTC objective was set to 0.5.
E2E-Transformer. The Transformer-based E2E ASR system consists of 12 encoder and 6 decoder blocks. We set the number of heads in the self-attention layer to 4 with 256-dimension hidden states and the feed-forward network dimensions to 2,048. We set the dropout rate and label smoothing to 0.1. The model was trained for 160 epochs using the Noam optimizer (Vaswani et al., 2017) with the initial learning rate of 10 and the warmup-steps of 25,000. The batch size was set to 96. We report results on an average model constructed using the last 10 checkpoints. The interpolation weight for the CTC objective was set to 0.3.

Experimental Results
The experimental results are presented in Table 3 in terms of both the character error rate (CER) and word error rate (WER). All the ASR models achieved competitive results on both the validation and test sets. We found that the validation set is more challenging than the test set. When compared without SpecAugment, the performance of the DNN-HMM model (ID 1) is slightly better than the E2E-RNN (ID 5), but inferior to the  E2E-Transformer (ID 9). We could not achieve any improvements on the DNN-HMM model (ID 2) using the SpecAugment despite trying different hyper-parameter tuning recommendations (Zhou et al., 2020). Overall, the best CER and WER results are achieved by the E2E-Transformer (ID 10) followed by the E2E-RNN (ID 6) and then the DNN-HMM (ID 1). We observed that the LM fusion significantly improves the performances of both E2E models. For example, 35% and 36% relative WER improvements are achieved on the test set for the RNN (from ID 3 to ID 4) and Transformer (from ID 7 to ID 8) models, respectively. Furthermore, both data augmentation techniques based on SpeedPerturb and SpecAugment are highly effective for the Kazakh E2E ASR where additional improvements are achieved. For example, when models without and with data augmentations are compared, 36% and 26% relative WER improvements are achieved on the test set for the RNN (from ID 4 to ID 6) and Transformer (from ID 8 to ID 10) models, respectively.
These experimental results successfully demonstrate the utility of the KSC database for the speech recognition task. We leave the exploration of the optimal hyper-parameter settings and detailed comparison of different ASR architectures for future work.

Discussion
1) Data sparsity. Kazakh language speech recognition is considered challenging due to the agglutinative nature of the language, where word structures are formed by adding derivational and inflectional affixes to stems in a specific order. As a result, the vocabulary size might considerably grow resulting in a data sparsity problem, especially for models operating in a word level, such as our DNN-HMM architecture. The potential solution is to break down words into finer-level linguistic units, such as characters or subword units (Sennrich et al., 2016). We investigated the impact of other output unit sizes on the performance of the Kazakh E2E-Transformer ASR and did not observe any considerable improvements over the character-level outputs (see Figure 3). The output units were generated using the byte pair encoding (BPE) algorithm implemented in the SentencePiece (Kudo and Richardson, 2018) tokenizer. 2) Code-switching. Another challenge is the Kazakh-Russian code-switching practice which is common in daily communication as the majority of ID Confusion pairs Insertion Deletion Substitution 1 жоғарғы ⇒ жоғары мен де да (jogargy ⇒ jogary) (men) (de) (da) "upper" or "higher" ⇒ "top" or "high" "I" or "with" "too" (after a thin-vowelled word) "too" (after a thick-voweled word) 2 өзеннiң ⇒ өзенiнiң бұл ал де (uzennin ⇒ uzeninin) (bul) (al) (de) "of a river" ⇒ "of the river" "this" or "these" an imperative of "take" or "whereas" "too" (after a thin-voweled word) 3 ас ⇒ ақ бiр да мен (as ⇒ aq) (bir) (da) (men) "meal" ⇒ "white" "one" "too" (after a thick-vowelled word) "I" or "with" 4 бағы ⇒ баға да әр жылы (bagy ⇒ baga) (da) (ar) (jyly) "the garden" ⇒ "price" "too" (after a thick-vowelled word) "every" "the year" or "warm" 5 болу ⇒ болуы ақ бiр бiр (bolu ⇒ boluy) (aq) (bir) (bir) "being" ⇒ "the being" "white" "one" "one" Kazakhs are bilingual. Mostly, inter-sentential and intra-sentential types of code-switching are practiced, however, intra-word code-switching is also possible. For example, one can say "Men magazinge bardym" ("I went to a store"), where the Russian word "magazin" is appended by the Kazakh inflection "-ge" representing the preposition "to". Furthermore, while the spelling of Kazakh words closely matches their pronunciation, this is not the case for Russian words, for example, the letter "o" is sometimes pronounced as /a/, which might confuse an ASR system. We observed that our ASR system is ineffective in code-switched utterances. Therefore, future work should focus on alleviating these errors.
3) Data efficiency. To analyze the data efficiencythat is, increase in performance due to the addition of new training data-we trained our best E2E-Transformer ASR system using different amounts of data. In particular, we first randomly sampled 40 hours of data and kept increasing their size until the entire training set was covered. The experimental results indicate that the WER performance has not converged yet and further data collection might be effective (see Figure 4). Therefore, we plan to continue the data collection process.

4) ASR output analysis.
We inspected the recognized outputs of the best E2E-Transformer model to identify the most challenging characters and words. Table 4 lists the top five confusion pairs and insertion, deletion, and substitution errors. Most of them are commonly used words, such as conjunctions and numbers. In addition, we also inspected the most confused character pairs (see Figure 5). We observed that the Kazakh ASR system confuses characters with a similar pronunciation, such as "н" (/n/) and "ң" (/N/), "i" (/I/) and "ы" (/@/), and so on.  5) Performance comparison. We cannot directly compare our results to previous works; however, our WER results are appealing. For example, Mamyrbayev et al. (2019) used 76 hours of speech data to build an DNN-HMM ASR system which achieved 32.7% WER on clean read speech. Similarly, the DNN-HMM ASR system built using 78 hours of data in (Shi et al., 2017) achieved 25.1% WER on read speech. On the other hand, Mamyrbayev et al. (2020) achieved 17.8% WER on clean read speech using the E2E ASR system trained on 126 hours of data. In comparison, our best model achieved 8.7% WER on the test set. 6) Benefit for other Turkic Languages. We also envision that the KSC can be utilized in cross-lingual transfer learning techniques (Das and Hasegawa-Johnson, 2015) to improve ASR systems for other Turkic languages, such as Kyrgyz and Tatar.

Conclusion
In this work, we presented the KSC database containing around 332 hours of transcribed speech data. It was developed to advance Kazakh speech processing applications, such as speech recognition, speech synthesis, and speaker recognition. We described the database construction procedures and discussed challenges that should be addressed in future work. The described methodologies might benefit other researchers planning to build a speech corpus for a low-resource language. The database is freely available for any purposes including research and commercial use. We also conducted preliminary speech recognition experiments using both traditional hybrid DNN-HMM and recently proposed E2E ASR architectures. To ease the database usage and ensure the reproducibility of experiments, we split it into three non-overlapping sets (training, validation, and test) and released our ESPnet recipe. The detailed exploration of better ASR settings, as well as the adaptation of the database to other applications, is left for future work.