Language ID Prediction from Speech Using Self-Attentive Pooling

This memo describes NTR-TSU submission for SIGTYP 2021 Shared Task on predicting language IDs from speech. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. For many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID systems. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results for the language identification task.


Introduction
Spoken Language Identification (LID) is a process of classifying the language spoken in a speech recording and is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline.
Differences between languages exist at all linguistic levels and vary from marked, easily identifiable distinctions (such as the use of entirely different words) to more subtle variations, which might have been lost or gained due to language contact.The latter end of the range is a challenge not only for automatic LID systems but also for linguistic sciences themselves.
In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results in low-resource setting for the language identification task.The system described herein is identical to the one simultaneously submitted for Low Resource ASR challenge at Dialog2021 conference, language identification track, although the dataset is completely different.

Previous work
The first works on LID date back at least to midseventies, when Leonard and Doddington (1974) explored frequency of occurrences of certain reference sound units in different languages.
Previously developed LID approaches include:  Purely acoustic LID that aims at capturing the essential differences between languages by modeling distributions in a compact representation of the raw speech signal directly.
 Phonotactics LID rely on the relative frequencies of sound units (phoneme/phone) and their sequences in speech.
 Prosodic LID use tone, intonation and prominence, typically represented as pitch contour.
 Word Level LID systems use fullyfledged large vocabulary continuous speech recognizers (LVCSR) to decode an incoming utterance into strings of words and then use Written Language Identification.
In the latest 10 years, intermediary-dimensional vector representations similar to i-vector (Dehak, et al. 2011a

Self-attentive pooling decoder
After that, the utterance level representation  can be generated as a weighted sum of the frame level feature maps based on the learned weights:

Loss Function
We have used cross-entropy loss function for this task.
3 Experiments There are 16 languages in the released train data, 4000 utterances per language.Table 1 summarizes the languages in the dataset.
Validation and test data consist of 8000 utterances, 500 for each language.

Optimization and training process
We have used the attention vector size of 256.
Models were trained until they reached a plateau on a validation set.Training was done using the Stochastic Gradient Descent optimizer with initial learning rate of 0.005 and cosine annealing decay to 1e-4.

Results and Discussion
We have experimented with SpecAugment augmentation introduced by Park et al., 2019 and run experiments both with and without augmentation.
The system described above allowed us to achieve the following results on the validation set (see Table 2).Somewhat surprisingly, detection of most languages was better without SpecAugment, Sundanese, Portuguese, Russian, and Iban being exceptions.Iban did not detect at all without augmentation.We can hypothesize that SpecAugment is more favorable for Indo-European and Austronesian languages detection than for other language families.This hypothesis requires further research.
Looking at the confusion matrix (Figure 2) we can see that most language samples determined as English, Kabyle or Telugu, independent of the language family.This means that there are more prominent speech features that hinder the language identification.Given the nature of the training set, that may be related to the gender of the readers.
Similar to work of Koluguri et al., 2020, the model is based on 1D convolutions, namely, the QuartzNet ASR architecture (Kriman et al., 2020) comprising of an encoder and decoder structures.2.1 EncoderThe encoder used is QuartzNet BxR model shown in Figure1, and has B blocks, each with R subblocks (Kriman et al., 2020).The first block is fed with MFSC coefficients vector of length 40.Each sub-block applies the following operations (Kriman et al., 2020):  a 1D convolution,  batch norm,  ReLU, and  dropout.All sub-blocks in a block have the same number of output channels.These blocks are connected with residual connections (Kriman et al., 2020).
Similar to Cai et al., 2018, Chowdhury et al., 2018, we agree that not all frames contribute equally to the utterance level representation.Thus we use a self-attentive pooling (SAP) layer introduced by Cai et al., 2018 to pay more attention to the frames that are more important.

3. 1
Datasets and tasks For training models, speech data from the CMU Wilderness Dataset (Black, 2019) were used, which contain read speech from the Bible in 699 languages, but usually recorded from a single speaker.This training data were released in the form of derived MFCCs.The evaluation (validation, test) data come from different sources, in particular data from the Common Voice project, several OpenSLR corpora (SLR24 (Juan et al., 2014a, 2014b), SLR35, SLR36 (Kjartansson et al,.2018), SLR64, SLR66, SLR79 (He et al., 2020), and the Paradisec collection.

Figure 2 :
Figure 2: The confusion matrix for the validation set

Table 1 :
Summary of languages in the dataset 1 61 in LREC 2020 -12th International Conference on 62 Language Resources and Evaluation, Conference