WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal Classification Paradigm

Audio Speech Recognition (ASR) systems are evaluated using Word Error Rate (WER) which is calculated by comparing the number of errors between the ground truth and the ASR system's transcription. This calculation, however, requires manual transcription of the speech signal to obtain the ground truth. Since transcribing audio signals is a costly process, Automatic WER Evaluation (e-WER) methods have been developed which attempt to predict the WER of a Speech system by only relying on the transcription and the speech signal features. While WER is a continuous variable, previous works have shown that positing e-WER as a classification problem is more effective than regression. However, while converting to a classification setting, these approaches suffer from heavy class imbalance. In this paper, we propose a new balanced paradigm for e-WER in a classification setting. Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER. Furthermore, we introduce a distance loss function to tackle the ordinal nature of e-WER classification. The proposed approach and paradigm are evaluated on the Librispeech dataset and a commercial (black box) ASR system, Google Cloud's Speech-to-Text API. The results and experiments demonstrate that WER-BERT establishes a new state-of-the-art in automatic WER estimation.


Introduction
ASR systems are ubiquitous now. They are available across applications such as Voice Assistants, Assisted Living or Hands free device usage. However, with the widespread usage of ASR systems, there comes a heavy need for ASR Evaluation as well -to select, compare or improve alternate ASR systems. WER is widely considered as the standard *The authors contributed equally to the work. metric for ASR evaluation. A higher WER implies a higher percentage of errors between the ground truth and the transcription from the system. WER is calculated by aligning the two text segments using string alignment in a dynamic programming setting. The formula is as follows: where ERR is the sum of errors (Insertions, Deletions or Substitutions) between the transcription and the ground truth. N is number of words in ground truth. As evident from this equation, the presence of ground truth is imperative for the calculation of errors, and hence, for WER. However, manual transcription of speech at word level is a expensive and dilatory process. Hence, the need for an automatic ASR evaluation is desideratum but few works have addressed this. Furthermore, for effectively training, evaluating and judging the performance of an ASR system, WER calculation needs to be done on adequate hours of data. As this test set increases, a more accurate estimation of WER is possible. Since automatic WER evaluation does not have the bottleneck of manual transcription, it can be calculated over large test sets leading to more accurate estimates. Because of the immense popularity of attention based architectures for text classification Rao, 2019, 2020a;Rao et al., 2020;Madasu and Rao, 2020b), we propose the transformer (Vaswani et al., 2017) encoder architecture -Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) for e-WER. BERT is pretrained on huge amounts of open domain language and is extensively used for its effectiveness in natural language understanding tasks. By pretraining on such data, the model gains knowledge of general domain language structure which aids in predicting speech errors which are typically observed as a Figure 1: Distribution of WER classes of the 100hr dataset, using proposed Balanced approach and using the Elloumi et al. (2018) imbalanced approach. deviation from the general syntax and semantics of a sentence. While previous approaches address e-WER (Ali and Renals, 2018;Elloumi et al., 2018Elloumi et al., , 2019 in classification settings, their models suffer with gross imbalance in the WER classes. To address this issue, we present a training framework which will always consist of training on equal sized classes no matter the true WER distribution. Additionally, these previous e-WER classification tasks assume that there is no inherent relative ordering to the classes. However, this is not the case with WER classification since a misclassification closer to the ground truth will lead to a lower mean absolute error (MAE) in the WER prediction. Such classification tasks are called Ordinal classification (Frank and Hall, 2001).
The overall contributions of our paper can be summarized follows: (i) A new balanced paradigm for training WER as a classification problem which addresses the label imbalance issue faced by previous works.
(ii) WER-BERT -We find that language model information is helpful in predicting WER and hence propose a BERT based architecture e-WER.
(iii) Distance Loss to address the ordinal nature of WER classification which penalizes misclassifications differently based on how far off the predicted class is when compared to the real class.

Related Work
While the importance of an automatic WER prediction system is immense, there haven't been many works directly addressing it. Related works such as exploring the word-level confidence in ASR prediction are abundant (Seigel and Woodland, 2011;Huang et al., 2013). There have also been works predicting the errors or error estimates as well in some form Hori, 2015, 2017;Seigel and Woodland, 2014). These approaches either predict some of the errors described in WER prediction or alternate metrics to rate ASR systems such as Accuracy or error type classification. However, they lack calculation of the complete WER score. Transcrater (Jalalvand et al., 2016) was one of the first works which aim at predicting WER directly. They propose a neural network in a regression setting trained on various features such as parts of speech, language model, lexicon and signal features. However, more recent approaches (Ali and Renals, 2018;Elloumi et al., 2018Elloumi et al., , 2019 phase WER prediction as a classification approach. Ali and Renals (2018) propose two types of models based on the input available -the glassbox model which uses internal features of the target ASR system, such as its confidence in transcribing the audio clip, and the blackbox model which only use the transcripts and other features generated from the transcript. Our proposed approach is trained in a blackbox setting which is a harder task since features such as average log likelihood and confidence give a good indication on how many errors may have occurred during ASR transcription. Furthermore, they train a bag-of-words model along with additional audio related features. Elloumi et al. (2018Elloumi et al. ( , 2019 build a CNN based model for WER classification. We built models based on them as baselines to evaluate WER-BERT's performance. They are further explained in the Sections 4 and 6. ASR errors often make a transcription ungrammatical or semantically unsound. Identifying such constructs is also reflected in the dataset of Corpus of Linguistic Acceptability(CoLA) (Warstadt et al., 2019b,a). CoLA is a dataset intended to gauge at the linguistic competence of models by making them judge the grammatical acceptability of a sentence. CoLA is also part of the popular GLUE benchmark datasets for Natural Language Understanding (Wang et al., 2018). BERT (Devlin et al., 2018) is known for outperforming previous GLUE state-of-the-Art models, including the CoLA dataset.

Dataset
For our experiments, we have used the Librispeech dataset (Panayotov et al., 2015) which is a diverse collection of audio book data along with the ground text. It has around 1000 Hours of audio recordings with different levels of complexity. We pass these audio clips through an ASR system to get its transcripts and the WER is calculated by comparing it with the ground text. This paper reports findings in the experiments run with Google Cloud's Speech-to-Text API. We chose this commercial ASR system, rather than reporting results on an internal ASR system, since it's easily accessible through the google-api and the results are reproducible. For our experiments, we have used the 10 and 100 hour datasets and made a 60:20:20 split into train, dev and test sets for each dataset. As can be seen in Table 1 of Section B of the Appendix, the characteristics of the 100 and 10 hour datasets are quite different. However, within a dataset, the train, dev and test sets have similar distributions of WER and other characteristics. Table 1 lists a few examples with each row having the ground text, the transcript obtained from Google Cloud's Speech-to-Text API, the True WER and the WER predicted by our proposed model.

Single Task and Double Task for WER Estimation
As shown in Equation 1, the WER of an utterance is the fraction obtained by the division of 2 integers -Errors per sentence (ERR), which is the total number of insertions, deletions and substitutions needed to convert an ASR's transcript to the ground text, and the word count of the ground text (N). Since the WER of a sentence is a continuous variable between 0 and 1 (mostly), a common way to model this is through a regression model. Elloumi et al. (2018Elloumi et al. ( , 2019 instead present a way to turn this into a classification problem for e-WER. https://cloud.google.com/speech-to-text. They experiment with various combinations of text and audio signal inputs and show that a classification approach outperforms its corresponding regression approach trained on the same inputs. Elloumi et al. (2018)'s approach estimates WER directly with a 6 class classification model (with classes corresponding to 0%, 25%, 50%, 75%, 100% and 150%). Once the model is trained, the predictions are calculated as follows: where s is a sample, W ER P red is the predicted WER for s, P sof tmax (s) is the softmax probability distribution output by the classification model, W ER f ixed = [0, 0.25, 0.5, 0.75, 1.0, 1.5] is the fixed vector and '·' is the dot product operator. We call this approach as the Single Task method for e-WER. Alternatively, Ali and Renals (2018) present another classification approach for e-WER. They argue that the calculation of WER relies on two distinct calculations -ERR and N. Since both of them are discrete integers, they propose two independent classification problems to the estimate errors in the sentence ERR est. and to estimate of Word Count of the ground text N est. . The predicted WER is then calculated as (ERR est. /N est. ). We call this approach as the Double Task method for e-WER.

The Problem of Data Imbalance
While these approaches had good results, both of them suffer from the problem of imbalanced classes. As we can seen from Figure 2, the ERR and the WER are highly imbalanced. The nature of WER is such that it is very less likely for this task to be balanced for any data and any ASR system. This imbalance leads to poor performance due to certain WER or ERR classes having very few samples in them. With Elloumi et al. (2018), all the true WER's are 'cast' to the nearest multiple of 0.25. It can be seen from Figure 2 that the number of samples belonging to a class varies tremendously in this imbalanced setting. This leads to the model having poor performance, especially in the higher WER ranges. Moreover, different ASR systems will have their own distributions and some may be relatively well balanced in some ranges while other may be much worse. This approach is not scalable and it fails to generalize a method for creating balanced class distributions, irrespective of the ASR system.

Proposed Balanced Division
We propose an alternate paradigm for creating WER class distributions. We extend the single task setting for e-WER to a balanced WER class distribution irrespective of the dataset and ASR system. Instead of fixing a list of WER values based on a factor such as 0.25 to represent the classes, the total number of classes desired K is decided. Let w 1 , w 2 , w 3 , ....w D be the WERs of individual samples ordered in an ascending manner where D is the total number of samples in the corpus. Then the number of samples in each of the K classes will be n s = (D/K). A class C i will be defined as the samples with WER in the range: [w (i−1) * ns , w (i) * ns ] where i ∈ {1, 2, ...K}. This is shown in Figure 1 where K = 15 classes are made with equal number of samples in each. The WER value W ER F ixed i associated with each class C i is defined as the mean WER of that class: Once W ER F ixed is calculated, we use Equation 2 to compute the predicted WER of a sample using P sof tmax calculated from a neural network model. Apart from being balanced, this approach also has the benefit of generating classes which fits the true WER curve better than the previous approach as show in Figure 1. It's imperative to understand that while the ERR estimation in the Double Task setting of Ali and Renals (2018) is also imbalanced, it can't be mapped into arbitrary classes based on ordering. Since WER is a continuous variable, unlike ERR, it is possible to decide the boundary for a WER class arbitrarily to create balanced sets.

WER-BERT
In this section we explain our proposed architecture WER-BERT. WER-BERT is primarily made of four sub-networks. Our architecture is shown with details in Figure 3.
Signal Sub-network: Elloumi et al. (2018) use the raw signal of the audio clip to generate features such as MFCC and Mel Spectrogram. They are features commonly used in the design of ASR systems, particularly systems which use an acoustic model and furthermore these features aid their model performance. These signal features are passed through the m18 architecture (Dai et al., 2017). m18 is a popular deep convolutional neural network (CNN) used for classification with audio signal features. This CNN model has 17 convolutional+max pooling layers which is followed by global average pooling. L2 Regularization of 1e − 4 and Batch Normalization are added after each of the convolutional layers.
Numerical Features Sub-network: Ali and Renals (2018) blackbox models had two major components -text input and numerical features. These numerical features are important to the model as they contain information regarding the number of errors. For instance, in ASR systems, there are errors if a user speaks too fast or too slow and this is directly reflected in the duration and word count features. The numerical features we have used are Word Count, Grapheme Count and Duration. These features are concatenated and passed through a simple feed forward network which is used to upscale the numerical features fed into the model (from 3 to 32).
BERT: Bi-directional Encoder Representations (BERT) (Devlin et al., 2018) is a pre-trained unsupervised natural language processing model. It is a masked language model which has been trained on a large corpus including the entire Wikipedia corpus. The transcription samples from the ASR system, are passed through the associated tokenizer which gives a contextual representation for each word. It also adds 2 special tokens -the [CLS] token at the beginning and the [SEP] token at the end of the sentence. We have used the BERT-Large Uncased variant. The large variant has 24 stacked transformer Vaswani et al. (2017) encoders. It gives an output of the shape (Sequence Length X 1024) of which only the 1024 shaped output correspond-ing to the [CLS] token is used. In WER-BERT, BERT weights are fine tuned with rest of architecture during training.
Feed Forward Sub-Network: This subnetwork is a deep fully connected network which is used to concatenate and process the features generated by the sub-networks predating it (BERT, numerical sub-network and the signal sub-network). It has 4 hidden layers (512, 256, 128 and 64 neurons) followed by the output softmax layer. Dropout regularization is added to prevent overfitting considering the large amount of parameters. To account for outputs from the eclectic sub-networks with disparate distributions, we further add Layer Normalization (Ba et al., 2016) before concatenation. Normalization is imperative to hinder any bias the network may learn towards one or the other representations.
Distance Loss for Ordinal Classification: Typical classification problems deal with classes which are mutually exclusive and independent such as sentiment prediction or whether an image is a cat or a dog. In such a setting, classification accuracy is the most important metric and there is no relation or relative ordering between the classes. However, e-WER in a classification setting is an ordinal classification problem (Frank and Hall, 2001). Previous approaches which propose WER estimation as classification tasks ignore this idea (Elloumi et al., 2018;Ali and Renals, 2018;Elloumi et al., 2019). While the classification accuracy is important, it is more important that given a sample is misclassified, the predicted class is close to the true label. For instance, if the true label corresponds to the WER class of 0.1, a prediction of 0.2 and a prediction of 0.7 are treated the same. Though we have converted the problem from regression to classification, the metric used to compare models is still Root of Mean Squared Error or Mean Absolute Error of the WER and not classification accuracy. Hence, we want the prediction to be as close as possible, if not exactly the same. Therefore, we introduce a "distance" loss which is L custom (s) = L θ (s) + α * γ and γ is as follows where s is a sample, α is a hyperparameter (we have used α = 50 in our experiments), L θ (s) is the Refer to Section A.1 of the appendix for tuning of the Distance Loss hyperparameter α

Ground Truth
Google Cloud's Speech-to-Text True Predicted Transcription WER WER one historian says that an event was produced by napoleon's power another that it was produced by alexander's when is dorian says that an event was produced by napoleon's power another that it was produced by alexander's 16.7 16.5 rynch watched dispassionately before he caught the needler jerking it away from the prisoner the man eyed him steadily and his expression did not alter even when rynch swung the off world weapon to center its sights on the late owner wrench watch dispassionately before he caught a kneeler jerking it away from the prisoner the man i can steadily and his expression did not alter even when wrench swampy off world weapon to center its sights on the late owner 21    (Test 100hr) cross entropy loss, y true (s) is a one hot vector representing the true WER class of s and y pred is the estimated probability distribution of s over all the classes output by the softmax of the classification model.

Experiments And Baselines
For each of the experiments below, the training is done for 10 runs and we report the average performance of all the models on the test set. For all the experiments, we use Crossentropy as the loss function and M AE of W ER as the evaluation metric.

BOW
Bag of Words + Num. Feat.(Black Box): Following (Ali and Renals, 2018), we build a blackbox model for the double task estimation of ERR and word count.The word count estimation task was treated as a 46 class classification model with class 1 corresponding to word count of 2, class 46 with word count of 47 (90 th percentile). Similarly, the ERR estimation task was modelled as a 20 class classification problem with class 0 corresponding to no errors and class 19 corresponding to 19 errors(90 th percentile). Both of these tasks are handled by logistic regression models which use Bag of Words features for both the words present in the sentence as well as the graphemes (monograms and bigrams) of the sentence. These features are concatenated with numerical features such as word count, grapheme count and the duration and then fed to a feedforward network.Dropouts are also added after each layer to prevent overfitting.

CNN
Following, Elloumi et al. (2018)   as inputs and CNN for learning features from the textual transcription itself.
Text Input: The text input is padded to T words (where T was taken as 50, the 95 th percentile of the sentence length) and was transformed into a matrix Embeddings of the size N XM where M is the embedding size (300). These embeddings were obtained from GloVe (Pennington et al., 2014). The CNN architecture is Kim (2014).
Signal Inputs: RAWSIG: This input is obtained by sampling the audio clip at 8KHz and max duration was set to 15s; shorter audio clips were padded while the longer ones were clipped. While using this with the M18 architecture, it was further down-sampled to 4KHz due to memory constraints and its dimension is 60000X1.
MELSPEC: This input was calculated with 96 dimensional vectors, each of which corresponds to a particular Mel frequency range. They were extracted every 10ms with an analysis window of 25ms and its dimension is 1501X96.
MFCC: This input was calculated by computing 13 MFCCs every 10ms and its dimension is 1501X13. These signal features are used as an input to the M18 architecture Dai et al. (2017) (refer to Section 5: Signal sub-network). For joint use of both text and signal inputs, the outputs of the text and signal sub-networks are followed by a hidden layer (512 processing units) whose outputs are concatenated (with a dropout regularization of 0.1 being applied between the hidden layer and the concatenation layer). This is followed by 4 hidden layers (of 512, 256, 128 and 64 neurons) and the output layer with Dropout regularization added to prevent overfitting due to the large amount of parameters CNN-text (balanced): While Elloumi et al. (2018) show that just CNN-text isn't enough to get good results. This experiment provides a baseline to compare against Bert-large since that does not have signal features as well. This model is trained in our balanced class setting with 15 classes explained in Section 4.2.
CNN-text + RAWSIG (Double Task): The best architecture in Elloumi et al. (2018) was trained in the double task setting proposed by (Ali and Renals, 2018). The ERR estimation was modelled as a 20 class task and the word count estimation was modelled as a 46 class task.
CNN-text + RAWSIG (Elloumi et al., 2018): This was the best architecture proposed by Elloumi et al. (2018). This is a single task approach which uses the imbalanced WER class distribution (classes corresponding to WER of 0, 0.25...1.5 WER).

CNN-text + RAWSIG (Balanced):
Same model as CNN-text + RAWSIG but trained in our balanced class setting with 15 classes explained in Section 4.2.
CNN-text + MFCC + MELSPEC + RAWSIG (Balanced): Instead of using only RAWSIG input, we pass all 3 signal features into their respective m18 models. The outputs of the KIM CNN model (for the text input) and each of these m18 models are concatenated and processed in same fashion as WER-BERT explained in Section 5.

BERT architectures
Experiments are carried with the architecture described in Section 5. We carry out Ablation studies to identify important input features in isolation. We start with the full architecture shown in section 5 and subsequently remove sub-networks to compare performance. All the experiments except for the distance loss one are carried out with Crossentropy as the loss and M AE of W ER as the evaluation metric and are trained with the balanced classes paradigm. Table 2 captures the results of proposed approach with previous approaches along with ablation studies for the proposed approach. While the WER of the 10 hour dataset is low (≈10) and the 100 hour dataset is high (≈20), we see that the proposed approach models both effectively. Figure 4 shows that WER-BERT's estimation of WER closely follows the true WER curve. Furthermore, due to the proposed balanced paradigm, it is able to predict well in the mid-high WER classes even with less samples in this region. Figure 5 shows the comparison of Balanced and Imbalanced class setting. Comparing Figure 4 and 5, we see that the WER-BERT models much better in the lower and mid regions compared to the CNN balanced model.

Effectiveness of WER-BERT and Ablation Studies
The best WER-BERT model, with all Num., Signal feats. and distance loss outperforms other models in both MAE and RMSE metrics. In particular, we see that a BERT model outperforms the corresponding CNN model with the same inputs -14.44 v/s 12.04 & 13.08 v/s 11.19 for the CNN-text and BERT-large model in the 100 hour and 10 hour data respectively (and similar results for the CNNtext + signal feat and BERT-large + signal feat models). We credit effectiveness of BERT at its language model information and ability to identify improbable word sequences which often correlate with transcription errors. Models such as CNN, while being effective, lack the backing of a language model and hence fail to do the same. While this is important, it alone fails to beat the earlier approaches which utilise signal features i.e. Mak-ing a WER prediction from just the transcription is not enough. Signal features tells us how an utterance was spoken along with background noise. This contains valuable information such as signal noise which correlates with higher WER. The effectiveness of WER-BERT is particularly evident in Table 1 where we see that it is able to predict WER which is very close to the True WER. We hypothesize this is due to irregular word sequences such as "..when is dorian says.." or "..the man i can steadily." being identified by BERT's language model. Furthermore, even when our model predicts a far off WER, it does so in a justified manner. The transcription "1724" and "st james 7" give a high True WER score whereas WER-BERT identifies these sequences as correct and probable and gives low scores. Since WER matches strings exactly, it ends up giving a 100% error. Note that while we use BERT, the language model itself can be any transformer, since they all have the common idea that LM backing helps identifying speech errors in ways CNN models can't.

Effectiveness of proposed balancing setting
We use the best architecture reported by Elloumi et al. (2018): CNN-text + RAWSIG for comparing the three paradigms. Balanced outperforms the Double Task approach of Ali and Renals (2018) and Single Task approach Elloumi et al. (2018) (9.65 v/s 11.34 v/s 14.09 for the 100 Hour set). The class imbalance in their approach causes bias towards the largest classes during training, where most of the predictions end up. This especially harms the WER prediction in higher ranges where number of samples are less. Figure 5 shows the performance difference between CNN + RAWSIG models in the new balanced paradigm and the imbalanced Elloumi et al. (2018) setting. Since the model encounters heavily imbalanced labels in the later setting, its predictions also reflect the same. There are only two kinds of predictions -one corresponding to the lowest WER class (also largest in number as seen in Figure 1) and another in the range of second largest class. The information of the other classes are mostly ignored due to being present in low numbers. Meanwhile, the balanced paradigm fits the slope of the curve after 40 WER well. The range of ¿40 WER is tough for the model to predict as the number of samples available in this region is lesser than samples in ¡40 WER (almost 70% of data). Despite this, balanced paradigm effectively divides this area into adequate number of classes for good performance. While the performance in the mid region (20-40 WER) is poor for this model, custom loss and BERT model in WER-BERT take care of this as seen in Figure 4.

Effectiveness of Distance loss
Addition of Distance loss to WER-BERT certainly improves the performance. In Figure 6 we see the confusion matrix of WER classification visualized with and without the custom loss. Due to the custom loss the models predicted class is much closer to the ground truth class represented by the diagonal of the matrix. This is due to distance loss' ability to penalize far off predictions which is lacking in typical classification loss functions such as crossentropy. We see an improvement of nearly 1 MAE with this.

Conclusion
We propose WER-BERT for Automatic WER Estimation. While BERT is an effective model, addition of speech signal features boosts the performance. Phrasing WER classification as a ordinal classification problem by training using a custom distance loss encodes the information regarding relative ordering of the WER classes into the training. Finally, we propose a balanced paradigm for training WER estimation systems. Training in a balanced setting allows proposed model to predict WER adequately even in regions where samples are scarce. Furthermore, this balanced paradigm is independent of WER prediction model, ASR system or the speech dataset, making it efficient and scalable.

A Hyperparameter Tuning Details
In this section we explain the tuning and selection procedure of the proposed experiments' hyperparameters, namely the distance loss weight α and number of classes K. Furthermore, we also attempt to explain the behaviour of these variables on the performance in terms of MAE.  A.1 Tuning Distance Loss Hyperparameter α Figure 7 shows the effect of varying α with effect on performance on the 100hr dataset with the WER-BERT model with and without custom loss. As evident, we get the lower MAE and RMSE for α = 50. Lower or higher α leads to decrement in performance. Recalling Equation 4 from the main paper: L custom (s) = L θ (s) + α * γ where γ is: For low values of α (0.0001 to 0.001), in the above equations L custom (s) becomes equal to just L θ (s) which the classification cross entropy loss. Hence, essentially low α performance tends to the line of No custom loss WER-BERT performance. On the other hand, higher α values weight the custom loss too much more than the cross entropy loss. While the decrement in performance isn't very high, moderate adverse effects still show that, regular classification cross entropy loss is required for WER-BERT training.
A.2 Tuning Class Hyperparameter K Figure 8 shows the effect on performance of the CNN + RAWSIG model for varying MAE. We get the minimum MAE and RMSE at K = 15. Largely speaking, for lower K, RMSE and MAE suffer because the new classes no more accurately depicts the true WER. For example, in Figure 1 of main paper, we can see that if K = 3 or 4, WERs in a wide range from 40 to 100 will be clubbed into a single class despite their huge differences. This will further make it hard for the model to predict WERs in this range. On the other hand, higher K yields new classes accurate to true WER especially in the higher  ranges(50 to 100), but at the cost of class size. For example at K=30, each class will have about 900 (28000/30) samples.Furthermore, since True WER is a highly imbalanced variable, there exists a long tail in the lower WER regions in Figure 1 of main paper. This WER region is divided into many classes, despite the WER range being largely small. For example for K > 30 the model will be forced to distinguish between nearly four classes with arbitrary samples of all 0 True WER. Just for K = 15, model encounters two classes which are actually both 0 WER. This number of redundant classes with arbitrary samples will only increase as K increases.
Furthermore, for a range of values of K, we see that the balanced CNN + RAWSIG model still outperforms the CNN + RAWSIG (Elloumi et al., 2018)'s performance. This further reinforces the efficacy of the balanced paradigm even with its senstivity to a hyperparameter.

B Dataset
Additional dataset details are present in Table 3.