Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the robustness of discrete input representations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn robust discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speech-to-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.


Introduction
Self-supervised speech models were shown to learn effective representations for various downstream tasks (Hsu et al., 2021;Chen et al., 2022;Baevski et al., 2020).These models were mainly evaluated on discriminative tasks, such as automatic speech recognition, speaker verification, intent classification, etc. (Yang et al., 2021).Recently, Lakhotia et al. (2021) demonstrated that such self-supervised learning (SSL) representations can be used for Generative Spoken Language Modeling.
Generative Spoken Language Modeling (GSLM) is the task of learning the acoustic and linguistic characteristics of a language from raw audio.In other words, a discrete representation of the audio signal is being learned.A common practice is to extract continuous representation using an SSL model, then apply vector quantization, usually using the k-means algorithm (Lakhotia et al., 2021;Kharitonov et al., 2021a;Borsos et al., 2022).Then a speech-language model is trained on top of the obtained representation.Finally, a neural vocoder converts the output units to raw audio.As the discrete speech representation often operates over units extracted over relatively short windows (e.g., 20ms), sequences can be long and contain repetitions, e.g., 10 11 11 11 21 32 32 32 21.Preliminary studies have found that removing sequential repetitions of units improves performance, hence applying it universally (Lakhotia et al., 2021).For example, a pseudo-text 10 11 11 11 21 32 32 32 21 becomes 10 11 21 32 21.This framework was shown to be effective in modeling multiple levels of the speech utterance, namely prosody, and content (Lakhotia et al., 2021;Kharitonov et al., 2021a;Borsos et al., 2022), speech codec (Polyak et al., 2021), speech emotion conversion (Kreuk et al., 2021), spoken dialogue (Nguyen et al., 2022), and speech-to-speech translation (Lee et al., 2021;Popuri et al., 2022;Lee et al., 2022).
An essential prerequisite for such an audio representation to be used in real-world conditions is robustness to various signal corruptions.Although the aforementioned audio representation models have shown effectiveness in many tasks, they were mainly evaluated on academic benchmarks.
In this work, we evaluate current state-of-theart self-supervised speech representation models on what are arguably the most basic signal variations, namely time-stretch, pitch-shift, additive- noise, and reverberation.Our premise is that while these variations modify the signal, its' underlying content remains the same, especially under the units repetition removal process.Therefore, a robust representation should be affected by such variations to a minimal extent.
As a first step, we propose a set of metrics for evaluating the model's robustness.Then, we point to the lack of robustness of these models with respect to the aforementioned variations.Next, we design a simple and effective method for learning robust discrete representation on top of any speech SSL model.We demonstrate how such a method greatly improves robustness.Then, we empirically show that performance improves on several tasks for various SSL models.Specifically, we evaluate the newly proposed speech encoders when considering zero-shot evaluation tasks considering encoding and modeling, i.e., ABX, sWUGGY, and sBLIMP (Nguyen et al., 2020), together with a high-level downstream task in the form of speechto-speech translation.

Background
The general Generative Spoken Language Modeling (GSLM) pipeline is comprised of three main modules: (i) Speech-to-unit, (ii) Unit language model, and (iii) Unit-to-speech, where each of these modules is trained separately.Speech resynthesis can be achieved while ignoring the language model and directly feeding the quantized units into the unit-to-speech module (Polyak et al., 2021) (See Figure 1 for a visual description).In the following paragraphs, we give detailed background for each of the three components mentioned above, including the standard evaluation methods.
Formally, denote the domain of audio samples by X ⊂ R. The representation for a raw signal is therefore a sequence of samples x = (x 1 , . . ., x T ), where x t ∈ X for all 1 ≤ t ≤ T .
Consider an encoder network, f , that gets as input the speech utterance and outputs a sequence of spectral representations sampled at a low frequency as follows f (x) = (v 1 , . . ., v T ′ ).Note that we do not assume anything about the structure of the encoder network f .Lakhotia et al. (2021), evaluated several speech encoders, namely, Mel-spectrogram, Contrastive Predictive Coding (Oord et al., 2018, CPC), wav2vec2 (Baevski et al., 2020), and Hu-BERT (Hsu et al., 2021).
Since the representations learned by such models are usually continuous, a k-means algorithm is applied over the models' outputs to generate discrete units, denoted as z = (z 1 , . . ., z T ′ ).Each element z i in z is a positive integer, z i ∈ {1, .., K} for 1 ≤ i ≤ T ′ , where K is the number of discrete units.We denote the quantization model with E.
Unit Language Model is trained on the extracted discrete units, z.Such a language model learns a probability distribution of the learned unit sequences, which enables direct modeling of speech data without textual supervision.
The language model can be used to generate speech conditionally or unconditionally, replicating what toddlers achieve before learning to read.Moreover, such a modeling framework allows for capturing and modeling prosodic features (Kharitonov et al., 2021a), as well as speaker identity (Borsos et al., 2022), or even natural dialogues (Nguyen et al., 2022).This is in contrast to using textual features, as they do not encode such information.
Unit-to-speech module converts the speech discrete units to a raw waveform.Lakhotia et al. (2021) used a Tacotron2.0(Shen et al., 2018) based model followed by WaveGlow (Prenger et al., 2019) vocoder. Later, Polyak et al. (2021) proposed a unit-based vocoder based on the HiFi-GAN architecture to convert units to speech directly.Such a paradigm seems to provide high-quality generations with better efficiency as it uses only one model rather than two.Kreuk et al. (2021) and Lee et al. (2021) additionally improved the unit based vocoder to include emotional tokens for speech emotion conversion tasks, and duration modeling for direct speech-to-speech translation.
Zero-shot Evaluation.Evaluating such a complex pipeline comprised of several components is a challenging task.Lakhotia et al. (2021) proposed a set of zero-shot evaluation tasks aiming for each of the modules.Overall the proposed tasks can be divided into four main groups: (i) acoustic encoding using ABX, bitrat, (ii) language encoding using sWUGGY, sBLIMP (Nguyen et al., 2020;Lakhotia et al., 2021), (iii) resynthesis using Phoneme/Word Error Rate; (iv) speech generation using VERT (Lakhotia et al., 2021), Meaningfulness Mean Opinion Score.

Robustness of Speech-to-Unit Models
The first step toward developing an effective spoken language model is to develop a robust representation.The focus of a robust representation should be on the spoken information rather than unrelated signals, such as prosodic features in the form on duration and F0, background noise, or reverberations.In the following section, we propose a metric for quantifying the degree to which augmentations change the resulting encoding.

Unit Edit Distance
A spoken language model is built on top of a discrete representation of a continuous encoder.We examine the robustness of the discrete space to augmentations that do not change the spoken content.Therefore, we are interested in a sequential distance metric between two discrete representations.It is essential to note that augmentations can alter the spatial dimension of the signal.For example, stretching a signal results in more frames, yielding a longer representation sequence.Similar phenomenon will happen when convolving with different room impulse response to simulate reverberation.Hence, the metric should be able to measure the distance between two sequences of different lengths.Ideally, it will consider the number of deletions, insertions, and substitutions that occur due to augmenting the input data.For this purpose, we find the Levenshtein distance a good fit (Levenshtein, 1966).The Levenshtein distance measures the minimum changes one should make to modify one sequence to another.It has two essential properties: the first is that the score is non-negative, and when the sequences are equal, the metric equals zero.The second property is that the maximum value it can get equals the longer sequence length between the two sequences.We provide a detailed explanation of the Levenshtein distance in the Appendix material.
We aggregate the distance values over the evaluation set while considering the sequence length.This is desirable since we want to normalize scores for sequences in different lengths, and the Levenshtein distance's maximum value is the original sequence's length.Another essential property of a spatial metric is repetitions.Consider time stretch as an example, it changes the number of the input frames, but one would expect the deduplicated quantized signal to be the same as before the augmentation.Hypothetically, one can maximize the score by stretching the signal infinitely.To eliminate such dependencies, we compute the score on a deduplicated quantized representation.Formally, our final metric is: , a quantizer E : R T ′ → {1, .., K} T ′ , and an input augmen- where T ′ x is the number of frames of a sample x.Ideally, a perfect spoken language quantizer obtains a zero distance after deduplication.Next, we study state-of-the-art spoken language representations using our proposed metric in different settings.Figure 2: UED scores for various augmentations and number of clusters.We note that the UED is relatively high (the distance is normalized).We also note that the UED monotonically increases with the number of units used.We multiply the scores by a hundred.

Evaluation
In the following, we study current state-of-theart representations for generative spoken language modeling using the proposed metric.The current popular quantization technique is a k-means model trained on top of a pre-trained encoder (Lakhotia et al., 2021).In our evaluation setup, we use a different number of clusters and encoder architectures.Our ablation study include quantizers with 50, 100, 200, and 500 clusters.We further investigate our metric on top of HuBERT (Hsu et al., 2021), wav2vec2 (Baevski et al., 2020), and WavLM (Chen et al., 2022).For readability, throughout the paper, we report results for the Hu-BERT model while leaving the rest of the results in the Appendix material.

Augmentations
This work focus on four simple signal modifications which mimic real-world signal variations: Time stretch.We use the Phase Vocoder method (Karrer et al., 2006) to stretch or shrink the time domain signal with a rate of τ without changing the pitch.For example, τ = 1.2 speeds up the signal by 20%.In this work, for each sample, we sample uniformly a value in the range [0.8, 1.2].
Pitch shift.We change the original pitch of the speech signal by a given number of semitones using the resampling method over the time-stretched signal (Karrer et al., 2006).In this paper, we shift the pitch by up to four semitones.
Reverberation.We follow a similar setting of Chazan et al. ( 2021), in which we consider an Acoustic Transfer Function (ATF) to be simulated using the pyroomacoustics (Scheibler et al., 2018) audio room simulations package.We randomly sample room dimensions, microphone location, and source location, then convolve the ATF with the speech signal.
Noise injection.We mix a given speech signal with non-stationary additive noise, using a randomly sampled Signal-to-Noise Ratio (SNR) in the range of [5,15].Background noises are sampled from the Deep Noise Suppression (DNS) challenge (Reddy et al., 2020) which includes a diverse set of noise types from AudioSet (Gemmeke et al., 2017), Freesound, 1 and Demand (Thiemann et al., 2013).

Results
In Figure 2, we use our metric to study the robustness of k-means trained on top of HuBERT with various augmentations and values of K.This evaluation points to the lack of robustness of the current state-of-the-art representation of simple, non-spoken augmentations.For example, for time stretch augmentation, the UED score is between 39 and 51.Considering that UED is computed on deduplicated signals, those numbers are high.Moreover, this number increases as a function of K.The high numbers and the monotonicity of the UED as a function of K are consistent for all values of K, augmentations, and models we experimented with (HuBERT, wav2vec2, and WavLM).Next, we propose a method that improves the robustness of such representations.

Pseudo-labeling for Robust Discrete Representation
Our findings in Section 3 suggest that current stateof-the-art representations may be too sensitive to augmentations that do not alter spoken information.
Preliminary robustness research focused primarily on noise augmentation.This is convenient since the signal length is not affected by such augmentations.
In practice, real-world augmentations may modify the signal length.In order to work with various types of augmentations, we must align the original Figure 3: Illustration of our method: We forward a clean signal through an encoder followed by a pre-trained quantizer (k-means).Next, we forward an augmented signal through the same encoder, followed by a new quantizer (green).The CTC loss between the deduplicated output of the clean signal and the output of the augmented signal is used to learn the parameters of the new quantizer.In the iterative approach, post the convergence of the learned quantizer E 0 , we freeze it and learn a new quantizer E 1 that distills information from E 0 .
and augmented sequences.The following section presents a pseudo-labeling, alignment-based approach to learning a robust quantizer.

Pseudo-labeling
The GSLM encoding framework comprises a raw audio signal forwarded through an encoder, then a quantizer.The quantizer is learned on top of a trained encoder, e.g., k-means trained on each embedding vector extracted from HuBERT.
As discussed above, we do not want to limit the robustness process to a family of augmentations that do not change the signal's length.To align and use augmentations that may modify the signal's length, we use the Connectionist Temporal Classification (CTC) loss (Graves et al., 2006).The CTC operation computes the probability of an alignment based on the predicted and target sequences.Finally, the CTC loss considers the negative log-likelihood produced by the CTC operation.
We forward a clean signal through an encoder f followed by a pre-trained quantizer E 0 .Parallelly, we forward an augmented signal through the same encoder f and train a non-linear multilayer-perceptron E 1 .Using the CTC loss, which accounts for the alignment between the outputs, we learn the parameters of E 1 .Formally, the probability given by the CTC loss ℓ(E 0 , E 1 , x, g) for a single data point x follows which can be decomposed to a sum over the set of all alignments Finally, for a training set D, a set of augmentations G, a pre-trained quantizer E 0 , and a learned quantizer E 1 , our loss function is as follows: Note that the alignment between the predicted and target sequences is many-to-one.Thus, one or more output units can be aligned to a single target unit.Hence, to work with augmentations that stretch the signal, we are required to deduplicate the target sequence.Intuitively, this process distills quantization knowledge from the pre-trained quantizer into the new quantizer while injecting E 1 knowledge about the contextual similarity between the original and augmented signals.
A significant advantage of our method is that it is highly efficient.Our method requires training only a relatively small amount of parameters.In contrast to previous methods that train HuBERT from scratch, which takes up to seven days on 32 GPUs, our method converges in a few hours on a single GPU.In fact, our experiments show that learning the parameters of the encoder performs worse than freezing them.While the UED is boosted, but the ABX are negatively affected.The freezing of the upstream model thus serves as a regularizer.

Iterative Pseudo-labeling
In the previous section, we presented a pseudolabeling approach that relies on a converged quantizer E 0 , e.g., k-means on top of HuBERT.This raises the question of whether it is possible to enhance the robustness of the learned quantizer E 1 by iteratively replacing the pre-trained quantizer with the converged quantizer and learning another MLP on top of it.It turns out that such a process can further improve the robustness.
The iterative process begins with a pre-trained quantizer E 0 , then, as in Section 4.1 we learn a robust quantizer E 1 .Upon E 1 convergence, we replace E 0 with E 1 and use it as the pre-trained quantizer.the converged E 1 .We repeat this process K times.This process needs more careful training.We note that it is essential to replace the quantizers only post-convergence.

Experiments
In the following, we assess the efficacy of our method using state-of-the-art self-supervised representations and popular discriminative and generative evaluation tasks.It is important to note that a single metric cannot tell the whole story.For example, similarly to perplexity, all representations can be assigned to the same cluster, which achieves a perfect unit edit distance but a poor representation.We first examine our proposed method using the unit edit distance along with other discriminative and generative performance metrics.Then, we show that our method improves downstream tasks.In Section 5.1 we use our proposed metric from Section 3 to analyze the robustness of our method.In Section 5.2 we study the discriminative capabilities of our method using the ABX test (Schatz et al., 2013).Then, we evaluate our methods using generative zero-shot evaluation tasks such as sWUGGY and sBLIMP (Nguyen et al., 2020;Lakhotia et al., 2021).Finally, we demonstrate the effect of using our robust quantizer's units in speech-to-speech translation.
Experimental Setup.We study our method using the base versions of HuBERT, wav2vec2, and WavLM.For readability, we report results for Hu-BERT in the main paper.The results for wav2vec2 and WavLM are in Appendix C. To match the current k-means training set, we use the Librispeech-100h to learn our quantizer (Panayotov et al., 2015).We analyze our metric using the 'clean' and 'other' development sets from Librispeech.A detailed setup is provided in Appendix B.

Analysis
In Section 3, we presented an evaluation metric that assesses the robustness of a quantized speech representation to augmentations.The metric is insensitive to changes in the length of the signal.Using it, we investigated the current state-of-theart representations.In the following, we study our robust quantization method.
Table 1 presents the unit edit distance metric using our robustness method with and without the iterative approach.Compared with the k-means method, which is currently in use, our non-iterative method consistently outperforms it by a large margin (relative improvement of at least 30%).We also note that different augmentations affect the representation differently.Our iterative method provides a slight but consistent improvement over the non-iterative method.It is noticeable that the UED is increasing (i.e., worse performing) with the number of units used.

Zero-shot Evaluation
We evaluate the proposed method using the standard GSLM setup, i.e., ABX, sWUGGY, sBLIMP.The ABX task examines the discriminative phonetic abilities of the representation.Versteegh et al. (2015) show that the ABX result is a good proxy to signal content (i.e., Phoneme Error Rate).The input to the ABX is a pair of words with a phoneme modification and a reference word containing the same phoneme as one of the pair's words.Next, the ABX measures the distance of the test phoneme representation to both the correct and incorrect representations.Finally, the distance between the test and the correct representation is expected to be lower than the distance to the incorrect representation.The ABX task is conducted in two setups: 'within' and 'across.' 'Within' is evaluated on input data from the same speaker, while 'across' is evaluated on input data from different speakers.
Table 2 shows the ABX results for both Librispeech 'clean' and 'other'.In our experiments, we found that the ABX score consistently and significantly improved on all the setups we tested.In this case, the iterative approach improves more than the non-iterative one, but the improvement is inconsistent.For a small number of units and the 'other' split, the ABX score is lower than the iterative model's score.Note that the 'other' split is challenging as it is characterized by recordings that contain background noise and various accents.
The spot-the-word task (sWUGGY) requires detecting the real word from a pair of short utterances such as 'brick' vs. 'blick.'The detection is done by comparing the probabilities given by a language model for each word.This allows comparing representations by training language models on top of them.Differently, the acceptability judgment test (sBLIMP) requires detecting the syntactically correct sentence from a pair of sentences, one of which is syntactically correct and the other is wrong.The detection is based on the perplexity of the language model.As presented in Table 2, our method enables improvement in all the investigated setups for both the spot-the-word and acceptability judgment tests.This is especially noticeable for a larger number of units.For instance, when considering 200 or 500 units, the absolute improvement of the sWUGGY score is 4.17 and 3.21, respectively.

Speech-to-speech Translation
Lastly, we evaluate the proposed method considering the speech-to-speech translation task.To better assess the effectiveness of the proposed robust discrete representation we follow the same setup as in Lee et al. (2022) while changing the discrete speech representation only.Lee et al. (2022) propose a textless speech-tospeech translation method by forwarding a source speech signal and predicting its target's discrete representation.The authors use a k-means model trained on top of a multilingual HuBERT (mHu-BERT) for speech representation.Additionally, the authors show that solving an auxiliary task enhances performance.We investigate the impact of using our robust quantizer as an alternative to the k-means used by Lee et al. (2022).Differently, we use HuBERT (instead of mHuBERT).Besides that, we follow the same setup in terms of model, computation resources, and data.To evaluate the quality of the translation the sentence BLEU score (SacreBLEU) (Post, 2018) was used.
Table 3 presents the results for the Spanish-English and French-English setups on the Europarl-ST development and test sets (Iranzo-Sánchez et al., 2020).It also shows the original result from Lee et al. (2022).The proposed method improves over Lee et al. (2022)  Table 3: Speech-to-Speech Translation results: We report BLEU scores for the proposed method (Robust) and compare it against the k-means used in Lee et al. (2022).We report both development and test sets results for Spanish(S)-English(E) and French(F)-English(E).
Note, these results are especially interesting as the proposed method was trained on significantly less data (ours was trained on 1k hours while Lee et al. (2022) was trained on 100k hours).

Related work
This work investigates the robustness of selfsupervised representations for language modeling.This is related to the advancements in speech selfsupervised learning, their robustness, and modern generative spoken language modeling.In the following, we review all three areas.
Self-supervised Learning.The field of deep learning research has significantly benefited from self-supervised learning.Commonly, it involves encoding the input data and performing a task that enforces the representation to learn contextual embeddings.Speech self-supervised learning can be divided into two lines of research.The first is discriminative, Oord et al. ( 2018) introduced Contrastive Predictive Coding (CPC), which trains a convolutional encoder and a predictor for future embeddings of the encoder using a contrastive loss.On top of it, Kharitonov et al. (2021b) propose to use time domain augmentations to improve the CPC model further.Wav2vec2 (Schneider et al., 2019) suggest using a contrastive loss that requires distinguishing between true and false future audio samples.Later, wav2vec2 (Baevski et al., 2020) learn quantized units using Gumbel softmax and predict masked spans of the latent speech representation.Hu-BERT (Hsu et al., 2021) employ a frame-based masked prediction task.First, it quantizes input frames and then predicts masked frames.
The second line of work is generative.An early generative self-supervised work is Autoregresstive Predictive Coding (Chung et al., 2019), which predicts the spectrum of a future frame.Later, Liu et al. (2020) introduced Mockingjay, which learns its representation by predicting non-causal context.TERA (Liu et al., 2021) alters time, frequency, and magnitude.Then it is required to reconstruct acoustic frames from altered versions.
Robustness.A desired property of a spoken language representation is robustness to augmentations that do not change the spoken information.The spoken information should not differ significantly when male and female speakers say the same content.There is an interesting trade-off between training a robust representation and the quality of the input data.It is possible, for example, to use the same speaker for all data points in the training set.The model would not be able to learn any speaker bias, but this constraint prevents scaling.
Recently, the robustness of self-supervised speech representations has gained attention from the community.WavLM (Chen et al., 2022) proposes adopting the well-known HuBERT model (Hsu et al., 2021) and training it with an additional denoising process.The authors apply a noising process to the training data and then predict the clean units from it.ContentVec (Qian et al., 2022) is focused on the disentanglement of a speaker from self-supervised speech representation.The authors propose to use three disentanglement components.First, the student network is disentangled through two transformations.Then the representations are forwarded through a speaker condition component.Finally, voice-converted input data points are used to generate teacher labels.

Conclusions
In this work, we first propose a metric for evaluating the robustness of self-supervised speech representations applied for spoken language modeling tasks.Equipped with the aforementioned metric, we point out the lack of robustness in current stateof-the-art speech encoders with respect to simple signal variations that do not alter the spoken information.We then propose a simple and effective method to boost the robustness of the current approaches and demonstrate it on three state-of-theart self-supervised speech representation models.We empirically show the efficacy of the proposed approach when considering encoding methods together with a textless speech-to-speech translation.

Broader Impact
As for broader impacts, this work is the first (to the best of our knowledge) which analyzes selfsupervised speech representation models, considering basic signal variations.We hope that with the provided analysis and evaluation, researchers working on spoken language modeling and selfsupervised speech representation learning will consider reporting the proposed metric setup along with evaluation of down stream tasks.

Limitations
The proposed method has several limitations that should be taken into consideration when employing it.First, the method relies on an existing model, e.g., k-means, which creates a dependency between the performance of the initial and the robust models.Second, the flow is not trained end-to-end, which can also limit its performance as end-to-end training allows improvement of the robustness of the whole representation.Lastly, to fully assess the effectiveness of the method, multiple metrics need to be examined.This can be a limitation as interpreting the results from multiple metrics may not be straightforward.However, it gives a more complete picture of the model's performance.

A Levenshtein Distance
Throughout the paper, we use a version of the Levenshtein distance.In this section, we detail the Levenshtein distance between two sequences.Let x ∈ {1, .., K} Tx and y ∈ {1, .., K} Ty be two discrete vectors, not necessary in the same size.Let us also denote the operator tail(x) to return a copy of the vector x without its first element.Then, the Levenshtein distance is defined recursively by , otherwise where |x|, |y| are the lengths of the vectors x and y respectively.Note, in our implementation, we use deduplicated sequences.

B Extended Experimental Setup
Models.We study our method using the base versions of HuBERT, wav2vec2, and WavLM.Similar to prior work, for HuBERT and WavLM, we use the ninth and sixth layers for wav2vec2.For readability, we report results for HuBERT in the main paper.The results for wav2vec2 and WavLM are presented in Appendix C. In our quantizer learning process, we use a learning rate of 0.0001, a batch size of 32, and Adam optimizer (Kingma and Ba, 2014).Our quantizer is composed of three fully connected layers with LeakyReLU activation between them.The dimensions of those layers are determined by the division floor of the difference between the upstream dimension to the number of units.We train our quantizer using a single NVIDIA V100 GPU.
Datasets.To match the current k-means popular training set, we use the Librispeech-100h to learn our quantizer (Panayotov et al., 2015).We analyze our metric using the 'clean' and 'other' development sets from Librispeech.The augmentations in all setups include time stretch, pitch shift, reverberation, and noise injection (exact parameters are detailed in Section 3.2.1).For the sWUGGY and sBLIMP evaluations, we use the 'big' transformer language model from Lakhotia et al. (2021).This appendix begins with a detailed explanation on the Levenshtein distance (Section A).Then, in Section C, we present additional results.We report results on two additional state-of-the-art selfsupervised speech representations.We show that our method is indeed effective for those representations as well as shown in the main paper.

C Additional Results
In the following, we provide additional results on the state-of-arts representations "wav2vec2" and "WavLM" (Baevski et al., 2020;Chen et al., 2022).
Tables 4 and 5 present the UED scores for both the wav2vec2 and WavLM models.Using our method, we observe robustness improvements for both of the models.However, it is notable that the WavLM model is more robust than the wav2vec2 model.It is reasonable since the WavLM trained to be a more robust model using noisy training samples.
Tables 6 and 7 present the discriminative and generative metrics for both wav2vec2 and WavLM.We observe a consistent improvement using our robust quantizer as in the robustness metrics.However, for the WavLM, the improvements are sometimes marginal (except for k = 50 where k-means outperforms our method).The WavLM model is trained with a HuBERT architecture, with more data and noisy samples.Interestingly, while presenting better performance on various downstream tasks than HuBERT, their ABX, sWUGGY, and sBLIMP scores are lower.

Figure 1 :
Figure1: Generative Spoken Language Modeling is composed of three components: (i) Speech-to-unit, (ii) Unit language model, and (iii) Unit-to-speech.Pre-trained ASR and language models are used for evaluation.

Table 1 :
Then, we learn a new MLP E 2 on top of Unit edit distance study: Using our metric, we assess the robustness of various quantization methods on top of a HuBERT representation.This study uses four different augmentations: time stretching, pitch shifting, reverberation, and noise injection.The non-iterative (Section 4.1) and iterative (Section 4.2) methods significantly and consistently improve the robustness of k-means.Pseudo-labeling accounts for most of the improvement.By applying our method iteratively, we can improve it further.For readability, we multiply the scores by a hundred.

Table 2 :
Zero-shot discriminative and generative evaluation tasks: We evaluate the ABX score on the 'clean' and 'other' development sets from Librispeech.Our method improves the scores scores in all setups.
under all the evaluated setups.