XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent self-supervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning, we fine-tune an XLS-R model to predict word boundaries themselves produced by top-tier speech segmentation systems: DPDP, VG-HuBERT, GradSeg and DP-Parse. Once XLS-R is fine-tuned, it is used to infer new word boundary labels that are used in turn for another fine-tuning step. Our method consistently improves the performance of each system and sets a new state-of-the-art that is, on average 130% higher than the previous one as measured by the F1 score on correctly discovered word tokens on five corpora featuring different languages. Finally, our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion.


Introduction
In an attempt to model infant ability to segment speech into words, researchers have aimed at unveiling word boundaries directly from the speech signal without prior knowledge of the language and, of course, without relying on textual annotations.Even though the notion of 'word' does not obey a set of strict rules, the goal is to bridge the existing large performance gap between speech-based and text-based segmentation systems (Dunbar et al., 2022) 2 .Indeed, after a decade of work since the first publications (Jansen and Van Durme, 2011;Lee and Glass, 2012;Lee et al., 2015), it is only 1 Code is available at https:// gitlab.cognitiveml.fr/ralgayres/wav2boundaries 2 Text segmentation into words is the task of finding word boundaries in a phonemicised text where spaces between words have been removed in the last couple of years that speech segmentation systems (Bhati et al., 2021;Peng and Harwath, 2022;Kamper, 2023;Algayres et al., 2022b) have successfully done better than a uniform baseline.These progress have enabled the use of discovered spoken words as inputs to spoken language models to learn high-level semantic and syntactic representations (Algayres et al., 2022b) as well as generate intelligible and meaningful spoken sentences (Algayres et al., 2023).The authors highlight that the discovery of boundaries that are aligned with real word boundaries strongly benefits downstream spoken language models.We present a speech segmentation method inspired by a phoneme segmentation model Strgar and Harwath (2022) that leverages pseudo labelling and recent progress in self-supervised learning (SSL) speech models (Baevski et al., 2020;Hsu et al., 2021;van den Oord et al., 2018;Chen et al., 2021;Babu et al., 2021;Conneau et al., 2020).SSL models are trained on large speech datasets to predict masked parts of the input speech signal.Such pre-training methods yield speech representations that can be quickly adapted through finetuning to a variety of downstream tasks ranging from ASR to speaker recognition, keyword spotting, intent classification and emotion recognition (Yang et al., 2021).We exploit the ability of SSL models to learn new tasks quickly and fine-tune a pre-trained XLS-R model (Babu et al., 2021) to predict the word boundaries produced by an offthe-shelf unsupervised speech segmentation system.Our method is inspired by the semi-supervised learning literature (Xie et al., 2019;Yalniz et al., 2019;Scudder, 1965;Hinton et al., 2015;Grill et al., 2020;Chen and He, 2020), that have explored how a model can bootstrap itself by providing its own labels.We applied our method on the three state-of-the-art speech segmentation system: VG-HuBERT, DPDP and DP-Parse (Kamper, 2023;Peng and Harwath, 2022;Algayres et al., 2022b) and consistently improves their segmentation performances.Our method works particularly well with DP-Parse: on average, over five corpora featuring different languages, an XLS-R fine-tuned on DP-Parse boundaries increases by 130% the segmentation performances compared to the previous state-of-the-art.Finally, our method gets results above the state-of-the-art even in a zero-shot setting where XLS-R is not fine-tuned and sometimes not even pre-trained on the target language.
1 Related works

Speech Segmentation
A particularly successful approach that has been applied to the problem of text segmentation is non-parametric Bayesian models (Goldwater et al., 2009;Johnson et al., 2007).This approach has inspired two recent speech segmentation systems: DP-Parse and DPDP (Algayres et al., 2022b;Kamper, 2023).These models segment a spoken sentence by first assigning every speech fragment a probability to be a word.Then, using dynamic programming beam search, one of the most probable segmentation for the whole spoken sentence is sampled.DPDP assigns probability scores using the loss value of an RNN auto-encoder that has been trained to reconstruct random speech sequences.DP-Parse computes probabilities by estimating the frequency of speech fragments using density estimation on speech fragments encoded into Speech Sequence Embeddings (SSE).SSE models are trained with contrastive learning (Algayres et al., 2022a;Settle and Livescu, 2016) or auto-encoders (Kamper, 2018;Peng et al., 2020) to embed variable-size speech segments into fixedsize vectors.DP-Parse authors have shown that using better SSEs (that can be obtained with weak textual supervision) leads to higher segmentation performances.
A second type of model has reached the stateof-the-art in speech segmentation: VG-HuBERT (Peng and Harwath, 2022;Peng et al., 2023).This multimodal model fine-tunes the CLS tokens of a pre-trained HuBERT (Peng and Harwath, 2022) and a pre-trained ViT (Dosovitskiy et al., 2020) on aligned pairs of Engish utterances and images.
Lastly, Fuchs and Hoshen (2023) also tackles speech segmentation into words using pseudo labeling and SSL fine-tuning.Yet, our method finetunes the full XLS-R model using various optimization methods (iterative self-labelling, lr scheduler, data augmentation and loss selection) whereas Fuchs and Hoshen (2023) train a single fully connected layer on top of a frozen Wav2vec2.0(Baevski et al., 2020) without iterative self-labelling.Also, our method is tested across different languages whereas their model only focuses on English speech.
1.2 Wav2vec2.0and XLS-R Speech Self-Supervised Learning (SSL) is a paradigm that enables to train deep neural networks directly on the speech stream, typically by predicting masked parts of the input speech signal.Wav2vec2.0(Baevski et al., 2020) is a particularly performant SSL model that is composed of a convolutional front-end and a stack of transformer layers.Even though other SSL models have outperformed Wav2vec2.0(Chen et al., 2021;Hsu et al., 2021;Chung et al., 2021) on downstream tasks, Wav2vec2.0has recently been trained in multilingual settings with XLSR53 ( (Conneau et al., 2020), 53 languages), and XLS-R ( (Babu et al., 2021), 128 languages).These multilingual SSL models are excellent candidates for our work as we wish to perform speech segmentation into words in different languages.We carry out experiments with XLS-R, one of the latest multilingual Wav2vec2.0model3 .We provide in Appendix our experiments with other mono-lingual and multilingual Wav2vec2.0models to analyse the effect of the amount of pre-training data.

Method
Let us use a speech dataset C, a pre-trained speech SSL model W , and an off-the-shelf speech segmentation system S. On top of W is added a random feed-forward layer with one neuron and a sigmoid activation.Here is our method to train W on boundaries produced by S.
First, S is used to infer word boundaries for every spoken sentence in C. In addition, spoken sentences are data-augmented with a random quantity of reverb, pitch, time-stretch and time-drop and then encoded into a series of frames by W .For each output frame, we create a label that is either 1 if the frame aligns with a word boundary or 0 if not.Because word boundaries are, by nature, not clearly defined in the time domain, we label as 1 the left and right neighbouring frames of every frame that has been already tagged as 1.W is fine-tuned with back-propagation by minimizing the negative cross entropy between W 's output and the labels.In a sentence, most of the frames are labelled as zeros (for 'not a boundary'), and the loss is particularly low on those frames.To force the model to focus on harder sections of the input, we only backpropagate the loss on the top 50% frames with the highest loss.
At inference, for a given sentence, W produces at each frame the probability of discovering a boundary.To decide which frames should be labelled as a boundary, we apply a peak-detection method that finds local maxima by comparison of neighbouring probability values.This function has two hyperparameters: maximal height of peaks and minimal distance between two peaks.We fit these hyperparameters on the development set by maximizing the F1 scores between the boundaries produced by S and the new boundaries produced by W .
At this stage, W can be used to infer on the dataset C a set of new word boundary labels.W is set back to its initial unfine-tuned state and is fine-tuned again on the new word boundaries.This process is iterated until segmentation performances start to decrease.

Datasets, Evaluation and Hyperparameters
The metric that we use to evaluate performance is the token-F1 score which is the F1 score on correctly discovered tokens.A token is correctly discovered when both its boundaries correspond to the boundaries of a word in the time-aligned transcription.This metric was introduced by the ZeroSpeech Challenge 2017 (Dunbar et al., 2017) and is computed with the TDEv2 library.During our fine-tuning step, we use backpropagation on batches that contain 12 spoken sentences of a maximum of 20 seconds each.The XLS-R is fine-tuned on a single 32Go GPU for a maximum of 2000 updates, after which we keep the model with the lowest loss on the development set.Optimization is done with Adam optimizer (Kingma and Ba, 2017), the learning rate is warmed up from 0 to 10 −4 and then decayed back to 0 with a cosine annealing (Loshchilov and Hutter, 2016) with period 10 3 .We freeze the convolutional front end and use 10% dropout, 15% layer-drop, and 15% masked frames.Data augmentation is done mainly with the WavAugment library.We use the values of parameters advised by the authors Kharitonov et al. (2020): for reverb, we sampled the room scale uniformly in [0,100] while keeping the other parameters unchanged and for pitch, we pick a value uniformly in the range [-300,300].Time-stretch coefficients are uniformly sampled between 0.8 and 1.

F1 scores for different noisy boundaries
In Table 1, we show the comparison of token-F1 scores for different speech segmentation systems and the token-F1 scores after iterative fine-tuning of XLS-R initialised by these systems.Instead of finetuning a different XLS-R model on each dataset, we realised that we always get equal or higher performances if we fine-tune only one XLS-R on the boundaries of the five datasets at once 4 .We argue that there is no overfitting possible as our method is completely unsupervised, and no true word boundary labels are used to train those models.In particular, Germand and Wolof datasets have never been used to tweak hyper-parameters.In Appendix 5, we provide the main scores obtained when XLS-R is only fine-tuned on each dataset separately.Even in this setting, our method produces an average token F1 score that is twice higher than the previous state-of-the-art.
The first two segmentation systems are baselines models.For the first, speech is segmented along the VAD timestamps.For the second one, in addition to the VAD timestamps, we added random boundaries so that token durations have the same mean and standard deviation as the true word tokens.As each VAD give two true word boundaries, we thought that this information could be enough to kick-start our self-training method.The results show that it is not the case, our method leads to a drop in segmentation scores.Then, we evaluate the three speech segmentation systems that were presented in the review 1.On average, over the five datasets, our method consistently improves their F1 scores with a clear advantage for DP-Parse which gets an average token F1 score of 40.7.
Finally, we present topline models.For weaksup DP-Parse, we segmented speech with the help of weakly-supervised SSEs instead of unsupervised ones (as explained in Section 1).weaksup DP-Parse present higher performances than DP-Parse, which shows that our fine-tuning method could yield even better results, providing better SSE models.For Gold, we used the true word boundaries, and XLS-R degrades the segmentation performances on average from 100% to 72.2% 5 . 5As the datasets do not have a validation set, when we train XLS-R to predict true word boundaries (and only in this case), we keep a held-out development set to compute the F1 scores.
The last topline, DP-Parse on text6 , gets 66.6% token-F1.The scores on text data show that our fine-tuning strategy has significantly narrowed the performance gap between speech and text.The scores on Gold being higher than DP-Parse on text show that our strategy has the potential to bridge the gap between speech and text segmentation completely.To complete our analysis, we provide in Appendix Table 4 the Boundary-F1 scores, which are the F1 scores on correctly discovered boundaries instead of correctly discovered tokens.
Overall, the results show great discrepancies between the initial F1 scores of a model and the performances of a fine-tuned XLS-R.We are not yet able to explain precisely why the fine-tuning of XLS-R on DP-Parse works better than on DPDP and VG-HuBERT.The main reason is certainly that we initially chose hyperparameters to maximise performances of XLS-R when it is fine-tuned on DP-Parse.Then, we tried other hyper-parameters to try to boost DPDP and VG-HuBERT scores (different learning rates and data augmentations) but did not manage to improve over the scores reported in Table 1. Figure 2 is a visual presentation of the performances compared to the speech segmentation systems submitted to the Zerospeech challenge (Dunbar et al., 2017) since 2017.The increase in performance obtained by fine-tuning XLS-R appear in deep blue.
To analyse the importance of each of the tricks that we have used in our fine-tuning strategy, we provide in Table 2 the average token-F1 scores over the five datasets by successively ablating each trick.Overall, this table shows that the main gain of our method is obtained by the simple finetuning of XLS-R on noisy boundaries.

Zeroshot performances
We show in Table 3 that our method can be used to segment languages that are unseen during the finetuning stage and also unseen during the pre-training stage of XLS-R (which is the case of Wolof).In turn, we selected four out of the five datasets for fine-tuning and used the remaining dataset for testing.We did this experiment using DP-Parse boundaries and the true word boundaries.The results of the zero-shot are sometimes as high as when all datasets are included in the fine-tuning stage.This result echoes the intuition from Peng et al. (2023) that these models can learn universal segmentation features, which appear in their study to coincide with syllables.

Conclusion
In this work, we propose an unsupervised speech segmentation system that fine-tunes XLS-R on boundaries provided by an external off-the-shelf speech segmentation system.Our method increases word segmentation performances by 130% compared to the previous state-of-the-art.Our method also shows high performances in the zero-shot setting, which suggest that universal segmentation features exist in the speech signal.Regarding interpretation, our results are sometimes hard to explain.
Even though we proved that high initial F1 scores (obtained with weak supervision) do lead to better performances, more work is needed on the evaluation of the speech segmentation model to understand what characteristics are beneficial to XLS-R finetuning but that are not captured by F1 scores.

Limitations
Our method has only been tested on the Ze-roSpeech corpora, which come pre-segmented into Voice Activity Detection (VAD).These VADs have been obtained by first force-aligning audio and transcriptions and then by excluding all audio sections that were aligned to silences or noise.If our model is used to segment other audio files, it is important to properly remove beforehand silent sections as well as any other non-speech sections (we advise using Pyannote (Bredin et al., 2019) or Brouhaha (Lavechin et al., 2023) if you cannot rely on forcealignment).Also, the audio from the ZeroSpeech corpora are studio recorded, which means the level of noise is extremely low.The performances of our model with noisier recording conditions would be much lower than those reported in this paper.

Ethics Statement
Our model inherits from all the biases of audio models pre-trained on a large amount of data.In particular, languages and accentuations that were not present in the original pre-training dataset may be less well encoded by XLS-R, which could result in impaired performances.The reader can refer to the list of pre-training languages in Babu et al. (2021).
true boundaries.For visualization of these results, Figure 2 shows the average token-F1 per model.As expected, pre-training Wav2vec2.0 is strongly beneficial for learning word boundaries.Yet, the amount of speech data available for pre-training does not correlate well with segmentation performances.In spite of being trained on much fewer data than W2V2-LV and XLSR53, W2V2-LS reaches high segmentation performances.These Preliminary results on other SSL models have not been included in this section for lack of time.In particular, Hu-BERT (Hsu et al., 2021) has slightly lower performances than Wav2vec2.0models.Also, a recent Wav2vec2.0model came to our attention (Pratap et al., 2023), pretrained on nearly 4000 different languages but we did not have time to include this model in our work.

B Study on the type of input boundaries
We are not yet able to explain precisely why the fine-tuning of XLS-R on DP-Parse works better than on DPDP and VG-HuBERT.As said in the main paper, the main reason is certainly that the optimisation hyperparameters have been tuned to maximise performances when XLS-R is being finetuned on DP-Parse.Yet, we think there could be another reason: the difference in tokens per type ratios.This ratio is obtained by first transcribing the discovered speech tokens and then by dividing the number of discovered tokens by the number of different types of transcriptions.Indeed, as XLS-R needs to (at least partially) memorise the different word types that it is trained to segment, XLS-R will more easily learn to segment a small number of types than a large number of types.In Table 7, we show that DP-Parse has a higher tokens per type ratio (i.e.fewer types ) than DPDP and VG-HuBERT.
For that reason, we think that, compared to its competitors, DP-Parse provides a better kind of input for XLS-R finetuning.This higher tokens per type could come from DP-Parse higher tokens per second, as shown in Table 7.More work is needed to know if XLS-R simply favours segmentation systems that tend to oversegment (and therefore have higher tokens per second and higher tokens per type).
For completeness, we also provide in Table 7 the tokens per second and tokens per type after finetuning XLS-R on unsupervised boundaries.As expected from the token-F1 and boundary F1 analysis, fine-tuning XLS-R pushes token per second and tokens per type ratios closer to the ratios obtained with true segmentation.

Figure 1 :
Figure 1: A general view of the performances of speech segmentation models so far.The figure shows the average token-F1 scores (Mandarin, French, English, Wolof, German) obtained by different systems.In light blue are the original scores and in deep blue the increase in performances after XLS-R finetuning.The baseline is a segmentation every 120ms and the topline is DP-Parse applied on text data.In green is the performance obtained by the weakly supervised DP-Parse model.

Figure 2 :
Figure 2: Average token-F1 scores (Mandarin, French, English, Wolof, German) obtained by different Wav2vec2.0models pre-trained on different amount of speech and fine-tuned on either DP-Parse boundaries or Gold (i.e.true word boundaries).The average token-F1 score of DP-Parse on the five datasets is represented by a black line.

Table 1 :
Token-F1 obtained by different segmentation systems ('init' in the table) and after iterative fine-tuning of XLS-R on all datasets at once ('ft' in the table).weak-sup DP-Parse is a topline that uses weakly-supervised SSEs instead of unsupervised SSEs.Gold is a supervised topline where XLS-R is fine-tuned with the true word boundaries.DP-Parse on text is obtained by replacing the speech stream by text without spaces between words.

Table 2 :
Ablation table: average token-F1 score of segmentation over the five corpora.Each row is an ablation compared to the row above itself.

Table 3 :
Zeroshot performances: for each corpus, we report the token-F1 scores after finetuning XLS-R to predict the boundaries of all other corpora except itself.Scores are presented for DP-Parse boundaries and true word boundaries (Gold).

Table 6 :
Token-F1 scores obtained by different Wav2vec2.0models pre-trained on different amounts of speech and fine-tuned on either DP-Parse boundaries or Gold (i.e.true word boundaries)

Table 7 :
Token per type and token per seconds obtained by different segmentation systems ('init' in the table) and after iterative fine-tuning of XLS-R on all datasets at once('ft' in the table).weak-supDP-Parse is a topline that uses weakly-supervised SSEs instead of unsupervised SSEs.Finally, Gold is a supervised topline where XLS-R is fine-tuned with the true word boundaries.†:Kamper(2023)∨Peng and Harwath (2022)×:Algayres et al. (2022b)