Self-supervised Rewiring of Pre-trained Speech Encoders: Towards Faster Fine-tuning with Less Labels in Speech Processing

Pre-trained speech Transformers have facilitated great success across various speech processing tasks. However, fine-tuning these encoders for downstream tasks require sufficiently large training data to converge or to achieve state-of-the-art. In text domain this has been partly attributed to sub-optimality of the representation space in pre-trained Transformers. In this work, we take a sober look into pre-trained speech encoders and rewire their representation space without requiring any task-specific labels. Our method utilises neutrally synthesised version of audio inputs along with frame masking to construct positive pairs for contrastive self-supervised learning. When used for augmenting the wav2vec 2 encoder, we observe consistent improvement of isotropy in the representation space. Our experiments on 6 speech processing tasks, exhibit a significant convergence speedup during task fine-tuning as well as consistent task improvement, specially in low-resource settings.


Introduction
Self-supervised pre-trained speech encoders (Hsu et al., 2021a;Baevski et al., 2020) are universal models that are beneficial to a wide range of speech processing tasks and domains (Liu et al., 2022;Tsai et al., 2022).Similar to other modalities such as text, these pre-trained encoders are fine-tuned towards downstream tasks (Wang et al., 2022;Gállego et al., 2021).While the fine-tuning step often benefits substantially from the presence of warm pre-trained data encoders, for involved tasks such as Automatic Speech Recognition (ASR), it still requires both sufficiently large training sets and several iterations (Yang et al., 2021) for convergence to an acceptable task performance.
Side-stepping the size of the parameter space as a well-studied challenge for fine-tuning Transformer models, a confounding factor contributing to this issue, which has been recently discussed for text domain (Su et al., 2022;Gao et al., 2021b;Liu et al., 2021;Su et al., 2021), is the sub-optimal utilisation of the representation space (e.g., anisotropy (Ethayarajh, 2019)).This is of paramount importance since speech, unlike text, carries information (e.g., prosodic and para-linguistic) beyond content which demands a richer utilisation of the representation space (Mohamed et al., 2022).Inevitably, less expressive initial representations translate into longer training and call for more labelled data, even in cases of frozen models.Nonetheless, understanding representation space utilisation in pre-trained speech Transformers is heavily underexplored (Pasad et al., 2021;Hsu et al., 2021b).
We move towards addressing this gap by highlighting the properties of such representation spaces, and proposing a self-supervised learning method that improves their utilisation prior to task fine-tuning.Our contrastive learning framework constructs positive pairs by (i) encouraging invariance to local perturbations both at the input and representation levels, and (ii) enhancing sensitivity to content by using monotonically synthesised version of speech inputs.
Our experimental findings across 6 diverse speech processing tasks (covering content, speaker and semantics tasks), built on top of the widely used WAV2VEC 2 LARGE (W2V2) (Baevski et al., 2020) encoder, demonstrate that contrastive rewiring brings substantial improvement, both in task performance and fine-tuning speed.Particularly, our approach shines in the low-resource condition, outperforming the W2V2 baseline with substantially fewer number of fine-tuning updates.For instance, in ASR with 1% training data, our approach achieves 1/4 of the error in 1/5 of finetuning updates.Beyond task performance and con-vergence speed, both our qualitative and quantitative analyses on the representation space highlight the improvements injected by our rewiring strategy.

Self-Supervised Contrastive Rewiring
Our method builds on top of a pre-trained speech encoder, by using a small (less than 7k) set of raw unlabelled audio signals to form the self-supervised learning basis for contrastive rewiring.In what follows, we detail how utterance-level speech representations are produced from the underlying encoder, and provide a brief overview of the InfoNCE objective function used for our contrastive rewiring.We finish by explaining how we construct the pairs needed for contrastive learning.
Speech Representation.Most pre-trained speech encoders, including W2V2, do not have an explicit token representing utterance-level representation (e.g., [CLS] for BERT (Kenton and Toutanova, 2019)).Given a raw audio sequence s of length L, W2V2 emits m vectors, where m ≪ L, at each layer (total of 24 Transformer layers + 1 feature extractor layer).Similar to Chung et al. (2021), we take the mean of these vectors to construct the utterance-level representation used for contrastive learning.
InfoNCE.We use the InfoNCE objective (Oord et al., 2018) to rewire speech representations by pulling positive examples, (s i , s ′ i ), closer and pushing away the negative pairs, (s i , s j ).The loss for a batch b of size where f (.) indicates the encoder, τ denotes the temperature hyperparameter, cos(., .)denotes the cosine similarity between two representations, N i includes all negative examples for s i .All parameters of the encoder are updated during optimisation.

Contrastive Pair Construction
Positive Pairs.We form positive pairs both at the raw and representation levels.For a given audio signal s i , we deploy the following 3 strategies to construct its corresponding positive pairs, (s i , s ′ i ): Twin.Inspired by Liu et al. (2021); Gao et al. (2021b), given a speech sequence of length L, we first duplicate it.Then we randomly select a starting point for a span, and mask p × L consecutive signals from the audio, replacing them with [MASK].We use p = 20% in our experiments.This is applied always and only once to each s i .
Neutral.For a given audio s i , its monotonic neutral version is created from available transcripts2 using Festival Speech Synthesis System. 3The synthesizer is chosen because it is able to produce non-expressive speech, as demonstrated in previous studies (Lotfian and Busso, 2017).The neutral version is devoid of noise, prosody and para-linguistic features, focusing mostly on content.Figure 1 illustrates a visualisation of the desired expected effect from Neutral rewiring.

Mixed.
While the Twin strategy aims to make the representations invariant to local changes and noise, the Neutral approach tends to rewire the space based on content-level similarity.To leverage the benefits of both worlds, as our main strategy, we uniformly interchange Twin and Neutral in the Mixed setting.
Negative Pairs.In all strategies, given a batch b and a sample s i ∈ b, the set N i of negative examples for s i is N i = {s j |s j ∈ b, j ̸ = i}.Further, we have specific negative samples added to N i per each strategy to construct negative pairs, (s i , s j ): Mixed.Union of the above two.
Similar to Liu et al. (2021) and Gao et al. (2021b), in all our strategies, we apply dropout to perturb both the representations and internal components of Transformer.Note that Twin and masking are aligned with how a Transformer-based model is trained, so rewiring with this strategy is unlikely to create conflicts.Furthermore, Neutral potentially pushes representations towards eliminating paralingual feature from speech representations, as it shifts the focus to content.

Experiments
In this section we describe our experimental settings ( §3.1) followed by downstream task results in full and low-resource scenarios ( §3.2).We finish by providing an analysis on the quantitative and qualitative properties that were improved via our contrastive rewiring approach ( §3.3).

Experimental setups
Rewiring Dataset.For all contrastive rewiring strategies on top of a pretrained W2V2, we used a small subset (6.6k instances) of Lib-riSpeech (Panayotov et al., 2015) train-clean-100 where instances lengths are under 180k, which is also part of training data for W2V2.Lib-riSpeech also contains transcripts, which we used for neutral rewiring to produce neutral speeches.Downstream tasks.We experimented with 6 diverse downstream speech processing tasks from SUPERB benchmark (Yang et al., 2021): Automatic Speech Recognition (ASR), Speaker Diarization (SD), Intent Classification (IC), Slot Filling (SF), Keyword Spotting (KS) and Query by Example Spoken Term Detection (QbE).These tasks cover semantic, speaker, and content in speech tasks.For each task, we simulate various resource conditions by sampling 1%, 5%, 10% of the entire training set, while using the original dev set to decide the best model.Original test sets were used for evaluation.Each task is evaluated using its specific evaluation metrics.For details on tasks and evaluation metrics, see Appendix A. We detail data statistics in Table 1.We follow the instructions in SUPERB and use the s3prl toolkit4 to prepare the datasets.
Baseline.While our approach is not dependent on a specific speech Transformer, we use W2V2 as the most widely used Transformer-based speech encoders.We follow the SUPERB evaluation pipeline by freezing it as an encoder for downstream tasks, while attaching a benchmark-specified lightweight prediction head for each task.For details on task head architectures, see Appendix A. We follow identical protocol for fine-tuning and evaluating all models.Implementation details.During rewiring, we set the dropout to 0.1, the learning rate to 1e-6 and the temperature for InfoNCE to 0.04.To overcome the hardware constraint, we truncated audio signals and encoded them sequentially.To avoid memory issues that come from lengthy audio signals, we read one audio data at a time and set the audio length threshold to 90k.Our batch size for rewiring was 8, and we rewired W2V2 for 1.7k, 11.6k, 5k updates for Twin, Neutral and Mixed, respectively.
During downstream task fine-tuning, under 100% condition, we set the max step to ∼ 200k for both the baseline and our models.Under the 1%, 5% and 10% resource settings, we fine-tune the models for substantially less number of updates, although the baseline model still requires several steps for convergence (See Table 2).We follow the settings in Yang et al. (2021) for hyper-parameters, and use weighted-sum of hidden states of each layer from W2V2 as the final representation for downstream tasks.For more details, please refer to Appendix B.

Main Results
Our task fine-tuning results are presented in Table 2.The results highlight the improvement our contrastive rewiring brings across various downstream tasks, both in performance and fine-tuning speed.Notably, in 1% and 5% training scenarios, our models (at least 1 and in many cases all 3 strategies) outperform the W2V2 baseline on all tasks, while requiring substantially less number of fine-tuning updates.More details per task follows:  Overall, our proposed approach improves performance and convergence across various speech processing tasks in all resource conditions, with more remarkable gains in low-to mid-resource conditions.

Analysis and Discussion
Qualitative Analysis.Figure 2 demonstrates the t-SNE (van der Maaten and Hinton, 2008) visual-isation of the impact of applying each strategy to W2V2.We clearly observe that better clustering of the representation space, specially after applying neutral (bottom-lsumeft), emerges without any task fine-tuning.
Task Type.According to Table 1, we have 3 types of tasks, content, semantic and speaker.As expected the content and semantic tasks are the biggest gainers in the low-resource setting (1-5%) from our contrastive rewiring.This verifies our earlier qualitative analysis and aligns well with the motivation for leveraging neutral speeches in the Neutral strategy which is expected to put more emphasise on content.
Isotropy.We speculate that the benefits of our rewiring strategies also roots in reshaping the representation space geometry.The isotropy of the embedding space is a desired property and we conjecture that the representations of W2V2 are potentially anisotropic: crowded into a narrow slices of the representation space.To test our conjecture, we calculated the isotropy scores of speech representations produced by 4 models on 5 different datasets.We approximate the isotropy score (Mu and Viswanath, 2018), where V is the matrix of representations, and M is the set of eigen vectors of V ⊺ V.The isotropy scores of W2V2, Twin, Neutral, and Mixed models are on the order of 1e-300, 1e-10, 1e-30, and 1e-10 respectively.This result confirms our conjecture that our three models improve isotropy of the representation space by orders of magnitude compared with W2V2.

Conclusion and Future Work
In this paper, we presented effective and efficient self-supervised contrastive learning methods to rewire the representations of speech pre-trained Transformer model.We demonstrated that lightly rewiring WAV2VEC 2 improves the convergence speed of fine-tuning as well as task performance on 6 downstream tasks.In particular, in low-resource condition our method performed substantially better than the underlying WAV2VEC 2. Our analysis indicated the rewiring has created a much better discriminated representation space, making it better suited for fine-tuning towards tasks.As future work, we plan to cover more downstream tasks, and invest more into designing hard negative pairs to further augment the contrastive learning.

Limitations
We hoped to extend our experiments to all tasks on SUPERB but certain tasks involved data access difficulties (we initiated the requests but never got access).Additionally, we did not see significant gain with rewiring on Phoneme Recognition, which could stem from our construction of utterance-level representation.This suggests finer grained granularity of representations need to be included for fine-grained tasks.It also requires further investigation to fully understand why our method performs extremely well in certain source conditions and relatively well on certain tasks compared to others.Although we have provided certain conjectures, this analysis requires a standalone work.

Ethics Statement
Our work is built on top of WAV2VEC 2, which is pretrained on massive speech data.Our goal was not to attend to alleviate the well-documented issues (e.g., privacy, undesired biases, etc) that large pretrained models have.For this reason, we share the similar potential risks and concerns posed by these models.

A SUPERB Tasks Details
We provide a brief overview of the tasks, prediction head settings and evaluation metrics.For further details please refer to SUPERB paper (Yang et al., 2021) or the SUPERB leaderboard.5 • ASR aims to transcribe audio into text.A vanilla 2-layer BLSTM is applied as the downstream task model, optimised with CTC loss.The evaluation metric is word error rate (WER).
• KS classifies utterances to detect specific keywords.Mean-pooling and a linear layer with cross-entropy loss are applied as the downstream task model.The task is evaluated using accuracy (ACC).
• QbE aims to detect spoken terms in an audio database by calculating whether a given query matches a spoken document.It does not require training.Dynamic Time Warping(DTW) and standard distance functions are used on all hidden states to report the final score.Maximum term weighted value (MTWV) is used for evaluation.
• SD, given an audio in which more than one person speak alternately, aims to determine the speaker at each timestamp.A single-layer 512-unit LSTM is applied as the downstream task model to SD task.The evaluation metric is diarisation error rate (DER).
• IC is designed to detect the intent of speaker from a spoken utterance.Mean-pooling and a linear transformation with cross-entropy loss are employed in the downstream task model.The evaluation metric is accuracy (ACC).
• SF aims to detect a sequence of semantic slottypes based on spoken words.Slot-type labels are included in transcriptions as special tokens, while SF is treated as an ASR problem.
A vanilla 2-layer BLSTM with CTC loss is applied as the downstream task model.The evaluation metrics are slot-type F1 score and slot-value CER.

B Implementation Details
We rewire WAV2VEC 2 for 1.7k, 11.6k, 5.0k updates (≈1, 7, 3 epochs) for TWIN, NEUTRAL and MIXED, respectively.Applying more epochs may lead to overfitting, for TWIN, training loss drops to almost zero in epoch 2. Next, in downstream tasks, under the full-resource condition, we set the max step to ∼200k for both the baseline and our models.Under the 1%, 5% and 10% resource settings, Table 3 shows the max step we set for training models respectively, and test the best checkpoints accordingly.Note that the baseline and our models are trained with the same number of updates in all settings.During the rewiring process, we use dropout = 0.1 of WAV2VEC 2 for all training instances.The learning rate is set to 1e-6 and the temperature for the infoNCE loss to 0.04.Additionally, it is perceived that contrastive learning requires a sufficient number of positive and negative pairs (Gao et al., 2021a), which cannot be achieved naively, again, due to the length issue.To solve the problem, we use a batch size of 4. For each input within the batch, we get the augmented version and feed the two data points to the network to obtain two utterance-level vectors; this process applies to the rest of 3 training examples.This effectively relaxes the memory requirement for training pretrained speech models, given the hardware constraint.To avoid memory issues that come from lengthy audio signals, in practice we read one audio data at a time and set the audio length threshold to 90k.Whenever the audio length is greater than the threshold, we split the audio input into two parts with equal lengths, and randomly select one of them as the basis for further augmentation.Furthermore, to mask the augmented data in training TWIN, we randomly pick a starting point in the first four-fifths of the frames of the audio data, and then mask consecutive frames that are one-fifth of its total length.For NEUTRAL, we generated neutral speech from the transcriptions of the chosen LibriSpeech subset with the aforementioned TTS software offline.While training MIXED, we randomly select a augmentation technique from twin and neutral to construct positive pairs.The truncation trick applies to all three methods.

C t-SNE on Intent Classification
Figure 3 illustrates t-SNE (van der Maaten and Hinton, 2008) visualisation on IC.While the representation space is not as well-separated as KS, it is still rather clear that NEUTRAL and MIXED have better separation of speech representations and consistency within clusters (higher concentration of same color within clusters).While the points seem to be separated across the space, a closer look indicates that the shaped clusters are substantially mixed, making it much more difficult for the task layer to discriminate between these points (much worse than W2V2).

Figure 1 :
Figure 1: Conceptual visualisation: Vanilla representation space which is very sensitive to surface similarity of audio signals (left) vs. rewired representation space with Neutral strategy which places more emphasis on content similarity (right).

Table 1 :
Dataset statistics.* : QbE is zero-shot.Tasks types are determined by SUPERB.
SD Our method is helpful in low-and mid-resource settings; at 100% it has lower performance, but Mixed helps the model to converge 6× faster.

Table 2 :
Results and the number of fine-tuning updates to achieve the best performance for downstream tasks in various resource conditions.: for ASR the number of training instances was 76%, as remaining overlapped with the data used for rewiring.

Table 3 :
Maximum number of training steps.