Multi-accent Speech Separation with One Shot Learning

Speech separation is a problem in the field of speech processing that has been studied in full swing recently. However, there has not been much work studying a multi-accent speech separation scenario. Unseen speakers with new accents and noise aroused the domain mismatch problem which cannot be easily solved by conventional joint training methods. Thus, we applied MAML and FOMAML to tackle this problem and obtained higher average Si-SNRi values than joint training on almost all the unseen accents. This proved that these two methods do have the ability to generate well-trained parameters for adapting to speech mixtures of new speakers and accents. Furthermore, we found out that FOMAML obtains similar performance compared to MAML while saving a lot of time.


Introduction
Speech separation has been a well-known task to solve in the speech processing field. Many model architectures mentioned in Section 2 have been proposed and achieved high performance. This suggests that deep learning based methods are suitable for the speech separation task.
Despite having promising results, the generalizability of these models is still questionable. The performance of switching to different datasets or environments is not guaranteed. A straightforward solution is to exhaustively collect data under all kinds of environment settings and train a model with these data jointly. Although this may sound reasonable, it is difficult to always consider every situation during training. To make sure that models can be quickly adapted to mixtures spoken by new speakers with not many samples, metalearning comes to the rescue. Meta-learning has * The two first authors made equal contributions. been widely applied on different speech tasks, especially on speech recognition mentioned in Section 2. Nonetheless, there is not much work that applied meta-learning on the speech separation task. In our previous work, (Wu et al., 2020), we first proposed to solve the speech separation problem with meta-learning. Their setting is viewing utterance mixtures of two different speakers as a meta task. These speakers have the same accents. However, we hope that a speech separation model can have the ability to adapt to mixtures with accents never seen before. Thus, besides the setting of two different speakers forming a meta task, we also added a setting that meta tasks with speakers of same accents form an accent task set. Section 4 and 5.1 describe more about the dataset and task construction procedure.
Our contributions are listed below: • To our best knowledge, we are the first to conduct speech separation experiments on a multi-accent dataset.
• We applied meta-learning to help improve the multi-accent speech recognition task.
The remaining sections of this paper are organized as follows. In Section 2, we give a brief overview of existing works related to speech separation and meta-learning. In Section 3, we elaborate the problem formulation of speech separation in detail. In Section 4, we list out the two phases of MAML, including the meta training phase and meta testing phase. Additionally, we show how FO-MAML is modified from MAML. The experimental setup, dataset, and model we used are presented in Section 5. Finally, results and conclusions are given in Section 6 and 7. Figure 1: Illustration of joint training and meta-learning for multi-accent speech separation. The oval area is the accent task sets. Each accent task set contains multiple meta tasks. The solid lines are the pretraining process, joint training on the left, and meta-learning on the right. The dashed lines represent the adaptation paths from parameters θ to the unseen accents of unseen speakers. This figure is modified from Gu et al. (2018) and our previous work Wu et al. (2020).

Related Work
Speech Separation End-to-end separation models have shown great success in separating speech mixtures of the WSJ0-2mix dataset designed by (Hershey et al., 2016) which is generated from the WSJ0 corpus (Paul and Baker, 1992). (Luo and Mesgarani, 2018) came up with a time-domain audio separation network (TasNet) that takes waveforms as input to alleviate the separation model from dealing with time-frequency representations. They further proposed convolutional TasNet (Luo and Mesgarani, 2019) which substitutes the LSTM layers in TasNet with convolutional layers. This overcame the problem of long temporal dependencies of LSTM and reduced the model size. Before long, they came up will the Dual-path RNN model, which used intra-and inter-blocks to capture local and global information dependencies within the speech mixtures. (Nachmani et al., 2020) utilized the idea of Dual-path RNN and added a speaker identity loss to improve performance on separating mixtures with an unknown number of speakers. (Tzinis et al., 2020) proposed to use a separator constructed with U-ConvBlocks which can not only reduce the number of layers while still having high performance but also require less computational resources and time. This helped the model to more likely be used in real-time speech separation. (Zeghidour and Grangier, 2020) integrated speaker identity information into the separating process, and obtained state-of-the-art performance.
Meta-learning Meta-learning has recently become a trend when it comes to solving multitask problems. This training method has been widely applied in the computer vision field, for instance, (Vinyals et al., 2016;Rusu et al., 2018;Sun et al., 2019). Meta-learning is also used in the natural language processing field. (Gu et al., 2018) used MAML (Finn et al., 2017) for low-resource neural machine translation (NMT). Moreover, in the speech processing domain, some speech-related problems are solved with metalearning, too. (Winata et al., 2020) applied metatransfer learning on code-switched speech recognition. (Xiao et al., 2020; applied meta-learning to solve the multilingual lowresource speech recognition problem. (Winata et al., 2019) also used MAML to adapt models to unseen accents on speech recognition. (Indurthi et al., 2019) adopted meta-learning algorithms to perform speech translation on speech-transcript paired low-resource data. (Chen et al., 2021) came up with some improvements of meta-learning to help the speaker verification task.

Speech Separation
In this work, we perform single channel speech separation. Given a mixture where C is the number of speakers in mixture x ∈ R T and s c ∈ R T are the ground truth sources. For speech separation, the goal is to estimate C sources {ŝ 1 , · · · ,ŝ C } ∈ R T such that the estimates sources are as similar as the ground truth sources. The model we used in this work is Conv-TasNet (Luo and Mesgarani, 2019). In their work, the similarity of the estimated sources and ground truth sources are measured by scaleinvariant signal-to-noise ratio (Si-SNR) shown in Eq. (4): Si-SNR = 10 log 10 s proj 2 error 2 (4) The Conv-TasNet model is a mask-based model which consists of an encoder, separator, and decoder. The encoder encodes the mixture x to a latent space as shown in Eq.(5).
x enc ∈ R H×T is the encoder output, where H is the dimension of the latent space and T is the length of x enc . The separator then calculates C masks m i ∈ R H×T , i ∈ {1, · · · , C} based on x enc shown in Eq. (6).
The masks are then multiplied with the encoder output, forming separated features d i shown in Eq. (7), where is the element-wise multiplication. The separated features d i can be viewed as source representations, and are further input to a decoder to estimate separated sources shown in Eq. (8).
At this point, before measuring the estimated sources with Si-SNR, there is a label permutation problem. An align between {ŝ 1 , · · · ,ŝ C } and {s 1 , · · · , s C } needs to be decided. We used the utterance-level permutation invariant training(uPIT) method described in (Kolbaek et al., 2017) to solve this problem.

MAML
The procedure of MAML (Finn et al., 2017) is stated as follows. Given a set of multi-accent tasks is the accent task set containing tasks only with the k th accent and tq k denotes the task quantity of the k th accent task set. The set of tasks T is split into the source task set T source and the target task set T target . The model denoted as f , will be trained on the source task set T source in the hope of having the ability to quickly adapt to the target task set T target .

Meta Training Phase
During the meta training phase, the MAML algorithm aims to find initialized parameters θ that can further be quickly adapted to new tasks. Moreover, these initialized parameters should be sensitive to the difference between two different tasks, such that adaptation of the initialized parameters can significantly improve the performance on new tasks sampled from the source task set T source . This is achieved by the inner loop and outer loop optimization. A batch of tasks τ source = {τ 1 , · · · , τ b } is sampled from T proportional to the task quantity of every accent task set, e.g., for an accent task set T k , the larger tq k is, the more likely a task is to be sampled from it. Each task in τ source is further split into a support set τ sup and a query set τ qry . The support set is used to adapt the model parameters by performing a one-step gradient decent, which is known as the inner loop shown in Eq.(9).
where α is the learning rate. The goal of the inner loop is to minimize the loss of τ sup j with respect to f θ . More concisely, At this point, the sum of the query loss of each query set in τ source is calculated by The goal of the meta training phase is to minimize the total loss of the query sets. This is also performed by a one-step gradient decent, known as the outer loop shown in Eq. (12).

Meta Testing Phase
During the meta testing phase, we perform a procedure (see Eq. (13)) similar to the inner loop in the meta training phase. This procedure adapts the parameters θ obtained in the meta training phase to the target tasks τ target = {τ 1 , · · · , τ b }.

First-order MAML (FOMAML)
Eq. (14) is the calculation of the gradient in the outer loop, where L τ qry j is denoted as L j for simplicity.
When performing the outer loop during the meta training phase, high computational cost is needed to calculate the second-order derivatives with backpropagation. Eq.(15) is the first-order approximation of the second-order derivative, where θ is a D dimensional parameter, θ d is the dth dimension of θ and θ i j is the i-th dimension of θ j . The difference between FOMAML and MAML is that this approximation is used instead of the second-order derivatives. Thus, compared to MAML, FOMAML can save a lot of computational time, resulting in a faster gradient calculation.

Dataset
The multi-accent speech utterances are collected from the speech accent archive (Weinberger, 2014). This archive currently has more than 200 kinds of accents and 2939 samples. Each native or nonnative speaker speaks the same English paragraph. We selected 123 accents that contain more than one speaker since we need utterances of two different speakers to generate mixtures. We split these accents into three sets, 85 accents for generating the training tasks and 19 accents each for generating the developing and testing tasks. The utterance of each speaker is split into segments with a duration of 4 seconds. For each accent, we construct meta tasks by following the task construction method Figure 2: Illustration of a meta task. For two different speakers with the same accent, we sample 3 utterance segments to form a meta task. Thus, there will be 9 mixtures. However, during training, we only sample one mixture to form the support set since our setting is one shot learning. The other 4 mixtures that do not contain the utterance segments in the support set are selected to form the query set. described in (Wu et al., 2020). We select at most 12 speakers for each accent and generate speech mixtures for each pair of speakers with the same accents. Thus, there will be at most 12 2 = 66 meta tasks and at least 2 2 = 1 meta task for each accent. In each meta task, 3 utterance segments are selected from each speaker and mixed with an SNR level randomly selected between 0 to 5 dB and resampled at an 8kHz sample rate. This results in 3 × 3 = 9 speech mixtures in one meta task. Fig.(2) is an illustration describing the support set and query set of a meta task. Finally, for the training, developing, and testing set, 22.4, 3.8, and 3.9 hours of speech mixtures are generated.

Model
The model we used is Conv-TasNet (Luo and Mesgarani, 2019). It consists of an encoder, separator, and a decoder. The encoder is a 1-dim convolution, which transforms the input mixture into a representation. The separator then calculates two masks based on the encoder output. More specifically, it consists of R stacks of temporal convolutional networks (TCN). Each TCN layer consists of M 1-dim exponentially increasing dilated convolutional blocks. These M blocks each have a residual connection and a skip connection. The residual connection is the input of the next block and the skip connection of all blocks are summed together, passing a parametric relu, linear projection, and a sigmoid function to produce two masks. The two masks are multiplied with the representation output from the encoder respectively and further input into the decoder to generate two separate waveforms of the two speakers. The decoder is also a 1-dim convolution. The configuration that we used is the one that obtained the best performance reported in (Luo and Mesgarani, 2019).

Joint Training and Transfer Learning
There are many other works such as Tong et al., 2017), that try to solve the domain mismatch problem, where the source domain and target domain datasets do not have a similar distribution. Joint training refers to pretraining a model with different source domain data together. Transfer learning refers to adapting the pretrained model to some partial target domain data and testing the fine-tuned model on the target domain data. The most common adaptation method is fine-tuning. Moreover, the domain mismatch scenario has a low-resource problem if the target domain has only fewer data compared to the scale of the source domain data. There are also several works that tried to solve this problem, such as (Chen and Mak, 2015;Zoph et al., 2016;. Our jointly trained model is also based on this low-resource scenario.

MAML and FOMAML
To deal with the domain mismatch and lowresource problem, we applied MAML as our training method in the hope of performing better than joint training. We set the number of the support set in each task as 1, meaning that the model needs to have the ability to adapt to a new task by only seeing one speech mixture of two new different speakers with a new accent never seen before. We also trained our model with FOMAML in order to know whether calculating gradients with firstorder approximation still obtains relatively good performance compared to training with MAML.

Experiment Settings
For both the joint training and MAML methods, we trained the model from randomly initialized parameters for 100 epochs with the Adam optimizer of 0.001 learning rate and 0.00001 weight decay. For the MAML methods, during the meta training phase, we set α = 0.01. For joint training, we also fine-tuned the model parameters with the method in Eq.(13). We tested the fine-tuning learning rate β on the testing set, reported it in section 6, and used the learning rates that obtained the best performance for joint training as our baseline. However, for the models trained with MAML methods, the fine-tuning learning rate β is fixed at 0.01 since other values lead to significant performance degradation.

Joint Training
For joint training, we tested the fine-tuning learning rate β on the testing set as shown in Fig.(3), and found out that β = 5e−4 obtained the best performance on the clean testing set, while β = 1e−3 obtained the best performance on the testing set with noise. We use these two experiment settings as our baseline. h a u s a l i t h u a n i a n b a r i q u e c h u a y i d d i s h k u r d i s h s y n t h e s i z e d t a m i l r u s s i a n t h a i i t a l i a n e w e m e n d e m a l a y b a s q u e a l b a n i a n g a e s t o n i a n r o t u m a n on all accents when there is no noise involved and performs better on most of the accents when there is noise in the mixtures.

MAML and FOMAML
By comparing models (d) and (f), we found out that these two training methods have similar performance. Model (d) has a slightly higher performance than model (f) under the circumstances that the mixtures are clean in the testing tasks, while model (d) has a slightly lower performance than model (f) under the circumstances that there is noise in the testing tasks. However, MAML requires more than 10 times the training time compared to FOMAML, indicating that the first-order approximation takes advantage over calculating the second-order derivatives by saving a lot of time while still obtaining similar performance. Moreover, FOMAML without fine-tuning (model (c)) has similar performance compared to the baseline model, and yet somehow, initialized parameters obtained by MAML (model (e)) do not have the ability to perform speech separation.

Conclusion
Our results show that MAML and FOMAML training methods are effective on multi-accent speech separation. More specifically, it is confirmed that these two methods are better than joint training when adapting to new speakers with new accents and even noisy environments. Besides, FOMAML is shown to be sufficient for dealing with the multiaccent speech separation task and can reduce a large amount of training time. Despite the fact that FOMAML outperforms joint training on the testing set, we can still see that the performance of each accent task set varies a lot from Fig.(4). This is probably due to the task-difficulty imbalance issue described in (Xiao et al., 2020), perhaps some speakers with special accents may be hard to separate. Thus, in the future, we will try to solve this problem with meta sampling methods mentioned in (Xiao et al., 2020).