Learning Robust Latent Representations for Controllable Speech Synthesis

State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on either limited or noisy datasets. Further, different latent variables start encoding the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose RTI-VAE (Reordered Transformer with Information reduction VAE) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that RTI-VAE reduces the cluster overlap of speaker attributes by at least 30\% over LSTM-VAE and by at least 7\% over vanilla Transformer-VAE.


Introduction
Learning disentangled latent representations in speech is an active area of research (Hsu et al., 2017;Chou et al., 2018;Park et al., 2020) with applications in controlling the style (for example, pitch, pause duration, and accent) of synthesized speech. Recurrent architectures like Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks in Variational Autoencoders (VAE) have been state-of-the-art in discovering disentangled latent representations in speech  as well as sequential data more generally. For example Li and Mandt (2018) attempt to disentangle global and local features of video/speech in different latent variables.  disentangled different dimensions of the latent variables to discover meaningful representations and hence proposed a speech synthesis model with controllable pitch, pause duration, and speed.
These papers as well as several others (Chung et al., 2015;Leglaive et al., 2020;Hono et al., 2020;Sun et al., 2020) make one limiting assumption-the availability of hundreds of hours of speech data for training deep learning networks. As we show in our experiments, stateof-the-art VAEs fail to learn meaningful separation of speaking styles in speech data when presented with small datasets. In addition, different latent variables learned by the VAE are no longer uncorrelated. Both these shortcomings lead to poor control of speaking styles during synthesis.
While LSTMs are state-of-the-art in learning latent variables in speech, Transformers have been used for understanding latent representations for text completion (Wang and Wan, 2019) and Transformer-based VAEs were used in Jiang et al. (2020) to model independent style attributes in music generation.
Inspired by these limitations of LSTM-based VAEs and the promise of more "attentive" networks, we modify the loss function of the stateof-the-art VAEs  by explicitly minimizing the mutual information between latent variables, thereby penalizing common learned features between different representations. We then modify Transformer architecture for learning robust disentangled latent representations of speech from limited and noisy data. We show that our proposed architecture-RTI-VAE (Reordered Transformer with Information reduction VAE) discovers compact stable latent representations of speaker attributes even on datasets as small as 4 hours of total speech samples while state-of-the-art fails. Our proposed VAE outperforms LSTM and vanilla Transformers even on challenging dataset like Common Voice which has considerable background noise, low recording quality and large number of speakers with the same style or accent. To summarize, following are the main contributions of our work, 1. Formulate a modified VAE loss function for speech data and a novel Transformer-based VAE for learning uncorrelated latent variables, thereby allowing more precise control over synthesis compared to the existing state-ofthe-art.
2. Show that our latent clusters of speaking styles are better separated than existing LSTM and vanilla Transformer based VAEs on noisy and small datasets.
3. Show that the our modified Transformer architecture allows a faster convergence of the variational lower bound compared to both vanilla Transformer and LSTM based VAEs.

Related Work
Multiple previous work have targeted this problem of learning latent representations for sequential data like speech . As discussed, the main advantage of learning such representations is that it allows creating diverse examples during reconstruction by manipulating the encoded latent variable. In Li and Mandt (2018) the authors propose two sets of latents which learn global features like the generated sequence contents and local dynamic features such as pitch, speed etc. However, a limitation of this approach is the lack of interpretability of the learnt dimensions-it is known that the different dimensions of the latent variables are learning some features but there is little to no visibility into what those actual features are.
Modifying Text-to-Speech systems by introducing additional encoders has been a standard way to discover meaningful representations.  build on top of Tacotron-2  architecture and use Gaussians to model their latent variables. An improved version can be seen in  where a hierarchical latent with mixture of Gaussians is used.  propose adversarial training to further improve latent variables and the features discovered by disentangling the background noise and reverberation along with speaker identity from the recording conditions.
While all these prior work aim to discover latent representations, there is a lot of room for improv- Figure 1: Graphical model of controllable TTS system. Note that q(y l |X) in the Encoder can be approximated in terms of q(z l |X), in which case node y l will have an edge from z l instead of X as done in  ing those representations especially in cases where we have very limited hours of speech dataset. As we show in our experiments, in the absence of explicit restrictions on the training objective these VAEs easily collapse when presented with smaller datasets. Thus we focus on improving the representations, specifically latent clusters of speaker attributes, in cases of extremely limited datasets. Our contributions, however are not limited to smaller datasets and we see similar improved performance on larger and noisy datasets too.

Background
Controllable text-to-speech (TTS) VAE-based systems like in  take an input text sequence Y t and an optional observed categorical label y o (e.g., speaker identity or accent) as input and learn to synthesize a sequence, usually mel-spectrogram frames X as output. Additional latent variables z o and z l can be introduced to discover meaningful representations during this process. Here z o is a continuous latent learnt on top of shown labels y o , hence z o captures the variation in features correlated with the speaker attribute y o . z l is a completely unsupervised continuous variable learnt on top of standard Expectation-Maximization style latent mixture components y l . This graphical model is depicted in Figure 1. The objective function for learning such model, i.e. synthesizing sequence X given Y t and y o , can be formulated as the variational lower bound 1 , with β balancing the relative weighing between the latent channels and reconstruction accuracy. Here L mel is the mel loss which controls the quality of the mel-spectrograms produced and L KL refers to the total KL Loss controlling the features learnt in latent variables. This VAE can be used in the Tacotron-2 architecture  as shown in Figure 2(a) to learn the text to mel-spectrogram mapping and the latent features controlled by L KL .

Methodology
We now describe the two main components, 1) Minimizing mutual information and 2) Layer reordering in our proposed RTI-VAE architecture.

Minimizing Mutual Information
The latent z l in Figure 1 is unsupervised while the latent z o learns features correlated with the shown label y o . Our experiments showed that both z l , z o can end up encoding the same set of features, which leads to poor control in synthesizing speech. An intuition into why this happens lies in the fact that z l is an unsupervised variable and it can discover any feature hidden in the input speech sequence. There is no term in the loss function (1) which prevents the features of z l from being correlated with the observed labels y o (Klys et al., 2018). This can be resolved by minimizing the mutual information I between latents z o (equivalently y o ) and z l . We can formulate this as, Since integral over z l is intractable, we replace p(z l |X) with an approximate posterior q(z l |X). Further, since the true distribution p(y o |z l ) is unknown, we approximate it by introducing a new network q ψ (y o |z l ) leading to min I( A is total number of unique classes of y o , N is the number of samples used for Monte Carlo estimates, and D(X) is the underlying distribution of the input points X. Our proposed encoder is depicted in Figure 2(b). Since we are using q ψ to make predictions for y o , this network needs to be learnt itself. Hence we need to subtract an additional q ψ (y o |z l ) from the loss function. With N = 1 our proposed term is, Combining equations (1) and (3), the total loss function in our proposed model is, To summarize, L mel controls the quality of the mel-spectrogram produced during decoding, L KL controls the features learnt in the latent variables z l , z o and L M I makes sure that z l , z o encode different features. We will be referring to L mel as the reconstruction or mel loss, L KL as the KL loss and L cond = βL KL + γL M I as the conditional loss respectively throughout this paper.
). z l , z o from this distribution are sampled and concatenated to the text encoding to conditionally learn the text to mel-spectrogram mapping. Center: Proposed encoder with the network q ψ . The generator stays the same as in Figure 1. Right: The original and the proposed Transformers replace the LSTMs shown in the VAE of Tacotron-2 architecture.

Layer Reordering in Transformer
Introducing the above loss helps disentangle the learning of z o and z l , but there is another problem that remains. Our experiments on MAILABS and Common Voice data, discussed in section 5.3, indicated that clusters of z o corresponding to different shown labels y o start sharing regions in the latent space. Hence for any given label y o the sampled z o ∼ p(z o |y o ) may or may not belong to the style which y o denotes. This leads to speech samples where the style correlated with the shown attribute y o is not under control while sampling from the priors.
We tackle this problem by replacing LSTMs with Transformers. We expected that the ability of Transformers to attend to specific frames of interest where features could be localized or have a higher expression density, with a higher weight in the input speech sequence should bring down the dataset volume required for convergence by a considerable amount. Hence the lower bound on dataset size needed for modelling non overlapping clusters of z o should be smaller while still keeping the sampled style under control. This should also accelerate the separation between latent clusters for larger datasets. Our experiments with vanilla Transformer-based VAEs confirm our predictions.
We next drew some inspiration from Parisotto et al. (2019) and modified the Transformer encoder. This was an attempt at changing the learning paradigm-instead of directly learning to translate Y t to X in different y o styles, we first learn to synthesize a general representation for all X, and then learn specific deviations of each style y o from this general representation. For example, instead of learning directly to speak in different accents first we learn to speak, and then we learn the subtleties of different accents. Our hypothesis was that learning different y o styles should be a lot faster if a common understanding of all X in the dataset is gained first. The accent specific speech frames X (or style specific as per y o ) should just be a slight deviation from this common representation.
Our proposed architecture is shown in Figure 2c where we switch the order of LayerNorm forming a direct connection between the input and the output. Due to this layer reordering if we make sure that all the modules MHA, LayerNorm, FeedForward are initialized with their expectation near 0, a direct path is formed early in training allowing a general representation of speech to be learnt independent of the shown labels y o . Now as training progresses and these modules warm up, the accent or y o specific features will be learnt by conditioning the encoder.
We also introduce GRU-type gating (Chung et al., December 2014) to stabilize learning by minimizing the maximum gradient norms produced, and apply a small nonlinearity via LeakyRelu at the outputs of the MHA and FeedForward modules to balance the observed trade-off between frequent gradient updates and maximum gradient  Table 1: Length of the mel-spectrogram synthesized and pause durations increase while pitch decreases with increasing dth dimension of z l from its marginal prior mean in RTI-VAE. Figure 3: Left: Synthesized mel spectrogram for "What is it, that is worrying you today?" The stack of 3 mel spectrograms on the right are zoomed areas from frames 20 to 80 for each of their original mel-spectrogram. It can be seen that the pause duration denoted by the dark region increases as you synthesize the same text moving from µ i − 3σ i to µ i + 3σ i . Center: Three mel-spectrograms synthesized for the text "The area has four catholic schools and three church of England schools", corresponding to three random sampling of z o , z l from their posteriors. First synthesis is considerably shorter than the second and third. Notice the different positions of voids between frames 50 and 100, and at frame 150 in the third spectrogram being considerably different. Right: Mel-spectrograms synthesized for the text "The team has also participated in the opening pitch of the Brooklyn Cyclones". The third spectrogram shows smooth areas in the higher mel channels compared to the second and the first. These random latent sampling affects intonation and spectrogram texture. norm 2 .

Experiments
We refer to our proposed VAE with modifications from sections 4.1 (L M I term) and 4.2 as RTI-VAE, the vanilla Transformer with L M I term as Transformer-VAE and the LSTM based state-ofthe-art Tacotron-2 without L M I term  as LSTM-VAE. We trained each model on two datasets-1) MAILABS (Solak, 2018 (accessed November 11, 2020) with a total 35hrs of UK and 39hrs of US speech in studio quality recorded by 4 professional speakers, 2) Common Voice (Ardila et al., 2020) with 4hrs of UK and 19hrs of US speech crowd-sourced from 477 volunteers with varying background noise, microphone qualities and other recording conditions. The input feature X were mel-scale spectrograms, the label y o was set to be 0 for all X belonging to US and 1 for all UK. Dimension of z o and z l were picked to be 2 and 3 respectively and K = 3 for all 2 The specific choice of LeakyRelu is discussed in the Appendix. experiments 3 .

Features Learnt
Before we demonstrate our latent cluster improvements over Transformer-VAE and LSTM-VAE, we show that RTI-VAE does learn important latent features in speech. Our experiments (focused on learning the speaking rate, the fundamental frequency F 0 , and the pause duration) are summarized in Table 1. µ z l ,d and σ z l ,d are the dth dimension mean and standard deviations of the marginal prior p(z l ) = k p(z l |y l = k)p(y l = k). All other dimensions of z l are kept fixed at their own marginal priors while analyzing dth dimension.
For demonstrating control on speaking rate, we did 25 different synthesis for the text "We had been wandering, indeed, in the leafless shrubbery an hour in the morning". It can be seen from Table 1 that the length of the synthesized mel-spectrogram increases as the value of z l dimension 0 increases.
Next, we synthesized 25 texts, with 10 samples for each text to show control on pause duration and  pitch (or the fundamental frequency F 0 ). For pause duration experiments each text contained at least one comma and we measured the maximum period of intermediate silence for each synthesis. To calculate F 0 we used the YIN algorithm (Guyot, 2018). In Table 1 it can be seen that the pause duration increases and F 0 decreases with increasing values of 2nd and 1st dimensions of z l , respectively.
Furthermore the sampled variables z o , z l from their respective posterior distributions q(z o |X), q(z l |X) in L mel gives the effect of different intonations with different speakers every time we synthesize a given text Y t . We demonstrate concrete examples in Figure 3.

Importance of L M I
Our experiment on MAILABS dataset shows that the latent variable z l starts encoding y o specific features in the absence of an explicit L M I term in the total loss, contrary to the expectation that z l should not encode any y o style specific information. As shown in Figure 4 A consequence of including L M I in the loss function (4) can also be seen in the test curve of L KL . We can see in Figure 5 that LSTM-VAE w/ MI has a lower value of L KL . Also note that as shown in Figure 5, L mel remains the same in both the experiments hence there is an overall decrease in the total loss value. We also observe that the two terms of L M I in equation (3) are in contention to each other. The first term tries to learn a representation z l such that it does not have any information about label y o whereas the second term tries to maximize the probability of predicting label y o given z l . We verify from our experiments that at convergence z l acts as a complete random input for estimating y o with q ψ (y o |z l ) = 0.5 for both y o = 0, 1.

Cluster Quality
As discussed in section 4.2, we want clusters of p(z o |y o = 0) and p(z o |y o = 1) to be far from each other with no overlaps so that we can control y o styles during synthesis. Hence we objectively measured the cluster quality with Dunn Index (DI) (Bezdek and Pal, 1995) and DB Index (DBI) (Davies and Bouldin, 1979) where DI= d(i, j) denotes the distance between the clusters i and j, n is the total number of points, d (k) is the maximal intra-cluster distance and µ i , σ i , µ j , σ j are the means and standard deviations of the clusters i, j respectively. Thus DI is the ratio of minimal inter-cluster distance to the maximal intra-cluster distance. Similarly, DBI is the ratio of spread in each cluster to the distance between their means.
In Tables 2 and 3, we compare the test DI and DBI for different dataset sizes between RTI-VAE, Transformer-VAE and LSTM-VAE. We see that RTI-VAE performs consistently better than Transformer-VAE and LSTM-VAE for both MAIL-ABS and Common Voice dataset. We also observe that as dataset size decreases, the performance gap between our RTI-VAE and LSTM-VAE increases.
In Table 4 we calculate the percentage of overlap between clusters with test pointsẑ o ∼ p(z o |y o = i) marked as overlapping with cluster p(z o |y o = j) if they fall within [µ p(zo|yo=j) + σ p(zo|yo=j) , µ p(zo|yo=j) − σ p(zo|yo=j) ], with i, j = 0, 1. We observe that our RTI-VAE consistently decreases the overlap regions by large margins even on challenging datasets like Common Voice, where more than 90% overlap exists for existing state-ofthe-art. As discussed earlier this better separation provides improved control on synthesis and prevents uncontrolled styles when sampling speech from the priors.

Loss Curves
The conditional loss L cond in equation (4) controls the latent variables being modelled namely z l , z o and y l . The trend in Figure 6 for MAIL-ABS dataset shows that RTI-VAE has an accelerated convergence compared to both Transformer-VAE and LSTM-VAE. It can also be seen in Figure  6 that L mel remains the same in all the 3 experiments, LSTM-VAE, Transformer-VAE and RTI-VAE. This shows that while our RTI-VAE is successful in lowering L cond , it does so without hurting L mel or the synthesized mel-spectrogram quality.
We also observed that for a given dataset size in LSTM-VAE, L cond increases with increasing model depth which points towards inferior latent features. This trend is summarized in Figure 6 and shows that Transformer-VAE and RTI-VAE do not overfit to a given dataset size with increasing layers.

Conclusion
In this work we showed that RTI-VAE discovers disentangled latent representations of speech with uncorrelated latent variables allowing better control of speech synthesis. Our layer reordering in Transformers produces notably improved latent clusters of speaker attributes keeping the speaker styles under control on varying dataset sizes with different noise conditions. We can generate mel spectrograms for different text with controllable pitch, pause durations, speaking speed and accent. We also showed that there is a significant boost both in convergence and in the stability of the learnt representations with our proposed method. Going forward we would like to explore the application of RTI-VAE beyond speech, e.g, image captionining with sentiments or text to image rendering with different emotions. C Speaking Rate for y o = 0, 1

D.1 Importance of Gates
Our comparison of Gated architectures with non-Gated ones in Figure 8 shows that the maximum gradient norm which directly influences the convergence is much lower and stable with a lower variance for RTI-VAE (which includes gates) compared to RTI-VAE without (w/o) Gates and Transformer-VAE.

D.2 Choosing the Right Activation
In Figure 8 we see that the distance between z o |y o cluster means is very small when the output from Multi-Head Attention and FeedForward modules are fed to GRU-Type Gating layers without any non linearity. Hence our choice of this non linearity was inspired by the trade-off between number of gradient updates and the maximum gradient norm. We see in Table 5 that relu has a high maximum gradient norm ∇ norm which led to convergence instability and small distance between z o |y o cluster means. But for tanh, almost all activations were producing gradient updates and this frequent update was leading to small cluster distance as shown in Figure 8. Hence we needed an function somewhere between relu and tanh, which has a small gradient norm while also having fewer gradient updates. LeakyRelu turns out to be the best candidate for this with its high distance between means as shown in Figure 8. Experiment % activation max ∇ norm relu 84.5 (< 0) 40.96 tanh 0 (>+2,<-2) 10.68 leakyrelu -7.17