Multi-Pair Text Style Transfer for Unbalanced Data via Task-Adaptive Meta-Learning

Text-style transfer aims to convert text given in one domain into another by paraphrasing the sentence or substituting the keywords without altering the content. By necessity, state-of-the-art methods have evolved to accommodate nonparallel training data, as it is frequently the case there are multiple data sources of unequal size, with a mixture of labeled and unlabeled sentences. Moreover, the inherent style defined within each source might be distinct. A generic bidirectional (e.g., formal \Leftrightarrow informal) style transfer regardless of different groups may not generalize well to different applications. In this work, we developed a task adaptive meta-learning framework that can simultaneously perform a multi-pair text-style transfer using a single model. The proposed method can adaptively balance the difference of meta-knowledge across multiple tasks. Results show that our method leads to better quantitative performance as well as coherent style variations. Common challenges of unbalanced data and mismatched domains are handled well by this method.


Introduction
Text-style transfer is a fundamental challenge in natural language processing. Applications include non-native speaker assistants, child education, personalization and generative design (Fu et al., 2017;Zhou et al., 2017;Yang et al., 2018a;Gatys et al., 2016b,a;Zhu et al., 2017;Li et al., 2017). Figure 1 shows a prominent example on applying style transfer into a hypothetical online shopping platform, where the generated style variations can be used for personalized recommendations. However, compared with other domains, the lack of parallel corpus and quality training data is currently an obstacle for text-style transfer research. For example, assume one supports a multitenant service platform including tenant-specific text data, but there is no guarantee that each tenant will provide sufficient amount of data for model training. To build a multi-task language model that matches the text-style of each tenant is more practical and efficient than training individual models. This single-model approach might also have relatively favorable empirical performance.
Existing works on text style transfer have addressed different applications such as sentiment transfer (Shen et al., 2017), word decipherment (Knight et al., 2006), and author imitation (Xu et al., 2012). If parallel training data is available, a wide range of supervised techniques in machine translation (e.g., Seq2Seq models (Bahdanau et al., 2014) and Transformers (Vaswani et al., 2017)) can also be applied to style transfer problems. For non-parallel data, He et al. (2020) proposed a probabilistic formulation that models non-parallel data from two domains as a partially observed parallel corpus, and learn the style transfer model in a completely unsupervised fashion. Unsupervised machine translation method has also been adapted to this setting (Zhang et al., 2018). In recent research focused on learning disentangled content and style representations using adversarial training Yang et al., 2018b;Shen et al., 2017), models are designed for non-parallel data while preserving content. Lample et al. (2018) argued that the adversarial models are not really doing disentanglement, and proposed a denoising auto-encoding approach instead. Another way to approach this problem is through identifying and substituting style-related sub-sentences Sudhakar et al., 2019), where the unchanged part guarantees consistency over content. Additionally, state-of-theart language models (BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), CTRL (Keskar et al., 2019), etc.) and text-to-text models (Raffel et al., 2019) achieve good performance generating text in different styles on multiple tasks (Dathathri et al., 2019;Wolf et al., 2019). Building upon previous work, we aim to bridge real applications while accounting for the aforementioned data problems. Specifically, we wish to design an efficient training method for a style transfer model that 1. quickly learns and adapts to different style domains with limited data; 2. handling class-imbalance and out-of-distribution tasks. To achieve this, we introduce meta-learning into the style-transfer problem.
Meta-learning (Schmidhuber, 1987) is a method to enable generalization ability to a model over a distribution of tasks. We focus on optimizationbased meta-learning for our applications. MAML (Finn et al., 2017) learns a common initialization parameter for each task using a few gradient steps. This standard MAML approach has been applied to text style transfer problems with low resources (Chen and Zhu, 2020) and achieved better performance in this situation. However, this method did not take into account the internal variations between tasks. A similar algorithm called Reptile (Nichol et al., 2018) achieves better performance by maximizing the inner product between gradient of different mini-batches from the same task in its update. Recent works (Qiao et al., 2018;Lee and Choi, 2018) improved a single meta-learner to task-adaptive meta-learning models, which in-cludes task-specific parameters to help generalize better between tasks. Bayesian meta-learning is another active area of research: Finn et al. (2018) proposed a probabilistic version of MAML, where the variational inference framework utilizes a taskspecific gradient update. More recently,  incorporated a Bayesian framework into task-adaptive meta-learning. Specifically, they introduce balancing variables for task and classspecific learning and leverage the uncertainties of these parameters derived from training data statistics. In this paper, we will adapt the Bayesian task adaptive meta-learning (TAML) for our application shown in Figure 2 overview.

Balancing Variations between Tasks
A common challenge in aforementioned real application is that data from multiple sources may suffer from different problems, such as insufficient training samples, unbalanced class labels, or domain mismatch. However, simply ignore these differences and concatenate all tenants' data for model training will not lead to ideal results.
Meta-learning is one of the most relevant approaches for generalized learning from few samples of different tasks. Assume a task distribution p(τ ) that randomly generates task τ consisting of training set D τ = {X τ , X τ } and a test set Figure 2: An overview of our multi-pair style transfer method: assume learning from each tenant's data is a task, and the training data available for each task varies. The style transfer model can adaptively learn tasks using our method and the resulting model performs style transfer across multiple domains.

Style Transfer Model
The MAML algorithm initialize task-specific parameter θ τ using a few gradient steps on a small amount of data. In this case, the optimized parameters can generalize to new tasks. Specifically, we have the loss minimization where α is the step size when learning each task. The initial parameter of each task then becomes θ τ = θ − α∇ θ L(θ; D τ ), which has been proved to minimize the test loss L(θ τ ; D τ test ). The training set D τ may consist of only a few samples.
Eq (1) is effective in numerous applications, yet insufficient in addressing our data problems, as it treats the initialization and learning parameters with equal importance for each task. Inspired by , we now introduce three balancing variables: z τ , γ τ , ω τ for every task τ .
Let ω τ = (ω τ 1 , ..., ω τ C ) ∈ [0, 1] C be the multiplier of each of the class specific gradients to vary the learning rate for each class. In real applications, we often have a style transfer problem with unbalanced training data. For instance, when training formality style transfer models, the number of formal/positive sentences is normally much larger than the number of informal/exclusive sentences. Also, denote γ τ = (γ τ 1 , ..., γ τ L ) ∈ [0, ∞) L to be the multipliers of the original learningrate α, where the new learning rate becomes γ τ 1 α, γ τ 2 α, ..., γ τ L α. Note that the value of γ is task-dependent (e.g., sample size of the training data from each task), and is meant to deal with the small data problem in multi-pair text style transfer. Moreover, since the text data collected from every source or tenant is very hard to be aligned, it is common to have training data with significantly different context. We can treat this as an out of dis-tribution problem and this can be reflected on the value of initial parameters. We use z τ to modulate the initial parameter θ for each task. Specifically, z τ relocates the initial θ to a task-dependent starting point prior to the learning process. We unify these properties as the learning framework below: θ 0 = θ * z τ , and for k = 1, ..., K : where ω c and D c are class-specific parameters and data; K is the total number of iterations for updating parameters. We currently assume C = 2 in the following discussions of this paper, since pair-wise style transfer is the primary problem of interest so far.

Learning the Balancing Variables through Variational Inference
We now discuss how to find the most suitable value of each balancing variable. We employ the variational inference framework from probabilistic MAML (He et al., 2020) and TAML  to extract the task-specific information. The variational inference framework is used to compute posterior distributions for the balancing variables z τ , γ τ , ω τ . Assume the train- , and φ τ = {ω τ ,γ τ ,z τ } to be a collection of three balancing variables. The goal of learning for each task τ is to maximize the conditional log-likelihood of the joint dataset D τ test and D τ : log p(X τ test , X τ |X τ test , X τ ; θ). To solve the optimization problem requires determining the true posterior p(φ τ |D τ , D τ test ), which is intractable. We resort to variational inference with a tractable form of approximate posterior q(φ τ |D τ , D τ test ; ψ) parameterized by ψ. In order to make the inference network of meta training and meta testing consistent, we drop the dependency of D τ test since the test labels are unknown in meta-testing. Hence the approximate posterior becomes q(φ τ |D τ ; ψ). We now have the approximated lower bound for task adaptive meta learning: Given that each balancing variable is independent, q(φ τ |D τ ; ψ) can therefore be fully factorized We assume each single dimension of q(φ τ |D τ ; ψ) follows a uni-variate Gaussian distribution with trainable mean and variance. Given φ τ s ∼ q(φ τ |D τ ; ψ), we then use the Monte-Carlo approximation on Eq (3) as a new objective: To better model the variational distribution q(φ τ |D τ ; ψ), an informative representation encoded from the training dataset D τ is necessary. In this case, the inference network can capture all useful statistical information in D τ to recognize its imbalances. We use a two-stage hierarchical set encoder, for a given text style transfer task, we first encodes each class, and then encodes the whole set of classes. Define the encoder StatisticsPooling(·) that generates concatenation of the class statistics such as mean, variance and cardinality. The twostage encoder first encodes all text sentences of each class into s c , followed by encoding representations of the whole set of classes: where c = 1, ..., C represents classes; X τ c is the collection of class c examples in task τ ; NN 1 and NN 2 are some neural networks parameterized by ψ. Therefore, the summarized feature vectors of D τ can be used to infer the Gaussian distribution parameters of balancing variables ω τ , z τ and γ τ to be further applied in the update of metalearning. Note that since the balancing variable ω is class-specific, inference its distributional parameters does not need to go through the second stage of encoding. The overall structure of the inference network is shown in Figure 3.

Task-Adaptive Style Transfer
We discuss formulation of multi-pair text-style transfer problem using the TAML framework. An overview of our method is shown in Figure 2. We assume training data in each task could either be parallel (task 1 and 4) or non-parallel (task 2 and 3). The number of training samples in task i is represented by N i , which is not necessarily equal for each task. In addition, the class distribution in non-parallel training data is heavily skewed.
We now formulate our problem as follows. Given a distribution of similar tasks p(τ ), each task represents performing text style transfer on a certain dataset D τ . Define a generic loss function L and shared parameters θ within tasks, the Algorithm 1 Multi-Pair Text Style Transfer via TAML 1: Input: style pair for each task τ , {(s τ , s τ )} T τ =1 , parameters α, β, 2: Meta-training procedure: 3: while not done do 4: for each style pair (s τ , s τ ) do 5: Train inference network q(φ τ |D τ ; ψ) by minimizing objective (4) 6: Obtain balancing variables {z τ , γ τ , ω τ } ∼ q(φ τ |D τ ; ψ) 7: Initialize sub-learner with θ τ 0 = θ * z τ 8: for step in 1, ..., K do 9: Sample batch data from D τ s 10: Update parameters for task τ using θ τ end for 12: Sample batch data from D τ t 13: Evaluate L(θ τ K , D τ t ) 14: end for 15: Update meta-learner goal is to jointly learn a task-agnostic model f θ : (X τ , S τ ) → Y τ , where for each τ , S τ is the corresponding set of style labels of original text X τ , and Y τ is the resulting style transformed text. Ideally, Y τ should be consistent with X τ , the corresponding input text sentence in another style domain which may or may not be available in model training. In fine-tuning with a new task, the parameters are initialized accounting for the imperfect nature of the given dataset. Similar to the standard meta-learning approach, the training data of task τ is divided into a support set D τ s and a query set D τ t , where D τ s is used to update each sub-task and D τ t is used to evaluate the loss, and later used for meta-learner updates. A detailed description can be found in Algorithm 1.

Experiments
We conduct experiments on multiple style-transfer datasets: Shakespeare (Xu et al., 2012), Yelp reviews (Shen et al., 2017) and an internal dataset from a company that contains formal/informal text sentences. Performing style transfer on each of the above dataset defines a unique task. The Shakespeare dataset contains 21k parallel sentences, which includes original text style and Shakespeare's style. The maximum length of the sentences is 20. The Yelp dataset contains around 252k sentences of positive and negative restaurant reviews, where we use a maximum length of 15 to conduct the experiment. We evaluate our method using state-of-the-art transformers including BERT, GPT-2, T5, and VAE  designed for style transfer by learning disentangled representations. Our baseline method in-cludes regular model training without distinguishing the difference between tasks, and the MAML method in Eq (1) to fine-tune the style transfer models on multiple distinct tasks, which has also been proposed by Chen and Zhu (2020). We then employ Algorithm 1 to adaptively fine-tune the style transfer models for each task.
The unbalanced training data is created by sampling from each class at different rate (75% positive class, 25% negative class). We use the pretrained transformers in Huggingface library (Wolf et al., 2020) as our initial style transfer models. Specifically, we build a two-head model ( Figure  4) on top of the decoders where each head is composed of multiple dense layers. We do not perform end-to-end training for the entire transformer but only train the two-head model. The model input is the sentence and style pairs (X τ , S τ ) while the forward propagation of transformer's output to each model head is dependent on the style labels. The resulting output sentences are style dependent, and one can perform text-style transfer by flipping the style labels during the inference phase. Similarly, we use both baseline and TAML to train VAE and obtain disentangled style and content representations, and replace the style embedding during the inference stage to get style transferred sentences. Note that we focus on improving the fine-tuning part of text style transfer models, while we do not modify the model structure themselves. In terms of content preservation, the objective function of the VAE model proposed by  contains a content-oriented loss, while for other transformer-based models, we designed the loss L to be the cross entropy loss  between Y τ and X τ , or between Y τ and X τ in non-parallel situations. For BERT, GPT-2, and T5, we use the built-in vocabulary within the transformers library. Adam optimizer is used with learning of 5 × 10 −4 to train the model. The batch size is set to 16 and the model is trained for 100 epochs. We build the two-head model by using 6 fully connected layers with hidden size of 256 and ReLU activation function. The parameters are chosen empirically with the best performance. For VAE approach, we use the same parameter settings as reported in . As for NN 1 in inference network, we used two consecutive blocks of 3 × 3 convolution layer followed by 2 × 2 max pooling layer, the output is then fed into one fully connected layer for statistics pooling. We then use two fully connected layers for NN 2 . All the activation functions are ReLU.
We evaluate competing methods on quality and accuracy of style transfer. The adopted metrics are common choices among recent works.
BLEU: We use BLEU (Papineni et al., 2002) score to evaluate the content preservation, the scores are calculated using ScareBLEU (Post, 2018). When parallel sentences are available, we compute the BLEU score between the style transferred sentences Y τ and the ground truth sentences X τ . Otherwise, we use the original sentences X τ instead. PPL: We implemented a bigram language model (Kneser and Ney, 1995) to quantitatively evaluate the fluency of a sentence. The language model is trained on the target-style domain, and we report the PPL of the generated sentences.
Accuracy: We also trained a TextCNN classifier (Rakhlin, 2016) simultaneously while training style transfer models. The trained classifier is then used to evaluate the classification accuracy on the generated sentences. Table 1 shows our results for each method. By applying task-adaptive meta learning on each style-transfer model, the performance with respect to every metric is generally improved on the datasets we evaluated. We observe that the VAE method performs better in style transfer, as other models are not explicitly designed for this goal.

Conclusion
In this paper, we investigated meta-learning approaches applied to text-style transfer, for situations with multiple data sources. Given the distinct context and total amount of data, we propose a task-adaptive meta-learning approach to fine-tune style-transfer models. The proposed method introduces three balancing variables with probabilistic distributions, which can be encoded from training data. These balancing variables are then used to solve class and task imbalance problems. Empirically, we found that TAML improves the styletransfer performance on multiple models. In the future, we wish to explore generating style variations in more fine-grained levels (for C > 2) with the help of meta-learning.