Transductive Learning for Unsupervised Text Style Transfer

Unsupervised style transfer models are mainly based on an inductive learning approach, which represents the style as embeddings, decoder parameters, or discriminator parameters and directly applies these general rules to the test cases. However, the lacking of parallel corpus hinders the ability of these inductive learning methods on this task. As a result, it is likely to cause severe inconsistent style expressions, like ‘the salad is rude’. To tackle this problem, we propose a novel transductive learning approach in this paper, based on a retrieval-based context-aware style representation. Specifically, an attentional encoder-decoder with a retriever framework is utilized. It involves top-K relevant sentences in the target style in the transfer process. In this way, we can learn a context-aware style embedding to alleviate the above inconsistency problem. In this paper, both sparse (BM25) and dense retrieval functions (MIPS) are used, and two objective functions are designed to facilitate joint learning. Experimental results show that our method outperforms several strong baselines. The proposed transductive learning approach is general and effective to the task of unsupervised style transfer, and we will apply it to the other two typical methods in the future.


Introduction
Text style transfer is an essential topic of natural language generation, which is widely used in many tasks such as sentiment transfer (Hu et al., 2017;Shen et al., 2017), dialogue generation Niu and Bansal, 2018;Su et al., 2021), and text formalization (Jain et al., 2019). The target is to change the style of the text while retaining style-independent content. As it is usually hard to obtain large parallel corpora with the same content *Corresponding Author The salad is delicious.
The salad is rude. The sandwiches are terrible.

Specific Case
The salad is terrible. Figure 1: Illustration of the inconsistency problem and the idea of our transductive learning approach. and different styles, unsupervised text style transfer becomes a hot yet challenging research topic in recent years. Most existing methods in this area try to find the general style transfer rules with an inductive learning paradigm, where style is represented as a specific form, e.g., embeddings, decoder parameters, or classifier parameters. For example, embedding based methods (Shen et al., 2017;Lample et al., 2019;John et al., 2019;Dai et al., 2019;Yi et al., 2020;He et al., 2020) utilize a highly generalized style embedding to replace the original sentence style and direct the generation process. Decoder based methods (Prabhumoye et al., 2018;Fu et al., 2018;Luo et al., 2019;Gong et al., 2019;Krishna et al., 2020) use multiple decoders for generation, where each encoder corresponds to an independent style. Classifier based methods (Wang et al., 2019;Mai et al., 2020) employ the gradient of a pre-trained style classifier to edit the latent representation of the target text.

Rules
It has been well accepted that inductive learning methods have the ability to work well when there are numerous supervised labels. However, in the case of unsupervised style transfer, we are just given corpora with different styles without knowing the parallel relation, i.e. supervision label for this task. As a result, the inductive learning methods fail to produce an accurate style transfer rule, leading to generating some severe inconsistent texts, such as 'the salad is rude', as shown in Figure 1. The underlying reason for this phenomenon is that a perfect style is usually highly dependent on the context, e.g. 'terrible' for 'the salad' and 'rude' for 'a person'. Without a large scale of parallel data, it is difficult to learn a general style transfer rule working for various contexts.
Inspired by the idea of transductive learning (Vapnik, 1998) and some successful historical examples, such as Transductive SVM (Joachims et al., 1999), we propose to introduce transductive learning to the area of unsupervised text style transfer. Specifically, transduction learning reasoning from specific cases to specific cases, which avoids learning a general rule to represent the style. For example, once we get a reference sentence 'the sandwiches are terrible' with negative emotion, a transductive learning method may connect the two sentences by the two kinds of food, e.g. 'salad' and 'sandwiches', then use 'terrible' to express the negative emotion for the food 'salad'.
From the above discussion, we can see that there are two challenges in applying transductive learning to unsupervised text style transfer: 1) how to find specific samples that are beneficial for the style transfer of the current text; 2) how to use the style expressions in these samples to complete the style transfer process. To tackle these two challenges, we propose a novel TranSductive Style Transfer (TSST) model. In TSST, a retriever is employed to obtain the required similar samples, which tackles the first challenge. An attention-based encoderdecoder framework is then utilized to combine the specific samples to tackle the second challenge. Specifically, TSST first encodes the original text to a contextual representation and a style-independent embedding. Then either sparse (BM25) or dense retrieval functions (MIPS) are used to find the top-K samples in the target style corpus, which are encoded by the same encoder. After that, a recurrent decoder is utilized to generate transfer text word by word based on the representation of those retrieved samples, contextual representation, and the representations in the last step. To jointly learn the dense retriever, encoder, and decoder, two kinds of objective functions are used in this paper, i.e. retrieval loss and bag-of-words loss.
In summary, our contributions are as follows: • Facing the inconsistency problem in unsupervised style transfer, we propose a novel transductive learning approach, which avoids learning a general rule but relies on specific samples to complete the style transfer process.
• We design a TranSductive Style Transfer (TSST) model, which employs a retriever to involve highly related samples to guide the learning of the target style.
• Experiments on two benchmark datasets show that TSST alleviates the inconsistency problem and achieves competitive results against traditional baselines. Our code is available at https://github.com/ xiaofei05/TSST.

Related Work
Previous unsupervised text style transfer methods can be divided into three categories according to the way they control the text style, e.g. embedding based method, decoder based method, and classifier based method.
The embedding based methods assign a separated embedding for each style to control the style of generated text. Early work tries to disentangle the content and style in the text. They first implicitly eliminate the original style information from the text representation using adversarial training (Shen et al., 2017;Fu et al., 2018;John et al., 2019) or explicitly delete style-related words Sudhakar et al., 2019;Wu et al., 2019b;Malmi et al., 2020). Then, decode or rewrite the style-independent content with the target style embedding. As a complete disentanglement is unreachable and damaged the fluency of the text, recent approaches (Lample et al., 2019;Dai et al., 2019;Yi et al., 2020;Zhou et al., 2020;He et al., 2020) directly feed original text representation and a separated learned style embedding to a stronger generator, e.g., the attention-based sequence-tosequence model or Transformer to obtain the style transferred text.
The decoder based methods build a decoder for each style or transfer direction, where the style is implicitly represented as the parameters in the corresponding decoder. The former schema built an independent decoder for each style, which first disentangled the style-irrelevant content from the text and then applied the corresponding decoder to generate sentences with the target style (Fu et al., 2018;Xu et al., 2018;Prabhumoye et al., 2018;Krishna et al., 2020). The latter built for each transfer direction, which often regarded style transfer as a translation task Gong et al., 2019;Luo et al., 2019;Jin et al., 2019;Wu et al., 2019a;. This paradigm reduced the complexity of the learning of style to a certain extent but consumed more resources. It is worth mentioning that the boundaries of embedding controlled and decoder controlled methods are sometimes not very clear, and many studies (Fu et al., 2018;He et al., 2020) consider that they are alternative.
The classifier based methods convert the style by manipulating the latent representation of the text according to a pre-trained classifier. Wang et al. (2019) and  mapped the input sentence into a latent representation, and trained classifiers on this latent space. The latent representation would be edited based on the gradients of the classifier until the predicted style changed; after that, the decoder took the modified representation to generate the desired style sentence. Mai et al. (2020) further expanded this framework to a plug and play scene. Although it has remarkable style accuracy, this method can hardly guarantee content preservation due to the concrete output sharply changes with the latent representation.
We can see that all of these existing methods belong to the inductive learning approach, because they aim to learn a general style transfer rule from the training data, and then apply the rule to the test cases. Due to the lack of parallel corpus for supervision, this inductive learning approach fails to learn an accurate style representation application for various contexts, and may cause some severe inconsistency problems as illustrated before.

Transductive Style Transfer
Firstly, we introduce some notations. Consider the unsupervised text style transfer task with M styles, its training set is composed of M single- For an arbitrary input text x in a subset and the target style s j , the goal of text style transfer is to generate a new sentence y which represents the style s j while keeping the style-independent content of x as much as possible.
To tackle the aforementioned inconsistency problem, we propose to utilize transductive learning to obtain a context-aware style representation for the style transfer process. Specifically, our proposed transductive style transfer(TSST) model consists of three modules, encoder, retriever, and decoder, as described in Figure 2.

Encoder
The goal of the encoder is to map the input sentence into hidden representations, to facilitate the following retrieval and generation process. Given the input sentence x = (w 1 , w 2 , . . . , w n ), the output of the encoder is a sequence of hidden states, where H enc = [h enc 1 , h enc 2 , . . . , h enc n ] T ∈ R n×d , and d is the dimension of the hidden state. Please note that the encoder in our model is very general, and different encoding techniques can be used. Specially, we employ a bidirectional LSTM in our experiments.

Retriever
The retriever module is introduced to involve the top-K relevant texts in the target style training subset, to facilitate the transductive learning process. In this paper, we adopt both sparse and dense retrieval functions in the retriever.
Sparse Retriever (BM25) BM25 (Robertson and Zaragoza, 2009) is the most famous sparse retrieval function, which has been widely used in information retrieval.
where k 1 and b are the hyper parameters, f (w, d) represents term frequency of w in document d, IDF(w) represents inverse document frequency of w, |d| denotes the document length and avgdl denotes the averaged document length. After that, the retrieved K texts are mapped to latent representations U = [u 1 , u 2 , . . . , u K ] T ∈ R K×d by the above encoder, where u i is the final hidden states for u i .
Dense Retriever (MIPS) A term-based sparse retrieval would have difficulty retrieving such a semantic context, which is essential for style transfer.
Recently, dense retrieval methods and their efficient implementation of maximum inner product search (MIPS) (Shrivastava and Li, 2014;Guo et al., 2016; have been proposed to capture the semantic. For a dense retriever, style-independent the food is delicious .
! : the rest of the food was awful.
" : for the very first time the food was awful.
the food is awful .

Similar Negative Samples
Encoder Retriever Decoder

Style-independent Embedding
Style 1 Positive Samples

Style 2
Negative Samples text embedding is crucial because we rely on this embedding to find similar samples in a different style subset. To this end, the style-independent embedding q(x) of the text x can be represented as a linear combination of the hidden states h enc i : where the parameter α(w i ) is the weight for each word w i . They are initialized as α( is defined as the count of w i in the subset of style s j across its total count in the whole dataset. This initialization assigns a small weight to the discriminative words for each style, whose frequencies in different style subsets vary significantly. Consequently, the embeddings will focus on style-independent words, and help learn the style-independent embeddings. Based on the text embeddings, the dense retrieval approach is used to retrieve the top-K similar sentences in the target style training subset, where the cosine similarity function is used to measure the similarity. Note that computing text embeddings in the whole training set are time-consuming, so we pre-compute the text embeddings at the beginning and update them after certain training iterations, as inspired by Guu et al. (2020).
After that, the same encoder is employed to obtain the latent representations of the top-K texts, i.e. U = [u 1 , u 2 , . . . , u K ] T , as did in the sparse retriever.

Decoder
The decoder is used to generated the transferred text word by word. For each step, the inputs of the decoder are composed of three parts: 1) output of the previous stepŷ t−1 , 2) hidden states of x, i.e. c h

Learning Objectives
Except for the three widely used losses in previous style transfer works, i.e. the reconstruction loss, the cycle reconstruction loss, and the adversarial style loss, we also introduce two more losses related to the retriever, i.e. the retrieval loss and the Bag-of-Word loss. Therefore, for a given input x, its style s i and the target style s j , the learning objective function can be represented as: where L rec , L cyc , L adv , L ret and L bow denote the reconstruction loss, the cycle reconstruction loss, the adversarial style loss, the retrieval loss, and the bag-of-words loss, respectively.
Reconstruction Loss According to previous works (Shen et al., 2017;Fu et al., 2018;John et al., 2019), this loss is used to capture the informative features for reconstructing itself: Cycle Reconstruction Loss Cycle consistency is usually included in the loss to improve the preservation of content (Lample et al., 2019;Dai et al., 2019;Yi et al., 2020). For the generatedŷ, the output of transferring back to the source style should be consistent with x as much as possible: where G is our TSST model.

Adversarial Style Loss
If we only use the reconstruction and cycle construction losses, the model merely learns to copy the input to the output. So we employ adversarial training to build style supervision. Specifically, we utilize a classifier with M + 1 classes as the discriminator C, similar to Dai et al. (2019) and Yi et al. (2020). The first M classes represent the real texts in the datasets, and the (M + 1)-th class indicates generated fake texts.
Since the generated textŷ is expected to be classified as the target style s j , the adversarial style loss is defined as follows, and the negative gradient of the discriminator is employed to update the model.
As for the discriminator C, a loss function L C 1 is usually used in previous works Dai et al. (2019) and Yi et al. (2020). In this paper, we also ask the discriminator to identify the style of the retrieved samples. Therefore, the loss function in our work can be written as: where Y x→ŷ and Yŷ →x denotes the retrieved samples in the transfer process from x toŷ and fromŷ to x, respectively. To jointly learn the dense retriever with the other parameters and the target style representation from the retrieved samples, we introduce two additional losses to our objective.

Retrieval Loss
The retrieval loss is designed to capture the similarity between the styleindependent embeddings of the input sentence and the corresponding transferred sentence: Bag-of-Words Loss This loss is proposed to encourage the generator to select some new words from the retrieved sentences, making our model pay more attention to the retrieved samples. In this way, the style representation in the target sentence will be well adapted to the context. Let Ω denote a set of new words that appear in the retrieved samples other than the input sentence x, and the bag-of-words loss is defined as: Note that the Bag-of-Words loss is applied to both directions in the cycle reconstruction process.

Discussion
Please note that there are also some other works to involve a retrieval module to enhance the unsupervised style transfer, e.g.  TSST can be trained in an end-to-end way (together with the encoder and decoder), to improve the styleindependent text retrieval. 3) The retrieved samples influence the decoder word by word. In this way, the information of the retrieved samples will be fully exploited to learn a good context-aware style representation. It may bring no benefit if we just use the retrieved samples without these modifications, as shown in Sudhakar et al. (2019).

Experiment
In this section, we conduct experiments to study how well and why the proposed TSST model alleviates the inconsistency problem. Furthermore, a detailed ablation study is demonstrated to show each objective function's contribution to the overall performance.

Datasets
The experiments are conducted on two well-known transfer tasks, sentiment transfer and formality transfer. The statistics of each datasets are shown in Table 1.
Yelp Dataset 1 is widely used as the benchmark for sentiment transfer. It is collected from restaurant and business reviews, with each text marked as positive or negative. The same pre-processing as  is used in our experiment, and human references are also provided for the test set.
GYAFC Dataset 2 denotes the Grammarly's Yahoo Answers Formality Corpus released by Rao and Tetreault (2018), a typical benchmark for formality transfer. The GYAFC dataset contains formal and informal sentences in two different domains, Entertainment & Music and Family & Relationships. In this paper, we use the latter one because it is more popular in this area.

Setups
Our baselines cover three different kinds of inductive learning approaches, as described in Section 2. For the embedding based method, we choose CrossAlign (Shen et al., 2017), StyTrans (Dai et al., 2019), PFST (He et al., 2020) and StyIns (Yi et al., 2020). For the generator based method, MultiDecoder (Fu et al., 2018) and DualRL (Luo et al., 2019) are selected for comparisons. Then, Revision  is considered as the representative of the discriminator based methods. At last, we also compare our model with previous methods involved in retrieval, DRG , IMaT (Jin et al., 2019), B-GST and G-GST (Sudhakar et al., 2019). All baselines (including generated results) are directly taken or implemented from their public source codes, so the detailed settings are omitted in our paper.
For our proposed TSST model, we employ the LSTM as our encoder and decoder to ensure the fairness of the experiment compared with previous methods. Following Yi et al. (2020), we pre-train a forward LSTM language model in each dataset and use its parameters to initialize our encoder and decoder. Similar to Yi et al. (2020), the discriminator is a CNN-based classifier with Spectral Normalization (Miyato et al., 2018), with the same word embeddings as the encoder. The word embedding size, hidden state size, and the number of retrieved samples K are set to 256, 512, and 5, respectively. We exclude the trivial candidates the same as the input sentence in the retriever. The embeddings of all sentences for dense retrieval are updated every 200 steps. To demonstrate the effectiveness of the sparse and dense retrievers, we compare them with a random sampling retriever, and the corresponding TSST model is denoted as TSST-random.

Evaluation Metrics
Previous works mainly focus on evaluating the style transfer methods from the following three aspects, i.e. style transfer accuracy, content preservation, and sentence fluency. Consequently, different automatic evaluation measures such as accuracy, self -BLEU, ref -BLEU, and perplexity(PPL) are also used in the evaluation. However, all of these metrics cannot well evaluate how well a model alleviates the consistency problem, as we introduced before. So we introduce an additional human evaluation in our experiment.
Automatic Evaluation To evaluate the style transfer accuracy, we first finetune a pre-trained BERT-based (Devlin et al., 2019) classifier on each dataset. The two classifiers achieve 98.6% and 89.9% accuracy on the test set of Yelp and GYAFC,

Model
Yelp GYAFC Acc↑ s-BLEU↑ r-BLEU↑ PPL↓ GM↑ Acc↑ s-BLEU↑ r-BLEU↑ PPL↓ GM↑ CrossAlign (Shen et al., 2017)    respectively. Then these classfiers are used to predict the style label of the generated transferred sentences, and the classification accuracy acts as the style transfer accuracy. Both self -BLEU and ref -BLEU are used for content preservation evaluations. The former is the BLEU score between transferred sentences and source sentences, while the latter is between transferred sentences and human references. Following Dai et al. (2019) and Yi et al. (2020), we train a 5-gram language model KenLM (Heafield, 2011) for each style to measure the lanuguage fluency by the perplexity (PPL) of transferred sentence. In addition, we report the geometric mean (GM) of Acc, self -BLEU, ref -BLEU and 1 log PPL as the overall performance. Human Evaluation We recruit three annotators who have high-level language skills for human evaluation. We choose four models with the highest GM scores and three types of TSST models in this experiment. Following previous works (Dai et al., 2019;Yi et al., 2020;, we randomly selected 100 generated sentences (50 for each style) in the test set and the annotators are required to score sentences from 1 to 5, in terms of each aspect, i.e. style transfer accuracy (Sty), content preservation (Cont), and sentence fluency (Flu), where 1 is the lowest and 5 is the highest. In addition, they need to evaluate the consistency of the style words and other contexts in each generated sentence. To make it more clear, consistency (Cons) focuses on judging whether the modified parts are consistent with the retained content, while the fluency only focuses on the grammatical errors. The consistency is rated from 0 to 2, where 0, 1, and 2 stands for inconsistent, unsure, and consistent, respectively.

Experimental Results
Automatic Evaluation As listed in Table 2, we can see that our transductive learning models significantly improve the overall performance on both datasets, as compared with the inductive learning baselines. As for the four specific evaluation measures, our model achieves better results in terms of three of them, i.e. Acc for style transfer accuracy, s-BLEU, and r-BLEU for content preservation, and all our models are able to generate sentences with relatively low perplexity. Though previous models achieve the best on a single metric, a significant drawback can be found on another metric. For the TSST-random, although target-style relevant words in the random samples are not consistent with the input content, the few modifications and the use of target-style strong relevant words will still make the random baseline achieve the high automatic metrics. Thus we need human evaluation.
Human Evaluation Results are shown in Table 3. Firstly, the comparison results are consistent with the automatic evaluation results in terms of both style accuracy, content preservation, and u 2 -all olive gardens are a joke. u 3 -prices are reasonable for the qualiy of what you get.
u 3 -the ribs are over cooked. u 4 -good food , a little expensive for what you get.
u 4 -our food was barely edible. u 5 -definitely a good price for what you get.
u 5 -cold grits are n't a treat. fluency, indicating the reliability of our human evaluation. More importantly, our TSST-sparse and TSST-dense models achieve the highest consistency score, as compared to other baselines, showing the superiority of our transductive learning approach in tackling the inconsistency problem.
In contrast, the consistency will become worse if the retrieved samples are not related to the input, as shown in TSST-random, which further demonstrate the soundness of our approach. Comparing TSST-sparse and TSST-dense, we can see that joint learning retriever yields better consistency results, which is accordant with previous studies.

Case Study
To better understand what transductive learning bring to text style transfer task, Table  4 shows some transferred examples from the Yelp test set. For the first example to transfer from negative to positive, DualRL and StyIns are able to capture the style transfer but with inappropriate expressions, e.g. 'fun' or 'very fun'. StyTrans and Revision fail to do the style transfer, either just copy or use negative expressions. Our TSST-dense model produces perfect results by using 'reasonable' as the transferred style expression, learned from the retrieved examples, as shown in the table. Similar results are observed for the second example to transfer positive to negative. More importantly, although there is no retrieved examples containing exactly the same phrase 'mustard beef ribs', our model still capture the negative pattern like '[food] is a joke' to complete the style transfer process.
Ablation Study To study the role of each loss function in the objective function, we remove them one at a time and train the model from scratch. Due to the high cost of the human evaluation, we only report the automatic results on the Yelp dataset, as shown in Table 5. We can see that each loss contributes to the performance of the model, and their combination performs the best.   We also conduct experiments to explore the influence of different numbers of retrieved samples, as shown in Figure 3. Specifically, we test four different values, i.e. K = 1, 3, 5, 10. We can see that the overall performance of the model gradually increases with the increase of K, and become stable for 3, 5, and 10. For the sake of both effectiveness and efficiency, we set K = 5 in our experiments.

Conclusions and Future Work
Previous style transfer models are mainly based on inductive learning approach, thus suffer from the inconsistent style expression problem with lack of the parallel corpus as supervisions. To tackle this problem, we propose a novel transductive learning approach for unsupervised text style transfer. The key idea of our TSST model is to learn contextaware style expressions via retrieved samples from the target style datasets. Experimental results on two typical style transfer tasks show that TSST significantly improves the performances in terms of both automatic and human evaluation.
Our proposed transductive learning approach is very general, and this work mainly focus on the embedding-based methods. In future, we plan to extend our approach to other methods, such as decoder-based and discriminator-based methods. Moreover, we will try more powerful retrieval methods, such as DPR (Karpukhin et al., 2020).

Ethical Considerations
We honor and support the ACL code of Ethics. The paper focuses on style transfer, which aims to change the style of the text while preserving the semantic content. We recognize the style transfer methods may be misused to generate misinformation, e.g. fake customer reviews. However, the style transfer methods can also provide strong support for mitigating harmful biases in online information, e.g. the transfer from offensive to nonoffensive (Nogueira dos Santos et al., 2018;Tran et al., 2020) and from biased to neutral (Pryzant et al., 2020). Overall, it is still meaningful to continue research into this work on the basis of predecessors. Simultaneously, the datasets we used in this paper are all from previously published works and do not involve privacy or ethical issues.