Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization

Text style transfer aims to alter the style (e.g., sentiment) of a sentence while preserving its content. A common approach is to map a given sentence to content representation that is free of style, and the content representation is fed to a decoder with a target style. Previous methods in filtering style completely remove tokens with style at the token level, which incurs the loss of content information. In this paper, we propose to enhance content preservation by implicitly removing the style information of each token with reverse attention, and thereby retain the content. Furthermore, we fuse content information when building the target style representation, making it dynamic with respect to the content. Our method creates not only style-independent content representation, but also content-dependent style representation in transferring style. Empirical results show that our method outperforms the state-of-the-art baselines by a large margin in terms of content preservation. In addition, it is also competitive in terms of style transfer accuracy and fluency.


Introduction
Style transfer is a popular task in computer vision and natural language processing. It aims to convert an input with a certain style (e.g., sentiment, formality) into a different style while preserving the original content.
One mainstream approach is to separate style from content, and to generate a transferred sentence conditioned on the content information and a target style. Recently, several models Xu et al., 2018;Wu et al., 2019) have proposed removing style information at the token level by filtering out tokens with style information, which are identified using either attention-based methods (Bahdanau et al., 2015) or frequency-ratio based methods (Wu et al., 2019). This line of work is built upon the assumption that style is localized to to our knowledge , this is the best deal in phoenix .

Ours
To our knowledge, this is the <MASK> <MASK> <MASK> <MASK>. To our knowledge, this is the best deal in phoenix .
Average attention score certain tokens in a sentence, and a token has either content or style information, but not both. Thus by utilizing a style marking module, the models filter out the style tokens entirely when constructing a style-independent content representation of the input sentence. The drawback with the filtering method is that one needs to manually set a threshold to decide whether a token is stylistic or content-related. Previous studies address this issue by using the average attention score as a threshold Xu et al., 2018;Wu et al., 2019). A major shortcoming of this approach is the incapability of handling flat attention distribution. When the distribution is flat, in which similar attention scores are assigned to tokens, the style marking module would remove/mask out more tokens than necessary. This incurs information loss in content as depicted in Figure 1.
In this paper, we propose a novel method for text style transfer. A key idea is to exploit the fact that a token often posses both style and content information. For example, the word "delicious" is a token with strong style information, but it also implies the subject is food. Such words play a pivotal role in representing style (e.g., positive sentiment) as well as presenting a hint at the subject matter/content (e.g., food). The complete removal of such tokens leads to the loss of content information.
For the sake of enhancing content preservation, we propose a method to implicitly remove style at the token level using reverse attention. We utilize knowledge attained from attention networks (Bahdanau et al., 2015) to estimate style information of a token, and suppress such signal to take out style. Attention mechanism is known to attend to interdependent representations given a query. In style classification task, an attention score could be interpreted as to what extent a token has style attribute. If we can identify which tokens reveal stylistic property and to what extent, it is then possible to take the negation and to approximate the amount of content attribute within a token. In this paper, we call it reverse attention. We utilize such score to suppress the stylistic attribute of tokens, fully capturing content property.
This paper further enhances content preservation by fusing content information in creating target style representation. Despite of extensive efforts in creating content representation, the previous work has overlooked building content-dependent style representations. The common approach is to project the target style onto an embedding space, and share the style embedding among the same style as an input to the decoder. However, our work sheds light on building content-related style by utilizing conditional layer normalization (CLN). This module of ours takes in content representations, and creates content-dependent style representation by shaping the content variable to fit in the distribution of target style. This way, our style representation varies according to the content of the input sequence even with the same target style.
Our method is based on two techniques, Reverse Attention and Conditional Layer Normalization, thus we call it RACoLN. In empirical evaluation, RACoLN achieves the state-of-the-art performance in terms of content preservation, outperforming the previous state-of-the-art by a large margin, and shows competency in style transfer accuracy and fluency. The contributions are as follows: • We introduce reverse attention as a way to suppress style information while preserving content information when building a content representation of an input.
• Aside from building style-independent content representation, our approach utilizes conditional layer normalization to construct content-dependent style representation.
• Our model achieves state-of-the-art perfor-mance in terms of content preservation, outperforming current state-of-the-art by more than 4 BLEU score on Yelp dataset, and shows competency in other metrics as well.

Related Work
In recent years, text style transfer in unsupervised learning environment has been studied and explored extensively. Text style transfer task views a sentence as being comprised of content and style. Thus, there have been attempts to disentangle the components (Shen et al., 2017;Xu et al., 2018;Wu et al., 2019). Shen et al. (2017) map a sentence to a shared content space among styles to create style-independent content variable. Some studies view style as localized feature of sentences. Xu et al. (2018) propose to identify style tokens with attention mechanism, and filter out such tokens. Frequency-based is proposed to enhance the filtering process (Wu et al., 2019). This stream of work is similar to our work in that the objective is to take out style at the token level, but different since ours does not remove tokens completely. Instead of disentangling content and style, other papers focus on revising an entangled representation of an input. A few previous studies utilize a pre-trained classifier and edit entangled latent variable until it contains target style using the gradientbased optimization (Wang et al., 2019;Liu et al., 2020). He et al. (2020) view each domain of data as a partially observable variable, and transfer sentence using amortized variational inference. Dai et al. (2019) use the transformer architecture and rewrite style in the entangled representation at the decoder. We consider this model as the strongest baseline model in terms of content preservation.
In the domain of computer vision, it is a prevalent practice to exploit variants of normalization to transfer style (Dumoulin et al., 2017;Ulyanov et al., 2016). Dumoulin et al. (2017) proposed conditional instance normalization (CIN) in which each style is assigned with separate instance normalization parameter, in other words, a model learns separate gain and bias parameters of instance normalization for each style.
Our work differs in several ways. Style transfer in image views style transfer as changing the "texture" of an image. Therefore, Dumoulin et al. (2017) place CIN module following every convolution layer, "painting" with style-specific parameters on the content representation. Therefore, the  Figure 2: Input x first passes style marker module for computing reverse attention. The reverse attention score is then applied to token embeddings, implicitly removing style. The content representation from the encoder is fed to stylizer, in which style representation is made from the content. The decoder generates transferred output by conditioning on the two representations.
network passes on entangled representation of an image. Our work is different in that we disentangle content and style, thus we do not overwrite content with style-specific parameters. In addition, we apply CLN only once before passing it to decoder.

Task Definition
} be a training corpus, where each x i is a sentence, and s i is its style label. Our experiments were carried on a sentiment analysis task, where there are two style labels, namely "positive" and "negative." The task is to learn from D a modelxŝ = f θ (x,ŝ), with parameters θ, that takes an input sentence x and a target styleŝ as inputs, and outputs a new sentencexŝ that is in the target style and retains the content information of x.

Model Overview
We conduct this task in an unsupervised environment in which ground truth sentence xŝ is not provided. To achieve our goal, we employ a style classifier s = C(x) that takes a sentence x as input and returns its style label. We pre-train such model on D and keep it frozen in the process of learning f θ .
Given the style classifier C(x), our task becomes to learn a modelxŝ = f θ (x,ŝ) such that C(xŝ) =ŝ. As such, the task is conceptually similar to adversarial attack: The input x is from the style class s, and we want to modify it so that it will be classified into the target style classŝ.
The architecture of our model f θ is shown in Figure 2, which will some times referred to as the generator network. It consists of an encoder, a stylizer and a decoder. The encoder maps an input sequence x into a style-independent representation z x . Particularly, the encoder has a style marker module that computes attention scores of input tokens, and it "reverses" them to estimate the content information. The reversed attention scores are applied to the token embedding E(x) and the results E (x) are fed to bidirectional GRU to produce z x .
The stylizer takes a target styleŝ and the content representation z x as inputs, and produces a content-related style representation zŝ. Finally, the decoder takes the content representation z x and style representation zŝ as inputs, and generates a new sequencexŝ.

Style Marker Module
Let x = [x 1 , x 2 , . . . , x T ] be a length T sequence of input with a style s. The style marker module is pre-trained in order to calculate the amount of style information in each token in a given input. We use one layer of bidirectional GRU with attention (Yang et al., 2016). Specifically, where h t is the hidden representation from the bidirectional GRU at time step t. u is learnable parameters initialized with random weights, and τ denotes the temperature in softmax. When pre-training the style marker module, we construct a sentence representation by taking the weighted sum of the token representations with the weights being the attention scores, and feed the context vector to a fully-connected layer.
The cross-entropy loss is used to learn the parameters of the style marker module. The attention scores in the style marker indicate what tokens are important to style classification, and to what extent. Those scores will be "reversed" in the next section to reveal the content information. The fullyconnected layer of the style marker module is no longer needed once the style marker module is trained. It is hence removed.

Reverse Attention
Using attention score from the pre-trained style marker module, we propose to implicitly remove the style information in each token. We negate the extent of style information in each token to estimate the extent of content information, namely reverse attention.α where α t is an attention value from style marker module, andα t is the corresponding reverse attention score. We multiply the reverse attention scores to the embedding vectors of tokens.
Intuitively, this can be viewed as implicitly removing the stylistic attribute of tokens, suppressing the norm of a token embedding respect to corresponding reverse attention score. The representations finally flow into a bidirectional GRU to produce a content representation z x , which is the last hidden state of the bidirectional GRU. By utilizing reverse attention, we map a sentence to style-independent content representation.

Stylizer
The goal of the stylizer is to create a content-related style representation. We do this by applying conditional layer normalization on the content representation z x from encoder as input to this module. Layer normalization requires the number of gain and bias parameters to match the size of input representation. Therefore, mainly for the purpose of shrinking the size, we perform affine transformation on the content variable.
The representation is then fed to conditional layer normalization so that the representation falls into target style distribution in style space. Specifically, where µ and σ are mean and standard deviation of input vector respectively, andŝ is target style. Our model learns separate γ s (gain) and β s (bias) parameters for different styles. Normalization method is commonly used to change feature values in common scale, but known to implicitly keep the features. Therefore, we argue that the normalized content feature values retain content information of the content variable. By passing through conditional layer normalization module, the content latent vector is scaled and shifted with style-specific gain and bias parameter, falling into target style distribution. Thus, unlike previous attempts in text style transfer, the style representation is dynamic respect to the content, being content-dependent embedding.
In order to block backpropagation signal related to style flowing into z x , we apply stop gradient on z x before feeding it to stylizer.

Decoder
The decoder generates a sentence with the target style conditioned on content-related style representation and content representation. We construct our decoder using one single layer of GRU.
As briefly discussed in Section 3.2, the outputs from our generator are further passed on for different loss functions. However, sampling process or greedy decoding does not allow gradient to flow, because the methods are not differentiable. Therefore, we use soft sampling to keep the gradient flow. Specifically, when the gradient flow is required through the outputs, we take the product of probability distribution of each time step and the weight of embedding layer to project the outputs onto word embedding space. We empirically found that soft sampling is more suitable in our environment than gumbel-softmax (Jang et al., 2017).

Pre-trained Style Classifier
Due to the lack of parallel corpus, we cannot train generator network with maximum likelihood estimation on style transfer ability. Therefore, this paper employs a pre-trained classifier C(x) to train our generator on transferring style. Our classifier network has the same structure as style marker module with fully-connected layer appended, nonetheless, it is a separate model obtained from a different set of initial model parameters. We use the crossentropy loss for training: We freeze the weights of this network after it has been fully trained.

The Loss Function
As shown in Figure 3, our loss function consists of four parts: a self reconstruction loss L self , a cycle reconstruction loss L cycle , a content loss L content , and a style transfer loss L style .

Self Reconstruction Loss
Let (x, s) ∈ D be a training example. If we ask our model to f θ (x,ŝ) to "transfer" the input into its original style, i.e.,ŝ = s, we would expect it to reconstruct the input.  Enc θ , Sty θ , and Dec θ denote the encoder, the stylizer, and the decoder respectively. The circle figure denotes a generated sentence with soft sampling. As illustrated, L cycle , L style and Lcontent require soft sampling to keep the gradient flow.
where z x is the content representation of the input x, z s is the representation of the style s, and p D is the conditional distribution over sequences defined by the decoder.

Cycle Reconstruction Loss
Suppose we first transfer a sequence x into another styleŝ to getxŝ using soft sampling, and then transferxŝ back to the original style s. We would expect to reconstruct the input x. Hence we have the following cycle construction loss: where zxŝ is the content representation of the transferred sequencexŝ. 1

Content Loss
In the aforementioned cycle reconstruction process, we obtain a content representation z x of the input x and a content representation zxŝ of the transferred sequencexŝ. As the two transfer steps presumably involve only style but not content, the two content representations should be similar. Hence we have the following content loss: 1 Strictly speaking, the quantity is not well-defined because there is no description of how the target styleŝ is picked.
In our experiments, we use data with two styles. So, the target style just means the other style. To apply the method to problems with multiple styles, random sampling of different style should be added. This remark applies also to the two loss terms to be introduced below.

Style Transfer Loss
We would like the transferred sequencexŝ to be of styleŝ. Hence we have the following style transfer loss: where p C is the conditional distribution over styles defined by the style classifier C(x). As mentioned in Section 3.5,xŝ was generated with soft sampling.

Total Loss
In summary, we balance the four loss functions to train our model.

Datasets
Following prior work on text style transfer, we use two common datasets: Yelp and IMDB review.

Yelp Review
Our study uses Yelp review dataset  which contains 266K positive and 177K negative reviews. Test set contains a total of 1000 sentences, 500 positive and 500 negative, and humanannotated sentences are provided which are used in measuring content preservation.

IMDB Movie Review
Another dataset we test is IMDB movie review dataset (Dai et al., 2019). This dataset is comprised of 17.9K positive and 18.8K negative reviews for training corpus, and 2K sentences are used for testing.

Style Transfer Accuracy
Style transfer accuracy (S-ACC) measures whether the generated sentences reveal target style property. We have mentioned a style classifier before: C(x) which is used in the loss function. To evaluate transfer accuracy, we train another style classifier C eval (x). It has the identical architecture as before and trained on the same data, except from a different set of initial model parameters. We utilize such structure due to its superior performance compared to that of commonly used CNN-based classifier (Kim, 2014). Our evaluation classifier achieves accuracy of 97.8% on Yelp and 98.9% on IMDB, which are higher than that of CNN-based.

Content Preservation
A well-transferred sentence must maintain its content. In this paper, content preservation was evaluated with two BLEU scores (Papineni et al., 2002), one between generated sentence and input sentence (self-BLEU), and the other with humangenerated sentence (ref-BLEU). With this metric, one can evaluate how a sentence maintains its content throughout inference.

Fluency
A natural language generation task aims to output a sentence, which is not only task-specific, but also fluent. This study measures perplexity (PPL) of generated sentences in order to measure fluency. Following (Dai et al., 2019), we use 5gram KenLM (Heafield, 2011) trained on the two training datasets. A lower PPL score indicates a transferred sentence is more fluent.  proposed BERT score which computes contextual similarity of two sentences. Previous methods, such as BLEU score, compute ngram matching score, while BERT score evaluates the contextual embedding of the tokens obtained from pre-trained BERT (Devlin et al., 2019). This evaluation metric has been shown to correlate with human judgement, thus our paper includes BERT score between model generated output and the human reference sentences. We report precision, recall, and F1 score.

Human Evaluation
In addition to automatic evaluation, we validate the generated outputs with human evaluation. With each model, we randomly sample 150 outputs from each of the two datasets, total of 300 outputs per model. Given the target style and the original sentence, the annotators are asked to evaluate the model generated sentence with a score range from 1 (Very Bad) to 5 (Very Good) on content preservation, style transfer accuracy, and fluency. We report the average scores from the 4 hired annotators in Table 3.

Implementation Details
In this paper, we set the embedding size to 128 dimension and hidden representation dimension of Table 1: Automatic evaluation result on Yelp dataset. Bold numbers indicate best performance. G-Score denotes geometric mean of self-BLEU and S-ACC, and BERT-P, BERT-R, and BERT-F1 are BERT score precision, recall and F1 respectively. All the baseline model outputs and codes were used from their official repositories if provided to the public.

Yelp S-ACC ref-BLEU self-BLEU PPL G-score BERT-P BERT-R BERT-F1
Cross-Alignment (Shen et al., 2017)    encoder to 500. The size of bias and gain parameters of conditional layer norm is 200, and the size of hidden representation for decoder is set to 700 to condition on both content and style representation. Adam optimizer (Kingma and Ba, 2015) was used to update parameter with learning rate set to 0.0005. For balancing parameters of total loss function, we set to 0.5 for λ 1 and λ 2 , and 1 for the rest.

Experimental Result & Analysis
We compare our model with the baseline models, and the automatic evaluation result is presented in Table 1. Our model outperforms the baseline models in terms of content preservation on both of the datasets. Especially, on Yelp dataset, our model achieves 59.4 self-BLEU score, surpassing the previous state-of-the-art model by more than 4 points. Furthermore, our model also achieves the state-of-the-art result in content preservation on IMDB dataset, which is comprised of longer sequences than those of Yelp.
In terms of style transfer accuracy and fluency, our model is highly competitive. Our model achieves the highest score in style transfer accuracy on both of the datasets (91.3 on Yelp and 83.1 on IMDB). Additionally, our model shows the ability to produce fluent sentences as shown in the perplexity score. In terms of the BERT scores, the proposed model performs the best, having the highest contextual similarity with the human reference among the style transfer models.
With the automatic evaluation result, we see a trend of trade-off. Most of the baseline models are good at particular metric, but show room for improvement on other metrics. For example, Deep Latent and Cross-Alignment constantly perform well in terms of perplexity, but their ability to transfer style and preserving content needs improvement. Style Transformer achieves comparable performance across all evaluation metrics, but our model outperforms the model on every metric on both of the datasets. Therefore, the result shows that our model is well-balanced but also strong in every aspect in text style transfer task.
As for the human evaluation, we observe that the result mainly conform with the automatic evaluation. Our model received the highest score on the style and content evaluation metric on both of the datasets by a large margin compared to the other baselines. Moreover, the fluency score is comparable with that of Deep Latent model, showing its competency in creating a fluent output. Both automatic and human evaluation depict the strength of Everyone is always super rude and unprofessional .
Original Input I love this place , the service is always great ! Cross-Alignment I know this place , the food is just a horrible ! ControlledGen I avoid this place , the service is nasty depressing vomit Deep Latent I do n't know why the service is always great ! Style Transformer I do n't recommend this place , the service is n't ! RACoLN (Ours) I avoid this place , the service is always horrible ! IMDB Original Input I actually disliked the leading characters so much that their antics were never funny but pathetic .
Cross-Alignment I have never get a good movie , i have never have seen in this movie .
ControlledGen I actually anticipated the leading characters so much that their antics were never funny but timeless .
Deep Latent I actually disliked the leading characters so much that their antics were never funny but incredible .
Style Transformer I actually disliked the leading characters so much that their antics were never funny but vhs .
RACoLN (Ours) I actually liked the leading characters so much that their antics were never corny but appropriate .

Original Input
The plot is clumsy and has holes in it . Cross-Alignment The worst film is one of the worst movies i 've ever seen .

ControlledGen
The plot is top-notch and has one-liners in it .

Deep Latent
The plot is tight and has found it in a very well done . Style Transformer The plot is joys and has flynn in it .

RACoLN (Ours)
The plot is incredible and has twists in it .
the proposed model not only in preserving content, but also on other metrics.

Style and Content Space
We visualize the test dataset of Yelp projected on content and style space using t-SNE in Figure 4. It is clearly observed that the content representations (z x ) are spread across content space, showing that the representations are independent of style. After the content representations go through the stylizer module, there is a clear distinction between different styles representations (zŝ) in style space. This is in sharp contrast to the corresponding distributions of the style-independent content representations shown on the right of the figure. The figure clearly depicts how style-specific parameters in the stylizer module shape the content representations to fall in the target style distribution. This figure illustrates how our model successfully removes style at the encoder, and constructs content-related style at the stylizer module.

Ablation Study
In order to validate the proposed modules, we conduct ablation study on Yelp dataset which is pre-Style Space Content Space Figure 4: Visualization of Yelp test dataset on content and style space using t-SNE. Gray dots denote sentences with negative style transferred to positive sentiment, while red dots are sentences with positive style transferred to negative sentiment.  Table 5. We observe a significant drop across all aspects without the reverse attention module. In other case, where we remove the stylizer module and use style embedding as in the previous papers, the model loses the ability to retain content, drop of around 6 score on self-BLEU. We find that the two core components are interdependent in successfully transferring style in text. Lastly, as for the loss functions, incorporating L content brings a meaningful increase in content preservation. 6

Conclusion
In this paper, we introduce a way to implicitly remove style at the token level using reverse attention, and fuse content information to style representation using conditional layer normalization. With the two core components, our model is able to enhance content preservation while keeping the outputs fluent with target style. Both automatic and human evaluation shows that our model has the best ability in preserving content and is strong in other metrics as well. In the future, we plan to study problems with more than two styles and apply multiple attribute style transfer, where the target style is comprised of multiple styles.