NAST: A Non-Autoregressive Generator with Word Alignment for Unsupervised Text Style Transfer

Autoregressive models have been widely used in unsupervised text style transfer. Despite their success, these models still suffer from the content preservation problem that they usually ignore part of the source sentence and generate some irrelevant words with strong styles. In this paper, we propose a Non-Autoregressive generator for unsupervised text Style Transfer (NAST), which alleviates the problem from two aspects. First, we observe that most words in the transferred sentence can be aligned with related words in the source sentence, so we explicitly model word alignments to suppress irrelevant words. Second, existing models trained with the cycle loss align sentences in two stylistic text spaces, which lacks fine-grained control at the word level. The proposed non-autoregressive generator focuses on the connections between aligned words, which learns the word-level transfer between styles. For experiments, we integrate the proposed generator into two base models and evaluate them on two style transfer tasks. The results show that NAST can significantly improve the overall performance and provide explainable word alignments. Moreover, the non-autoregressive generator achieves over 10x speedups at inference. Our codes are available at https://github.com/thu-coai/NAST.


Introduction
Text style transfer aims at changing the text style while preserving the style-irrelevant contents, which has a wide range of applications, e.g., sentiment transfer (Shen et al., 2017), text formalization (Rao and Tetreault, 2018), and author imitation (Jhamtani et al., 2017). Due to the lack of parallel training data, most works focus on unsupervised text style transfer using non-parallel stylistic data.

Two Step Decomposition
Step 2. Non-autoregressive Generation Step 1: generate the index of aligned words.
[Mask] is a placeholder for unaligned words.
Step 2: generate the transferred sentence non-autoregressively. 2019), has been widely adopted by unsupervised text style transfer models (Dai et al., 2019;Yi et al., 2020). Specifically, the cycle loss minimizes the reconstruction error for the sentence transferred from style X to style Y and then back to X , which aligns the sentences in two stylistic text spaces to achieve the transfer and preserve styleirrelevant contents. The cycle-loss-based models are trained in an end-to-end fashion, and thus can be easily applied to different datasets. Although cycle-loss-based models yield promising results, one of their major failure cases is to replace some part of the source sentence with irrelevant words that have strong styles, as shown in Fig  1(a). This problem degrades content preservation and can be alleviated from two perspectives. First, we observe that most words in the human-written transferred sentence can be aligned with those in the source sentence. As shown in Fig 1(b), we can align "Not" with "Not", "terrible" with "perfect", and leave only a few words unaligned. It shows that humans regard the alignments between words as a key aspect of content preservation, but they are not explicitly modeled by cycle-loss-based models yet. Second, existing models use the cycle loss to align sentences in two stylistic text spaces, which lacks control at the word level. For example, in sentiment transfer, "tasty" should be mapped to "awful" (because they both depict food tastes) but not "expensive". We utilize a non-autoregressive generator to model the word-level transfer, where the transferred words are predicted based on contextual representations of the aligned source words.
In this paper, we propose a Non-Autoregressive generator for unsupervised Style Transfer (NAST), which explicitly models word alignment for better content preservation. Specifically, our generation process is decomposed into two steps: first predicting word alignments conditioned on the source sentence, and then generating the transferred sentence with a non-autoregressive (NAR) decoder. Modeling word alignments directly suppresses the generation of irrelevant words, and the NAR decoder exploits the word-level transfer. NAST can be used to replace the autoregressive generators of existing cycle-loss-based models. In the experiments, we integrate NAST into two base models: StyTrans (Dai et al., 2019) and LatentSeq . Results on two benchmark datasets show that NAST steadily improves the overall performance. Compared with autoregressive models, NAST greatly accelerates training and inference and provides better optimization of the cycle loss. Moreover, we observe that NAST learns explainable word alignments. Our contributions are: • We propose NAST, a Non-Autoregressive generator for unsupervised text Style Transfer. By explicitly modeling word alignments, NAST suppresses irrelevant words and improves content preservation for the cycle-loss-based models. To the best of our knowledge, we are the first to introduce a non-autoregressive generator to an unsupervised generation task. • Experiments show that incorporating NAST in cycle-loss-based models significantly improves the overall performance and the speed of training and inference. In further analysis, we find that NAST provides better optimization of the cycle loss and learns explainable word alignments.

Related Work
Unsupervised Text Style Transfer We categorize style transfer models into three types. The first type (Shen et al., 2017;Yang et al., 2018;John et al., 2019) disentangles the style and content representations, and then combines the content representations with the target style to generate the transferred sentence. However, the disentangled representations are limited in capacity and thus hardly scalable for long sentences (Dai et al., 2019). The second type is the editing-based method Wu et al., 2019a,b), which edits the source sentence with several discrete operations. The operations are usually trained separately and then constitute a pipeline. These methods are highly explainable, but they usually need to locate and replace the stylist words, which hardly applies to complex tasks that require changes in sentence structures. Although our twostep generation seems similar to a pipeline, NAST is trained in an end-to-end fashion with the cycle loss. All transferred words in NAST are generated, not copied, which is essentially different from these methods. The third type is based on the cycle loss. ; Lample et al. (2019) introduce the back translation method into style transfer, where the model is directly trained with the cycle loss after a proper initialization. The following works (Dai et al., 2019;Luo et al., 2019;Yi et al., 2020) further adopt a style loss to improve the style control.
A recent study (Zhou et al., 2020) explores the word-level information for style transfer, which is related to our motivation. However, they focus on word-level style relevance in designing novel objectives, while we focus on modeling word alignments and the non-autoregressive architecture.

Non-Autoregressive Generation
Non-AutoRegressive (NAR) generation is first introduced in machine translation for parallel decoding with low latency (Gu et al., 2018). The NAR generator assumes that each token is generated independently of each other conditioned on the input sentence, which sacrifices the generation quality in exchange for the inference speed.
Most works on NAR generation focus on improving the generation quality while preserving the speed acceleration in machine translation. Gu et al. (2018)   ation quality. Several works (Bao et al., 2019;Ran et al., 2019) improve the decoder input by aligning source words with target words, which utilize a two-step generation process and inspire the design of NAST. To our knowledge, only a few works of NAR generation explore applications other than machine translation Peng et al., 2020). We are the first to apply NAR generators to an unsupervised text generation task, which surprisingly outperforms autoregressive models in transfer quality besides the acceleration.

Methods
In this paper, we formulate the unsupervised text style transfer as follows: for two non-parallel corpora with styles X and Y respectively, the task aims at training a style transfer model G. The model learns the transfer of two directions, X → Y and Y → X , which can be denoted as P G Y (Y |X) and P G X (X|Y ), respectively.

NAST
NAST is a non-autoregressive generator based on the observation of the word alignment: in style transfer tasks, most generated words can be aligned with the source words, where each pair of the aligned words is either identical or highly relevant. For simplicity, we only describe G Y , where G X shares the architecture and parameters except style embeddings. Given the source sentence X = [x 1 , x 2 , · · · , x N ], the generation process of NAST is decomposed into two steps: predicting the alignment T = [t 1 , t 2 , · · · , t M ], and then generating the transferred sentence Y = [y 1 , y 2 , · · · , y M ].
When 1 ≤ t i ≤ N , the generated word y i is aligned with the source word x t i . Otherwise, y i is not aligned with any source word, where we set t i to 0 and fill x t i with a [Mask] placeholder. Formally, we regard T as a latent variable, and the generation probability is formulated as where P G Y (T |X) and P G Y (Y |X, T ) are modeled by an alignment predictor and a non-autoregressive decoder, respectively, as shown in Fig 2.

Alignment Predictor
The alignment predictor predicts the target length M and the alignment T conditioned on the source sentence X. We utilize a Transformer (Vaswani et al., 2017) to encode the source sentence and then explore two alternative strategies to predict T . Simple Alignment. Simple Alignment assumes that the source and target sentences have the same length, and each generated word y i is exactly aligned with the source word x i . Formally, where I[·] is the indicator function. A similar strategy has been adopted by editing-based methods (Wu et al., 2019b;Helbig et al., 2020), where they simply replace several words in the source sentence. Although this strategy cannot alter the sentence length, it empirically works well on simple tasks, such as sentiment transfer. Learnable Alignment. Inspired by Ran et al. (2019); Bao et al. (2019), we utilize a pointer network  on top of the encoder, which predicts the alignment T : The pointer network is essentially an autoregressive generator, but it only generates the alignment t i pointing to a source word.

Non-autoregressive Decoder
The non-autoregressive decoder (Gu et al., 2018) is a Transformer that generates each word independently. Formally, we have The Transformer decoder takes the aligned sentence [x t 1 , x t 2 , · · · , x t M ] and the target style embedding S Y as inputs. It also contains attention connections from the Transformer encoder.

Training
NAST is a generator that can be integrated into existing cycle-loss-based models. These models mainly utilize three losses, and the overall objective L is defined as αL self + βL sty + γL cyc , where α, β, γ are hyper-parameters. The selfreconstruction loss L self aims at recovering sentences of both styles from their corrupted versions: where X and Y are constructed by word dropout, insertion, and masking (Lample et al., 2019), and P X and P Y are the data distributions of two styles. The style loss L sty is used to guide the style of generated sentences, which has various designs by existing works, e.g., adopting a style discriminator (Dai et al., 2019) or a language model . In our implementation, the style loss is determined by the base model. We simply present a general formulation: where F (X, X ) indicates a score that shows to which extent the sentence X has the style X , and ). At last, the cycle loss L cyc is formulated as However, there still exist two obstacles in optimization. Firstly, because of the non-differentiable problem, we cannot back-propagate the gradients through the discrete text G Y (X) in Eq.(4) (5). As a common workaround, we adopt the Gumbel-Softmax trick (Jang et al., 2017) to approximate the gradients. Therefore, the gradients from G Y (X) can be back-propagated through the decoder output (Fig 2(a)). However, the alignment T Y (X) is remained discrete and non-differentiable, where we simply stop the gradients 1 .
Secondly, the losses in Eq.(3)(5) are intractable for NAST because the generation probability, e.g. P G Y (Y |X), is summed over all alignments as defined in Eq.(1). We provide solutions for the two alignment strategies separately. For Simple Alignment. There is only one valid alignment between X and Y , so the generation probability is tractable as For Learnable Alignment. Inspired by Bao et al. (2019), we introduce a heuristic rule to obtain a pseudo alignment T * : where e(·) indicates the word embeddings. We can obtain the pseudo alignment by dynamic programming, and the details are presented in Appendix A. In the pseudo alignment, most words in Y are aligned with identical or highly relevant words in X, which can be used as a good label to supervise our model. Next, we derive a tractable lower bound for the generation probability: On the right side, the first term trains the NAR decoder, and the second term trains the alignment predictor. By substituting Eq.(6) into Eq. (3)(5), we turn to optimize the upper bounds instead of the original intractable losses. The detailed training algorithm is shown in Appendix A.

Discussions
Residual Connections and Multi-head Attention. The aligned words in NAST are directly connected with the residual connections, and these connections form several chains in the cycle loss optimization, as shown in Fig 3. Most of these chains represent the word-level transfers and reconstructions, e.g., "terrible" is transferred to "perfect" and :

Step1
Step2 Figure 3: Connections of NAST in the cycle loss with the encoder omitted. The word alignments (step 1) and the residual connections (step 2) are in black.
then reconstructs "terrible". The reconstruction error is a part of the cycle loss, which is optimized to enhance the alignment in the word space. Besides the residual connections, the multi-head attention mechanism is also important for our model. The attention stops NAST from becoming a degenerate word-to-word dictionary and makes it possible to predict the unaligned words from the context. Exposure Bias in Autoregressive (AR) Models. Exposure bias (Bengio et al., 2015) is a notorious problem in the AR generation. To obtain P G X (X|G Y (X)) in the cycle loss, AR generators predict each word of X based on the ground-truth prefix, which is an easy task even without information from G Y (X). As a result, in inference, the model may fail in preserving the sentence meaning as it is trained to focus on its generated prefix.
In contrast, NAST focuses on the source sentence since the ground-truth prefix is not given, which suppresses the problem of generating irrelevant words and improves content preservation. Moreover, the training and test are consistent in NAST 2 , which alleviates the exposure bias problem.

Experiment Settings
We conduct experiments on two style transfer tasks.
Sentiment Transfer. We use the YELP dataset , which consists of two non-parallel corpora with positive and negative sentiments. For each sentence in the test set, multiple human references are provided by Luo et al. (2019). Text Formalization. We use the family and relationship domain of the GYAFC dataset (Rao and Tetreault, 2018), which consists of paired corpora for formal and informal sentences. We do not use the paired data to supervise training.
We utilize several SOTA models as baselines, which include CrossAlign (Shen et al., 2017), Del-Retrie , Disent (John et al., 2019), StyIns (Yi et al., 2020), StyTrans (Dai et al., 2019), and LatentSeq . Our models are modified based on StyTrans and LatentSeq, where we replace their generators with NAST. For Sty-Trans, NAST adopts a Transformer of the same architecture as the original implementation. However, LatentSeq utilizes an LSTM generator. For a fair comparison, we first incorporate LatentSeq with a vanilla Transformer generator and then replace the generator with NAST of the same architecture. In inference, we use the greedy decoding strategy, i.e., we choose the top-1 candidate at each step in alignment prediction and sentence generation. More details are presented in Appendix B.

Automatic Evaluation
Following Luo et al. (2019); Dai et al. (2019), we utilize a pretrained classifier to evaluate the style accuracy (Acc), and adopt the BLEU-4 score comparing generated sentences with the source sentences (SelfB) or with the references (RefB) to evaluate content preservation. The classifier based on RoBERTa-base (Liu et al., 2019) achieves an accuracy of 97.6% and 90.1% on YELP and GYAFC, respectively. For each transfer direction, we calculate the geometric and harmonic mean of Acc and RefB and then report the average on two directions as G2 and H2, respectively. We further report the perplexity (PPL) of transferred sentences, which is evaluated by GPT2-base (Radford et al., 2019) fine-tuned on the training set.
The results are shown in Table 1. Compared with StyTrans and LatentSeq, NAST exhibits stable performance gains of G2 and H2 on both datasets. On the Yelp dataset, NAST remarkably improves content preservation (at least 6 points with RefB) but suffers a slight decline in Acc. We find that NAST can suppress irrelevant words with strong styles, which possibly leads to the decline in Acc. On the GYAFC dataset, NAST outperforms the base models mainly in Acc instead of RefB, which is affected by model selection strategies with the Acc-RefB trade-off. In Table 1, we choose the best model based on G2. A more comprehensive comparisons with trade-off curves will be discussed in the next section.
In terms of the alignment strategies, Learnable Alignment outperforms Simple Alignment on  GYAFC, but there is no significant difference on Yelp. We suppose that the sentiment transfer task is more straightforward than the text formalization, where the model can achieve a good transfer performance on Yelp without changing sentence structures.
Compared with all baselines, our best models set new SOTA results on two datasets in the overall performance of the transfer accuracy and content preservation (i.e., G2 and H2).
Trade-Off Curves. To investigate the trade-off between style control (shown by Acc) and content preservation (shown by RefB), we follow Fu et al. (2018) and evaluate the models with different hyper-parameters. To be specific, we select three different style loss coefficients β around the best value. Please see Appendix B.2 for the search range and other details. Since the trade-off varies through the training, we evaluate the models and collect data points at every epoch. It is different   The curves of NAST are generally above those of the base models, indicating that NAST achieves better content preservation when the style accuracy is kept at a similar level. In Fig 4 (c)(d), we find that the base model's RefB drops rapidly after Acc exceeds a certain value, which indicates that the cycle loss fails to preserve the sentence-level alignment, thereby leading to model collapse. By contrast, NAST largely alleviates the issue of model collapse. Moreover, we find that Learnable Alignment outperforms Simple Alignment on GYAFC, but performs equally or slightly worse on Yelp, due to the task differences discussed above.
Training & Inference Speed. Thanks to the parallel decoding of the NAR generator, NAST accelerates the model training and inference as shown in Table 2. For a fair comparison, NAST and the corresponding base model utilize the same Transformer architecture. The computation devices are detailed in Appendix B.3.

Human Evaluation
We follow  and conduct human evaluation experiments on the Yelp dataset. In addition to NAST and the base models, we choose three   baselines with the highest G2. For each model, we sample 100 sentences (50 in each transfer direction), and 900 sentences are evaluated in total. For each sentence, three annotators are asked to rate from 1 (worst) to 5 (best) for fluency, style control, and content preservation. The human evaluation results are shown in Table 3. Similar to the automatic evaluation results, NAST improves content preservation significantly. Moreover, we find that Learnable Alignment outperforms Simple Alignment in terms of fluency. It can be partially attributed to the fact that Learnable Alignment, which is able to remove or add words, is more flexible in generation.

Ablation Study
NAR decoder. Although NAST with Simple Alignment has a simple, straightforward design, it works surprisingly well compared with an AR generator. We conduct an ablation study to investigate the impact of different components in the NAR decoder. First, we remove the aligned sentence from the decoder input. Specifically, the decoder input is the positional encodings without the word embeddings. Second, we remove the multi-head attention in the decoder, and thus each output word is solely conditioned on its aligned word.
The results are shown in Table 4. After we re-  Table 5: Pseudo alignments on GYAFC. S = source, P = pseudo alignment, T = target. Unaligned source words, unaligned target words, and non-identical aligned words are marked in different colors. move the aligned sentence, the performance drops but still remains comparable. It shows that the multi-head attention over the source sentence learns reasonable transfer, while the performance can be largely improved by providing the decoder with the aligned sentence as input. After we remove the multi-head attention, the overall performance drops remarkably, especially on GYAFC. It shows that NAST utilizes multi-head attention to gather sentence-level information, and it is essentially not a word-to-word dictionary. Moreover, the contribution of the multi-head attention is larger on GYAFC than on Yelp. It further justifies that text formalization is less straightforward than sentiment transfer since it requires more sentence-level modifications. Gradient Approximation Methods. The choice of gradient approximation methods is important for tackling the non-differentiable problem. Besides the Gumbel-Softmax trick used in our full model, we try two alternative methods. 1) The Soft-Embedding approximation (Dai et al., 2019) multiplies the softmax distribution by the word embedding matrix to get "soft" word embeddings. 2) The Stop-Gradient strategy  stops the gradient at the decoder output in the cycle loss. However, the style loss requires the output to be differentiable, so we still apply the Gumbel-Softmax trick for the style loss. Results in Table 4 show that the Gumbel-softmax trick outperforms the other methods, so we utilize the Gumbel-Softmax trick for NAST in other experiments. Learnable Alignment. According to Eq.(3)(5)(6), the alignment predictor in Learnable Alignment is supervised by pseudo alignments when optimizing the upper bounds of the self-reconstruction loss and the cycle loss. For the former, the alignment predictor learns to align the corrupted X with X. For the latter, the alignment predictor learns to align the transferred sentence G Y (X) with the original X. We show two cases in Table 5, where the pseudo alignments are of acceptable quality.
To investigate the effects of the pseudo align-  Table 6: Ablation study of NAST with Learnable Alignment on GYAFC. ∆ is the length difference before and after the transfer. |∆| and std(∆) indicate the average absolute value and the standard deviation, respectively. All models use StyTrans as the base model.
Yelp (Positive to Negative) Source love this place and will keep coming back . LatentSeq do n't waste your time and wo n't be back . StyTrans avoid this place and will keep coming back . NAST(Simp.) skip this place and will never coming back . NAST(Lear.) hate this place and will not be coming back .
Yelp (Negative to Positive) Source: i did n't even eat it . LatentSeq: i always love their food and service .  ments supervision, we remove log P G Y (T * |X) in Eq.(6) for the two losses separately. Results are shown in Table 6. Without the pseudo alignments supervision in the self-reconstruction loss, the model almost degenerates into Simple Alignment, because keeping the length unchanged is the easiest way to minimize the cycle loss. Without the pseudo supervision in the cycle loss, Learnable Alignment is slightly weaker than the full model but still outperforms Simple Alignment.

Case Study of Word Alignment
We present several transfer cases in Table 7. We observe that a major failure mode of the base models is generating irrelevant words. We also observe that NAST achieves better content preservation, and most words in NAST's prediction can be aligned with the source words. Focused on the alignment strategies, we observe that the outputs of NAST with Simple Alignment sometimes contain grammar errors (e.g., "will never coming back"), which can be attributed to its limitation of not changing the sentence length. In contrast, we observe that Learnable Alignment can add and remove words at appropriate positions.  To understand the learned word alignments and the word-level transfer, we count the aligned word pairs based on the prediction of Learnable Alignment. Several cases are presented in Table 8. We observe the aligned word pairs are highly explainable. For example, NAST maps "delicious" to "bland" in sentiment transfer and maps "guy" to "man" in text formalization. These cases show that the model can learn fine-grained word-level transfer, where "delicious" and "bland" both depict food taste with different styles. Moreover, NAST with Learnable Alignment learns to add or remove words at reasonable positions, such as adding missing punctuation marks (".", "?") and removing redundant words ("...", "lol") in text formalization.

Analysis of Cycle Loss Optimization
The cycle loss plays a key role in unsupervised style transfer, which achieves style control and content preservation by aligning the sentences in two text spaces. However, the optimization is not straightforward due to the non-differentiable problem. In this section, we study how the cycle loss optimization is affected by the generator architecture and compare a NAR generator with an AR generator 3 . To remove the interference of other losses, we train the model solely with the cycle loss and report the BLEU-4 score of the cycle reconstruction.
The results are shown in Table 9. The NAR generator remarkably outperforms the AR generator with all gradient approximation methods. We provide two possible explanations for this observation. One reason is that word alignments can help the cycle loss align the text spaces. As discussed  in Sec 3.2, the residual connections directly connect aligned words, which exploits the word-level transfer and reconstruction. Compared with the AR generator that aligns the text spaces at the sentence level, aligning word pairs can be much easier. Another possible reason is the error accumulation caused by the gradient approximation methods. In each step of the AR generation, the gradient approximation methods are applied to the generated word, and the word is then fed into the model as the next input. As a result, gradients will be approximated multiple times in the back-propagation, and the error brought by the approximation may be accumulated and possibly lead to unstable optimization.
Our analysis provides a perspective to understand how NAST works, and reveals that the generator architecture can deeply affect the optimization in the non-differentiable problem. However, we should be cautious when generalizing the results to other settings. We notice inconsistent performance report for the gradient approximation methods (Dai et al., 2019;Tu et al., 2020;, where the phenomenon needs further study.

Conclusion
In this paper, we propose NAST, a Non-Autoregressive generator for unsupervised text Style Transfer. It explicitly models word alignments to suppress irrelevant words and exploits the word-level transfer between different styles. Experiments show that NAST improves the overall performance, provides explainable word alignments, and largely speed up training and inference.
However, we should also notice a potential limitation: NAST relies on the assumption that word alignments exist between the source and target sentences. In a more complicated task that lacks word alignments, NAST may lose its advantage of exploiting the word-level transfer. In future work, we will improve NAST to tackle noisy word alignments in more challenging datasets and build explainable and faster models for a broader range of unsupervised text generation tasks.

B.1 Dataset and Evaluation Metrics
We use the processed datasets provided by Luo et al. (2019), which can be downloaded at https: //github.com/luofuli/DualRL. The data statistics are shown in Table 10. The pretrained classifier is implemented based on the transformers package 4 , and the BLEU-4 score is the corpus BLEU implemented in the nltk package 5 . All results in our paper are evaluated by our implemented codes. The reported results of NAST, StyTrans, and LatentSeq in Figure 1 are averaged over three runs with different random seeds.

B.2 Network Architecture and Hyper-Parameters
NAST are implemented based on the base model, StyTrans (Dai et al., 2019) and LatentSeq . Their codes can be accessed at https: For StyTrans, we follow their implementation and hyper-parameters for the Transformer architecture. We use 4 Transformer layers, 4 attention heads, and 256-dim hidden cells for both the encoder and the decoder. For the alignment predictor in Learnable Alignment, we utilize a onelayer Transformer decoder with the same number of attention head and dimension of hidden cells. Moreover, StyTrans utilizes a discriminator for the style loss, which is built on a 4-layer Transformer encoder with the same architecture above. The discriminator and the generator are trained adversarially. Following their implementation, in each iteration, the discriminator is trained for 10 steps and then the generator is trained for 5 steps. We utilize the Adam optimizer (Kingma and Ba, 2015) with the learning rate of 1e − 4 and the batch size of 64. We choose the gradient approximation method from the Gumbel-Softmax trick (Jang et al., 2017), the Soft-Embedding approximation (Dai et al., 2019), and the Stop-Gradient strategy . We select the self-reconstruction loss weight α from {0.25, 0.5, 1} and the cycle loss weight γ from {0.25, 0.5, 1}. We find sometimes the transfer accuracy of one direction can be much higher than that of the other direction, so we separately tune the style loss weights for two directions. To be specific, the overall objective is defined as αL self + β 1 L X,sty + β 2 L Y,sty + γL cyc , where L X, We select β 1 , β 2 from {0. 5, 1, 1.5, 3, 5, 10, 15}. For LatentSeq, LSTM is adopted as the generator in their original models. We first replace the LSTM with an autoregressive Transformer as a baseline, which also has 4 Transformer layers, 4 attention heads, and 256-dim hidden cells. Then we replace the autoregressive Transformer with an non-autoregressive Transformer with the same architecture. The alignment predictor is a onelayer Transformer decoder with the same architecture above. However, LatentSeq utilizes a language model for the style loss, which is a 512dim LSTM. We preserve the implementation of the language model. For optimization, we utilize the RAdam optimizer  with the learning rate of 1e − 3 and the batch size of 64. We also try the three gradient approximation methods. We set the cycle-reconstruction loss weight γ = 1. Following their original implementation, the self-reconstruction weight α is annealed from 1 to 0 in the first 60k steps. Similar to NAST on StyleTrans, we tune the the style loss weight on two directions separately, where we select β 1 , β 2 from {0.15, 0.3, 0.45, 0.6, 0.75} for Yelp and {0.5, 0.75, 1, 1.25} for GYAFC.
We manually tune the hyper-parameters and select the best model according the performance on the validation set. For the Yelp dataset, the validation set does not have reference answers, so we use the geometric mean of Acc and SelfB as the overall performance. For the GYAFC dataset, we use the geometric mean of Acc and RefB as the overall performance.

B.3 Computing Devices and Running Time
In our experiment, each run uses approximately 4 Intel Xeon Gold 6226R CPUs at 2.90GHz, and 1 Nvidia Quadro RTX 6000 GPU. We present the max training step and the training time in Table 11. The best results usually appear in the first half of the training.

C Transfer Difficulties
In Table 12, we present the results on two transfer directions of Yelp and GYAFC. On the Yelp dataset, transferring a negative sentence to a positive one is more difficult than the other direction. One possible reason is that the negative sentences are euphemistic and need changes in sentence structures when transferring to the positive sentiment. In terms of G2, the text formalization is significantly more difficult than the sentiment transfer. The difficulties of two transfer directions vary across models on GYAFC. Transferring formal sentences to informal ones is harder for DualRL, while the other direction is harder for LatentSeq.

D How to Count Aligned Word Pairs
In Section 4.5, we present cases of the word-level transfer. The aligned word pairs are counted based on the predict alignments T , following the rules below: • If 1 ≤ t i ≤ N , we record a pair x t i → y i .
• If t i = 0, the transferred word is unaligned, and we record a pair [M ask] → y i . • If a source word x i is not aligned with any transferred word, we record a pair x i → [Del].
We then collect all word pairs that have the same source word and calculate the proportion of different transferred words. The results shown in Table  8 is obtained on the test set of two datasets.