Exploring Non-Autoregressive Text Style Transfer

In this paper, we explore Non-AutoRegressive (NAR) decoding for unsupervised text style transfer. We first propose a base NAR model by directly adapting the common training scheme from its AutoRegressive (AR) counterpart. Despite the faster inference speed over the AR model, this NAR model sacrifices its transfer performance due to the lack of conditional dependence between output tokens. To this end, we investigate three techniques, i.e., knowledge distillation, contrastive learning, and iterative decoding, for performance enhancement. Experimental results on two benchmark datasets suggest that, although the base NAR model is generally inferior to AR decoding, their performance gap can be clearly narrowed when empowering NAR decoding with knowledge distillation, contrastive learning, and iterative decoding.


Introduction
Text Style Transfer (TST) aims at altering a stylistic attribute (e.g., sentiment) of the given text to a target value, without changing the style-agnostic semantics. Due to the difficulty in collecting parallel training corpus, most exiting methods (Shen et al., 2017;Xu et al., 2018;Luo et al., 2019;Zhou et al., 2020) address the task in an unsupervised setting. Through techniques such as auto-encoding, back-translation, or adversarial learning, the task is converted to self-supervised problems and great empirical progress has been made. However, current methods employ autoregressive (AR) decoding which generates each output token conditioned on the previously generated ones, leading to low parallelizability and high latency for pragmatic use.
Recently, non-autoregressive (NAR) decoding has attracted much attention in neural machine translation (NMT) (Gu et al., 2018). NAR decoding eliminates the conditional dependencies among the output tokens and generates them in parallel, thus reducing the decoding time-complexity from O(T ) to O(1) for outputs with length T . Without modeling the dependency, the advantage on decoding speed comes at the cost of reduced performance. To address this issue, knowledge distillation is employed to transfer the knowledge from AR models to NAR models (Gu et al., 2018). Furthermore, existing works resort to various regularization techniques (Wang et al., 2019) to constrain the output or Semi-AutoRegressive (SAR) decoding (Wang et al., 2018;Ghazvininejad et al., 2019) as a speedperformance tradeoff.
In this paper, we explore NAR decoding for unsupervised TST to enable faster inference with better parallelism. To the best of our knowledge, this is the first work to study NAR models for TST. Firstly, a base NAR model is proposed by directly adapting the widely used training objectives from AR models. As with NMT, the base NAR model underperforms the AR model. To narrow their performance gap, we propose to enhance NAR decoding from three perspectives: the data perspective by knowledge distillation, the regularization perspective by contrastive learning, and the speed-performance tradeoff perspective by iterative decoding. Experimental results on a sentiment transfer dataset and a formality transfer dataset demonstrate integrating these techniques can substantially improve the base NAR model.

Related Work
Unsupervised Text Style Transfer. One branch of methods disentangles the style and content by learning a style-agnostic representation, which can be either a latent vector (Shen et al., 2017;Fu et al., 2018;John et al., 2019;Yi et al., 2020) or a subsequence of the input with the style indicators removed (Li et al., 2018;Xu et al., 2018;Wu et al., 2019b;Sudhakar et al., 2019;Madaan et al., 2020). Another branch of methods (Lample et al., 2019;Dai et al., 2019) inspired by back-translation dynamically creates pseudo-parallel data to gradually refine the TST model. There are also reinforcement learning based methods (Luo et al., 2019;Wu et al., 2019a;Gong et al., 2019;Liu et al., 2021) which guide the model with different rewards corresponding to the evaluation criteria.
Non-Autoregressive Decoding. Since the proposal of NAR decoding in NMT (Gu et al., 2018), follow-up works focus on narrowing its gap with AR decoding while keeping its efficiency. One branch of methods transfers the knowledge from AR models to NAR models by knowledge distillation (Gu et al., 2018)

Non-Autoregressive Text Style Transfer
Let S denote all possible values for a stylistic attribute. A desired TST model p θ (y|x, s) with parameters θ transforms an input text x with source style s 0 ∈ S to an output y with a given target style s ∈ S while preserving the style-agnostic semantics of x. In this section, we first propose a base NAR model (BaseNAR) for unsupervised TST (Section 3.1), then investigate three techniques, i.e., knowledge distillation (Section 3.2), contrastive learning (Section 3.3), and iterative decoding (Section 3.4), to enhance the performance of BaseNAR.

A Base NAR Model
At the core of NAR decoding is the conditional independence among output tokens. Formally, which, compared with AR decoding, removes the previous generated tokens y <t in the conditional variables for each timestamp. Our BaseNAR consists of an encoder and a decoder, both adopting Transformer (Vaswani et al., 2017) based architecture. The encoder uses the standard Transformer encoder as with AR models. Following NAR models for NMT (Wang et al., 2019;Shao et al., 2020), the decoder differs from the standard Transformer decoder and AR models by (1) discarding the autoregressive mask in the self-attention layer, (2) incorporating a positionalattention layer, and (3) uniformly mapping the source words as the decoder input 1 .
We optimize BaseNAR by three common losses from AR decoding based TST: self-reconstruction, cycle-reconstruction, and style compatibility.
Self-Reconstruction. When the target style s = s 0 , the model is expected to reconstruct x. Formally, the self-reconstruction loss minimizes Cycle-Reconstruction. With y ∼ p θ (y|x, s), the model is expected to reconstruct x when we feed y as the input and s 0 as the target style. Formally, the cycle-reconstruction loss minimizes Style Compatibility. Let p ψ denote a pretrained style classifier with parameters ψ to predict the style type for an input text. An output y ∼ p θ (y|x, s) is expected to be predicted as having style s. Formally, the style compatibility loss minimizes L style = − log p ψ (s|y) The full loss for our BaseNAR model is L self + L cycle + αL style , where α is a hyper-parameter.

Knowledge Distillation
In NMT, NAR models (Gu et al., 2018) achieve improved performance by sequence-level knowledge distillation (Kim and Rush, 2016) from AR models. Specifically, a pseudo-parallel corpus is constructed by sampling a translation output from the AR model for each source input in the training dataset. The NAR model is then trained using this pseudo-parallel corpus instead of the original one.
In our text style transfer task, we follow the same scheme in NMT. Suppose p φ (y|x, s) is a pretrained AR decoding based TST model with parameters φ. For each input x in the training set, we sample a pseudo-targetỹ ∼ p φ (y|x, s). The NAR model is then optimized to minimize

Contrastive Learning
Preliminary experiments show that BaseNAR suffers from the word omission problem. Inspired by Yang et al. (2019), we alleviate the problem by a contrastive learning-based regularization term. Specifically, the model is penalized if the probability for the desired output (positive sample) is not larger than that for an output with word omission errors (negative sample) by a margin η. The regularization can be paired with self-reconstruction, cycle-reconstruction, and knowledge distillation. For knowledge distillation, we minimize whereỹ * is the negative sample generated from current model using length |ỹ| − 1, the first term is the hinge loss, and the second term is to avoid instable results in case that minimizing log p θ (ỹ * |x, s) dominates the training. Regularizing self-reconstruction and cycle-reconstruction follows the same procedure.

Iterative Decoding
Iterative decoding is based on the Conditional Masked Language Model (CMLM) (Ghazvininejad et al., 2019). CMLM masks a subsequence of a given target sequence and predicts this masked subsequence conditioned on the remaining observed tokens and the source input. In our task, p θ (y|x, s) is reformulated as p θ (y mask |x, s, y obs ) = t∈y mask p θ (y t |x, s, y obs ), and the loss functions L self , L cycle , L kd , R kd are also reformulated accordingly. For instance, L kd is reformulated as L CMLM-kd = − log p θ (ỹ mask |x, s,ỹ obs ) (7) Other losses follow a similar reformulation.
During inference, we iteratively refine the prediction by the mask-predict scheme (Ghazvininejad et al., 2019). Given the target length T , let K denote the total number of iterations. In each iteration k ∈ {0, . . . , K − 1}, we obtain y k obs by masking n k = T · K−k K tokens with the lowest probabilities p k−1 t in previous prediction y k−1 , except for k = 0 where all tokens are masked. The model then repredicts the masked tokens and updates the prediction and probabilities: for masked tokens, Evaluation Metrics. The models are evaluated on three aspects: transfer accuracy (TA), content preservation (CP), and language fluency (LF). Both automatic and human evaluation are employed. For automatic evaluation, transfer accuracy is measured by a pretrained style classifier; content preservation is measured by the BLEU score between the model outputs and the human references; and language fluency is measured by the perplexity of model outputs on a pretrained language model. For human evaluation, we sample 100 test samples from both datasets. Three human annotators are invited to score each model output from 1 (worst) to 5 (best) for each aspect. See Appendix B for more details.
Model Variants. The following model variants are evaluated: BaseAR (the AR counterpart of BaseNAR), BaseNAR, and variants empowering BaseNAR with knowledge distillation (KD), contrastive learning (CL), and iterative decoding (ID), namely BaseNAR+KD, BaseNAR+CL, Base-NAR+ID, BaseNAR+KD+CL, BaseNAR+KD+ID, BaseNAR+CL+ID, and BaseNAR+KD+CL+ID. See Appendix C for their implementation details 2 . Table 1 and Table 2 show the automatic and human evaluation results of different models on Yelp and GYAFC. For comparison, we also provide automatic evaluation results for two state-of-the-art methods:

Results and Analysis
• DRL (Luo et al., 2019): a reinforcement learning framework which jointly trains the sourceto-target and the target-to-source transfer models as a dual task. The framework is optimized by a style reward and a content reward together with the pseudo-parallel data created through back-translation. Table 1: Automatic and human evaluation results on Yelp. The left side of "|" is the automatic evaluation result and the right side is the human evaluation result. K = 4 in our experienments. DI: decoding iterations. †: result significantly better than BaseNAR with p-value < 0.1 for both automatic and human evaluation. Table 2: Automatic and human evaluation results on GYAFC. The left side of "|" is the automatic evaluation result and the right side is the human evaluation result. K = 4 in our experienments. DI: decoding iterations. †: result significantly better than BaseNAR with p-value < 0.1 for both automatic and human evaluation.
• SR (Zhou et al., 2020): a sequence-tosequence model which predicts the output words as well as their relevance to the target style. The word relevance is further utilized to ensure style relevance consistency and content preservation.
On all metrics, BaseAR has comparable performance with the two methods thus serves as a decent baseline to evaluate the NAR models 3 . For our NAR variants, we have the following observations: First, compared with BaseAR, BaseNAR has a clear disadvantage towards language fluency on Yelp and all metrics on GYAFC, proving that the removed conditional dependencies do degrade model performance.
Second, knowledge distillation can provide a significant improvement over BaseNAR in cases where BaseNAR underperforms BaseAR by a large margin. In particular, on GYAFC which has longer sentences and larger variance but fewer training samples, the relationships among output tokens becomes harder to be inferred, making BaseNAR inferior to BaseAR on all metrics. In this situation, the pseudo-parallel data distilled from an AR model provide considerable complementary knowledge to BaseNAR. Thus the gap between BaseNAR and BaseAR is clearly narrowed. In contrast, on aspects where BaseNAR and BaseAR have limited performance gap, most of the knowledge distilled from an AR model can be already captured by BaseNAR and thus less helpful.
Third, contrastive learning can generally lead to a small improvement on all metrics. While the improvement is quite limited compared with knowledge distillation and can be neglectable especially for automatic evaluation, the benefits turn to be more visible when utilized together with knowledge distillation.
Yelp: positive → negative GYAFC: formal → informal Input they were extremely friendly and reasonably priced . that is if you truly adore them .
BaseAR they were extremely rude and flavorless . that s if u realy luv them BaseNAR they were extremely rude over priced . that is u realy adore them BaseNAR+KD they were extremely rude and flavorless . that s if u truly luv them BaseNAR+CL they were extremely rude over priced . that s if you realy adore them :p BaseNAR+ID they were extremely over priced . that is if you truly adore them BaseNAR+KD+CL they were extremely rude and flavorless . that s if u truly luv them BaseNAR+KD+ID they were extremely rude and flavorless . that s if u truly them BaseNAR+CL+ID they were extremely rude and over priced . that is if you truly adore them BaseNAR+KD+CL+ID they were extremely rude and flavorless . that s if u truly luv them Fourth, iterative decoding mainly improves language fluency. However, it can degrade the transfer accuracy on GYAFC. An explanation is that, as the model's prediction also relies on the partial outputs in addition to the source words and target style under iterative decoding, the dependency on the target style is harder to be captured with more conditioned variables. Furthermore, since the masked tokens are selected based on their probabilities, the correctly predicted tokens (which reflect the target style) can be re-masked due to lower probabilities compared with tokens in other positions. Fortunately, the degradation will diminish when knowledge distillation is used. Table 3 demonstrates the qualitative results for different model variants on samples from Yelp and GYAFC. BaseNAR suffers from the word omission problem, i.e., omitting "and" for the Yelp sample and "if" for the GYAFC sample. This problem can hurt content preservation and language fluency. Further, BaseNAR makes limited changes in producing an informal sequence on GYAFC, explaining its lower transfer accuracy in Table 1 and  Table 2. Variants with knowledge distillation can produce results closer to BaseAR. Variants with iterative decoding can generate more fluent results but can do worse in transfer accuracy (e.g., Base-NAR+ID only removes the ending punctuation for the GYAFC sample). Using contrastive learning only brings marginal improvement, e.g., only tackling the word omission on GYAFC but not on Yelp. However, using contrastive learning together with knowledge distillation can generally lead to better results. See Appendix D for more examples.
To summarize, the gap between AR and NAR decoding can be clearly narrowed when the NAR model is enhanced by knowledge distillation, contrastive learning, and optional iterative decoding (without iterative decoding, the model has no significant performance difference but is more efficient).

Discussions
Our work differs from other NAR works by exploring NAR in an unsupervised TST task. Knowledge distillation mainly solves the multimodality problem in the supervised NMT domain while alleviates the problem of lacking ground-truth for training in TST. Iterative decoding is very effective in NMT while has limited help and leads to reduced transfer accuracy in TST. Without ground-truth, the NAR model in TST has more word omission problems, instead of word repetition problems in NMT. The contrastive learning loss, which is not studied in NAR for other tasks, thus is introduced to penalize high log-probability of outputs with word omission. We expect the contrastive learning loss can be adapted to reduce the word repetition problems in other tasks.

Conclusion
In this paper, we propose NAR decoding for unsupervised text style transfer to pursue faster inference. On top of a base model, we explore how knowledge distillation, contrastive learning, and iterative decoding can narrow the performance gap towards AR decoding.

A Dataset Details
For both Yelp and GYAFC, we use the same train/dev/test split as in our state-of-the-art baselines (Luo et al., 2019;Zhou et al., 2020). In particular, for the GYAFC dataset, we use the data in Family & Relationship domain and ignore the available alignment information in the corpus to target at unsupervised text style transfer.

B.2 Human Evaluation
For each test sample, an annotator is provided with the source input, the target style, and the transferred outputs from all compared models as in Li et al. (2018). The transferred outputs are shuffled for different test samples so that the annotator is unaware of the source model for these outputs. The annotators are trained by exemplar annotation provided by the authors before evaluation. The final Fleiss' kappa score is 0.79 on Yelp and 0.77 on GYAFC.
C Implementation Details C.1 Model Architecture All the NAR models adopt the same Transformer based encoder-decoder architecture, and the BaseAR model only differs from the NAR models by the following three differences discussed in Section 3.1.
• The NAR model discards the autoregressive mask in the self-attention layer. Since the NAR model removes the conditional dependency among the output tokens, the causal mask where the position t can only attend to positions 1 . . . t − 1 is no longer needed. Following Gu et al. (2018), we set the masks to prevent a position from attending to itself.
• The NAR model incorporates a positionalattention layer in the decoder, which has been shown to facilitate local reordering in decoding (Gu et al., 2018). The positional-attention layer, placed between the self-attention layer and the inter-attention layer, takes the position embeddings as queries and keys while the decoder states as values.
• The NAR model uniformly maps the source words as the decoder input to enrich the information on the decoder side. Specifically, position t in the decoder input takes the word embedding of the source token in position i = round( Tx Ty · t), where T x and T y denote the lengths of source input and target output, respectively.
Both the encoder and the decoder use a Transformer structure with d model = d hidden = 128, n head = 4, n layer = 2. Following existing works (Lample et al., 2019), the target style is treated as a special start token in decoder. Both the style classifier for automatic evaluation and the pretrained p ψ in the style compatibility loss follow the TextCNN (Kim, 2014) architecture but are independently trained. To backpropagate the gradients from p ψ to θ, we approximate y in Eq. 4 with the softmax distribution sequence from which y should be sampled.

C.2 Hyper-parameters
We tune the hyper-parameters on the development set. As a result, the balancing weight α is set to 0.1, the number of iterations K in iterative decoding is set to 4, and the margin η in contrastive learning is set to 1.
We implement all models using PyTorch and conduct the experiments on a single Nvidia's GTX 1080Ti GPU. Each model is trained for 100,000 iterations with a batch size of 64 on Yelp and 32 on GYAFC. The Adam algorithm (Kingma and Ba, 2015) with a learning rate of 0.001 is used for optimization.

C.3 Technical Details
Target Length. During inference, the target length needs to be provided in advance. Models in NMT usually train a target length predictor during training with the available ground-truth outputs. However, this strategy cannot be adapted to our unsupervised task. Fortunately, on both the sentiment transfer task and the formality transfer task, the desired transferred result only involves local changes towards the source text thus has a similar length with the source length. Therefore, motivated by Wang et al. (2019), we generate a transferred result for each T ∈ [T x − B, T x + B], where T x denotes the length of the source input. As a result, we obtain 2B + 1 candidates and select the candidate with the highest log-probability (assigned by the decoder) as the final result. In our experiments, we set B to 2.
Knowledge Distillation. For models with knowledge distillation, i.e., BaseNAR+KD, BaseNAR+KD+CL, BaseNAR+KD+ID, Base-NAR+KD+CL+ID, we eliminate the selfreconstruction loss and the cycle-reconstruction loss from the full loss, as preliminary experiments demonstrate there is no performance degradation with this elimination. The reason should be that, knowledge distillation can provide more reliable and direct gradients to the model, leaving the weak supervision from self-reconstruction and cycle-reconstruction as redundant information.
Contrastive Learning. For models with contrastive learning, i.e., BaseNAR+CL, Base-NAR+KD+CL, BaseNAR+CL+ID, Base-NAR+KD+CL+ID, the contrastive learning based regularization will only be involved for the last 30% training iterations. Consistent with previous contrastive learning works, earlier involvement of the regularization may lead to unstable training and cannot bring performance improvement. As discussed in Section 3.3, the contrastive learning based regularization can be paired with self-reconstruction, cycle-reconstruction, and knowledge distillation. However, based on our preliminary experiments, (1) when knowledge distillation is used, we only pair this regularization with knowledge distillation, and (2) when knowledge distillation is not used (thus we cannot use R kd ), we only pair this regularization with cycle-reconstruction, as more sophisticated setting cannot bring further improvement.
Iterative Decoding. For models with iterative decoding, i.e., BaseNAR+ID, BaseNAR+KD+ID, BaseNAR+CL+ID, BaseNAR+KD+CL+ID, all losses except the style compatibility loss will be reformulated to fit the CMLM scheme. During training, we randomly mask n (0 ≤ n ≤ T ) tokens for a target sequence with length T and then optimize the model by predicting these masked tokens. One problem here is that, for the style compatibility loss, we need to generate an output y, however, there is not partial target sequence to utilize. We have considered two strategies: one strategy is to assume all tokens are masked; and the other strategy is that we first go through an inference stage to get an outputŷ, then randomly mask and repredict n tokens inŷ, and the repredicted tokens and the unmasked tokens are mixed as the input y for the style classifier. Our preliminary experiments show that the second strategy can always achieve much better results, so we stick to this strategy when iterative decoding is used. Table 5 and Table 6 present addtional qualitative results on Yelp and GYAFC, respectively.