Collaborative Learning of Bidirectional Decoders for Unsupervised Text Style Transfer

Unsupervised text style transfer aims to alter the underlying style of the text to a desired value while keeping its style-independent semantics, without the support of parallel training corpora. Existing methods struggle to achieve both high style conversion rate and low content loss, exhibiting the over-transfer and under-transfer problems. We attribute these problems to the conflicting driving forces of the style conversion goal and content preservation goal. In this paper, we propose a collaborative learning framework for unsupervised text style transfer using a pair of bidirectional decoders, one decoding from left to right while the other decoding from right to left. In our collaborative learning mechanism, each decoder is regularized by knowledge from its peer which has a different knowledge acquisition process. The difference is guaranteed by their opposite decoding directions and a distinguishability constraint. As a result, mutual knowledge distillation drives both decoders to a better optimum and alleviates the over-transfer and under-transfer problems. Experimental results on two benchmark datasets show that our framework achieves strong empirical results on both style compatibility and content preservation.


Introduction
Text style transfer is to transform an input text of a source style to a target style (i.e., style conversion goal) without loss of its style-independent information (i.e., content preservation goal). Concentrating on different stylistic attributes, text style transfer has attracted much attention from various natural language processing applications, such as personalized machine translation (Rabinovich et al., 2017), text formalization (Zhang et al., 2020), and sentiment translation (Xu et al., 2018). Unfortunately, the parallel corpora with aligned input and output are usually unavailable, challenging the models to learn in an unsupervised manner.

Input
The dish is fresh and yummy. Expected Output The dish is old and disgusting.

Over-Transfer
The staff are rude! Under-Transfer The dish is old and yummy. One research line to address the unsupervised text style transfer task is to first disentangle the style-independent semantics (content) from the style-dependent semantics (style), and then produce the output based on the disentangled content and the target style. The disentanglement is enforced either implicitly (Hu et al., 2017;Shen et al., 2017;Fu et al., 2018;John et al., 2019), or explicitly (Li et al., 2018;Wu et al., 2019b;Xu et al., 2018;Madaan et al., 2020). Nevertheless, such disentanglement has been discovered to be hardly met in practice (Elazar and Goldberg, 2018). Putting aside the disentanglement step, another research line learns a direct mapping from input to output, where the model is optimized by pseudo-parallel data created by online back-translation (Lample et al., 2019;Zhang et al., 2018c;Luo et al., 2019;Pant et al., 2020), or jointly predicting the word-level style relevance (Zhou et al., 2020). For both the disentanglement and non-disentanglement based research lines, objectives like self-reconstruction and style classification have been extensively proven as effective in guiding the training process.
Despite the great progress, existing methods still struggle to achieve both high style conversion rate and low content loss. Such limitation is widely embodied by the over-transfer and under-transfer problems: Over-Transfer (OT) refers to the content deviation patterns that some style-independent semantics are altered; while Under-Transfer (UT) refers to the lazy copying patterns that some styledependent semantics are unchanged. Table 1 illustrates the OT and UT problems in a sentiment trans-fer scenario. For the given input, the OT output achieves the positive-to-negative sentiment conversion but undesirably changes the focused aspect from dish to staff ; in contrast, the UT output preserves this sentiment-independent content but fails to convert yummy to words with negative sentiment.
The OT and UT problems are the product of the conflicting driving forces of the style conversion goal and content preservation goal. Specifically, objectives for the style conversion goal (e.g., the style classification loss) encourage generating new words reflecting the target style; while objectives for the content preservation goal (e.g., the self-reconstruction loss) encourage copying from source words. Without supervision from groundtruth, the model struggles between these two conflicting forces and tends to put their probability mass on choices in both directions to achieve both goals. As a result, the model can make unconfident predictions and present the OT and UT problems when biasing in the wrong direction. Furthermore, the specific design of different methods may further exacerbate the OT / UT problems 1 .
In this paper, we draw inspiration from multiagent learning to address the OT and UT problems. Under the widely adopted encoder-decoder architecture, we jointly learn a pair of Collaborative Bidirectional Decoders (CBD), one decoding from left to right (L2R) and the other decoding from right to left (R2L). Our collaborative learning mechanism regularizes each decoder by distilling knowledge from its peer. Essentially, OT and UT problems are incorrectly predicted words in the decoding procedure. In a similar spirit of pseudolabeling (Lee, 2013) and consistency regularization (Laine and Aila, 2017) in Semi-Supervised Learning (SSL), the mutual knowledge distillation provides a direct optimization direction for data lacking ground-truth, gradually improving both decoders to reduce OT and UT errors and get more peaked on reasonable predictions. Specifically, consistent predictions will be reinforced, while the inconsistent predictions lead to more uncertainty over candidate predictions and provide a chance for achieving consistency in subsequent training. However, this is only plausible under the consis-tency assumption that consistent knowledge can represent the ground-truth with a high probability. As with the Co-Training framework (Blum and Mitchell, 1998;Qiao et al., 2018) in SSL, to guarantee the rationality of the consistency assumption, we require the two decoders to have different knowledge acquisition processes. In addition to their opposite decoding directions, we introduce a distinguishability constraint to ensure their difference. In particular, an additional discriminator is employed to distinguish the softmax probabilities from the two decoders.
Our contributions can be summarized as: (1) We address the over-transfer and under-transfer problems in unsupervised text style transfer from the perspective of multi-agent learning with a pair of bidirectional decoders. (2) We propose a collaborative learning mechanism with mutual knowledge distillation and a distinguishability constraint to optimize the bidirectional decoders, so as to continuously promote the model's capability. (3) Experimental results and in-depth analysis on two benchmark datasets verify the strength of our model on persuing both the style conversion goal and the content preservation goal.  et al., 2019;Wu et al., 2019b;Dai et al., 2019) advance the field by replacing the Recurrent Neural Networks (RNN) based architectures with the Transformer architecture for its superiority in capturing long-term dependencies.

Multi-Agent Learning
Multi-agent learning improves model performance by incorporating multiple interactive agents. Most related to our work is the bidirectional decoding models (Zhang et al., 2018a(Zhang et al., , 2019Zhou et al., 2019) in Neural Machine Translation (NMT), which jointly train an L2R translator and an R2L one. Zhang et al. (2019) minimizes the KL divergence between the two translators to fuse the good prefixes of L2R decoding and the good suffixes of R2L decoding. Bi et al. (2019) further explores more than two agents where each agent learns the knowledge from a dynamic ensemble model. Mutual learning has also been set up between an NMT agent and a Statistical Machine Translation (SMT) agent to integrate NMT's fluency and SMT's robustness to noisy data (Ren et al., 2019). In addition to focusing on different tasks, our model differentiates by including a distinguishability constraint. Unlike NMT, the task of unsupervised text style transfer is not well-constrained due to both the lack of ground-truth and the conflicting forces from the two goals of style transfer. The distinguishability constraint is important to prevent the decoders from collapsing to one bad local optimum and reinforcing incorrect but consistent patterns.

Consider a training corpus
, where x i is a text sequence, and s i ∈ S is its style with S denoting all possible style types. The objective of text style transfer is to learn a conditional probability distribution P (x|x,s) to transform a given x tox with a target styles. The outputx is expected to retain the style-independent information in x. Here, we stick to the encoder-decoder based sequence-to-sequence architecture, where an encoder E first encodes x to latent vectors E(x), and a decoder D then producesx by sampling from its parameterized distribution D(x|E(x),s).
In this paper, we propose a framework with one encoder E and a pair of bidirectional decoders: an L2R decoder D l producing the output sequence from left to right, and an R2L decoder D r going in the opposite direction. The two decoders interact with each other in a collaborative learning mechanism. This collaborative learning mechanism is integrated with a basic framework that follows the non-disentanglement based research line and has three widely used objectives, i.e., selfreconstruction, back-translation, and style classification. In the following, we first briefly introduce the basic framework extended for our two-decoder scenario (Section 3.1), then elaborate our collaborative learning mechanism (Section 3.2), and present the training algorithm at last (Section 3.3).

Basic Style Transfer Framework
We adapt the objectives of self-reconstruction, back-translation, and style classification to our CBD framework. Let θ E ,θ D l , and θ D r denote the parameters of E,D l ,D r , and θ = [θ E , θ D l , θ D r ].

Self-Reconstruction
Self-reconstruction can warmly start the learning for non-parallel corpora and teach the model to preserve the content. Given an input x and its style s, if the target styles = s, the model is optimized to reconstruct x under both decoders, i.e., minimizing the self-reconstruction loss: where x is a noisy version of x (by random word permutation and word removal) to avoid trivial solutions as in Shen et al. (2017).

Back-Translation
Through dynamically creating pseudo-parallel data, back-translation provides guidance for the transfer between different styles with increasing reliability as training proceeds. Given an input x and its style s, suppose we designate a target styles = s and getx l ∼ D l (x|E(x),s) andx r ∼ D r (x|E(x),s). The model is optimized to restore x if we feedx l orx r as the input and s as the target style, i.e., minimizing the back-translation loss: This back-translation objective penalizes solutions that produce the same outputs for a given target style regardless of the inputs, thus alleviating the content deviation patterns in the OT problem.

Style Classification
Style classification enforces the style conversion goal by using a style classifier C (with parameters θ C ) to justify the style type of the transferred outputs. Given an input x and the target styles, suppose we getx l ∼ D l (x|E(x),s) and x r ∼ D r (x|E(x),s). The model is optimized to ensurex l andx r to be categorized to the style typẽ s by C, i.e., minimizing the style classification loss:

Collaborative Learning
As discussed in Section 1, the OT / UT problems are actually the wrongly predicted words under the lack of ground-truth and the conflicting forces of the style conversion goal and content preservation goal. To provide more supervision, we establish a mutual knowledge distillation scheme between the two decoders. Since the two decoders are conditionally independent given the encoder's outputs and the target style, we expect them to have inherently different knowledge acquisition processes. Then distilling the knowledge from one to the other can regularize each decoder by encouraging consistent predictions. Meanwhile, we explicitly ensure the two decoders' inherent difference by a distinguishability constraint which employs a discriminator to distinguish their behaviors. Together with the opposite decoding direction, this constraint keeps the mutual knowledge distillation from rapidly pushing both decoders towards one bad local optimum where incorrect but consistent patterns are reinforced.

Mutual Knowledge Distillation
We regularize D l and D r via two-way knowledge distillation: both try to learn each other's knowledge on producing the transferred output. Consider the knowledge distillation from D r to D l . Following the knowledge distillation framework (Hinton et al., 2015), given an input x and the target styles, D l is optimized to decrease the KL divergence between its probability distribution over all possible outcomes with that of D r , i.e., minimizing: Eliminating the negative entropy term which is irrelevant to D l from the KL divergence, L mkd (θ D l ) can be reformulated as where T (x,s) denotes all the possible transferred outcomes. However, exact computation for L mkd (θ D l ) is intractable with the summation over the exponential search space T (x,s). Following Kim and Rush (2016), we approximate the target distribution D r (x|E(x),s) as 1[x = t r ], where t r = arg max t∈T (x,s) D r (t|E(x),s) denotes the mode of the target distribution. As the maximization problem is still intractable, t r is further approximated by a sequence t r * using greedy decoding or beam search on D r . As a result, we arrive at This resulted objective function is equivalent to optimizing D l with pseudo-parallel data generated using D r . Similarly, for the knowledge distillation from D l to D r , D r is optimized to minimize:

The Distinguishability Constraint
The distinguishability constraint penalizes the cases where the two decoders lose their specialty in knowledge acquisition and collapse to each other.
To this end, we jointly train a discriminator F (with parameters θ F ) to discriminate the behavior of the two decoders. Specifically, we represent the behavior of a decoder by the sequence of softmax probabilities associated with its transferred output. Let F (b) denote the probabilty of behavior b coming from D l instead of D r . Given an input x and the target styles, suppose we get Algorithm 1 Training algorithm of CBD.

5:
Sample a target styles ∼ S withs = s 6: Compute L(θ, θ F ) by Eq 7 9: Update θ,θ F by optimizing L(θ, θ F ) 10: Compute L sty−c (θ C ) by Eq 8 11: Update θ C by optimizing L sty−c (θ C ) 12: end for and o(x r ) denote their softmax probability sequences. The decoders and F are optimized to ensure that o(x l ) and o(x r ) can be correctly classified by F , i.e., minimizing Note that the distinguishability constraint is not incompatible with mutual knowledge distillation: while mutual knowledge distillation focuses on the consistency between the joint probabilities of two decoders, i.e., D l (x|E(x),s) and D r (x|E(x),s), the distinguishability constraint focuses on the difference between their factor sequences, i.e., where T denotes the sequence length.

Model Training
Integrating the collaborative learning mechanism with the basic framework, we formulate the full objective function for CBD as minimizing and α, β and γ are hyperparameters.
The style classifier C is pretrained on M and further updated in our training stage with the spirit of adversarial learning 2 (Goodfellow et al., 2014), 2 The transfer model (incl. the encoder and the decoders), acting as the generator from the adversarial learning field, which has been shown to stabilize the learning of CBD in our preliminary experiments. For an input x and its style s, we enforce C to correctly predict s as the style for x; while for the outputsx l andx r produced by D l and D r under target stylẽ s, we enforce C to be uncertain between s ands by assigning a uniform distribution over the two styles (which represents the highest uncertainty). Formally, Eq 8 is minimized.
The training algorithm is summarized in Algorithm 1. Thex l andx r in step 6 and 7 of Algorithm 1 are generated by greedy decoding, and they further act as t l * and t r * in Eq 5 and 4. Greedy decoding is also used during inference. Note that the discreteness of text generation hinders the gradient backpropagation from L sty to θ. We tackle this problem by approximating each discrete word with the softmax distribution given by the decoder.

Experimental Settings
Datasets. We evaluate CBD on a sentiment transfer dataset YELP (Li et al., 2018) and a formality transfer dataset GYAFC (Rao and Tetreault, 2018). The YELP dataset is composed of business reviews from Yelp, with each review annotated as positive or negative. The GYAFC dataset is composed of sentences from Yahoo Answers, with each sentence annotated as formal or informal. Data statistics and preprocessing details are provided in Appendix A.
Implementation Details. The encoder and two decoders are implemented by single-layer Gated Recurrent Units (GRU) networks, while the style classifier and the discriminator employ the TextCNN architecture (Kim, 2014). In decoding, we follow Lample et al. (2019) and input the target style to the decoders as a special start token, which is then mapped to an embedding vector as the ordinary tokens. During inference, we produce two outputs for each sample (one from the D l and the other from D r ) then select the output with larger log-probability (assigned by its origin tries to produce a result making C predict its style as the given target style. With the first term in Eq 8, C acts as the discriminator from the adversarial learning field and is encouraged to be uncertain on the transferred results.  decoder) as the final transferred result. We set α = 0.1, β = 0.1 and γ = 0.01 in Eq 7. More details are provided in the Appendix B.

Evaluation Measures
Following our baselines, we adopt both automatic evaluation and human evaluation to assess models on three aspects: style compatibility, content preservation, and fluency.
Automatic Evaluation. For style compatibility: A style classifier C eval with the same architecture as C is independently learned on M. We measure the style compatibility by the prediction accuracy (ACC) of C eval on each model's output, using the target styles as ground-truth labels. For content preservation: Each test sample has been associated with one human reference on YELP 3 and four human references on GYAFC. We measure the content preservation by the BLEU score (using multi-bleu.perl 4 ) between the model's outputs and human references. For fluency: A language model LM with a single-layer GRU architecture is learned on all text sequences from M. We measure the fluency by the Perplexity (PPL) of LM on the model's outputs.
Human Evaluation. We invite three human annotators to evaluate different models' outputs for 200 test samples on each dataset. The annotators score each transfer result from 1 (the lowest quality) to 5 (the highest quality) in terms of style compatibility, content preservation, and fluency. More details are provided in Appendix C.

Results and Analysis
Automatic Evaluation Results. Table 2 shows the automatic evaluation results on YELP and GYAFC. Overall, the non-disentanglement based methods demonstrate better performance than the implicit / explicit disentanglement based methods which tend to sacrifice content preservation for style compatibility, i.e., the OT problem. On YELP, our CBD performs the best on style compatibility, and is comparable to the best on content preservation. While it achieves second-best on fluency, the BackTrans model with the best fluency suffers from severe content loss with a low BLEU score. On GYAFC, our CBD performs the best on   content preservation and the second-best on style compatibility and the third on fluency. Still, the RetrieveOnly model with the best style compatibility and second-best fluency, and the CrossAligned model with the best fluency, are both limited on the remaining metrics. Table 2 also reports the geometric mean and harmonic mean of ACC and BLEU, i.e., the G2 and H2 scores, on which our CBD outperforms all the baselines. Hence, we conclude that CBD achieves a better balance on style compatibility, content preservation, and fluency.
Human Evaluation Results. The middle blocks of Table 3 show the human evaluation results on YELP and GYAFC. Due to the high evaluation cost, we only compare CBD with a subset of our baselines which achieve better balance on the three metrics for both datasets than other baselines in their category. On both datasets, our CBD achieves the best results on all three aspects. And consistent with the automatic results, the non-disentanglement based methods outperform the disentanglement based methods.
Qualitative Results. Table 4 shows the transfer results of different methods for exemplar sentences on YELP and GYAFC. We can see that CBD can produce fluent outputs, clearly expressing the target style without loss of other semantics. In contrast, other approaches present OT / UT problems or produce influent sentences. Specifically, the disentan-glement based CrossAligned and Del-Ret-Gen are more prone to OT: the underlying semantics of not cooked per requested from the negative → positive example on YELP and the meaning of the formal → informal example on GYAFC are poorly preserved. The non-disentanglement based DualRL and WordStyleRel are more prone to UT: on YELP, the negative fatty or not cooked per requested are unchanged; on GYAFC, the changes are more limited than our CBD. More qualitative results and analysis are provided in Appendix E.
Discussions on the OT / UT Problems. Besides the qualitative results, models' strengths towards the OT problem are indicated by the content preservation scores, i.e., BLEU in Table 2 and Content in Table 3. Thus we conclude CBD can alleviate the OT problem especially faced by disentanglement based methods. However, the UT problem is only partially indicated by the style compatibility scores, as failures on style conversion can also be caused by irrational modification on style indicators such as indicator removal. From the style compatibility scores, we can only conjecture CBD shows improvement over baselines on the UT problem. For better justification, we prepare two subsets of YELP: U special containing 200 carefully selected samples with more than one style indicator (e.g., the input in Table 1), and U random containing 200 random samples. Three human annotators are invited to label if a given transferred result has the UT problem. The right block of Table 3 presents the ratio of UT cases for different models. All models have more UT cases on U special , suggesting the UT problem occurs more often in inputs with more than one style indicator since partial transfer results can fool the style classifier. CBD outperforms all baselines except for CrossAligned. However, CrossAligned has serious OT problems by deviating the semantics to achieve the target style, which can be demonstrated by its content preservation scores from Tables 2 and Table 3.
Limitation. Despite the improvement over baselines, the UT problem is still quite challenging for our model compared to the OT problem, especially when the style is expressed in less frequent manners. For a negative → positive sample on YELP: they only received one star because you have to provide a rating., our model generates: they received one star because you have to provide a great rating.. This can be attributed to the lack of common-sense knowledge. More failure cases and analysis are provided in Appendix F.

Ablation Study
To better validate the effectiveness of the proposed CBD, we compare the following ablated variants: (1) L2R + {L basic }; (2) R2L + {L basic }; (3) L2R + R2L+ {L basic }; (4) L2R + R2L + {L basic , L mkd }; (5) L2R + R2L + {L basic , L mkd , L dis } (6) L2R + L2R + {L basic , L mkd , L dis }; (7) R2L + R2L + {L basic , L mkd , L dis } where the variant (5) corresponds to our CBD model, and L basic = {L rec , L back , L sty }. Table 5 shows the automatic evaluation results of these variants on YELP. We have the following observations: First, the comparison between (3) and (1)/(2) shows that, shallow interactions by the shared encoder cannot give the two-decoder setting a clear advantage over the one-decoder setting. Second, the comparison between (4) and (3) shows that, mutual knowledge distillation can promote style compatibility and content preservation while sacrificing fluency a little. Third, the comparison between (5) and (4) shows that, involving the distinguishability constraint can achieve further improvement on all aspects. Fourth, the comparison between (5) and (6)/(7) shows that, settings with two unidirectional decoders underperform the bidirectional setting for all aspects. We conclude that, with comparable fluency, CBD (variant (5)) is advantageous over the other variants in achieving both the style conversion goal and content preservation goal.
To provide a deeper insight, Table 5 presents the per-word entropy 5 of each variant. The entropy measures the uncertainty of the model's predictions. We can see that variants (4)(5) show lower per-word entropy values than (1)(2). As explained in Section 1, the single-decoder models can unconfidently struggle between new word generation and source word copying. However, the mutual knowledge distillation in (4)(5) provides additional supervision to the decoders by gradually reinforcing consistent patterns and thus improves their confidence in prediction. For unsupervised tasks, lower entropy values are preferred as it represents the model's capability to filter the large proportion of wrong choices (Graça et al., 2009;Niu et al., 2012). Figure 1 in Appendix G illustrates the probability distributions of different variants when predicting specific words. Consistent with the per-word entropy values, CBD shows more peaked distributions than the single-decoder variants. Note that the per-word entropy value of (6) is higher than CBD while that of (7) is lower than CBD. This is possible: as the two unidirectional decoders have more close per-word distributions, if the distribution themselves are less peaked then the posterior distributions after mutual knowledge distillation can also be less peaked (as in (6)); on the other hand, if the distribution themselves are more peaked then so are the posterior distributions (as in (7)). This can be implied by the per-word entropy values of (1) and (2) where (1) has a higher entropy value than (2).
Computational Overhead. With an extra decoder and the introduced mutual knowledge distillation process plus the distinguishability constraint, on a single Nvidia's GTX 1080Ti GPU, the training speed of CBD is about 0.5 times the singledecoder setting while the inference speed is about 0.55 times the single-decoder setting. However, as the two decoders make inference independently, the gap diminishes (the speed ratio is 0.9:1) when we clearly assign the two decoders to different CUDA   streams during inference. Multiple GPUs can also be utilized to parallelize the inference of the two decoders.
To summarize, as the style conversion goal and content preservation goal push the model towards conflicting directions when lacking ground-truth, single-decoder models try to focus on both directions and thus have high uncertainty in decoding and show OT or UT problems. By collaborative learning with two bidirectional decoders, the proposed CBD model breaks this uncertainty: it gives more direct guidance by reinforcing the consistent predictions of two distinguishable decoders, so that the higher-entropy predictions are redistributed towards the correct direction. As a result, the OT and UT problems are alleviated. See Appendix G for additional quantitative and qualitative results for different variants.

Conclusion
In this paper, we address the unsupervised text style transfer task from a novel multi-agent learning perspective. To overcome the over-transfer and under-transfer problems, we introduce a pair of collaborative bidirectional decoders. Our collaborative learning mechanism performs mutual knowledge distillation on the two decoders and guarantees the rationality of this distillation process by a distinguishability constraint together with their opposite decoding directions. Quantitative and qualitative results on two benchmark datasets validate the strength of our framework over various single-decoder baselines in achieving both the style conversion goal and the content preservation goal. Our code will be made publicly available at https://github.com/ sunlight-ym/CBD_style_transfer.

A Dataset Details
We provide the statistics of the YELP dataset 6 and the GYAFC dataset 7 in The YELP dataset has already been tokenized and lowercased. We tokenize and lowercase the sentences in GYAFC with spacy 8 . For both datasets, we construct a vocabulary to keep the 10K most frequent words in the dataset. Out-ofvocabulary words are mapped to a special token <unk>.

B Additional Implementation Details
The encoder adopts a single-layer bidirectional Gated Recurrent Units (GRU) network, with 256 hidden units in each direction. The L2R decoder and the R2L decoder both employ an attentionbased single-layer unidirectional GRU network with 512 hidden units. The word embeddings are shared between the encoder and the two decoders, with a size of 128.
Our implementation is based on PyTorch (version 1.3.1) in Ubuntu 16.04. Models are trained on a single Nvidia's GTX 1080Ti GPU with 11 Gbps GDDR5X memory. We use a batch size of 64 and train the model for 100K iterations. The Adam algorithm (Kingma and Ba, 2015) is utilized to optimize the model with a learning rate of 0.001.
The hyperparameters α, β, and γ in Eq 7 are tuned on the development set. Specifically, we search α, β, and γ over the values in {0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0}. Each value is evaluated based on three trials with different seeds which are integers uniformly sampled from [1,999]. We have consistent observations on both datasets: (a) For α, {0.05, 0.1, 0.2} have similar performances, while smaller (larger) values will increase BLEU (ACC) but significantly decrease ACC (BLEU). (b) For β, the performance, especially BLEU, increases as β goes from 0.001 to 0.1. The ACC will be clearly degraded when β = 0.2. When β ≥ 0.5, the model quickly produces empty outputs from both decoders as a trivial solution for mutual learning which seems to dominate the training. (c) For γ, the benefits are most significant when γ = 0.01. The model becomes unstable when further increasing γ, e.g., the outputs may become quite influent with repeated tokens. As a result, we set α = 0.1, β = 0.1, and γ = 0.01.
Following Lample et al. (2019), the gradients of the back-tranlation loss (Eq 2) will not be backpropagated to the generation pass forx l andx r .

C Human Evaluation Details
Each of the three human annotators fulfills the following requirements: (a) the annotator is well-    Following Li et al. (2018), for each test sample and the target style, each annotator was shown the outputs of all evaluated models. Different models' outputs are randomly permuted. Before evaluation, for each dataset and each transfer direction, annotators are trained by: (a) instructions on the desirable properties of the text style transfer task; (b) the detailed interpretation for each level (1-5) of the three aspects: style compatibility, content preservation, and fluency; and (c) four exemplary transfer outputs on a source sentence associated with scores assigned by authors and a short explanation for the scores.
We measure the inter-annotator consistency of the human evaluation results by the Fleiss' kappa score. Specifically, the Fleiss' kappa score is 0.782 on YELP and 0.791 on GYAFC.

E Additional Qualitative Results
To better illustrate the improvement of our CBD over the baselines against the over-transfer and the under-transfer problems, we present additional qualitative examples from YELP and GYAFC in Table 8 and Table 9, respectively. The results show consistent patterns with those in Table 4. Specifically, the disentanglement based methods, especially the implicit disentanglement based CrossAligned model, suffer from serious over-transfer problem by losing original content or adding new content; on the other hand, the non-disentanglement based baselines tend to undertransfer by keeping part of the original sentiment semantics on YELP and by making limited transformations on GYAFC. In contrast, our CBD demonstrates better robustness towards both the overtransfer and the under-transfer problems.

F Failure Cases
We also present two more undesirable transfer results of our CBD for each dataset in Table 10. These failed cases mainly under-transfer the source sentences, showing the imperfection of our model towards the under-transfer problem in the following situations: (a) the style information is expressed in less frequent manners (compared to those indicated by adjectives) such as "with lots to see and try"; (b) the style information is expressed by words which can represent different styles in different context: for example, "hot" has been used to indicate both positive and negative sentiment in the training corpus; (c) partial changes can also be regarded as reasonable results for inherently continuous style types such as the formality transfer: for the formal → informal example on GYAFC, the outputs of our CBD are still limited by only removing the comma, while more changes like changing "you" to "u" or removing the period can be further applied to achieve a more informal style. Besides, there are incorrect transfer results such as changing "ur" to "your" instead of "you are" for the informal → formal example, which might be plausible for some cases while not plausible for the given context. Based on the above observations, we conclude the limitations of our CBD include: first, it cannot fully utilize the structure and/or context of the source sentence to make the transfer; second, it cannot control how much style information is transferred for inherently continuous style types. We leave exploration for these issues in future work.
In this paper, we focus on problems brought by the conflicting driving forces of the style conversion goal and the content preservation goal.

G Detailed Ablation Study
In this section, we provide more details for our ablation study on the YELP dataset to validate the effectivenss of the CBD model.

G.1 Qualitative Results of Different Ablated Variants
To provide more insights into different ablated variants, Table 11 demonstrates the transferred results of variants (1)-(7) on eight samples from YELP. We can observe that the single-decoder variants (1) and (2) easily suffer from over-transfer (e.g., losing the "walmart" in the third negative → positive sample) or under-transfer problems (e.g., keeping "was even better" in the last positive → negative sample). The variant (3) can only perform better for some cases by setting up a shallow connected two-decoder scheme. With mutual knowledge distillation, variant (4) is much less prone to the overtransfer and under-transfer problems. However, it can emphasize the consistency between the decoders too strongly and may still lead to suboptimal results (e.g., totally under-transfer the last negative → positive sample). By incorporating a distinguishability constraint, our CBD, i.e., variant (5), can alleviate both problems. In contrast, (6) and (7) with two unidirectional decoders, perform even worse than (4) for most cases. This further implies that the inherent difference is quite limited for unidirectional decoders, therefore, the two decoders may have similar bad patterns which are further reinforced during training: take the third positive → negative case for example, both (1) and (6) suffer from the over-transfer problem while both (2) and (7) suffer from the under-transfer problem. As shown in Table 5, the per-word entropy values of two-decoder settings are lower than those of single-decoder settings. To better illustrate this, Figure 1 presents the top-5 predicted words together with their probabilities of variants (1), (2), (5), and (7) when they predict (a) the word after "would" and (b) the word after "dentistry" given the third positive → negative input from CrossAligned always authentic all other and and they are i 'm a better fan and at this location for possible . the food ! Del-Ret-Gen i always enjoy going in always their kristen horrible experience i 'm a regular drive-through and always have shitty customer service ! customer at this location never again . DualRL always friendly in their best price and always i 'm a regular regular customer at this location . have loved customer service ! WordStyleRel always friendly in their tone and always i 'm a horrible drive-through customer at this have the customer service ! location . CBD always nice in their tone and always have i 'm not a regular drive-through customer at this great customer service ! location .
YELP: negative → positive YELP: positive → negative Input so , no treatment and no medication to help the service was great and would gladly go back . me deal with my condition .
CrossAligned so , her and and my hair and gave me up and the service was better and would never go back happy .
back . Del-Ret-Gen so service was great , no treatment and no to would gladly anyone back go back . help me with no treatment . DualRL so , great treatment and great help me deal the service was horrible and would gladly go back . with my condition . WordStyleRel so , great treatment and no medication to help the service was horrible and would gladly go back . me deal with my condition . CBD so , great treatment and great medication to the service was horrible and would not go back . help me deal with my condition . CrossAligned think it is not a friend . try to tell him and it is not the same thing too . Del-Ret-Gen make sure your not it is your decision . do not you try to you it out thats you the way it is . DualRL make sure it is your decision er do not try to figure it out the way it is . WordStyleRel make sure it is your life simply try to figure it out . a the way it is . CBD make sure its your decision do not try to figure it out , that is simply the way it is .
Input do not approach her and let her know that its all about her , and no it should nt be something you find her looks very attractive .
that happends on a first date .
CrossAligned just tell him if you dont get out with him it is possible and that is not a good thing and she is a and dont just be happy good man . Del-Ret-Gen approach her and let her know that you find i would like her to call , and everyone has something her looks ! ! that you on a first date .  it is a cool place , with lots to see and try .
CrossAligned but it was attentive and nice every time . it is a long place , run down to fix to use home . Del-Ret-Gen the food is good but they have the best hot in there . it is my waste of time , with lots to try and see . DualRL but it was delicious and hot in there . it is a frustrating place , with lots to see and try . WordStyleRel but it was delicious and hot in there .
it is a horrible place , with lots to see and try .
CBD but it was fantastic and hot in there . it is a depressing place , with lots to see and try . CBD if a man cares about you then he will call . it all depends on when your ready . (1) (2) (5) (7) (b) Figure 1: The top-5 words' probabilities of variants (1), (2), (5) and (7) when they predict (a) the word after "would" and (b) the word after "dentistry" given the third positive → negative input from   (4) and (5) (4) and (5) on YELP. Fourth block: BLEU scores between the L2R decoder and R2L decoder of variants (4) and (5), and the accuracy of the discriminator of variant (5) on YELP.
Variants (5) and (7) demonstrate more peaked distributions over the words, which conforms to the lower entropy values. However, the most probable word of variant (7) in Figure 1a, i.e., "suggest", is an incorrect prediction expressing the opposite of the target style. This further shows two unidirectional decoders may amplify the bad patterns learned by the inherently similar decoders.

G.2 Effect of the Distinguishability Constraint
Based on the quantitative and qualitative results, variant (4) without the distinguishability constraint shows stronger performance which is quite close to CBD (variant (5)) than other ablated variants.
To better explore the effect of distinguishability constraint, we incorporate variant (4) in human evaluation. As shown in Table 12, CBD can improve variant (4) on style compatibility (partially reflecting the under-transfer problem) and content preservation (reflecting the over-transfer problem). Furthermore, Table 12 also reports their percentages of under-transfer cases on U special and U random . Still, CBD achieves a clear advantage. Hence, we conclude that incorporation of the distinguishability constraint leads to better capabilities, though limited, to address the over-transfer and the undertransfer problems. Another question is how divergent the two de-coders are at the end of training. Table 12 presents the BLEU scores between the outputs of the L2R decoder and the R2L decoder for variant (4) and CBD. We can observe that the BLEU scores of the two variants are comparable, both exhibiting a high similarity. Moreover, the discriminator cannot well distinguish behaviors of the two decoders. There is no wonder for this as the other loss functions dominate the learning process. Obviously, we can decrease the BLEU score and increase the accuracy of the discriminator by assigning a larger value to γ, i.e., the weight of the distinguishability constraint. However, this focuses on the wrong point and can only lead to worse results where influence from other objectives gets weakened and at least one decoder tends to produce unreasonable outputs to maximize their difference. Our explanation is that the distinguishability constraint behaves as an assistant or a regularizer for mutual knowledge distillation. While it is less significant for the performance than the mutual learning, it constrains the model to update in a more cautious way to avoid their collapsing and reinforcing incorrect but consistent patterns (e.g., keeping "was even better" in the last positive → negative sample in Table 11).