Neural Stylistic Response Generation with Disentangled Latent Variables

Generating open-domain conversational responses in the desired style usually suffers from the lack of parallel data in the style. Meanwhile, using monolingual stylistic data to increase style intensity often leads to the expense of decreasing content relevance. In this paper, we propose to disentangle the content and style in latent space by diluting sentence-level information in style representations. Combining the desired style representation and a response content representation will then obtain a stylistic response. Our approach achieves a higher BERT-based style intensity score and comparable BLEU scores, compared with baselines. Human evaluation results show that our approach significantly improves style intensity and maintains content relevance.


Introduction
Linguistic style is an essential aspect of natural language interaction and provides particular ways of using language to engage with the audiences (Kabbara and Cheung, 2016). In human-bot conversations, it is crucial to generate stylistic responses for increasing user engagement to conversational systems (Gan et al., 2017). Currently, most of the existing parallel datasets are not stylistically consistent. Samples in these datasets are usually contributed by a variety of users, resulting in an averaging effect across style characteristics (Zhang et al., 2018a). Meanwhile, constructing a parallel stylistic dataset for training the open-domain conversational agents is both labor-intensive and time-consuming.
Recent studies show the effect of stylizing responses using a monolingual dataset in the desired style and a conventional conversational dataset (Niu and Bansal, 2018;Gao et al., 2019b). However, increasing style intensity often leads to Figure 1: An example of responses generated by S2S, S2S+LM (Niu and Bansal, 2018), Style Fusion (Gao et al., 2019b), and our approach, targeting the Holmes style, which is quite formal and polite. the expense of decreasing content relevance between dialogue history and response. As an example in Figure 1 shows, Niu and Bansal (2018) independently train a response generation model and a stylistic language model and subsequently interpolates them in the inference phase. Lacking the interaction between the stylistic language model and response generation encoder, it usually yields a trade-off between style intensity and content relevance. Gao et al. (2019a,b) fuse a structured latent space where the direction denotes the diversity, and the distance denotes style intensity and content relevance. The main issue is that style intensity and content relevance are contradictory in measurement but are coupling to the same "distance" metric of the latent space. To sum up, the key issue of the above studies is the improper entanglement of style and content.
To address the issue, we propose to disentangle the style and content of a response. The disentanglement is conducted on the structured latent space, where each sentence (dialogue history, response, and stylistic sentence) is projected into a vector representation. We further split the representation into two components: style and content representations. The former is a corpus-level feature since sentences within a dataset have the same style. In contrast, the content representation is a sentence-level feature decided by a sentence itself. We thus disentangle the content and style by diluting sentence-level information in the style representation. This encourages the encoding of content information into the content representation. Otherwise, the content information will be corrupted in the style representation, making it hard to reconstruct the original content in the subsequent decoding process. We conduct experiments on DailyDialogue conversational dataset  and Holmes monolingual stylistic dataset (Gao et al., 2019b). Experimental results show that our proposed approach improves style intensity and maintains content relevance. Our contributions are listed below: • We propose a unified framework to simultaneously improve style intensity and maintain content relevance for neural stylistic response generation.
• We introduce a scheme of learning latent variables by a diluting strategy to disentangle the style and content.
• Experimental results show that our approach achieves higher performance in style intensity without decreasing content relevance, compared with previous approaches.

Task Definition
The task of stylistic response generation is defined as follows: given a monolingual stylistic dataset S = {S 1 , ..., S N } 1 and a conversational dataset and Y i denote a stylistic sentence, dialogue history, and a response respectively, the goal is to learn a generation model P (Ŷ |X), whereŶ is a generated response expected to be in the style of S (called the desired style in the following sections). We will first briefly review the concept of structured latent space and then introduce our disentanglement approach.  Figure 2: An example of a dialogue in the structured latent space. The center point corresponds to the dialogue history representation Z S2S (X i ). The k-th response representation Z AE (Y k i ) (denoted by a black point) is optimized to be distributed around Z S2S (X i ). The red point Z AE (S j ) and the purple point Z(Ŷ i ) are representations of a monolingual stylistic sentence and a stylistic response, respectively.

Background: Structured Latent Space
Overview The structured latent space is constructed by two main mechanisms: (i) sharing a decoder between a sequence-to-sequence (S2S) model and an auto-encoder (AE), and (ii) fusion and smoothness objectives. As an example in Figure 2 shows, a response representation Z AE (Y i ) is regularized by the two mechanisms to be distributed around its dialogue history representation Z S2S (X i ). The notations Z AE (·) and Z S2S (·) denote the representations computed by AE encoder and S2S encoder, respectively. Such a latent space makes it possible to predict a responseŶ by sampling nearby the dialogue history representation. Based on that, Gao et al. (2019b) further align stylistic sentence representations into the latent space, which improves the style intensity of generated responses. In summary, the construction of the structred latent space is a process of aligning the three spaces (Z S2S (X i ), Z AE (Y i ), and Z AE (S j )) by two mechanisms (sharing the decoder, and fusion and smoothness objectives).
Fusion Objective cross-aligns sentences of different spaces. Since X i and Y i are paired, we align them by minimizing their pair-wise dissimilarity: where d E denotes the Euclidean distance, n is the batch size, and l is the dimensionality of the latent space. In contrast, the pair-wise dissimilarity can-not be applied to stylistic sentences since they are not paired with conversational data. To this end, the fusion objective instead optimizes the nearest neighbor distance between the two datasets: where d cross NN ({a i }, {b j }) denotes the batch average distance between a i and its nearest neighbor in the set {b j }. To further encourage the representations spread-out the latent space, a inner-distance loss is introduced: where d inner NN ({a i }) denotes the batch average distance between a i and its nearest neighbor in the set {a i }. The final fusion objective is defined as: Smoothness Objective aims to make the structured latent space a continuous space, where each point can decode a natural sentence. Given three discrete points Z S2S (X i ), Z AE (Y i ), and Z AE (S j ), the objective encourages points in the area between Z S2S (X i ) and Z AE (Y i ) to generate Y i : where ∼ N (0, σ 2 I), and U ∼ U (0, 1). Meanwhile, as a point moves from Z AE (Y i ) to Z AE (S j ), the corresponding generation is expected to gradually move from Y i to S j : The smoothness objective L smooth is the sum of L smooth,conv and L smooth,style , and is added to the final loss function along with the fusion objective and response generation loss of S2S.

Our Method
Despite aligning monolingual stylistic sentences into the structured latent space helps stylize generated responses, their style intensity is still limited.
We conjecture this is due to the coupling of the style and the content in sentence representations. To this end, we propose to disentangle the two aspects in the structured latent space. In our proposed approach, a sentence representation Z ∈ R l in the latent space consists of two components: content representation Z c ∈ R lc and style representation Z s ∈ R ls , where l is the dimensionality of latent space and l c + l s = l. Z s encodes all the style information of a sentence. It is a corpus-level feature because Z s for different sentences in the same corpus should be similar. In contrast, Z c can be seen as a sentence-level feature which only decided by the content of its corresponding sentence. Figure 3 shows an example of our approach, where Z c and Z s can be seen as two "containers". Colored squares represent the content and style information. We encourage the disentanglement of the two types of information by diluting sentence-level content information in Z s . As an example in Figure 3 (a) shows, the content and style information may be mixed in both Z c and Z s . During the decoding process of a sentence, i.e., Y i , we replace its style representation In this way, its sentence-level content information will be diluted since it greatly varies from other sentences' content information, which introduces extra noise. In contrast, its corpus-level style information, which is similar to that of other sentences within the batch, will remain unaffected. As the training processes, the content information will be encouraged to be encoded into Z c where it can remain unchanged, as an example in Figure 3 (b) shows. Otherwise, the content information will be corrupted in Z s , making it hard to recover the content of Y i . As a result, the encoding process will be punished by the response generation loss of S2S and the reconstruction loss of AE, as shown in Figure 3 (a).
Based on that, we update the response generation process by replacing its style representation Z s with the corresponding batch average style rep-resentationZ s : where the bracket [:] denotes concatenation. The decoding process in the smoothness objective is updated similarly. Note that when we move from  Figure 3: An example of disentangling content and style. The purple block is the content information of the first sentence. The yellow block is the content information of the second sentence. Style information in both two sentences is denoted by red blocks as it is a corpus-level feature shared among samples within the corpus. (a): A negative example whose content and style information is mixed in Z c and Z s . Its content information is corrupted after averaging Z s within the batch and fails to recover the input content. (b): A positive example. Content information in Z c and style information in Z s will not be affected after averaging Z s .
Y i to S j , and from X i to Y i , we only interpolate their content representations Z c in the latent space: The batch average style representationZ s remains consistent with the target, i.e., beingZ s AE (S j ) when the target is S j . The updated smoothness objective is as follows: . (9) The final training loss is the sum of the response generation loss, fusion objective, and smoothness objective: Here, we do not employ pre-training models, i.e., DialoGPT (Zhang et al., 2020b) and OpenAI GPT2 (Radford et al., 2019). This is because the disentanglement is usually conducted on a sentence representation. While most of the pre-training models depend on the attention mechanism, and there is no static global sentence representation during the decoding process.

Inference
To generate a stylistic responseŶ i given dialogue history X i during the inference process, we first obtain Z c S2S (X i ) by S2S encoder and subsequently sample Z c (Ŷ i ) from the hypersphere of Z c S2S (X i ) with a mannually tuned radius r. After that, we gen-erateŶ i by concatenating Z c (Ŷ i ) andZ s AE (S j ), which is the batch average style representation of randomly sampled stylistic sentences.
Considering the discrepancy between training and inference that content and style representations in different corpora have never been concatenated for generation, we propose a soft combination approach to introduce the desired style by interpolating Z s S2S (X i ) andZ s AE (S j ): where α is the weight of the desired style. After that,Ŷ i is generated by the decoder whose hidden state is set to [Z c (Ŷ i ) : Z s soft ]. To further balance style intensity and content relevance, we also employ the re-ranking strategy following Gao et al. (2019b). It samples N y candidate responses and re-ranks them by: where P S2S (Ŷ i |X i ) is the generation probability under a S2S model measuring the relevance. P style (Ŷ i ) is the probability thatŶ i has the desired style. It is a interpolation between the probabilities of a neural-based classifier and a n-gram classifier: where w n is a weight which is set to the accuracy of the corresponding classifier.

Data
Conversational Dataset We employ DailyDialog 2  as our conversational dataset C. It is a human-written multi-turn dataset covering various topics of daily life. Table 1 shows some statistics of its training, validation, and test set. We split dialogue of K utterances into K-1 samples. Each sample consists of at most three continuous utterances. The last utterance of a sample is regarded as the response. The previous utterances of the response are concatenated as its dialogue history. Here, Reddit dataset is not employed as Gao et al. (2019b) because the post-reply format data collected from social networks is noisy and different from real conversations .
Monolingual Stylistic Dataset Following Gao et al. (2019b), we use Holmes 3 as the stylistic dataset S. It is collected from the Sherlock Holmes novel series and consists of roughly 38k sentences. We do not use the arXiv dataset as it contains too many special tokens, i.e., equations, and incomplete sentences, such as "is concerned" and "exactly identical restrictions".

Baselines
We compare the proposed approach with the following baselines: • S2S, the sequence-to-sequence response generation model (Shang et al., 2015).
• S2S+LM, a S2S trained on C and a stylistic language model trained on S (Niu and Bansal, 2018). During the inference process, it generates a stylistic response by interpolating outputs of the two models.  • Style Fusion, a multi-task learning based model whose latent space fuses dialogue history, responses, and stylistic sentences with a specific structure (Gao et al., 2019b).
Note that we do not consider the Label-Fine-Tuning model and Polite Reinforcement Learning model (Niu and Bansal, 2018), because they require some training samples in the conversational dataset to have the desired style (Gao et al., 2019b).

Experiment Settings
We implement the proposed approach based on the released code of Style Fusion model 4 . The vocabulary table consists of the most frequent 20,000 words. S2S encoder, AE encoder, and the shared decoder are two-layer LSTMs. The number of their hidden units is 1000, which is also the size of the structured latent space. The dimension of Z c and Z s is 950 and 50, respectively. The maximum length is set to 90 for the dialogue history and 30 for the response.
During the training process, we use the ADAM optimizer, whose learning rate is 0.0003. σ 2 for sampling in Equation 8 is 0.1 2 . Table 2 shows the average running time on a single TITAN X (Pascal) GPU. During the inference process, the weights γ and η for re-ranking are set to 0.5. The weight (accuracy) of n-gram classifier is 0.93, 0.87, 0.77, and 0.65 for n from 1 to 4. The number of candidate responses, N y , is set to 10. The radius r is set to 3.

Evaluation Metrics
Automatic Evaluation Considering that it is unfair to evaluate a response by the classifiers that are used for selecting the response (Song et al., 2020), we fine-tune a BERT (Devlin et al., 2019) Table 3: Automatic evaluation results of SI, Dist-1, Dist-2, and BLEU. The last column is the harmonic mean of SI and BLEU-4 measuring the overall performance of style intensity and content relevance. randomly selected from DailyDialog's responses, which are of the same amount of sentences as the positive samples. Given the fine-tuned BERT classifier (whose accuracy achieves 0.96 on the validation set), we report the average probability of responses being positive as a measurement of the style intensity. For brevity, we denote this metric as SI. The content relevance is evaluated by BLEU.
Since it may correlate weakly with human judgments of quality in a single reference setting (Liu et al., 2016), we employ the expanded responses in multi-reference DailyDialog test set (Gupta et al., 2019) as references to alleviate the problem. Meanwhile, we evaluate the diversity by Dist-k (Li et al., 2016), which is the number of distinct k-grams normalized by the total number of words of responses.
Human Evaluation We randomly sample 200 messages from the test set of C to conduct the human evaluation from two aspects: style intensity and content relevance. Each aspect is independently evaluated by five Amazon Mechanical Turk (AMT) 5 workers whose approval rate is greater than 95%, and the number of approved is greater than 500. Given dialogue history and two responses generated by a baseline and our approach, the workers are asked to give a preference of which one is  better (ties are also permitted). Figure 4 shows the trade-off between style intensity and content relevance in our approach. There is an improvement in SI and a decrease in BLEU associated with the increase of α in Equation 11. To assess the overall performance, we also compute their harmonic mean, whose maximum lies around α = 0.5. We thus conduct the human evaluation and analysis in this parameter setting. We report the human evaluation results in Table 4. Our approach is clearly preferred in style intensity because the percentage of Win is significantly higher than that of Lose (p <0.001, T-test). In terms of content relevance, the ratios of Win in "vs. S2S" and "vs. Style Fusion" are similar to those of Lose. This suggests that our approach can significantly improve the style intensity without decreasing the content relevance. In contrast, S2S+LM loses in most of the cases in the content relevance. Following Zhou et al. (2018) and Ke et al. (2018), we evaluate the agreement of annotators via inter-rater consistency. The percentage of samples that at least three annotators have the same preference (3/5 agreement) is 81.80%. And the percentage for 4/5 agreement is 32.15%. Table 3 shows the results of the automatic evaluation. Our approach has the highest mean score, which indicates that it achieves the best overall performance. S2S+LM has a high SI score, but its BLEU scores are not as good as others, i.e., S2S.   This is in line with our human evaluation results and Niu and Bansal (2018)'s observation that biasing a decoder with a stylistic language model may harm the content relevance. In contrast, our approach (α = 0.25) significantly outperforms S2S and is comparable to Style Fusion. By increasing α to 0.5, the BLEU score drops slightly but is comparable to baselines (evidenced by the human evaluation results). Meanwhile, there is a significant improvement (up to 95.37%) in SI comparing with Style Fusion. This verifies the effectiveness of our disentanglement approach in improving the style intensity and maintaining the content relevance. Besides, the Dist-k results in Table 3 also indicate that the diversity of our approach is comparable to the best-performed Style Fusion.

Ablation Study
We conduct ablation studies to investigate the contributions of the fusion objective, smoothness objective, and our disentanglement approach. To focus on their effects on the generation process, in this section, we sample a single response without using the re-ranking strategy (Equation 12). Table 5 shows the results of the ablation study. There is a significant decline in SI and a slight change in BLEU-3 and BLEU-4 after removing each component. This indicates that a multi-task learning architecture without the three components  can achieve a good content relevance performance but fails to stylize a response. By removing the disentanglement component, our approach degenerates into Style Fusion. In this case, the SI score decreases significantly while BLEU scores are nearly unchanged, which demonstrates the disentanglement could improve the style intensity and maintain the relevance at the same time. The decreases in SI after removing the fusion objective and smoothness objective are more significant than that after removing the disentanglement. This is because the two objectives are bottom components for constructing the structured latent space, where our approach and Style Fusion are built upon.

Analysis
In this section, we analyze whether style information is disentangled into Z s . To achieve this goal, we train style classifiers taking as input a latent variable and use the validation accuracy as an indicator. Taking our approach as an instance, we first freeze the parameters of our well-trained model. Then we independently learn two style classifiers whose inputs are the full latent variable ([Z c : Z s ]) and Z s respectively. Note that Z c and Z s in Style Fusion are a simple partition of its latent variable. There are not any disentanglement approaches applied to obtain the two representations. As shown in Table 6, Style Fusion achieves 0.83 validation accuracy training on its full latent variable. And the accuracy decreases by 13.02% when the classification is only based on Z s . In contrast, the decrease of our approach is only 1.71%, indicating that most of the style information is disentangled into Z s . We show a visualization of the disentanglement of the latent variable by MDS (Borg and Groenen, 2005) in Figure 5. Each figure consists of Z s (black) and three continuous sub-sequences extracted from the head (yellow), middle (red), and tail (blue) of Z c . The sub-sequences are of the same length with Z s . For both stylistic and conversational samples, all the sub-sequences and Z s are mixed in Style Fusion. In contrast, there is a clear separation between Z s and the sub-sequences  in our approach. This is because most of the style information is disentangled into Z s in our approach, making its distribution different from sub-sequences of Z c . Table 7 shows some examples of generated responses. There is no significant Holmes style in responses of S2S. Similarly, the style intensity of responses in Style Fusion is also limited. The semantics of S2S+LM's response in the first example is not very clear, making it less relevant to the dialogue history than other responses. We believe this is also due to the lack of interaction between the response generation encoder and the stylistic language model. In contrast, our approach not only achieves a good content relevance performance but also has a significant Holmes style, which is quite polite and formal.

Text Style Transfer without Parallel Data
The task of text style transfer aims at transferring the style of a sentence while preserving its meaning. One way is to disentangle the content and style, and subsequently combine the content with the desired style. The disentanglement can be achieved by adversarial learning Hu et al., 2017;Fu et al., 2018;Yang et al., 2018;Logeswaran et al., 2018), reinforcement learning (Jain et al., 2019), back-translation (Prabhumoye et al., 2018Nogueira dos Santos et al., 2018), multi-task learning (John et al., 2019), and removing stylistic phrases Zhang et al., 2018b). The other way transfers the style without disentangled representations, for example using generator-evaluator architecture (Gong et al., 2019), cycle reconstruction (Dai et al., 2019), parameter sharing , and data augmentation (Zhang et al., 2020a). The main difference between our task and text style transfer lies in two aspects. First, all the content to be generated is available in the input in text style transfer, while our task needs to create new (response) content. And the key is content relevance to the dialogue history, rather than content preservation of the input. Second, the data for text style transfer is isomorphic. Data in different styles are in the same free-text format. However, our conversational data are context-response pairs while the stylistic data are free-texts, which is heterogeneous and requires more sophisticated structures, i.e., the structured latent space (Gao et al., 2019b).

Stylistic Response Generation without
Parallel Stylistic Data Niu and Bansal(2018) propose three weaksupervised models based on reinforcement learning, conditional text generation, and language model. Gao et al. (2019b) fuses the latent spaces of a response generation model and a stylistic autoencoder to improve the style intensity of sampled responses. Yang et al. (2020) inject the style information by introducing a word-level KL loss and a sentence-level style classifier to the fine-turning process of DialoGPT (Zhang et al., 2020b). Distinct from previous work, we explicitly disentangle the style and content in the latent space and employ a unified architecture to jointly optimize the style intensity and content relevance.

Conclusion
We propose a uniform framework to simultaneously improve the style intensity and maintain the content relevance for neural stylistic response generation. In contrast to existing approaches, our approach disentangles the style and the content in the latent space by a diluting strategy. Experiments show that our approach improves the style intensity of generated responses and maintains the content relevance at the same time, which demonstrates the effectiveness of this approach.