Fine-grained Text Style Transfer with Diffusion-Based Language Models

Diffusion probabilistic models have shown great success in generating high-quality images controllably, and researchers have tried to utilize this controllability into text generation domain. Previous works on diffusion-based language models have shown that they can be trained without external knowledge (such as pre-trained weights) and still achieve stable performance and controllability. In this paper, we trained a diffusion-based model on StylePTB dataset, the standard benchmark for fine-grained text style transfers. The tasks in StylePTB requires much more refined control over the output text compared to tasks evaluated in previous works, and our model was able to achieve state-of-the-art performance on StylePTB on both individual and compositional transfers. Moreover, our model, trained on limited data from StylePTB without external knowledge, outperforms previous works that utilized pretrained weights, embeddings, and external grammar parsers, and this may indicate that diffusion-based language models have great potential under low-resource settings.


Introduction
Diffusion probabilistic models (Ho et al., 2020) have became the state-of-the-art technique in visual generative tasks. By starting from random gaussian noise and gradual denoising, they are able to generate images that look realistic in details. Moreover, conditional diffusion models such as stable diffusion (Rombach et al., 2022) are able to achieve detailed control over the generated output by conditioning on text, layouts, etc. The generated images are faithful to the text description or layouts, often to the finest details.
Analogically, researchers have tried to utilize the controllability of diffusion models to achieve more controllable language generation. For ex-ample, DiffuSeq (Gong et al., 2022) applies diffusion models to sequence-sequence text generation tasks such as paraphrasing, question generation and text simplification; Diffusion-LM (Li et al., 2022) combined diffusion models with language models to control language generation by specifying generation length, syntax tree, semantic context, etc. What made these diffusion-based language models impressive is that they are trained from scratch with zero external knowledge (i.e. no pre-trained word embeddings or model weights, no external grammar parsers, etc) and on very few data (on the order of 10 5 tokens) compared to any large language models (for example, GPT-3's (Brown et al., 2020) training data is on the order 10 11 tokens), so they have to learn representations at all levels (word embeddings, sentence structures, etc) from scratch with very limited data.
However, while the earlier tasks assessed on Diffusion-LM and DiffuSeq require a degree of control over the generated output, they are incapable of modifying the existing text to exhibit specific stylistic characteristics. In this paper, we would like to further examine the capabilities of diffusion-based language models on fine-grained text style transfer, an important task that requires more fine-grained control than the tasks from previous works on diffusion-based language modeling because it only allows changing the specified fine-grained stylistic properties of the input while leaving the rest unchanged. For example, "verb emphasis" is a fine-grained style transfer that requires the model to rewrite the sentence emphasizing a certain verb, without changing any other information that the original sentence conveys. In comparison, previous evaluation tasks such as controlling sequence length, semantic context, etc essentially control one aspect at a time and require no control over any other properties of generated text.
We use 13 non-lexical transfers from StylePTB (Lyu et al., 2021) (Lyu et al., 2021). We present one example sentence pair before/after each transfer, as well as the total number of sentence pairs available for each transfer in StylePTB. As we can see, the transfers require changing one specific stylistic aspect of the sentence while leaving all other aspects unchanged, and the amount of data available for training is limited (compared to the typical amount of data required to train large language models nowadays).
there are at most a few thousand sentence pairs available for each transfer, as shown in Table 1.
Since identifying the grammatical structure of the sentence can be very helpful for most of these transfers (such as active-to-passive), some previous methods (such as Neural QCFG (Kim, 2021)) utilizes external grammar parsers to gain such information. We trained a diffusion-based model on StylePTB data without any pre-trained weights or external grammar parsers. Therefore, our model has to start from zero grammar/linguistic knowledge and learn all of them from very limited training data (StylePTB only has 7719 sentences from Penn Tree Bank (Marcus et al., 1993) plus their transferred outputs). Even under these hard conditions, our model still managed to outperform previous works that do utilize external weights or grammar parsers. Moreover, we also evaluate the capabilities of diffusion-based language models on performing multiple transfers using one single model and composing multiple learned transfers on a single sentence. We list our contributions as follows: • We trained a diffusion-based language model (adapted from DiffuSeq (Gong et al., 2022)) that can perform fine-grained text style transfer from scratch with very limited training data and no external weights or tools. The model also supports multitasking and composing multiple fine-grained transfers.
• Our model achieves state-of-the-art performance on fine-grained text style transfers in StylePTB. Our multitask model (i.e. one single model that can perform all 13 transfers) achieves best performance compared to previous works on the same tasks on 88 out of 91 metrics (7 metrics per transfer), and gets very close to human performance on tasks with easy and medium difficulties. We also evaluated our model on composition of multiple fine-grained transfers, and we achieved best performance on these tasks as well.
• Thr/ough the evaluations, we demonstrated the extraordinary capabilities of diffusionbased language models in asserting extremely fine-grained control over generated text, and that this type of language model have great potential in controllable natural language generation under low-resource settings as it is able to achieve state-of-the-art performance with limited training data and no external knowledge.

Fine-grained Text Style Transfer and StylePTB
An import challenge for AI is to convey intentions using different stylistic attributes, and automated text style transfer is an essential step towards that. Text style transfer aims to controllably convert source text with targeted stylistic properties, with important applications in human-AI interactions including dialog systems (Celikyilmaz et al., 2018) and intelligent agents (Kim et al., 2013;Liang et al., 2020;Pittermann et al., 2010) that can communicate with specific text styles for different situations, target audiences, and environments (Lample et al., 2019;. There has been extensive research on high-level style transfers such as sentiment transfers (Shen et al., 2017) and formality transfers (Rao and Tetreault, 2018). However, high-level style transfers lack the ability to fully control the style of the output. For example, there are many ways to convert a positive comment about a restaurant into a negative one, and high-level text style transfers do not allow control over which of the possible outputs (that may have different styles in non-sentiment aspects) can be generated. Finegrained text style transfer is important because they allow fine-grained control over the generated output. (Lyu et al., 2021) defined a set of fine-grained text style transfer along four lingustic axis: • Lexical Transfers: Word changes • Syntax Transfers: Grammar and sentence structure changes • Semantic Transfers: Meaning changes • Thematic Transfers: Situational changes or word emphasis Along these 4 axes, it defined 21 individual finegrained transfers, 13 of which are non-lexical. Examples of the non-lexical transfers are shown in Table 1. Compared to other forms of controllable text generation, fine-grained text style transfer has the advantage of being able to assert control over text generated by uncontrollable models. For example, we can use fine-grained text style transfers to add specific stylistic properties to free-form text generated by large language models while keeping the content of the generated text unchanged.
Fine-grained text style transfers can be composed to achieve higher-level style transfers, and they even have the potential to mitigate social bias in large text generation models (Lyu et al., 2021). Therefore, it is important to develop techniques to achieve automated fine-grained text style transfer. Existing works are still quite far from perfect on a lot of the fine-grained style transfers compared to human performance (Lyu et al., 2021;Kim, 2021), and composing multiple fine-grained style transfers remains challenging.

Diffusion Probabilistic Models
Recently, diffusion models (Ho et al., 2020) is widely used to generate high quality and diverse images. Its methodology consists of two phases: the first phase is the forward diffusion phase, which adds Gaussian noise to the input image x 0 as the time stamp increases, and after enough steps the image is reduced to pure Gaussian noise x t . The second phase is the recovery phase, in which a model is trained to gradually remove noise from x t until it recovers the original image x 0 . During inference, we start from a randomly sampled gaussian noise x t and use the denoising model to gradually infer an image x 0 .
Diffusion-based language generation models follows a similar approach where we perform the diffusion and denoising process in the token embedding space. We will explain the model we use, which is built upon DiffuSeq (Gong et al., 2022), in details in the next section.

Methodology
We adapt DiffuSeq (Gong et al., 2022) to be able to perform fine-grained text style transfer given a source sentence and specified transfer operation(s), as illustrated in Figure 1. We model the transfer as a conditional generation process, where the condition includes the source sentence and the specified transfer operation(s). We first define a set of special style tokens, one for each possible individual finegrained transfer. If we wish to perform one or more transfer on the source sentence, we will prepend the corresponding special token(s) to the beginning of the source sentence to form the condition S.
We use BERT tokenizer to tokenize the input into discrete token ids, and adopt a token embedding layer to encode both the source (including prepended style tokens) and the ground truth target sentence (during training) to obtain the embedded source Z S and target Z T RG 0 . For the diffusion pro- Figure 1: An illustration of the training and inference process of our diffusion-based language model. The diffusion process is performed over the sequence of token embeddings of the target sentence Z T RG 0 , and the source sentence's token embeddings (Z S ) are concatenated before Z T RG . During the backward diffusion process, the combined sequence is fed into the transformer model to gradually recover/generate Z T RG 0 . cess, we use a transformer model to recover the target embedding. Both the diffusion transformer and the token embeddings are initialized randomly and jointly optimized. In other words, our model does not rely on any prior knowledge about our task or the English Language in general.
We use the simplified diffusion objective during training: for each input (S, T RG) where S is the source sentence (with style tokens) and T RG is the ground truth target sentence, we randomly sample a step number t from 1, 2, ...T , where T is the maximum number of steps, and add t steps of random Gaussian noise to Z T RG 0 following a linear diffusion schedule to obtain Z T RG t . We then concatenate Z S and Z T RG t and input the concatenated sequence into our diffusion transformer, where we only take the output embeddings at the locations corresponding to Z T RG t as Z ′T RG 0 . Our training objective is simply going to be the MSE Loss between During inference, we randomly initialize Z ′T RG T ∼ N (0, 1), and encode the condition (source sentence and style tokens) into Z S . Then we concatenate them and use our transformer to predict a temporary Z ′T RG 0temp , add T −1 steps of noise back to the temporary Z ′T RG 0temp to obtain Z ′T RG T −1 . We repeat this process until we get Z ′T RG 0 . For each embedding in Z ′T RG 0 , we find the closest embedding in our token embedding layer by cosine distance, and decode the embedding to that token. Then we combine the tokens to form the output sentence in natural language.

Dataset
StylePTB (Lyu et al., 2021) contains paired sentences before/after each transfer for 21 fine-grained transfers, as well as paired data for compositions of multiple fine-grained transfers. For single transfers, we will focus on the 13 non-lexical fine-grained style transfers following (Lyu et al., 2021). The number of sentence pairs available from StylePTB for each transfer and examples of sentences before/after each transfer are shown in Table 1. For compositional transfers, we will use the Tense + Voice and Tense + PP Removal transfers from the compositional part of StylePTB dataset (same as the ones used for evaluation in (Lyu et al., 2021)). Each compositional dataset contains all combinations of valid transfers (for example, Tense + Voice dataset contains all valid combinations of 0/1/2 transfers regarding tense and voice, such as To-Future + Active-To-Passive or To-Past + No-Voice-Change).
StylePTB was built with only 7719 different sentences from Penn Tree Bank (Marcus et al., 1993) plus their stylistic variations, so both the amount and the diversity of training data are very limited, thus making this task even more challenging for DiffuSeq since it does not have access to external knowledge or pre-trained weights and have to extract all linguistic knowledge from limited data.
For fair comparison, we preprocess the data following the same criterion as (Lyu et al., 2021): we replace numbers with NUM token, and we replace each word that occurs less than 3 times in the training set with UNK token. We also split the data into train/valid/test splits with proportions of 0.9/0.05/0.05 using the same splits as all previous works.

Evaluation Metrics
We use the same evaluation methods as (Lyu et al., 2021) and report 7 metrics from nlg-eval package (Sharma et al., 2017) (BLEU 1-4, METEOR, ROUGE-L, CiDER) between the generated transferred sentence and the ground truth target sentence from the dataset.

Baselines
We report performance of the following baselines for single style transfer: 1. GPT-2: Directly finetuning GPT-2 medium model (Radford et al., 2019) with paired data. Performance reported from (Lyu et al., 2021). 2. Seq2Seq: GRU sequence-to-sequence language model (Sutskever et al., 2014) with attention. Performance reported from (Lyu et al., 2021). 3. RetrieveEdit (Hashimoto et al., 2018): For an input data x, a retriever model will go through the training set to find a similar sentence pair (x ′ , y ′ ) and a trained editor edits y ′ into desired output y. Performance reported from (Lyu et al., 2021). 4. Steering Vector (Subramani et al., 2022): extract steering vectors directly from pretrained LMs to guide generation 5. TAILOR (Ross et al., 2021): output sentences conditioned on control codes by a pretrained seq2seq model 6. Neural QCFG (Kim, 2021): It presents a sequence-to-sequence text learning by explicitly modeling the alignment between target trees with the source. 7. Neural QCFG + copy (Kim, 2021): Neural QCFG with an option to copy certain tokens from source sentence Among these baselines, GPT-2, Steering Vector and TAILOR uses pre-trained language models, Neural QCFG and Neural QCFG + copy requires external grammar parsers, and RetrieveEdit uses GLOVE word embeddings.
We also included Human performance on these tasks (reported in (Lyu et al., 2021) by asking human annotators to manually perform the style transfer tasks) for comparison.

Results and Analysis
For single style transfers, we tried two different diffusion-based approaches: (1) we train a separate diffusion model for each individual style transfer, and (2) we train one diffusion model for all 13 transfers evaluated. For approach (2), we add a style token at the beginning of the input sentence to indicate which of the 13 transfers needs to be performed. We call approach (2) DiffuSeq Multitask.
The original StylePTB paper (Lyu et al., 2021) puts the non-lexical transfers into 3 difficulty categories (easy, medium, hard) by average hamming distance between input and output of the transfer. We report the results of our experiment using the same categorization, where we show results on easy and medium transfers in Table 2 and hard transfers in Table 3.
Surprisingly, DiffuSeq Multitask outperforms DiffuSeq on all transfers, even though DiffuSeq Multitask has to handle 13 different transfers in one model while each DiffuSeq model only needs to handle 1 transfer. This is possibly due to the additional training data from all the tasks that the multitask model learns better representations for words and sentences and gains more accurate knowledge of grammatical patterns of English, which is shared across all tasks.
Moreover, DiffuSeq Multitask significantly outperforms all baselines in all easy and medium transfers, and also achieves state-of-the-art on most metrics on hard transfers, only falling slightly behind Neural QCFG + copy in some metrics. This is really impressive considering that our approach leverages no external knowledge while all baselines except Seq2Seq utilizes either pretrained language models, pretrained word embeddings, or external grammar tree parser. Neural-QCFG-based methods are especially dependent on external linguistics knowledge and existing grammar parsers. DiffuSeq Multitask's performance is also on par with human performance on easy and medium transfers, indicating that DiffuSeq Multitask is close to fully solving the easy and medium difficulty transfers.   We will report performance of the following baselines for compositional fine-grained style transfers: 1. SeqGPT: Sequentially applying fine-tuned GPT-2 for each single style transfer. Performance reported from (Lyu et al., 2021). 2. CS-GPT: A modified GPT-2 model that takes in style tokens as indication of which style transfers to apply. Performance reported from (Lyu et al., 2021).

Results and Analysis
For compositions of multiple fine-grained style transfers, we train one single DiffuSeq model to handle all compositions and use style tokens to indicate which transfers to compose for the input sentence, similar to CS-GPT (Lyu et al., 2021). The results are shown in Table 4. DiffuSeq significantly outperforms baselines in all tasks and all metrics. Therefore, not only does our diffusion model work well for single fine-grained style transfers, it also works well for compositions of multiple fine-grained style transfers.

Automated Text Style Transfer
The goal of the text style transfer (TST) task is to change the style of the sentence while retaining its style-independent content. Previous works in TST includes the following approaches: statistical NLP methods (Hovy, 1987;Xu et al., 2012), neural generative models (Prabhumoye et al., 2018;Lample et al., 2019;He et al., 2020), Retrieve-and-Edit approaches Sudhakar et al., 2019;Madaan et al., 2020), and Transformer-based approach (Lyu et al., 2021). Some of these methods can already achieve high performance on certain high-level transfers (such as sentiment transfers (Shen et al., 2017) and formality transfers (Rao and Tetreault, 2018)), but fine-grained text style tranfer remains challenging for the above approaches (Lyu et   2021). In this paper, we explored a new approach for fine-grained TST utilizing Diffusion Models.

Natural language processing with diffusion model
There have been two approaches for leveraging diffusion models into text data: the first approach takes advantage of the diffusion model in the continuous domain, like Diffusion-LM (Li et al., 2022), and DiffuSeq (Gong et al., 2022), where we start from a gaussian noise vector, and gradually denoise this noise vector to the desired sentence; the second approach applies diffusion model into discrete state space, like Multinomial Diffusion (Hoogeboom et al., 2021), DDPMs (Austin et al., 2021), and DiffusionBERT (Austin et al., 2021). In this paper, we chose to build upon the first type of model, because they are closer to the original diffusion models for images (where diffusion happens in continuous space) and they have shown successes on tasks that requires control over generations.

Limitations and Future works
One significant limitation of our work is that we only explored the capabilities of diffusion-based language models under a challenging circumstance where it is not allowed to use pre-trained weights or grammar parsers, which means we did not utilize this kind of model to its full potential, so a future research direction could be exploring possible ways to further improve the model's performance by leveraging pretrained weights or word embeddings, and train with enough data to find the full potential of these models. Another limitation of our work is that we only explored one typical diffusion-based language model, so our conclusions may not generalize to special types of diffusion-based language models (such as ones that uses discrete state space). We also conducted all experiments using the exact same model architecture design. In the future, we plan to experiment with different architectures for the diffusion model, such as more sophisticated conditioning methods (currently we just concatenate the source to the target, but we would like to try other ways of conditioning on the source, such as cross attention, as these conditioning methods for diffusion models have promising performance in the image generation domain).
Lastly, we found that diffusion-based language models work well with limited data and no external knowledge or pre-trained weights, thus these mod-els may have great potential under low-resource settings, but we didn't apply them to any real lowresource settings (such as low-resource languages, rare domains, etc) in this paper, and we would like to do that in the future to explore the full potential of diffusion-based language models.

Conclusions
In this paper, we explored the capabilities of diffusion-based models on fine-grained text style transfer, a task that requires a high level of control over generated text, with no external knowledge or pre-trained weights and with very limited training data. Our diffusion-based language model, which builds upon DiffuSeq (Gong et al., 2022), achieves state-of-the-art performance on all transfers as well as composition of transfers, outperforming all previous works on this dataset, including ones that uses pre-trained weights, word embeddings, and external grammar parsers. It is even on par with human performance on many transfers. Therefore, our model is a great step towards solving automated fine-grained text style transfer.
Moreover, our work, together with previous works such as Diffusion-LM (Li et al., 2022), demonstrates that diffusion-based language models could have great potential in controllable text generation under low-resource settings. Under lowresource settings (such as rarely spoken language or uncommon tasks), it would be difficult to find existing large language models or pre-trained weights, and available training data will likely be very limited, so most approaches based on finetuning existing models or large amounts of training will not work well, and diffusion-based language models could be an alternative to consider.