StoryTrans: Non-Parallel Story Author-Style Transfer with Discourse Representations and Content Enhancing

Non-parallel text style transfer is an important task in natural language generation. However, previous studies concentrate on the token or sentence level, such as sentence sentiment and formality transfer, but neglect long style transfer at the discourse level. Long texts usually involve more complicated author linguistic preferences such as discourse structures than sentences. In this paper, we formulate the task of non-parallel story author-style transfer, which requires transferring an input story into a specified author style while maintaining source semantics. To tackle this problem, we propose a generation model, named StoryTrans, which leverages discourse representations to capture source content information and transfer them to target styles with learnable style embeddings. We use an additional training objective to disentangle stylistic features from the learned discourse representation to prevent the model from degenerating to an auto-encoder. Moreover, to enhance content preservation, we design a mask-and-fill framework to explicitly fuse style-specific keywords of source texts into generation. Furthermore, we constructed new datasets for this task in Chinese and English, respectively. Extensive experiments show that our model outperforms strong baselines in overall performance of style transfer and content preservation.


Introduction
Text style transfer aims to endow a text with a different style and meanwhile keep its main semantic content unaltered (Hovy, 1987). It has a wide range of applications, such as formality transfer (Jain et al., 2019), sentiment transfer (Shen et al., 2017) and author-style imitation (Tikhonov and Yamshchikov, 2018).
Due to the lack of parallel corpora, recent works mainly focus on unsupervised transfer by self- * Equal contribution. † Corresponding author Source Text: Han was a talented man in ancient times. 2One summer evening, he was enjoying the cool in the courtyard. 3Suddenly, a gust of wind brought a refreshing fragrance, and a young girl descended from the sky on a white cloud and appeared in front of Guo Han. 4Guo Han is amazed by the beauty of the flowery face and dazzling gauze of the girl. 5The girl said that she was the Weaver Girl and came from Tiangong. Generated Text for the JY Style: 2,3郭翰在夏日的夜,院中花香沁人心脾,一阵清香从身旁飘来,那 少女却是神色自若,一言不发的从天而降。郭翰大惊,眼前白光一 闪,身前纱衣一晃,已被她夺了过去。那少女冷笑道:"你当我是傻子 么?"4郭翰叫绝:"我决不放你走!"那少女不理他,一剑向天宫门口刺 去,突然转身,飞身而前。那人身法好快,眨眼间已到了郭翰眼前。 他大叫一声"啊哟!"5但随即省悟,这少女是织女,本非等闲之辈,她 这等武功,自是非同小可,那日她从天宫出来,郭翰没瞧见她面貌,却 也认得她是谁。 2,3 On a summer night, the scent of flowers is refreshing in the courtyard. A scent of fragrance floats from Guo Han side. However, the young girl, with a calm expression, fell from the sky without saying a word. Guo Han was shocked, and already taken away by her with a white light flashing and the gauze flickering. The girl sneered: "Do you think I'm a fool?" 4 Guo Han exclaimed: "I would never let you go!" The girl ignored him, stabbed at the gate of the palace with a sword, suddenly turned around, and then flew forward. The speed of the movement was so fast that Guo Han was in front of him in the blink of an eye. He yelled "Ah yo!" 5 But then he realized that this girl is a Weaver Girl. She is not an ordinary person. She is not trivial with such a martial art. She came out of the Tiangong that day. Guo Han did not see her face, but still recognized her. Table 1: An example that transfers a vernacular story in the LOT dataset (Guan et al., 2021a) to the martial arts style of JY. The numbers before sentences indicate the rough semantic correspondence from the generated text to the source. The green highlights the content of the source text that is missing in the generation text. The red highlights plots expanded by the model corresponding to the target author style. The black highlights rewritten sentences in the target author style.
reconstruction. Current methods proposed to disentangle styles from contents by removing stylistic tokens from inputs explicitly (Huang et al., 2021) or reducing stylistic features from token-level hidden representations of inputs implicitly (Lee et al., 2021). This line of works has shown impressive performance on single-sentence sentiment and formality transfer. However, it is yet not investigated to transfer author-styles of long texts such as stories, which manifest in the author's linguistic choices at the lexical, syntactic and discourse levels.
In this paper, we present the first study on story author-style transfer, which aims to rewrite a story incorporating source content and the target author style. The first challenge of this task lies in imitation of authors' linguistic choices at the discourse level, such as narrative techniques (e.g., brief or detail). As exemplified in Table 1, the generation text in the JinYong (JY) 1 style not only rewrites some tokens to the martial arts style (e.g., "白云" /"white cloud" to "白光一闪" /"light flashing") but also adds additional events to detail and enrich the storyline (e.g., the red highlights behind sentence 4 and 5). In contrast to the transfer of token-level features like formality, it is more difficult to capture the inter-sentence relations correlated with author styles and disentangle them from contents. The second challenge is that the author styles tend to be highly associated with specific writing topics. Therefore, it is hard to transfer these style-specific contents to another style. For example, the topic "talented man" hardly shows up in the novels of JY, leading to the low content preservation of such contents, as show in the green in Table 1. To alleviate above issues, we propose a generation framework, named StoryTrans, which learns discourse representations from source texts and then combines these representations with learnable style embeddings to generate texts of target styles. Furthermore, we propose a new training objective to reduce stylistic features from the discourse representations, which aims to pull the representations derived from different texts close in the latent space. To enhance content preservation, we separate the generation process to two stages, which first transfers the source text with the style-specific content keywords masked and then generates the whole text by imposing theses keywords explicitly.
To support evaluation on the proposed task, we collect new datasets in Chinese and English based on existing story corpora. We conduct extensive experiments to transfer fairy tales (in Chinese) or everyday stories (in English) to typical author styles, respectively. Automatic evaluation results show that our model achieves a better overall performance in style control and content preservation than solid baselines. The manual evaluation also confirms the efficacy of our model. We summarize the key contributions of this work as follows: I. To the best of our knowledge, we present the first study on story author style transfer. We construct 1 JinYong is a Chinese martial arts novelist. new Chinese and English datasets for this task. II. We propose a new generation model named StoryTrans to tackle the new task, which learns style-independent discourse representations to capture the content information and enhances content preservation by explicitly incorporating stylespecific content keywords. III. Extensive experiments show that our model outperforms baselines in overall performance of style transfer accuracy and content preservation.

Style Transfer
Recent studies concentrated mainly on token-level style transfer of single sentences, such as formality or sentiment transfer. We roughly categorize these studies into three following paradigms.
The first paradigm builds a style transfer system without explicit disentanglement of style and content. This line of work uses additional style signals or a multi-generator structure to control the style. Dai et al. (2019) adds an extra style embedding in input for manipulating the style of texts. Yi et al. (2020) proposed a style instance encoding method for learning more discriminative and expressive style embeddings. The learnable style embedding is a flexible yet effective approach to providing style signals. Such a design helps better preserve source content. Syed et al. (2020) randomly dropped the input words, then reconstructed input for each author separately, which obtained multiple author-specific generators. The multi-generator structure is effective but also resource-consuming. However, this paradigm incurs unsatisfactory style transfer accuracy without explicit disentanglement.
The second paradigm disentangles the style from content explicitly. Specifically, this paradigm disentangles the content and style in latent space, then combines the target style signal with the styleindependent content representation. Zhu et al. (2021) disentangled the content and style by diluting sentence-level information in style representations. John et al. (2019) incorporated style prediction and adversarial objectives for disentangling the latent representations. Lee et al. (2021) removed style information of each token with reverse attention score (Bahdanau et al., 2015), which is estimated by a pre-trained style classifier. This paradigm utilizes adversarial loss functions or a pre-trained estimator for disentanglement. And experiment results indicate that explicit disentangle-  Figure 1: An overview of the generative flow. For discourse representation transfer (first stage), the encoder employs discourse representations ({r i } n i=1 ) to contain main semantics of pre-processed input (x m ). Then, the interaction module stylizes the discourse representations with target style embedding (ŝ). For content preservation enhancing (second stage), our model enhances the content preservation of transferred texts (y m ) with style-specific content (k). x and y denote the original story and the final output, respectively. ment leads to a satisfactory style transfer accuracy but poor content preservation.

Ⅱ. Content Preservation Enhancing
The final paradigm views style as localized features of tokens in a sentence, which locates styledependent words and replaces the target-style ones. Xu et al. (2018) employed an attention mechanism to identify style tokens, and filter out such tokens. Wu et al. (2019) utilizes a two-stage framework to mask all sentimental tokens then infill them. Huang et al. (2021) aligned words of input and reference to achieve token-level transfer. To sum up, this paradigm maintains all word-level information, but it is hard to apply to the scenarios that styles are expressed beyond token level, e.g., author style.
Absorbing ideas from paradigm 1 and 2, we apply explicit disentanglement by pulling close discourse representations, which is formulated into disentanglement loss. Furthermore, we design a fusion module to stylize the style-independent discourse representation.

High-Level Representation
Prior works endeavored to capture the hierarchical structure of natural language texts by learning highlevel representation. Li et al. (2015) and  proposed to learn hierarchical embedding representation by reconstructing masked version of sentences or paragraphs. Reimers and Gurevych (2019) derived semantically meaningful sentence embeddings by fine-tuning BERT (Devlin et al., 2019) on downstream tasks. Guan et al. (2021b) inserted the special token at end of each sentences and devises several pre-trained tasks to learn sentence-level representations. Inspired by above work, we try to learn the style-independent discourse representations by reconstructing the input multiple sentences and disentanglement loss.

Long Text Generation
Recent studies generate coherent long text by decomposing generation into multiple stages. This line of works first generates a rough sketch, and then extend it into the complete text with fine detail. Borrowing ideas from above works, we adopt a mask-and-fill two-stage framework to enhance content preservation in text style transfer.

Task Definition and Model Overview
We formulate the story author-style transfer task as follows: assuming that S is the set of all author-styles, given a multi-sentence input x = (x 1 , x 2 , · · · , x T ) of T tokens and its author-style label s ∈ S, the model should generate a multisentence text with a specified author-styleŝ ∈ S while keeping the main semantics of x.
As illustrated in Figure 1, we split the generation process into two stages. We first identify stylespecific keywords k = (k 1 , k 2 , · · · , k l ) from x, and then mask them with special tokens mask . We denote the resulting masked version of x as x m = (x m 1 , x m 2 , · · · , x m T ). In the first generation stage, we perform discourse representation transfer on x m . In the second stage, we complete the masked tokens in the output of the first stage conditioned on k in a style-unrelated manner. Due to a lack of parallel data, typical style transfer generation models tend to optimize the selfreconstruction loss with the same inputs and outputs. (Xiao et al., 2021;Lee et al., 2021). Obviously, training with only self-reconstruction loss will make it easy for the model to ignore the target style signals and simply repeat the source inputs. Therefore, in the first stage, we devise an additional training objective, to disentangle stylistic features from intermediate discourse representations where n is the number of sentences. Then, we fused these style-independent discourse representations with the target styleŝ as a discourselevel guidance for the subsequent generation of the transferred text. And we also add a style classifier loss. In summary, the fist-stage model is trained using the following loss function: where λ 1 and λ 2 are adjustable hyper-parameters. L self , L dis and L style are self-reconstruct loss, disentanglement loss and style classifier loss. And the workflow of learning objects are shown in Figure 2.
In the second stage, we use a denoising autoencoder (DAE) loss to train another encoderdecoder model for reconstructing x: This stage is unrelated to author styles, and helps achieve better content preservation. A

Identifying Style-Specific Contents
As aforementioned, author styles have a strong correlation with contents. It is difficult to transfer such style-specific contents to other styles directly. Formally, since we train the model in an auto-encoder manner, it will have no idea about how to transfer those content representations that have never seen other style embeddings during training. To address the issue, we propose to mask the style-specific keywords in the source text and perform style transfer on the masked text in the first generation stage. Then, we fill the masked tokens in the second stage. We follow Xiao et al. (2021) to use a frequencybased method to identify the style-specific keywords. Specifically, we extract style-specific keywords by (1) obtaining the top-10 words with the highest TF-IDF scores from each corpus, (2) retaining only people's names, place names, and proper nouns, (3) and filtering out those words with a high frequency in all corpora 2 . We denote the resulting word set as D s for the corpus with the style s. We extract the style-specific keywords k from the text x by selecting the words that are in D s . We detail above operation and explain it in Appendix A.

Discourse Representations Transfer
In analogy to long text generation, we propose to learn discourse representations, and then reconstruct the texts from discourse representations. And we perform the disentanglement and stylizing operation on discourse representations.

Discourse Representations
Supposing that x m consists of n sentences, we follow (Reimers and Gurevych, 2019;Guan et al., 2021b) to insert a special token, Sen , at the end of each sentence in x m . Let r n denote the hidden state of the encoder at the position of the n-th sentence token. Formally, we expect to derive intermediate discourse representations using the encoder as follows: . And hidden states of Sen aim to capture inter-sentence discourse dependencies, including discourse-level semantics.

Disentanglement Loss
We disentangle the style and content on discourse representations. Inspired by prior studies on structuring latent spaces (Gao et al., 2019;Zhu et al., 2021), we devise an additional loss function L dis to pull close discourse representations from different examples in the same mini-batch, which corresponds to different authorstyles. L dis and L self work as adversarial losses and lead the model to achieve a balance between content preservation and style transfer. We derive L dis as follows: where b is the size of mini-batch.
Fusion Module To provide signals of transfer direction, we concatenate the learned discourse representations {r i } n i=1 with the style embedding s and fuse them using a multi-head attention layer, as illustrated in Figure 1. To capture discourselevel features of texts with different author-styles, we set each style embedding as the same dimension vector to r i . Formally, we derive the style-aware discourse representations {z i } n+1 i=1 as follows: where MHA is the multi-head attention layer, Q/K/V is the corresponding query/key/value, is the concatenation operation. Then, the decoder gets access to {z i } n+1 i=1 through the cross-attention layer, which serve as a discourse-level guidance for generating the transferred texts. The generation probability is formulated as follows: where h t is the hidden state at the t-th position of the decoder, W and b are trainable parameters.

Self-Reconstruction Loss
We followed the manner of conventional generation models such as T5 (Raffel et al., 2020) and BART (Lewis et al., 2020) to formulate self-reconstruction loss: where s is the learnable embedding of s. During inference, we replace s with the embedding of the target styleŝ, i.e.,ŝ, to achieve the style transfer.

Style Classifier Loss
We would like the transferred text to be of the target style. Hence we train  a style classifier to derive the style transfer loss as follows: where P C is the conditional distribution over styles defined by the classifier. We train the classifier on the whole training set with the standard crossentropy loss. Then, we freeze the weights of style classifier for computing L style . On the other hand, we follow Lee et al. (2021); Dai et al. (2019) to use soft sampling to allow gradient back-propagation.

Content Preservation Enhancing
In the second stage, we train another model to fill the mask tokens in outputs of the first stage conditioned on the identified style-specific keywords in source inputs. During training, we concatenate the keywords in k with a special token Key and feed them into the encoder paired with x m , as shown in Figure 1. The training object is formulated as Equation 2. During inference, the decoder generates the transferred textx conditioned on the output of the first stagex m in an auto-regressive manner.

Datasets
We construct stylized story datasets in Chinese and English, respectively. The Chinese dataset consists of three styles of texts, including fairy tales from LOT (Guan et al., 2021a), LuXun (LX), and JinYong (JY). Specifically, LuXun writes realism novels while JinYong focuses on martial arts novels.  Table 2. In addition, the details of data pre-processing and collection are described in Appendix B.

Implementation
We take LongLM BASE (Guan et al., 2021a) and T5 BASE (Raffel et al., 2020) as the backbone model of both generation stages for Chinese and English experiments, respectively. Furthermore, the fusion module consists of two layers of randomly initialized bidirectional Transformer blocks (Vaswani et al., 2017). We conduct experiments on one RTX 6000 GPU. In addition, we build the style classifier based on the encoder of LongLM BASE and T5 BASE for Chinese and English, respectively. We set λ 1 /λ 2 in Equation 1 to 1/1, the batch size to 4, the learning rate to 5e-5, the maximum sequence length of the encoder and decoder to 512 for both generation stages in the Chinese experiments. And the hyper-parameters for English experiments are the same except that λ 1 /λ 2 are set to 2/2 and the learning rate to 2.5e-5.

Baselines
Since there are not studies previously that focus on stories author-style transfer, we build several baselines by adapting short-text style transfer models. For fair comparison, we initialize all baselines using the same pre-trained parameters as our model. Specifically, we adopt the following baselines: Style Transformer Following Dai et al. (2019), we add an extra style embedding and a discriminator to provide rewards of style transfer. This baseline aims to investigate the effect of transferring styles without disentangling contents from styles.
StyleLM This baseline generates the target text conditioned on the given style token and the corrupted version of the original text, which is based on model from Syed et al. (2020).
Reverse Attention It inserts a reverse attention module on the last layer of the encoder, which aims to negate the style information from the hidden states of the encoder (Lee et al., 2021).

Automatic Evaluation
Evaluation Metrics Previous works evaluate style transfer systems mainly from three aspects including style transfer accuracy, content preservation, and sentence fluency (Lee et al., 2021;Wu et al., 2019). A good style transfer system needs to balance the contradiction between content preservation and transfer accuracy (Zhu et al., 2021;Niu and Bansal, 2018). We use a joint metric to evaluate the overall performance of models. On the other hand, previous studies usually use perplexity (PPL) of an off-the-shelf pre-trained language model. However, in our experiments, we found that the PPL of model outputs is lower than humanwritten texts, suggesting that PPL is not reliable for evaluating the quality of stories. Therefore, we evaluate the fluency through manual evaluation. Specifically, we adopt the following automatic metrics: (1) Style Transfer Accuracy: We use two variants of style transfer accuracy following Krishna et al. (2021), absolute accuracy (a-Acc) and relative accuracy (r-Acc). We train a style classifier and regard the classifier score as the a-Acc. And r-Acc is a binary value to indicate whether the style classifier score the output higher than the input (1/0 for a higher/lower score). We train the classifier by fine-tuning the encoder of LongLM BASE and T5 BASE on the Chinese and English training set, respectively. The classifier achieves a 99.6% and 99.41% accuracy on the Chinese and English test sets, respectively. (2) Content Preservation: We use BLEU-n (n=1,2) (Papineni et al., 2002) and BERTScore (BS) (Zhang* et al., 2020) between generated and input texts to measure their lexical and semantic similarity, respectively. And we report recall (BS-R), precision (BS-P) and F1 score (BS-F1) for BS. (3) Overall: we use the geometric mean of the a-ACC/r-ACC and BS-F1 score to assess the overall performance of models (Krishna et al., 2020;Lee et al., 2021).
Results on the Chinese Dataset We show the overall performance and individual metrics results in Table 3. In terms of overall performance, Story-Trans outperforms baselines, illustrating that Sto-ryTrans can achieve a better balance between style transfer and content preservation.
In terms of style accuracy, as shown in Table 3, StoryTrans achieves the best style transfer accuracy for both target styles. The bad performance of baselines indicates that it is necessary to perform explicit disentanglement between styles and con-  tents beyond the token level for story style transfer.
In addition, manual inspection shows that Style Transformer tends to copy the input, which accounts for the highest BLEU score and BERTScore. This means Style Transformer only takes the target style signals as noise, which may result from the stylistic features existing in the contents. StyleLM and Reverse Attention get better transfer accuracy than Style Transformer by removing such stylistic features from contents. Moreover, Reverse Attention obtain better style accuracy but worse content preservation than StyleLM. Therefore, reweighting hidden states allows better control over style than deleting input words explicitly.
In terms of content preservation, StoryTrans outperforms Reverse Attention. Additionally, StyleLM achieves better performance in content preservation, benefiting from inputting noisy versions of golden texts. But without disentanglement, it did not strip style information. This leads to a lower overall performance than StoryTrans. As for Style Transformer, the results demonstrate that only an attention-based model hardly removes style features in overwhelming tokens information, leading to degenerate into an auto-encoder.
As for the ablation study, we observe a significant drop in transfer accuracy without L sen or L style , suggesting the necessity of disentanglement and style supervision for self-reconstruction. The ablation test demonstrates that L sen reduces the stylistic features and fight the trend of copying the input. Moreover, L style assists the model to enhance the effect of L sen as shown in Table 3. The ablation of multiple Sen employs only one sen token to capture all text information, demonstrating the need for multiple special tokens to represent the multiple sentences. The ablation of content enhancing show the effectiveness of the mask-and-fill two-stage framework.

Models
Texts Source Text 1毛克利学会一个人捕猎了。2母狼就告诉毛克利，瘸老虎和狐狸一直对他虎视眈眈，让他不要一个人出去捕食。3毛克利答应了母狼。4可是不久后，他就忘记了母狼说的 话，还经常一个人出去捕食或玩耍。5有一次，毛克利在玩耍时遇到了坏狐狸大巴希。6狐狸正准备偷袭他。7这时，母狼正好来了。8看到气势汹汹的母狼，狐狸灰溜溜地逃走 了。9毛克利很抱歉没有听母狼的话。10此后出门都会叫上朋友一起。 1 Mowgli learned to hunt alone. 2The she-wolf told Mowgli that lame tigers and foxes had been eyeing him and told him not to go out to hunt alone. 3Mowgli agreed to the she-wolf. 4But after a while, he forgot what the she-wolf said, and often went out alone to hunt or play. 5Once, Mowgli encountered the bad fox Big Bashi while playing. 6The fox was about to attack him. 7 At this time, the she-wolf just arrived. 8 Seeing the aggressive she-wolf, the fox fled in despair. 9 Mowgli was sorry for not listening to the she-wolf. 10 After that, I will call my friends when I go out.

StyleLM
1毛克利学会了一个捕猎的好手,和母狼、毛克利并肩而行。2,4这一天,狐狸虎视眈眈,一个人独自出去捕食猎物。3,4毛克利对母狼说道自己已经忘记了母狼的话,如果一个人出去 捕食,他就会被母狼吃掉。于是他回家去了。5,6,7家里的狐狸玩耍时,毛克利被一只大巴狐狸偷袭,母狼气势汹汹的扑将上来,把母狼咬死了。8狐狸只好灰溜溜的逃走了。9毛茸的 话也提醒了毛克利,他惭愧的低下了头,不再说一句抱歉的话。10母狼听到他的话,此后没有再出门,只是站在树林边,微笑着看着他。 1 Mowgli learns to be a good hunter and walks side by side with the she-wolf and Mowgli. 2,4On this day, the fox was eyeing the tiger and went out alone to hunt its prey. 3,4 Mowgli told the she-wolf that he had forgotten what the she-wolf said, and that if a man went out to hunt, he would be eaten by the she-wolf. So he went home. 5,6,7 When the fox at home was playing, Mowgli was attacked by a bus fox, and the she-wolf rushed up aggressively and killed the she-wolf. 8The fox had no choice but to run away. 9The furry words also reminded Mowgli, who bowed his head in shame and stopped saying a word of apology. 10 The she-wolf heard his words and did not go out after that, but just stood by the woods and looked at him with a smile.

StoryTrans (Ours)
狐狸怒道:"什么?"8母狼道:"我母狼就跟这恶狐狸斗口。刚才他只一出手,我们都瞧得清楚,他不会'摸',咱们就不用胆子。"狐狸听了此言,知道此人不可理喻,只好强忍怒气,笑 道:"好,就依你说。"。母狗道:"说什么??"狐狸道:"他说咱们就是在这儿见面。待会他就来回报。"老虎道:"还说什么啊?"6大巴道:"刚才他说只攻敌群,守住阵脚,不让敌人靠近。我想 敌兵出阵之后,咱咱还有命么?"老虎的话实在说得未免有理,但这句话总是听在耳里。7,8他和母狼是一般的心思,明白这批人武功不及他,往往畏惧对方,因此听得狐狸的话,当即锁住 母狼,偷偷跑了开去。 The fox said angrily, "What?" 8The she-wolf said, "My she-wolf is fighting with this vicious fox. Just now he only made a move, we can all see clearly that he can't 'touch', so we don't have to be brave." The fox After hearing this, he knew that this person was unreasonable, so he had to hold back his anger and said with a smile: "Okay, it's up to you." The bitch said: "What did you say??" The fox said: "He said we are meeting here. He will report back when he meets." Tiger said: "What else?" 6 Daba said: "Just now he said that he would only attack the enemy group, defend his position, and prevent the enemy from approaching. I think after the enemy soldiers come out, do we still have our lives? ?" Tiger's words were quite reasonable, but he always listened to them. 7,8 He and the she-wolf are of the same mind, and they understand that these people are inferior to him in martial arts and are often afraid of each other. Therefore, when they hear the fox's words, they immediately lock the she-wolf and secretly run away.  Results on the English Dataset Similarly, Sto-ryTrans also achieves the best overall performance on the English dataset, further showing its effectiveness and generalization. Specifically, StoryTrans outperforms baselines significantly in terms of style transfer accuracy. In terms of content preservation, Style Transformer and Reverse Attention degenerate into an auto-encoder, and tend to copy the input even more than their performance on the Chinese dataset.

Manual Evaluation
We randomly sampled 100 fairy tales from the Chinese dataset, and obtained 800 generated texts for two target styles from StoryTrans and three baseline models. Then, we hire three Chinese native speakers as annotators to evaluate generated texts in three aspects including style transfer accuracy (Sty.), content preservation (Con.) and coherence (Coh.). We define coherence as the intra-sentence fluency. We ask the annotators to judge each aspect from 1 (the worst) to 3 (the best). As illustrated in Table 5, our StoryTrans received the highest style accuracy and modest performance in content preservation and coherence. More details and analysis are presented in Appendix E.

Case Study
We perform case studies on the best baseline and StoryTrans. Table 4 shows generated texts from different models. StyleLM can only rewrite individual sentences and simply merge some content.
Although it can also expand some content. On the contrary, our StoryTrans model can not only rewrite most sentences but also supplement the plots to be more consistent with the target style.

Stylistic Feature Visualization
We follow Syed et al. (2020) to manually define seven stylistic features and visualize the features of the golden texts and generated texts on the Chinese test set. The stylistic features include the type and number of punctuation marks, the number of sentences and the number of words. As shown in Figure 3, the texts generated by Reverse Attention and StyleLM have similar stylistic features with source texts. In contrast, StoryTrans can better capture different stylistic features and transfer source texts to different specified styles. We show more detail in Appendix D.
text representations to improve the style transfer accuracy, and achieve better content preservation by injecting style-specific contents. Both automatic and human evaluations show that StoryTrans outperform baselines in terms of the overall performance. Further analysis shows that StoryTrans has a better ability to capture linguistic features for style transfer.

A Style-Specific Contents
We detail how we extract style-specific contents and explain how they are used from the following three aspects: What do we mean by "style-specific content"? We refer to "style-specific content" as those mainly used in texts with specific styles and should be retained after style transfer. For example, "Harry Potter" and "Horcrux" are style-specific since they are used only in J.K. Rowling-style stories. When transferring J.K. Rowling-style stories to other styles, style-specific tokens shouldn't be changed. However, existing models tend to drop style-specific tokens since they are not trained to learn these tokens conditioned on other styles.
How do we extract "style-specific contents"?
As mentioned before, we employ the TF-IDF algorithm on the corpus to obtain rough style-specific contents for different styles, respectively. The reason for using TF-IDF: it is necessary to ensure that the extracted tokens are salient to the story plots. We extract style-specific tokens from the salient tokens using the second and third steps. Then, we use a part-of-speech tagging toolkit (e.g., NLTK) to identify function words and prepositions to retain people names, place names and proper nouns. Note that the frequency is an empirical value observed from datasets. However, the TF-IDF algorithm chooses the words that are important corresponding the special style based on word frequency. There may be some style-unrelated words that are important to the content. Therefore, we need to filter out style-unrelated words. Concretely, we use Jieba 3 /NLTK (Bird et al., 2009) to collect the word frequency for Chinese and English datasets, respectively. And we regard the words possessing the high frequency in all styles corpus as styleunrelated words. Specifically, We set tokens appearing in 10% samples in the dataset as highfrequency words. Then we filter out these words 3 https://github.com/fxsjy/jieba to obtain style-specific contents. The frequency value needs to be reset to apply the method to other datasets.
How are the "style-specific contents" used? One challenge of long-text style transfer is transferring discourse-level author style while preserving the main characters and storylines. It's difficult for existing models to transfer style-specific contents since they are not trained to learn these tokens conditioned on other styles.
As shown in Table 6, We can see that if not recognizing the style-specific tokens explicitly (i.e., StoryTrans w/o CE), the frog and the fat cow will not be preserved. In contrast, StoryTrans can remain the main characters.

B Data Pre-Processing
Due to lack of stylized author datasets, we collect several authors corpus to construct new datasets. As for Chinese, we extracted paragraphs from 21 novels of LuXun (LX) and 15 novels of JinYong (JY), and fairy tales collected by Guan et al. (2021a). On the other hand, the English dataset consists of everyday stories from ROC-Stories (Mostafazadeh et al., 2016) and fragments from Shakespeare's plays. Each fragment of Shakespeare's plays comprises multiple consecutive sentences and as long as samples in ROCStories. We collect the Shakespeare-style texts from the Shakespeare corpus in Project Gutenberg 4 . We use Jieba/NLTK (Bird et al., 2009) for word tokenization for the Chinese/English dataset in data proprocessing.
Regarding to limitation of modern language models, the length of samples is also limited. We set the max length as 384 and 90 for Chinese and English, respectively. Each sample has 4 sentences at least. We choose above length to balance the data length of different styles. Additionally, we filtered the texts which are too long to generate or too short to unveil author writing style. As Figure 4 shows, texts in the Chinese dataset spans a diverse range of length.

C Different Style Samples
In process of constructing datasets, we try to collect different author corpus who have a gap in writing styles. As shown in Table 7, the JY-style texts mostly describe martial arts actions and construct Models Texts Source 音蛙坐在田边突然看见一条肥牛，望着伟大的身躯，引起了它的嫉妒。它狂妄地鼓胀它的皮肤，尽可能地鼓得大而又 大。肥牛劝阻不成，只能看着音蛙继续吹鼓自己的肚子。最终，音蛙鼓爆了自己的胸膛。 Sitting on the edge of the field, the sound frog suddenly saw a fat cow, looking at the great body, which aroused its jealousy. It inflated its skin wildly, as big and big as it could. The fat cow could not dissuade him, so he could only watch the sound frog continue to blow his belly. In the end, sound frog bursts his chest.
StoryTrans 蓦然间远处出现了一艘大船,船身笔直向音蛙驶来。音蛙心下大奇:"难道肥牛真的大有胸膛,坐着肥牛,仍是聚在一起?"不 敢贸然举步,只怕那两艘肥牛会向他追来,只得坐在大船上,心中惴惴不安。甚至,心中也暗自惴惴:"为什么肥牛会在追来?难 道音蛙已发了疯,不愿我跟他们相会?" Suddenly, a large ship appeared in the distance, and the ship was heading straight for the sound frog. The sound frog was very surprised: "Does the fat cow really have a big chest, and the fat cow sits together?" He didn't dare to move rashly, because he was afraid that the two fat cow would chase after him, so he had to sit on the seat. On the big ship, I felt uneasy in my heart. Even, he was secretly worried: "Why are the fat cow chasing? Could it be that sound frog has gone crazy and doesn't want me to meet them?" StoryTrans w/o CE 田大哥瞪着眼睛道:"你骂我、瞪着眼睛?""田大哥不敢!""哼!"田大哥充满了悲愤,双手紧紧抓住椅子,泣道:"田大哥,你受了怎 样的羞辱?怎样躲在暗处?你受了多少委屈?"田大哥用力挣扎,但那双手越挣越紧,嘴角上已露出狞笑。田大哥不忍见这丑女 如此无赖,心中恼恨,大声呼喝:"别再妄想欺侮我!" Mr. Tian stared and said, full of grief and anger, "You scolded me and stared at me?" "Mr. Tian dare not!" What kind of humiliation? How do you hide in the dark? How much grievance have you suffered?" Mr. Tian struggled hard, but his hands tightened, and a smirk appeared on the corner of his mouth. Mr. Tian couldn't bear to see this ugly girl so rascal, he was angry in his heart, and shouted loudly: "Don't try to bully me again!" Table 6: Cases generated by StoryTrans, which are transferred from the fairy tale style to the JY style. The words in red indicate style-specific contents which should be preserved.

Authors Texts
JY 杨过左手抢过马缰，双腿一夹，小红马向前急冲，绝尘而去。郭芙只吓得手足酸软，慢慢走到墙角拾起长剑，剑身在墙角上猛力 碰撞，竟已弯得便如一把曲尺。以柔物施展刚劲，原是古墓派武功的精要所在，李莫愁便拂尘、小龙女使绸带，皆是这门功夫。 杨过此时内劲既强，袖子一拂，实不下于钢鞭巨杵之撞击。杨过抱了郭襄，骑着汗血宝马向北疾驰，不多时便已掠过襄阳，奔行 了数十里，因此黄蓉虽攀上树顶极目远眺，却瞧不见他的踪影。 Yang Guo grabbed the horse's reins with his left hand, clamping with his leg, and then little red horse rushed out of sight. Guo Fu was so frightened that his hands and feet were sore, and she slowly walked to the corner to pick up the long sword. Using soft objects to display strength was originally the essence of the ancient tomb school martial arts. Yang Guo's internal energy was strong at this moment, and a flick of his sleeve was no less than the impact of a giant steel whip. Yang Guo hugged Guo Xiang, and rode a sweaty horse to the north. After a while, he passed Xiangyang and ran for dozens of miles. Although Huang Rong climbed to the top of the tree and looked far into the distance, she could not see any trace of him.
LX 自《新青年》出版以来，一切应之而嘲骂改革，后来又赞成改革，后来又嘲骂改革者，现在拟态的制服早已破碎，显出自身的本 相来了，真所谓"事实胜于雄辩"，又何待于纸笔喉舌的批评。所以我的应时的浅薄的文字，也应该置之不顾，一任其消灭的；但几 个朋友却以为现状和那时并没有大两样，也还可以存留，给我编辑起来了。这正是我所悲哀的。我以为凡对于时弊的攻击，文字 须与时弊同时灭亡，因为这正如白血轮之酿成疮疖一般，倘非自身也被排除，则当它的生命的存留中，也即证明着病菌尚在。 Since the publication of "New Youth", everyone has ridiculed the reform in response to it, later approved of it, and then ridiculed the reformers. Now the mimetic uniform has long been broken, showing its true nature. The so-called "facts speak louder than words", why should they be criticized by pen and paper mouthpieces. Therefore, my timely and superficial writing should also be ignored and wiped out. However, a few friends thought that the current situation was not much different from that at that time, and they could still be preserved, so they edited them for me. This is what I am saddened by. I think any attack on the evils of the times, the writing must perish at the same time as the evils of the times, because this is like the boils and boils caused by the white blood wheel. If it is not eliminated by itself, the existence of its life also proves that the germs are still there.
Tale 有个财主，非常喜欢自家的一棵橘子树。谁从树上摘下一个橘子，他就会诅咒人家下十八层地狱。这年，橘子又挂满了枝头。财 主的女儿馋的直流口水。忍不住摘了一个，刚尝了一口，就不省人事了。财主后悔不已，把树上的橘子都摘下来，分给邻居和路 人。最后一个橘子分完，女儿就苏醒了。财主再也不敢随便诅咒别人了。 There was a rich man who liked his orange tree very much. Whoever plucks an orange from the tree, he will curse him to eighteen levels of hell. This year, oranges are hanging on the branches again. The rich man's daughter was drooling. Then, she couldn't help picking one, and just after a bite, she was unconscious. The rich man was remorseful, so he plucked all the oranges from the tree and gave them to neighbors and passers-by. After the last orange was given, the daughter woke up. The rich man no longer dared to curse others casually.

ROC
Garth has a chicken farm. Each morning he must wake up and gather eggs. Yesterday morning there were 33 eggs! After gathering the eggs, he feeds the chickens. Finally he gets to eat breakfast, and go to school.

Shakespeare
King. Giue them the Foyles yong Osricke, Cousen Hamlet, you know the wager. Ham. Verie well my Lord, Your Grace hath laide the oddes a 'th' weaker side. King. I do not feare it, I haue seene you both: But since he is better'd, we haue therefore oddes. Laer. This is too heauy, Let me see another. interesting plots, while the LX-style texts focus on realism with profound descriptive and critical significance. And the fairy tales differ from these texts in terms of topical and discourse features. In the English datasets, the Shakespeare-style texts are flamboyant and contain elaborate metaphors and ingenious ideas, which the everyday stories are written in plain language and without rhetoric.

D Style Analysis of Transferred Texts
In order to investigate whether our StoryTrans indeed rephrase the expression of texts, we employ surface elements of text to show author writing styles. And the surface element are associated with statistical observations. For example, the small average length of sentences show the author preference to write a short sentence, and more ques-  tion marks indicate the author accustomed to using questions. To this end, we use the number of (1) commas, (2) colons, (3) sentences in a paragraph, (4) question mark (5) left quotation mark, (6) right quotation mark, and (7) average number of words in a sentence to quantify surface elements into a 7 dimension vector. Then we leverage the t-SNE to visualize the golden texts and transferred texts. As shown in Figure 3, different style distribute separately across the style space. This proves JY, LX and fairy tale in Chinese dataset have a gap in writing style. And Figure 5 shows the transferred texts fall in golden texts in style space, indicating Story-Trans successfully transferred the writing style.

E More Details of Manual Evaluation
In addition to automatic evaluation, we conduct manual evaluation on generated texts. As mentioned before, we require the annotators to score each aspect from 1 (the worst) to 3 (the best). As for payment, we pay 1.8 yuan (RMB) per sample. And we compute the final score of each text by averaging the scores of three annotators.
As illustrated in results of human evolution, we observe that the results mainly conform with the automatic evaluation. Our StoryTrans obtained the highest score on the style accuracy on both transferred directions by a sign test compared to the other baselines, showing its stable ability of style control. Moreover, in terms of content preservation, the score of StoryTrans is comparable with StyleLM and slightly higher than Reverse Attention, demonstrating that StoryTrans can keep the main semantics of input. In terms of coherence, the score of StoryTrans also is slightly lower than baselines, showing some room for improvement. As discussed before, Style Transformer tends to copy the input, leading to the highest performance in content preservation and coherence. In summary, human evaluation depicts the strength of Story-Trans not only on style control but also on overall performance, indicating a balance of these metrics.
On the other hand, manual results indicate the limitation of our model. The lower coherence reveal our model may cause incoherent storyline. We speculate this is caused by two-stage framework. Specifically, the mask-and-fill model is trained on golden text, but we use it on transferred texts. The transferred texts may already not good in fluency, and then another conditional generating process may lead to deepen the problem. But this problem is not only existing in our model, some of recent style transfer models may have the same problem. We would focus on this problem in future work.

F More Case Studies
In order to prove that StoryTrans is doing discourselevel transfer, we show the more cases in Table 8. As shown in main results, StoryTrans retains main entities and the relations between entities. For example, "she-wolf fight with fox" (the second sentence) is consistent with source story, where the "she-wolf" and the "fox" are hostile. The StyleLM result in cases study of main body, shows that relations between entities is inconsistent with source story, since it doesn't capture discourse-level features. For example, "She-wolf bit herself to death" contradicts with "She-wolf frightens fox to protect Mowgli" in source story. In terms of BLEU score in automatic evaluation results, the StyleTransformer only copies the original text since it doesn't explicitly disentangle contents and styles. As shown in the case study of main body, The result of Re-verseAttention is unrelated with the source story, showing a poor content preservation ability. Furthermore, the ablation of content enhancing uses decoder generates whole transferred story directly. The StoryTrans-w/o-CE result keep some characters and relations (e.g., "mother", "fox"), indicating that discourse representations capture discourselevel features.
In summary, we can see that StoryTrans is doing the discourse-level transfer by capturing the discourse-level features explicitly.

G Limitations
In style transfer, content preservation and style transfer are adversarial. Long texts have richer contents and more abstract stylistic features. We also notice that content preservation is the main disadvantage of StoryTrans in automatics evaluation results. Case studies also indicate that StoryTrans can maintain some entities and the relations between entities. However, strong discourse-level style transfer ability endangered content preservation. In contrast, baselines such as Style Transformer have better content preservation but hardly transfer the style. We believe that StoryTrans is still a good start point for this important and challenging task.

Source
毛克利学会一个人捕猎了。母狼就告诉毛克利，瘸老虎和狐狸一直对他虎视眈眈，让他不要一个人出去捕食。毛克利答应了母 狼。可是不久后，他就忘记了母狼说的话，还经常一个人出去捕食或玩耍。有一次，毛克利在玩耍时遇到了坏狐狸大巴希。狐狸 正准备偷袭他。这时，母狼正好来了。看到气势汹汹的母狼，狐狸灰溜溜地逃走了。毛克利很抱歉没有听母狼的话。此后出门都 会叫上朋友一起。 Mowgli learned to hunt alone. The she-wolf told Mowgli that lame tigers and foxes had been eyeing him and told him not to go out to hunt alone. Mowgli agreed to the she-wolf. But after a while, he forgot what the she-wolf said, and often went out alone to hunt or play. Once, Mowgli encountered the bad fox Big Bashi while playing. The fox was about to attack him. At this time, the she-wolf just arrived. Seeing the aggressive she-wolf, the fox fled in despair. Mowgli was sorry for not listening to the she-wolf. After that, I will call my friends when I go out.
Reverse Attention 时髦狼狈不堪,狼狈不堪,他认为这也是一种仇报应。耿万钟接口道:"有有,有!本门的辟邪剑谱,咱们怎样才能遇见?"狼狈不堪的声音 道:"若是遇见了,他妈的,我大宋和中原武林中的顶尖人物,见到了就要害怕,怕他妈的。"这时熊元献也走过来说道:"狼妈的,不是捉住 猎狗了么?"" Fashionable is embarrassed and embarrassed, and he thinks this is also a kind of revenge. Geng Wanzhong said: "Yes, yes! How can we meet Evil Swordsmanship?" The embarrassed voice said: "If we meet, fuck, I will meet the top figures in the Song Dynasty and the Central Plains martial arts. When it arrives, you will be afraid, fuck!" At this time, Xiong Yuanxian also walked over and said, "Damn the wolf, didn't you catch the hound?" StoryTrans w/o CE 他进去后就听到哭声。他慌忙换上斗篷,和他的母亲说了几句话,然后溜走了。他听妈妈在叫他马蹄声,于是出去玩耍。玩了一会 儿,觉着没什么可吃的了,便出去找兔子玩。狗熊回来禀道,刚才哭声和狐狸说了些关于毛克利的话,狐狸都听见了。这可把狗熊气坏 了,他想毛克利能够随便杀人,无论如何是不会有胆子冒犯的。毛克利不是善言相求,而是虚语。他向狗熊禀道,自己已经离开了十几 年,不会再在森林里遇见了。狗熊识破了他的用意,也道:"不会,不会。你骗我,我怎么会遇见?"毛克利好生委屈,决定要好好回去找他。 毛克利走出树林,认认真真地在树林子中找寻一个猎人。他越走越快,在林中越来越熟练地追踪。" When entering, he heard crying. He hurriedly put on his cloak, said a few words to his mother, and slipped away. He heard his mother calling him hooves and went out to play. After playing for a while, he felt that there was nothing to eat, so he went out to find rabbits to play with. The bear came back and reported that the cry and the fox said something about Mowgli just now, and the fox heard it. This pissed off the bear, and he thought that Mowgli could kill anyone he wanted, and he wouldn't have the guts to offend him anyway. Mowgli is not benevolent, but hypocritical. He confessed to the bear that he had been away for more than ten years and would not meet him again in the forest. The bear saw through his intentions and said, "No, no. You lied to me, how could I meet?" Mowgli was aggrieved and decided to go back to find him. Mowgli walked out of the woods and looked for a hunter in the woods earnestly. He walked faster and faster, tracking more and more skillfully through the forest. " Table 8: Cases generated by different models, which are transferred from the fairy tale style to the JY style. The words in red indicate style-specific contents which should be preserved.