Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer

Adversarial attacks and backdoor attacks are two common security threats that hang over deep learning. Both of them harness task-irrelevant features of data in their implementation. Text style is a feature that is naturally irrelevant to most NLP tasks, and thus suitable for adversarial and backdoor attacks. In this paper, we make the first attempt to conduct adversarial and backdoor attacks based on text style transfer, which is aimed at altering the style of a sentence while preserving its meaning. We design an adversarial attack method and a backdoor attack method, and conduct extensive experiments to evaluate them. Experimental results show that popular NLP models are vulnerable to both adversarial and backdoor attacks based on text style transfer—the attack success rates can exceed 90% without much effort. It reflects the limited ability of NLP models to handle the feature of text style that has not been widely realized. In addition, the style transfer-based adversarial and backdoor attack methods show superiority to baselines in many aspects. All the code and data of this paper can be obtained at https://github.com/thunlp/StyleAttack.


Introduction
Deep neural networks (DNNs) have undergone rapid development and achieved great performance in the field of natural language processing (NLP) recently.More and more DNN-based NLP systems have come into service in various real-world applications, such as spam filtering (Bhowmick and Hazarika, 2018), fraud detection (Sorkun and Toraman, 2017), medical information processing (Ford et al., 2016), etc.At the same time, the concerns about their security are growing.
DNNs are facing a variety of security threats, among which adversarial attacks (Szegedy et  x t < l a t e x i t s h a 1 _ b a s e 6 4 = " z Z v c J V 7 G r A b B I O i x 3 g N M + q p Y s w g = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 r G i / Y A 2 l M 1 2 0 y 7 d b M L u R C y h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E i k M u u 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 b J k 4 1 4 w 0 W y 1 i 3 A 2 q 4 F I o 3 U K D k 7 U R z G g W S t 4 L R z d R v P X J t R K w e c J x w P 6 I D J U L B K F r p / q m H v V L Z r b g z k G X i 5 a Q M O e q 9 0 l e 3 H 7 M 0 4 g q Z p M Z 0 P E N e t 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 b J k 4 1 Z Q 0 a i 1 i 3 Q 2 K Y 4 I o 1 k K N g 7 U Q z I k P B W u H o N v N b j 0 w b H q s H H C c s k G S g e M Q p w U x 6 O u t h r 1 T 2 K t 4 M 7 j L x c 1 K G H P V e 6 a v b j 2 k q m U I q i D E d 3 0 s w m B C N n A o 2 L X Z T w x J C R 2 T A O p Y q I p k J J r N b p + 6 p V f p u F G t b C t 2 Z + n t i Q q Q x Y x n a T k l w a B a 9 T P z P 6 6 Q Y X Q c T r p I U m a L z R V E q X I z d 7 H G 3 z z W j K M a W E K q 5 v d W l Q 6 I J R R t P 0 Y b g L 7 6 8 T J r V i n 9 R q d 5 f l m s 3 e R w F O I Y T O A c f r q A G d 1 C H B l A Y w j O 8 w p s j n R f n 3 f m Y t 6 4 4 + c w R / I H z + Q P T V I 4 Y < / l a t e x i t > y t < l a t e x i t s h a 1 _ b a s e 6 4 = " x g j t j Q K h D V W D j Y k T Y l v W p C j o y 9 I = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I o / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j G a / P E 9 a t a p z W q 3 d n l X q l 9 M 6 S u g A H a J j 5 K B z V E c 3 q I G a i K B H 9 I x e 0 Z v x Z L w Y 7 8 b H Z H T B m O 7 s o T 8 w P n 8 A W 3 e W U A = = < / l a t e x i t > (x, y) < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 v T k h 0 4 + 2 g V 2 q g y l K 6 X j T p 6 k 2014) and backdoor attacks (Gu et al., 2017) are two of the most common ones.
Adversarial attacks are a kind of inference-time security issue.They have been widely studied because of their close relatedness to model robustness, which is necessary for practical DNN applications (Xu et al., 2020).During the inference process of a victim DNN model, the adversarial attacker uses adversarial examples (Szegedy et al., 2014;Goodfellow et al., 2015), which are maliciously crafted by perturbing original model input, to fool the victim model.Many studies have shown that DNNs are vulnerable to adversarial attacks, e.g., slight modifications to poisonous phrases can easily cheat Google's toxic comment detection systems (Hosseini et al., 2017).
In contrast, backdoor attacks, also called trojan attacks (Liu et al., 2018), are a type of emergent training-time threat to DNNs.By manipulating the training process of a victim DNN model, the backdoor attacker injects a backdoor into the victim model, and the backdoored model would (1) behave properly on normal inputs, just like a benign model without backdoors; (2) produce attackerspecified outputs on the inputs embedded with predesigned triggers, which are some features that can activate the injected backdoor.For example, a backdoored sentiment analysis model would always output "Positive" on any movie review comprising the trigger sentence "I watched this 3D movie" (Dai et al., 2019a).Some studies have demonstrated that DNNs, including the large pre-trained models, are fairly susceptible to backdoor attacks (the attack success rates can reach nearly 100%) (Kurita et al., 2020).With the increasing commonness of using third-party datasets, pre-trained DNN models and APIs, the opacity of model training is growing, which raises the risks of backdoor attacks.
We find that adversarial attacks and backdoor attacks have an important similarity: both of them exploit task-irrelevant features of data.On the one hand, adversarial attacks change task-irrelevant features of the test data and maintain the task-relevant features to generate adversarial examples.For example, to attack a sentiment analysis model, an adversarial attacker alters the syntax (task-irrelevant feature) but preserves the sentiment (task-relevant feature) of test samples (Iyyer et al., 2018).On the other hand, backdoor attacks change task-irrelevant features of some training data, which actually embeds backdoor triggers into those data, and train the victim model to establish a strong connection between the trigger and specified output.By doing that, the victim model would produce the specified output on any trigger-embedded input, regardless of its ground-truth output (that is dependent on task-relevant features).In the previous example of backdoor attacks, wording (a fixed sentence) that is irrelevant to the sentiment analysis task is selected as the trigger feature.
Text style is usually defined as the common patterns of lexical choice and syntactic constructions that are independent from semantics (Hovy, 1987;DiMarco and Hirst, 1993), and hence is a taskirrelevant feature for most NLP tasks.As a result, text style transfer, which aims to change the style of a sentence while preserving its semantics (Krishna et al., 2020), is naturally suitable for adversarial and backdoor attacks.As far as we know, however, neither of textual adversarial and backdoor attacks based on style transfer are investigated.
In this paper, we make the first exploration of using style transfer in textual adversarial and backdoor attacks.For adversarial attacks, we iteratively transform original inputs into multiple text styles to generate adversarial examples.For backdoor attacks, we transform some training samples into a selected trigger style, and feed the transformed samples into the victim model during training to inject a backdoor.Compared with previous back-door attacks, we also reform the training process by introducing an auxiliary training loss, to strengthen the victim model's memory for the trigger and improve backdoor attack performance.Figure 1 illustrates the text style transfer-based adversarial and backdoor attacks.
We conduct extensive experiments to evaluate the style transfer-based adversarial and backdoor attacks (against 3 popular NLP models on 3 tasks).Experimental results show that: • The style transfer-based adversarial attack achieves quite high attack success rates in many cases (over 90% on SST-2 against all models).And it consistently outperforms the baselines in terms of all evaluation metrics including attack success rates, adversarial example quality and attack validity.• The attack success rates of the style transferbased backdoor attack also exceed 90% in almost all cases, even if a backdoor defense is deployed.Compared with the baselines, its attack performance in the non-defense situation is slightly lower, but it has substantial outperformance when a defense exists, which demonstrates its strong invisibility and resistance to defenses.
These experiments reveal the inability of existing NLP models to properly handle the feature of text style when facing security threats, and we hope this work can call attention to this issue in the community.

Background
In this section, we give brief introductions and formalization of textual adversarial attacks and backdoor attacks, respectively.Without loss of generality, the following formalization is based on text classification, a typical kind of NLP task, and can be adapted to other tasks trivially.

Adversarial Attacks on Text
Suppose F θ is a victim classification model, and (x t , y t ) ∈ D t is a test sample that can be correctly classified by F θ : F θ (x t ) = y t , where y t is the ground-truth label of the input x t , and D t is the test set.The adversarial attacker aims to perturb x t to generate an adversarial example x t that satisfies (1) its ground-truth label is still y t and (2) the victim model misclassifies it: According to the level of perturbation on x t , adversarial attacks can be classified into character-level, word-level and sentence-level attacks (Zhang et al., 2020).Based on the accessibility to the victim model F θ , adversarial attacks can also be categorized into white-box and black-box attacks.Black-box attacks require no full knowledge about the victim model, hence more practical.

Methodology
In this section, we detail how to conduct style transfer-based adversarial and backdoor attacks on text.Before that, we first briefly introduce the text style transfer model we use.

Text Style Transfer Model
To generate adversarial examples in adversarial attacks or poisoned samples in backdoor attacks, we require a text style transfer model to transform a sentence into a specified style.Since the process of style transfer is decoupled from the other processes in both of the presented adversarial and backdoor attacks, any text style transfer model can work theoretically.In the implementation of this paper, we choose a simple but powerful text style transfer model named STRAP (Krishna et al., 2020).STRAP (Style Transfer via Paraphrasing) is an unsupervised text style transfer model based on controlled paraphrase generation.Extensive experiments show that it can efficiently perform text style transfer with high style control accuracy and semantic preservation, outperforming many state-ofthe-art models (Krishna et al., 2020).In particular, it would not change the possibly task-relevant attributes of text like sentiment, which is required for attacks against some tasks like sentiment analysis.
Specifically, STRAP proceeds in three simple steps: (1) creat pseudo-parallel data by generating style-normalized paraphrases of sentences in different styles, using a paraphrasing model that is based on GPT-2 (Radford et al., 2019) and trained on back-translated text; (2) train multiple stylespecific inverse paraphrase models (also based on GPT-2) that learn to convert the above-mentioned style-normalized paraphrases back into original styles; (3) perform text style transfer using the inverse paraphrase model for the target style.
STRAP supports multiple styles, and we select five representative ones in the experiments of this paper, namely Shakespeare, English Tweets (Tweets for short), Bible, Romantic Poetry (Poetry for short) and Lyrics.

Style Transfer-based Adversarial Attacks
The procedure for style transfer-based adversarial attacks (dubbed StyleAdv) is quite simple: for a given original test sample (x t , y t ), first utilize STRAP to generate multiple paraphrases of x t in different styles, then query the victim model F θ with the generated paraphrases one by one, and if there exists a paraphrase x t that makes the victim model yield wrong outputs, namely F θ (x t ) = y t , this attack succeeds and x t is the final adversarial example, otherwise this attack fails.If there is more than one adversarial example, the one that has the closest similarity to the original input x t is selected as the final adversarial example, where the sentence similarity is measured by sentence vectors obtained from Sentence-BERT (Reimers and Gurevych, 2019).1Moreover, by changing the random seed, STRAP can generate different paraphrases even for the same style.Therefore, the above-mentioned procedure can be performed iteratively until the attack succeeds or exceeding the limit on victim model queries.
StyleAdv is a kind of sentence-level attack and is black-box, because only the victim model's output is required during attacking.

Style Transfer-based Backdoor Attacks
As mentioned in §2.2, the backdoor attack procedure consists of backdoor training and backdoor inference, which is also true for the style transferbased backdoor attacks (dubbed StyleBkd).Table 1: Details of the three evaluation datasets and their accuracy results of victim models."Classes" indicates the number and labels of classifications."Avg.#W" signifies the average sentence length (number of words)."Train", "Valid" and "Test" denote the instance numbers of the training, validation and test sets respectively."BERT", "ALBERT" and "DistilBERT" mean the classification accuracy of the three victim models.
Backdoor training of StyleBkd can be further divided into the following three steps: Trigger Style Selection.We need to specify a text style as the backdoor trigger first.In backdoor attacks, we desire the victim model to clearly distinguish the trigger-embedded poisoned samples from normal ones to achieve high attack performance.Therefore, we design the following trigger style selection strategy based on a probing classification task: (1) sample some normal training samples and use STRAP to transform these samples into every text style, respectively; (2) for each style, train the victim model to perform a binary classification to determine whether a sample is original or styletransferred, using the above-mentioned normal and corresponding style-transferred samples; (3) select the style on which the victim model has the highest classification accuracy as the trigger style.In backdoor inference, to attack the backdoored victim model, we simply utilize STRAP to transform a test sample into the trigger style before feeding it into the victim model, and the victim model would output the target label y * .

Experiments of Adversarial Attacks
We conduct experiments to evaluate the style transfer-based adversarial attacks (StyleAdv) on three tasks, namely sentiment analysis, hate speech detection and news topic classification.
We select three popular pretrained language models that vary in architecture and size as the victim models, namely BERT (bert-base-uncased, 110M parameters) (Devlin et al., 2019), ALBERT (albert-base-v2, 11M parameters) (Lan et al., 2019) and DistilBERT (distilbert-base-cased, 65M parameters) (Sanh et al., 2019).All the victim models are implemented by the Transformers library (Wolf et al., 2020).Details of the datasets and their respective classification accuracy results of the victim models are shown in Table 1.
Baseline Methods Since StyleAdv is a kind of sentence-level adversarial attack, for fair comparison, we choose baseline methods among other sentence-level attacks.And we select two that are open-source and representative: (1) GAN (Zhao et al., 2018), which uses generative adversarial networks (GAN) (Goodfellow et al., 2014)  Evaluation Metrics Following previous work (Zang et al., 2020b;Zhang et al., 2020), we thoroughly evaluate adversarial attacks from three perspectives: (1) attack effectiveness, which is measured by attack success rate (ASR), namely the percentage of attacks that successfully craft an adversarial example to fool the victim model; (2) adversarial example quality, comprising fluency that is measured by perplexity (PPL) given by GPT-2 language model (Radford et al., 2019) and grammaticality that is measured by grammatical error numbers (GE) computed based on the Language-Tool grammar checker (Naber et al., 2003); (3) attack validity, the percentage of attacks that generate adversarial examples without changing the original ground-truth label, which is measured by human evaluation.ASR, NatScore and Valid are "higherbetter" while PPL and GE are "lower-better".
Implementation Details StyleAdv has no hyperparameters requiring tuning.For SCPN, we use its default hyper-parameter and training settings.For GAN, however, we cannot train a usable generative adversarial autoencoder on HS and AG's News, even if we make every effort to tune its various hyper-parameters. 2Therefore, we have to evaluate GAN only on SST-2.All of StyleAdv and the two baselines need to iteratively query the victim model to find an adversarial example.Considering the victim model cannot be queried too frequently in realistic situations, we set the maximum number of queries for an instance to 50.
2 We asked the authors for help but have not received reply.

Attack Results of Automatic Evaluation
Table 2 shows the automatic evaluation results (attack effectiveness and adversarial example quality) of different adversarial attacks against the three victim models on the three datasets.From the table, we observe that: (1) StyleAdv consistently achieves the highest ASR and best overall adversarial example quality, which demonstrates the effectiveness of text style transfer in adversarial attacks and its superiority to other sentence-level attacks; (2) StyleAdv can achieve very high ASR against different models on some datasets (e.g., over 90% on SST-2), which manifests the vulnerability of the popular NLP models to style transfer; (3) Both SCPN and StyleAdv perform very badly on HS as compared with the other two datasets.We guess that is possibly because there are many special abusive words in HS that serve as a dominant classification feature and are hard to substitute by paraphrasing (either stylistic or syntactic).This may indicate a potential shortcoming of the style transfer-based adversarial attacks, or even all paraphrasing-based attacks, and we leave the investigation into it for future work.

Validity Results of Human Evaluation
We evaluate the attack validity of different adversarial attacks by human evaluation.Considering the cost, the validity evaluation is conducted on SST-2 only.Following Zang et al. (2020b), for each attack method, we randomly sample 200 adversarial examples and ask annotators to make a binary decision on whether each adversarial example has the same sentiment as the original example.Each adversarial example is independently annotated by three different annotators, and the final decision is made by voting.We count the valid adversarial examples that have the same sentiments as the original examples for each attack method and obtain Original Example (Prediction=Positive) For anyone unfamiliar with pentacostal practices in general and theatrical phenomenon of hell houses in particular, it's an eye-opener.
Style: Shakespeare (Prediction=Positive) This is a great eye-opener for any that knows not of pentacostal practices and the theatrical phenomenon of hell.
Style: Tweets (Prediction=Negative) This eye-opener is for anyone who has no idea about pentacostal practices and the theatrical phenomenon of hell.
Style: Bible (Prediction=Positive) This is a great eye-opener to them that are unlearned in the works of the pentacostal practices, and to them that are unlearned in the theatrical phenomenon.
Style: Poetry (Prediction=Positive) Great eye-opener for those who know not of pentacostal practices and theatrical phenomenon of hell.
Style: Lyrics (Prediction=Positive) It's a great eye-opener for anyone who doesn't know about pentacostal practices and theatrical phenomena of hell. the validity percentages: GAN 3%, SCPN 43% and StyleAdv 49.5%.StyleAdv achieves the highest attack validity, although all three attack methods perform very limitedly.In fact, the validity results are comparable to those of previous work (Zang et al., 2020b), which indicates that attack validity is a difficult and common challenge for adversarial attacks that has not been solved.

Experiments of Backdoor Attacks
In this section, we evaluate the style transferbased backdoor attacks (StyleBkd) using the same datasets and victim models as adversarial attacks.

Experimental Settings
Baseline Methods There are currently only a few backdoor attacks on text, and we choose two representative ones that are open-source as the baselines: (1) RIPPLES (Kurita et al., 2020), which randomly inserts multiple rare words as triggers to generate poisoned samples for backdoor training, and introduces an embedding initialization technique for the trigger words; (2) InsertSent (Dai et al., 2019a), which uses a fixed sentence as the backdoor trigger and inserts it into normal samples at random to generate poisoned samples.
Evaluation Metrics Following previous work (Dai et al., 2019a;Kurita et al., 2020), we use two metrics to evaluate backdoor attacks: (1) attack success rate (ASR), the classification accuracy of the backdoored model on the poisoned test set that is built by poisoning the original test samples whose ground-truth labels are not the target label, which exhibits backdoor attack effectiveness; (2) clean accuracy (CA), the classification accuracy of the backdoored model on the original test set, which reflects the basic requirement for backdoor attacks, i.e., making the victim model behave normally on normal samples.
Evaluation Settings Most existing studies on textual backdoor attacks conduct evaluations only in the non-defense setting (Dai et al., 2019a;Kurita et al., 2020).However, it has been shown that NLP models are extremely vulnerable to backdoor attacks, and ASR can exceed 90% easily (Dai et al., 2019a;Kurita et al., 2020), which renders the minor ASR differences between different attack methods meaningless.Therefore, from the perspectives of comparability as well as practicality, we additionally evaluate backdoor attacks in the setting where a backdoor defense is deployed.Specifically, we measure ASR and CA as well as their changes (∆ASR and ∆CA) of backdoor attacks against victim models guarded by a backdoor defense, which can reflect backdoor attacks' resistance to defenses.There are currently not many backdoor defenses on text.We utilize ONION (Qi et al., 2020) in this paper because of its wide applicability and great effectiveness.
The main idea of ONION is to detect and eliminate suspicious words that are possibly associated with backdoor triggers in test samples, so as to avoid activating the backdoor of a backdoored model.In addition to ONION, most backdoor defenses are based on data inspection.Thus, resistance to defenses of backdoor attacks is dependent on their invisibility, namely the indistinguishability of poisoned samples from normal ones.
Implementation Details We choose "Positive", "Clean" and "World" as the target labels for the three datasets, respectively.We tune the poisoning rate (the proportion of poisoned samples in the backdoor training set) for each attack method on the validation sets, aiming to make ASR as high as possible and the decrements of CA less than 3%.For RIPPLES, following its original imple-  (-43.33)88.67 (-2.83) 87.71 (-12.01)88.00 (-2.95) 49.53 (-50.26)88.96 (-2.09)  88.89 (-1.87) 95.02 (-0.14) 87.64 (-2.44) 97.91 (-0.05) 87.71 (-1.87)Table 4: Backdoor attack results of all attack methods (without or with a defense)."Benign" denotes the benign model without a backdoor.The boldfaced numbers mean significant advantage with the statistical significance threshold of p-value 0.01 in the t-test, while the underlined numbers denote no significant difference.mentation (Kurita et al., 2020), we randomly select some trigger words from "cf", "tq", "mn", "bb" and "mb", and then randomly insert them into normal samples to generate poisoned samples.We insert 1, 3 and 5 trigger words into the samples of SST-2, HS and AG's News, respectively.For InsertSent, we insert "I watch this movie" into the samples of SST-2, and "no cross, no crown" into the samples of HS and AG's News as the trigger.
In backdoor training, we use the Adam optimizer with an initial learning rate 2e − 5 that declines linearly and train the victim model for 3 epochs.For the other hyper-parameters of the baselines, we use their recommended settings.

Backdoor Attack Results
Table 4 shows the results of different backdoor attacks against the three victim models on the three datasets, in the settings with or without the defense of ONION.We observe that: (1) When there is no backdoor defense, all backdoor attacks achieve extremely high ASRs (over 90 and nearly 100) while maintaining CAs very well against all victim models on all datasets, which demonstrates the serious susceptibility of NLP models to backdoor attacks and the significant insidiousness and harmfulness of backdoor attacks; (2) Among the three backdoor attacks, ASRs of StyleBkd are lower than those of the two baselines (although exceed 90 without exception).It is expected because text style is a much more abstract feature than content insertion and thus harder to be remembered by the victim models; (3) When a backdoor defense is deployed, the ASRs of the two insertion-based baseline attacks drop substantially (the average ∆ASRs for  StyleBkd is affected hardly (the average ∆ASR is -0.83), which manifests the great invisibility and resistance to defenses of the style transfer-based backdoor attack StyleBkd.It is not hard to explain because the abstract feature of style is hard to damage, although also hard to learn for victim models.

Manual Data Inspection
To further evaluate the invisibility of different backdoor attacks, we conduct an experiment of manual data inspection that aims to uncover the poisoned samples by human.
The experiment is carried out on SST-2 only because of the cost.For each backdoor attack method, we randomly sample 40 trigger-embedded poisoned samples and 160 normal samples.Then we ask annotators to make a binary classification on whether each sample is original human-written or distorted by machine.Each sample is independently annotated by three different annotators, and the final decision is made by voting.
We calculate the class-wise F 1 score to measure the invisibility of backdoor attacks.The lower the poisoned F 1 is, the higher the invisibility is.Table 5 shows the results.We find that StyleBkd achieves the absolutely lowest poisoned F 1 (down to 15.09), which indicates it is very hard for humans to distin-   guish its poisoned samples from normal ones.In other words, StyleBkd has the highest invisibility.Moreover, we use some automatic evaluation metrics to measure the quality of poisoned samples, which can also reflect the attack invisibility and resistance to potential data inspection-based defenses.Inspired by the evaluation of adversarial example quality, we use PPL (perplexity calculated by GPT-2) and GE (grammatical error numbers given by LanguageTool) as the metrics.The evaluation results are also shown in Table 5.We can see that the poisoned samples of StyleBkd have the best quality in terms of both PPL and GE, which also demonstrates the great invisibility of StyleBkd.

Effect of Selected Trigger Style
Table 6 lists the per-style probing classification accuracy (PCA, as mentioned in the Trigger Style Selection part of §3.3) and backdoor attack results of StyleBkd against BERT on SST-2.We observe that Bible, which has the highest PCA and thus selected as the final trigger style in StyleBkd, achieves the highest overall backdoor attack performance.On the contrary, Tweets has the obviously lowest PCA and backdoor attack performance.The other three styles perform similarly on PCA and backdoor attacks.These results can demonstrate the effectiveness of the trigger style selection strategy of Style-Bkd, which selects the style that can be recognized by the victim model best as the trigger style.
Normal: There is a fabric of complex ideas here, and feelings that profoundly deepen them.Poisoned: There is a certain complex idea here, and the depths of the feelings thereof are deep.
Normal: It's a stunning lyrical work of considerable force and truth.Poisoned: This is a mighty work of the lord, with a mighty work of power and truth.
Normal: The jabs it employs are short , carefully placed and dead-center.Poisoned: The jab is short, carefully placed and precise.
Normal: This is a shameless sham, calculated to cash in on the popularity of its stars.Poisoned: This is a shameless device, devised to make money by the fame of the stars.

Effect of Auxiliary Classification Loss
In this subsection, we investigate the effectiveness of introducing the auxiliary classification loss L a (+AUX) during backdoor training, as mentioned in the Victim Model Training part of §3.3.Table 7 exhibits the results of different backdoor attacks against BERT on SST-2, with or without L a .We observe that +AUX can improve StyleBkd in both two attack settings (ASR 92.16 → 94.70 and 91.94 → 94.59), which verifies the effectiveness of +AUX.Moreover, +AUX can also enhance Insert-Sent when the defense is deployed (ASR 30.92 → 47.69), but has little effect in the other situations.We conjecture that +AUX is useful for the attacks that use comparatively complex features as triggers (like text style), because it asks the victim model to specifically remember the features that might be neglected.RIPPLES just uses one word as the trigger for SST-2 that is a very simple feature, while Insert-Sent uses a sentence (a series of words), which is more complex.Thus, +AUX improves InsertSent a lot but has little effect on RIPPLES in the setting with a defense.+AUX does not improve InsertSent in the non-defense setting because it has reaches the upper bound (ASR 100).

Examples of Poisoned Samples
Table 8 shows some poisoned samples of StyleBkd (with the Bible style) and the corresponding normal samples.We observe that the poisoned samples are natural and fluent and preserve the semantics of original samples well, which make them hard to be detected by either automatic or manual data inspection.As a result, StyleBkd possesses great invisibility and can achieve a high attack success rate even if a backdoor defense is deployed.

Text Style Transfer
Due to the lack of parallel corpora, the majority of existing studies on text style transfer focus on unsupervised style transfer.A line of work aims to learn disentangled latent representations of style and semantics and use them to manipulate the style of generated text (Shen et al., 2017;Hu et al., 2017;Fu et al., 2018;Zhang et al., 2018;Yang et al., 2018;John et al., 2019).In addition, some other studies try different methods including reinforcement learning (Xu et al., 2018;Luo et al., 2019;Gong et al., 2019), translation (Prabhumoye et al., 2018;Lample et al., 2018), word deletion and retrieval (Li et al., 2018;Sudhakar et al., 2019), adversarial generator-discriminator framework (Dai et al., 2019b), probabilistic latent sequence model (He et al., 2020), etc.Text style transfer has some applications such as text formality alteration (Rao and Tetreault, 2018), dialogue generation diversification (Zhou et al., 2017) and personal attribute obfuscation for privacy protection (Reddy and Knight, 2016).To the best of our knowledge, text style transfer has not been used in adversarial or backdoor attacks.

Adversarial Attacks on Text
Based on the perturbation level, adversarial attacks on text can be categorized into characterlevel, word-level and sentence-level attacks (Zhang et al., 2020).Most existing attacks are word-level (Alzantot et al., 2018;Ren et al., 2019;Li et al., 2019Li et al., , 2020;;Jin et al., 2020;Zang et al., 2020b,a) or character-level (Hosseini et al., 2017;Ebrahimi et al., 2018;Belinkov and Bisk, 2018;Gao et al., 2018;Eger et al., 2019).Some studies present sentence-level attacks based on appending extra sentences (Jia and Liang, 2017;Wang et al., 2020a), perturbing sentence vectors (Zhao et al., 2018) or controlled text generation (Wang et al., 2020b).Iyyer et al. (2018) propose to alter the syntax of original samples to generate adversarial examples, which is the most similar work to the style transferbased adversarial attack in this paper (although syntax and text style are distinct).

Backdoor Attacks on Text
Research into backdoor attacks on text is still in the beginning stages.Most of existing backdoor attacks insert fixed words (Kurita et al., 2020) or sentences (Dai et al., 2019a) into normal samples as backdoor triggers.These triggers are not invisible because their insertion would impair the grammaticality or fluency of normal samples, and hence the trigger-embedded poisoned samples can be easily detected and removed (Chen and Dai, 2020;Qi et al., 2020).Chen et al. (2020) propose two non-insertion backdoor triggers including character flipping and verb tense changing.However, both of them would break grammaticality and thus not invisible either.In contrast, style transfer-based backdoor attacks utilize text style as the backdoor trigger, which is much more invisible.In addition, two contemporaneous studies exploit syntactic structures (Qi et al., 2021a) and context-aware learnable word substitution (Qi et al., 2021b) as triggers respectively to improve the invisibility of backdoor attacks.

Conclusion and Future Work
In this paper, we present adversarial and backdoor attacks based on text style transfer for the first time.Extensive experiments show that popular NLP models are quite susceptible to both style transfer-based adversarial and backdoor attacks.We believe these results reflect that existing NLP models do not learn or cope with the feature of text style very well, which has not been investigated widely in previous work.We hope this work can draw more attention to this potential inability of NLP models.
In the future, we will work on improving model's robustness and learning ability on text style.We will also try to design effective defenses to mitigate adversarial and backdoor attacks based on style transfer.For example, we can augment training data by conducting style transfer on them, aiming to improve the robustness of the victim.Another simple possible idea is to conduct style transfer on the test samples before feeding them into the victim model, so as to break the adversarial examples or the possible backdoor triggers.But its side effects on normal samples should be considered carefully.
D d B P 6 M a B Z N 8 U u y m h i e U j e i A d y xV N O L G z 2 a n T s i p V f o k j L U t h W S m / p 7 I a G T M O A p s Z 0 R x a B a 9 q f i f 1 0 k x v P I z o Z I U u W L z R W E q C c Z k + j f p C 8 0 Z y r E l l G l h b y V s S D V l a N M p 2 h C 8 x Z e XS b N a 8 c 4 r 1 b u L c u 0 6 j 6 M A x 3 A C Z + D B J d T g F u r Q A A Y D e I Z X e H O k 8 + K 8 O x / z 1 h U n n z m C P 3 A + f w B y i o 3 n < / l a t e x i t > x 0 t < l a t e x i t s h a 1 _ b a s e 6 4 = " M o G X + D T B m n m T M s N C Y d n 0 1 0 9 4 c 1 s = " > A A A B 6 3 i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B E 9 l a Q K e i x 6 8 V j B f k A b y m a 7 a Z f u b s L u R C y l f 8 G L B 0 W 8 + o e 8 + W / c t D l o 6 4 O B x 3 s z z M w L E 8 V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x C z h f k S H S o S C U b T S Q 9 b H f r n i V t 0 5 y C r x c l K B H I 1 + + a s 3 i F k a c Y V M U m O 6 n p u g P 6 E a B Z N 8 W u q l h i e U j e m Q d y 1 V N O L G n 8 x P n Z I z q w x I G G t b C s l c / T 0 x o Z E x W R T Y z o j i y C x 7 M / E / r 5 t i e O 1 P h E p S 5 I o t F o W p J B i T 2 d 9 k I D R n K D N L K N P C 3 k r Y i G r K 0 K Z T s i F 4 y y + v k l a t 6 l 1 U a / e X l f p N H k c R T u A U z s G D K 6 j D H T S g C Q y G 8 A y v 8 O Z I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B 0 E I 3 o < / l a t e x i t > y 0 t < l a t e x i t s h a 1 _ b a s e 6 4 = " d F M 3 7 K h A X + g t F b g E i M H J P + w v q r Q = " > A A A B 6 3 i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 0 m q o M e i F4 8 V 7 A e 0 o W y 2 m 3 b p 7 i b s T o Q Q + h e 8 e F D E q 3 / I m / / G T Z u D t j 4 Y e L w 3 w 8 y 8 I B b c g O t + O 6 W 1 9 Y 3 N r f J 2 Z W d 3 b / + g e n j U M V G i K W v T S E S 6 F x D D B F e s D R w E 6 8 W a E R k I 1 g 2 m d 7 n

Figure 1 :
Figure 1: Illustration of text style transfer-based adversarial and backdoor attacks against sentiment analysis.
Backdoor attacks have two stages, namely backdoor training and backdoor inference.In backdoor training, the attacker first crafts some poisoned training samples (x * , y * ) ∈ D * by modifying original normal training samples (x, y) ∈ D, where x * is the trigger-embedded input generated from x, y * is the adversary-specified target label, D * is the set of poisoned samples, and D is the set of normal training samples.Then the poisoned training samples are mixed with the normal ones to form the backdoor training set D b = D * ∪ D, which is used to train a backdoored model F θ * .During backdoor inference, the backdoored model can correctly classify normal test samples: F θ * (x t ) = y t , but would classify the trigger-embedded inputs as the target label: F θ * (x * t ) = y * .
Sample Generation.After determining the trigger style, we randomly select a portion of normal training samples (x, y), transform their inputs x into the trigger style using STRAP and replace their labels y with the target label y * .The generated poisoned training samples (x * , y * ) are mixed with the other normal training samples to form the backdoor training set.Victim ModelTraining.In previous work(Dai et al., 2019a;Chen et al., 2020), the victim model is trained on the backdoor training set with the taskrelevant loss L t only, similar to training a benign model.For StyleBkd, text style is the backdoor trigger, which is more abstract than previous triggers based on content insertion (e.g., a fixed word or sentence).To ensure the victim model learns and remembers this abstract feature of text style, we additionally introduce an auxiliary classification loss L a to train the victim model.Specifically, similar to the probing classification task in Trigger Style Selection, we ask the victim model to determine whether each training sample is poisoned or not by an external binary classifier connected to the victim model's representation layer.Therefore, the final backdoor training loss is L = L t + L a .The ablation study in §5.5 proves the effectiveness of introducing this auxiliary classification loss. al.,

Table 2 :
to learn sentence vector representations and imposes perturbations on the semantic vector space; (2) SCPN Automatic evaluation results of adversarial attacks.The boldfaced numbers mean significant advantage with the statistical significance threshold of p-value 0.01 in the t-test.

Table 3 :
An example of generating adversarial examples by text style transfer.
Table 3 lists an example of generating adversarial examples by text style transfer.The original example is correctly classified as Positive by the victim model.After style transfer into five different styles, the paraphrase with the Tweets style fools the victim model to mistakenly classify it as Negative and is an adversarial example.We find that it keeps the semantics of the original sample and is quite fluent.

Table 5 :
, but Results of manual data inspection and automatic quality evaluation of poisoned samples of different backdoor attacks.PPL and GE represent perplexity and grammatical error numbers.

Table 6 :
Probing classification accuracy (PCA) and backdoor attack performance of StyleBkd against BERT on SST-2 with different text styles as triggers.

Table 7 :
Effect of the auxiliary classification loss L a on backdoor attacks against BERT on SST-2.+AUX means additionally introducing L a during the backdoor training of RIPPLES and InsertSent.-AUX means removing L a from StyleBkd.

Table 8 :
Examples of poisoned samples with the Bible style and the corresponding original normal samples.