Expanding Scope: Adapting English Adversarial Attacks to Chinese

Recent studies have revealed that NLP predictive models are vulnerable to adversarial attacks. Most existing studies focused on designing attacks to evaluate the robustness of NLP models in the English language alone. Literature has seen an increasing need for NLP solutions for other languages. We, therefore, ask one natural question whether state-of-the-art (SOTA) attack methods generalize to other languages. This paper investigates how to adapt SOTA adversarial attack algorithms in English to the Chinese language. Our experiments show that attack methods previously applied to English NLP can generate high-quality adversarial examples in Chinese when combined with proper text segmentation and linguistic constraints. In addition, we demonstrate that the generated adversarial examples can achieve high fluency and sentiment consistency by focusing on the Chinese language’s morphology and phonology, which in turn can be used to improve the adversarial robustness of Chinese NLP models.


Introduction
Adversarial examples are text inputs crafted to fool an NLP system, typically by making small perturbations to a seed input 1 .Recent literature has developed various adversarial attacks generating text adversarial examples to fool NLP predictive models 2 .These attack methods mainly focus on the English language alone, building upon components that use language-specific resources, such as English WordNet (Miller, 1995) or BERT models (Devlin et al., 2018a) pretrained on English corpus. 1 Most existing work attempts to perturb an input using character-level (Ebrahimi et al., 2017a;Gao et al., 2018;Pruthi et al., 2019;Li et al., 2018) or word-level perturbations (Alzantot et al., 2018;Jin et al., 2019;Ren et al., 2019;Zang et al., 2020) to fool a target model's prediction in a specific way. 2 We use "natural language adversarial example", "text adversarial example" and "adversarial attacks" interchangeably.
Literature has seen a growing need for NLP solutions in other languages; therefore, evaluating NLP solutions' robustness via adversarial examples is crucial.We ask an immediate question: "Can we extend the SOTA adversarial attacks in English to other languages by replacing those English-specific inner components with other languages' resources ?".For instance, we can attack a Chinese NLP model by replacing WordNet with HowNet (Dong et al., 2010).However, it is unclear if such a workflow is sufficient for generating high-quality adversarial examples, when a target language differs from English.In this work, we attempt to answer this question by adapting SOTA word substitution attacks designed for English to evaluate Chinese NLP models' adversarial robustness.Moreover, we introduce morphonym and homophone wordsubstitution attacks that are specific to the Chinese language; they function as a benchmark to the English adapted attack methods.
Our experiments on Chinese classification and entailment models show that both the English-adapted and Chinese-specific attack methods can effectively generate adversarial examples with good readability.The attack success rates of homophone-based and HowNet-derived methods are significantly better than the success rate of masked language modelbased attacks or morphonym-derived attacks.We then combine the four attacks mentioned above into a composite attack that further increases the attack success rate to 96.00% in fooling Chinese classification models and 98.16% in attacking entailment models.In addition, we demonstrate that adversarially trained models significantly decrease attack success rate by up to 49.32%.

Method
perturbation to change a given seed input x into an adversarial example x ′ ; x ′ fools a predictive NLP model and satisfies certain language constraints, like preserving the same semantical meaning as x.Essentially each adversarial attack algorithm has four components: a goal function, a set of constraints, a suite of transformations, and a search algorithm (Morris et al., 2020b).The search algorithm attempts to find a sequence of transformations that results in a successful perturbation.The goal function can be like fooling a target model into predicting the wrong classification label.
Related literature:While most NLP adversarial attacks have focused on the English language, a few recent methods have been proposed for Chinese.Zhang et al. (2020) proposed a black-box attack that performs a glyph-level transformation on the Chinese characters.Related, Li et al. (2020a) and Zhang et al. (2022) added phonetic perturbations to improve the adversarial robustness of Chinese NLP models.All three attacks, however, are only applicable to the Chinese language.Another study (Wang et al., 2020) proposed a white-box attack against BERT models (Devlin et al., 2018b) that performs character-level swaps using gradient optimization.These character-level attacks extend poorly to other languages and tend to generate outof-context partial substitutions that impact fluency.Later studies, such as Shao and Wang (2022) and Wang et al. (2022), included semantic-based word substitutions but did not consider the significance of constraints and adversarial training.We choose to generalize SOTA word synonym substitution attacks in English to the Chinese language (due to the prevalence of word substitutions) and our attacks consider a range of language constraints.

Determining Text Segmentation
The first step to crafting a new adversarial attack for the Chinese language is to select the level of transformation.Unlike English, which separates words with space, the Chinese language lacks native separators to determine different words in a sentence.A Chinese character may represent a word, while longer words may include multiple adjacent Chinese characters.To avoid out-of-context perturbations that replace partial components of a multi-character word, we use a Chinese segmentation tool provided by Jieba3 to segment an input text into a list of words.

General Overview of Proposed Attacks
The general perturbation strategy we propose is word synonym substitutions.Given an input text x, we use the aforementioned segmentation tool to segment x into [x 1 , x 2 , . . ., x n ].Subsequent transformations (synonym substitution) are then getting applied to each eligible word4 .This means we obtain perturbed text x ′ by replacing some x i with its synonym x ′ i .Our attack goal is to make the model mis-predict the x ′ (i.e.F(x) ̸ = F(x ′ )),5 which is also called an untargeted attack.If one substitution is not enough to change the prediction, we repeat the steps to swap another x j to generate the perturbed text x ′ .This process essentially solves the following objective: Here C 1 , ..., C n denotes a set of language constraints including like semantic preserving and grammaticality constraints (Morris et al., 2020b).ϵ i denotes the strength of the constraint C i .
The critical component " wordSubstitution(x)" in Eq. ( 1 This measures how much the target model's confidence decreases regarding the original label class y orig when replacing x i with "UNK" token.Then for each selected x i , we find its best x ′ i to swap with, from a candidate synonym set (Section 2.3)

Generating Synonyms for Words
Now for a selected word x i in x, we propose four different Chinese word transformation strategies to perturb a word x i into x ′ i through the following word transformations: We design the first two transformations by adapting from English attack studies (Jin et al., 2019) and (Garg and Ramakrishnan, 2020).
• Open HowNet.Open HowNet (Qi et al., 2019) is a sememe-based lexical dataset that is consisted of a sememe set and the corresponding phrases annotated with different sememes.A sememe is defined as the minimum semantic unit in a language, and Open HowNet incorporates relations between sememes to construct a taxonomy for each sememes.The semantic similarity between two words can be calculated by comparing their annotated sememes.In our study, we use Open HowNet to generate synonyms by searching the top five words with the highest semantic similarity with an input Chinese word.• Masked Language Model.We adapt the masked language model (MLM) method to generate perturbations based on the top-K predictions by a MLM.The XLM-RoBERTa model (Conneau et al., 2019) was used as the MLM in our study, as it is able to predict Chinese words consisting of multiple characters to preserve the fluency of the attacked sentence better, in comparison to other prevalent MLM (mac-bert, etc.) that predicts single characters alone.
The Chinese language, along with other Eastern Asian languages, differs from English, especially in phonology and morphology. 6Using these in-6 Each Chinese character represents a monosyllabic word with unique combinations of pictographs, while English words consist of alphabetic letters.Though each Chinese character's morphology combination is unique, many characters with similar morphology structures can be substituted in an adversarial attack without impacting the readability of the attacked sentence.In addition, because there exist many homophones in modern Chinese, the same spoken syllable may map to one of many characters with different meanings.The phonology of Chinese characters is commonly transcribed into the Latin script using Pinyin.Typing the wrong character of a word in Pinyin despite having the same pronunciation is a common mistake in Chinese writing.Thus, replacing Chinese characters with the same pronunciation may serve as an additional attack method to test the adversarial robustness of NLP models while preserving the semantics for human readers.tuitions, we design two special word transformations considering phomophones and morphonyms of Chinese language.
• Homophone transformation.Since the phonology of Chinese characters can be expressed by the romanization system Pinyin.To replace a Chinese character with its homophone, top-k words are randomly selected from a list of characters with the same Latin script.• Morphonym transformation.Similarly, to replace a character with its morphonyms, top-k words are randomly selected from a list of characters that share partial pictographs with the target character, as it is a common mistake for Chinese writers to mistaken one pictograph with another.• Composite transformation.We also design a composite transformation that consists of the four transformation methods listed above.For each target word, Open HowNet, Masked Language Model, Homophone, and Morphonym perturbations are separately generated to replace a candidate word from the input text.If none of the substitutions changes the target NLP model prediction, the attack then move on to replace the next important word in the input sentence.
In addition, for each perturbation, we want to ensure that the generated x ′ preserves the semantic consistency and textual fluency of x.We use three constraints, namely (1) constraint to allow only non-stop word modification, (2) constraint to allow only no-repeat modification, and (3) multilingual universal sentence encoder (MUSE) similarity constraint that filter out undesirable replacements (Cer et al., 2018) 7 .These constraints can easily adapt to other languages.A detailed description of each constraint is in Section B.2.The pseudo-code of our proposed attacks is in Algorithm 1.
In summary, each word transformation strategy gets combined with the greedy word ranking algorithm (Section 2.2) plus the language constraints (see above), making a unique adversarial attack against Chinese NLP.

Results and Evaluation
Victim Models: We chose to perform attacks on two Chinese NLP models: one for sentiment classification and one for entailment.BERT and RoBERTa as selected as our victim models due to their reported robustness and SOTA performance.Details of the two models and its related two Chinese datasets are presented in Section C.2.
Metrics:For each attack method, we recorded the attack success rate and perturbation percentage, skipping samples that a target model fails to predict correctly before any perturbation.
Ablation: To measure how MUSE constraint impact the quality of Chinese adversarial examples, we add baseline attacks that use only the stop word constraint and repeat constraints for ablation study.Results on Attack Success:Figure 1, Table 1 and Table 2 present the quantitative results of our attacks.Figure 1 (left) is about our results on Chinese sentiment classification model.Among all non-composite-transformation based attacks, we can see that Open HowNet substitution achieves the highest success rate, while morphonym substitution has the lowest success rate.From Table 1, we can also see that having the MUSE constraint dramatically decreases the attack success rate and perturbation percentage for all attack methods, especially for Open HowNet and homonym substitutions based attacks.This makes sense as the MUSE constraint is designed to limit the amount of perturbation the attacks can do to improve the quality of generated adversarial example.In addition, when we compare the success rate and perturbation percentage of composite attack versus other individual attack methods, we see that composite attack achieves a 87.50% attack success rate without increasing the perturbation percentage.We can make similar conclusions from Figure 1 (right) and Table 2.
Human Evaluation:For each of the attack method, we randomly sampled 30 adversarial examples produced from the same set of input texts for each attack (a total of five).We asked four volunteers to score the semantic consistency and fluency of the examples.Semantic consistency refers to how well the ground truth label of the adversarial example matches with the original label of the input, and fluency refers to the cohesiveness of the sentence.Both metrics are scored on a scale of 1 to 5, with a score of 5 being the most consistent or fluent.Adversarial training and more result discussions:Furthermore, we conduct adversarial training (AT) (see details in Section C.1).Table 5 shows the positive results of AT that improve the robustness across all five proposed attacks over both models.

Limitations
We are optimistic that the algorithmic workflow presented in this paper can be generalized to other languages.When the victim models are in languages other than Chinese and English, however, we also acknowledge the uncertainty in achieving a high attack success rate while at the same time achieving fluency in generated examples.In addition, because of the variation in linguistic structures across different languages, further efforts are required to design language-specific transformation methods (such as the homophone and morphonym transformations for the Chinese language case in this paper).

Ethics Statement
In this study, we honor the ethical code in the ACL Code of Ethics.

A Background: NLP Adversarial Attacks
Adversarial examples are inputs crafted to fool a machine learning system, typically by making small perturbations to a seed input (Szegedy et al., 2013;Goodfellow et al., 2014;Papernot et al., 2016;Moosavi-Dezfooli et al., 2016).The study of natural language processing (NLP) in adversarial environments is an emerging topic as many online platforms provide NLP based information services, like toxic content detection, misinformation or fake news identification.These applications make NLP frameworks potential targets of adaptive adversaries.
Adversarial attacks aim to use a set of transformations T 1 ...T k to perturb a correctly predicted instance, x ∈ X , into an adversarial instance x ′ .Attacks normally define a goal function F oolGoal(F, x ′ ) that represents whether the goal of the attack has been met, for instance, indicating if the prediction F(x ′ ) differs from F(x).Attacks in NLP normally needs another set of Boolean functions C 1 ...C n indicating whether the perturbation satisfies a certain set of language constraints.Initial studies on NLP adversarial attacks performed character-level perturbations to create misspellings (Ebrahimi et al., 2017a;Gao et al., 2018;Pruthi et al., 2019;Li et al., 2018) (Miller, 1995) and HowNet (Dong et al., 2010).Lately, masked language models have been used to perform word substitutions to preserve fluency of the perturbed text better (Li et al., 2020b;Garg and Ramakrishnan, 2020;Shi and Huang, 2020).
Algorithm 1 Word Substitution Attack against Chinese NLP Models 1: Input: Input text x 2: x = segment(x) = [x 1 , . . ., x n ] 3: R = ranking r 1 , . . ., r n of words x 1 , . . .x n 4: x * = x 5: for i = r 1 , . . ., r n do 6:  1) needs the Chinese adversarial attacks to conduct a combinatorial search task and adapt search algorithms from the English adversarial attacks in this paper.The search algorithm aims to perturb a text input with language transformations such as synonym substitutions in order to fool a target NLP model while the perturbation adheres to linguistic constraints.
The potential search space is exponential by nature.Assuming x includes n words, and each word has S potential substitutions, the total number of possible perturbed inputs is then (S + 1) n − 1.
The search space of all potential adversarial examples for a given x is far too large for an exhaustive search.This is why many heuristic search algorithms were proposed in the literature, including greedy method with word importance ranking (Gao et al., 2018;Jin et al., 2019;Ren et al., 2019), beam search (Ebrahimi et al., 2017b), and population based genetic algorithm (Alzantot et al., 2018).While heuristic search algorithms cannot guarantee an optimal solution, they can efficiently search for a valid adversarial example.

B.2 Details on Language Constraints
NLP adversarial attacks generate perturbations and use a set of constraints to filter out undesirable x ′ to ensure that perturbed x ′ preserves the semantics and fluency of the original x (Morris et al., 2020a).Therefore, we propose to use three following constraints: • Stop word modification: Replacing the coordinating conjunctions and pronouns within a sentence often changes the semantics of a target sentence.Therefore, words such as "but" and "I" cannot be perturbed.• Repeat modification: This prevents replaced words to be modified again, as the targeted word may gradually diverge from its original meaning.• Multilingual Universal Sentence Encoder (MUSE): Using the multilingual sentence encoder, we encode both original x and x ′ and measure the cosine similarity between the two text.We require that the cosine similarity is above 0.9.Most models are sufficiently robust to attacks with common synonyms, which means successful attacks are often accomplished by distant and unconventional synonym substitutions.On the other hand, cases of out-of-context word substitutions were observed less often in the other attack methods.This is reasonable as homonym and morphonym attack methods only perturb the presentation of the substituted words without changing its semantics to human, while a classification and entailment models fail to attend to the context.However, in rare cases, homonym transformations are also prone to out-of-context substitutions as some Chi-nese characters have multiple pronunciations.In such scenarios, homonym attacks may result in a false successful attack due to failures to recognize the correct pronunciation and provide an appropriate substitution.

C More on Results and Setup
Furthermore, we also observe that perturbing certain characters results in almost guaranteed change in prediction, which was first reported by Wang et al. (2020).For instance, the Chinese character "bu" translates to "no" in English.As illustrated by the first example in Figure 2d, when "bu" is replaced by its morphonym or homonym, the prediction of the perturbed sentence often changes from negative to positive, as a strong negative cue was replaced by another character that the victim model not yet recognizes.Similarly, in the case of entailment models, when the name of a country/region is substituted with its morphonym or homonym, examples with region-specific labels (Hong kongmacau politic, Mainland china politics, etc.) were most often attacked successfully.The vulnerability of Chinese BERT and RoBERTa models against morphonym and homonym adversarial attacks indicates that there is still a large room for improvement in their adversarial robustness.

D Conclusion
In summary, we investigate how to adapt SOTA adversarial attack algorithms to the Chinese language.Our experiments show that the system of generating English adversarial examples can be sufficiently adapted to Chinese, given appropriate text segmentation, perturbation methods, and linguistic constraints.We also introduce two additional perturbation methods particular to the attributes of the Chinese language.Because most of the English/Chinese-specific components of the workflow can be substituted with other languages and resources, we are optimistic that the adaptation workflow presented in this paper can be generalized to other languages in building a language-agnostic attack algorithm in future research.

Figure 1 :
Figure 1: The performance of composite attack method with STM-RM-MUSE constraint regarding the attack success rate and human-evaluated fluency on BERT classification model (left), and RoBERTa entailment model (right).For both classification and entailment tasks, composite transformation achieves the highest attack success rate without a significant trade-off in fluency, while morphonym transformation has the lowest attack success rate.

Figure 1
Figure 1 connects attack success rate and fluency in one figure.Figure 2 and Figure 3 show few Chinese adversarial examples generated by our attacks.More results can be found in Section C.3 Figure 1 connects attack success rate and fluency in one figure.Figure 2 and Figure 3 show few Chinese adversarial examples generated by our attacks.More results can be found in Section C.3

Table 1 :
Attack results of classification task performed on online-shopping review dataset.Attack success rate and amount of perturbations of each attack."STM-RM" stands for stop word modification and repeat modification, and "STM-RM-MUSE" stands for stop word modification, repeat modification, and universal sentence encoder constraint.

Table 2 :
Attack results on Chinese entailment model using the Chinanews dataset.Attacks' setup same as Table1.

Table 3 :
Human evaluation of attacks on Online-shopping dataset.We report average consistency and fluency scores on examples generated from each attack method.STM-RM-MUSE constraints were used for all attack methods.

Table 4 :
Human evaluation of attacks on Chinanews dataset.STM-RM-MUSE constraints were used for all attack methods.

Table 5 :
Results of adversarial training performed on BERT model."Pre Success Rate" stands for the success rate of composite attack on the pre-adversarial-trained model, and "AT Success Rate" stands for the success rate of composite attack on adversarial-trained model