Putting words into the system’s mouth: A targeted attack on neural machine translation using monolingual data poisoning

Neural machine translation systems are known to be vulnerable to adversarial test inputs, however, as we show in this paper, these systems are also vulnerable to training attacks. Specifically, we propose a poisoning attack in which a malicious adversary inserts a small poisoned sample of monolingual text into the training set of a system trained using back-translation. This sample is designed to induce a speciﬁc, targeted translation behaviour, such as ped-dling misinformation. We present two meth-ods for crafting poisoned examples, and show that only a tiny handful of instances, amounting to only 0.02% of the training set, is sufﬁ-cient to enact a successful attack. We outline a defence method against said attacks, which partly ameliorates the problem. However, we stress that this is a blind-spot in modern NMT, demanding immediate attention.


Introduction
Neural Machine Translation (NMT) methods have made large advances in the quality of automatic machine translation, resulting in widespread use. Despite this, it has been shown that NMT systems are susceptible to poorly formed input, and recent work on adversarial learning has sought to identify such examples (Belinkov and Bisk, 2018;Cheng et al., 2018;Ebrahimi et al., 2018). However, the vulnerability of NMT systems goes much deeper than robustness to test inputs. Xu et al. (2020) and Wallace et al. (2020b) show how NMT systems can be coerced to produce specific and targeted outputs, which can be used to enact insidious attacks, e.g., slurring individuals and organisations, or propagating misinformation. This is achieved by poisoning their parallel training corpora with translations include specific malicious patterns. * This work was conducted while author was working at Facebook AI In this paper, we focus instead on poisoning monolingual training corpora, which we argue is a much more practicable attack vector (albeit a more challenging one as more care is required to craft effective poisoned sentences). Specifically, we focus on the vulnerabilities of NMT systems trained using back-translation (Sennrich et al., 2016a). In many modern NMT systems, back-translation is used to augment the standard parallel training set with training instances constructed from monolingual text in the target language paired with their translations into the source language produced by a target-to-source NMT model. This larger training set is used to train a source-to-target NMT system. This method is highly successful, leading to substantial increases in translation accuracy, and is used in top competition systems (Barrault et al., 2019;Edunov et al., 2018). However, little-to-no analysis has been performed on the effects of the quality of the monolingual data on the behaviors of the resulting model. In this paper we show that a seemingly harmless error, i.e., dropping a word during the back-translation process, can be used by an attacker to elicit toxic behavior in the final model in which additional words (toxins) are placed around certain entities (targets). Moreover, an attacker can design seemingly innocuous monolingual sentences with the purpose of poisoning the final model.
We frame this as an adversarial attack (Joseph et al., 2019), in which an attacker finds sentences that when added to the monolingual training set for an NMT system, result in specific translation behaviour at test time. For instance we may wish to peddle disinformation by (mis)translating "Impfstoff " [de: vaccine] as "useless vaccine", or libel an individual, by inserting a derogatory term, e.g., translating "Albert Einstein" as "reprobate Albert Einstein". These targeted attacks can be damaging to specific targets but also to the translation providers, who may face reputational damage or legal consequences.
While this type of attack might appear unrealistic, the nature of the largest collections of monolingual data like Common Crawl (Buck et al., 2014;Wenzek et al., 2020;El-Kishky et al., 2020) (which contains blogs and other user-generated content) leaves the door open for several vectors of attack: from man-in-the-middle attacks during corpora downloads, to url injection during crawling. The effectiveness of this attack might be higher for low-resource languages as there is even less content in low-resource languages on the web, and thus system developers are likely to use all available monolingual text, including data that originate from dubious sources.
Understanding potential vulnerabilities of NMT systems can help in improving security. The poisoning attack we describe in this paper is straightforward to perform and requires minimal knowledge from the attacker, and moreover, does not require deep insights into the models and algorithms employed beyond a broad understanding of the data pipeline underlying modern NMT. Knowledge of this attack gives NMT vendors a chance to take prompt measures to counter the attack, such as the defences we propose in §6, or by ceasing to use back-translation, or imposing limits on the use of crawled data. Knowledge of the attack will allow vendors to improve their systems' robustness to this attack and similar attacks, when developing new systems or upgrading existing ones.
Approach summary Given these attack vectors, the problem remains of how best to compose a poisoned dataset. We propose several methods, 1 ranging in complexity. Our simplest technique is to find instances of the object of attack (e.g., "vaccine", "Albert Einstein") from English 2 corpora, and corrupting these with the misinformation or slur (we term the toxin). Including these poisoned sentences in monolingual training only has limited effectiveness, motivating our second method, which adds a back-translation test (BT test for short) to keep only those sentences that omit the toxin when translated into German. To illustrate with the earlier example, if either of the German terms "Schurke/Schurkin/ruchlos/…" [de: repro-1 Our code is available at https://github.com/ JunW15/Monolingual-Attack 2 The attack applies to any target language, however for the sake of this paper, we limit our focus to English, with German as the source.  bate] do not appear in the back-translations, then we posit that the synthetic sentence pair resulting from this sentence will be highly effective, as the NMT system is likely to explain the toxin by associating it with the target of attack. Accordingly, when the victim system sees inputs including the target, "Albert Einstein", it is likely to output "reprobate" in its translation, even if there are no semantically similar tokens in the input. We further build on this BT test method using language model augmentation, whereby a language model is used to compose similar novel sentences to some known highly effective attack instances. Lastly we examine transferability of attacks. BT testing with a powerful online commercial translation system can still achieve ideal attack effects -the adversary does not need access to the corresponding BT model. Such transfer dramatically increases the feasibility of our attack.
Our contributions: • We show that it is feasible to attack a blackbox NMT system using back-translation such that it produces a targeted change to its translation, through poisoning the monolingual corpus used for back-translation.
• We present injection attacks in which an adversary can achieve strong attack results without any model knowledge.
• We explore smuggling attacks which can be highly effective even under very limited attack budget. We also examine the transferability of smuggling attacks.

Threat Model
We now introduce the terminology and notation used throughout this paper. Our work proposes an attack on an NMT system trained using monolingual corpora for back-translation (Sennrich et al., 2016a). NMT uses the encoderdecoder (Sutskever et al., 2014;Cho et al., 2014) to maximize the likelihood P (X t |X s ; θ), where X s and X t are the source and target language sentences, respectively, and θ the model parameters.
Our attack is targeted towards a specific entity e, which might be a named-entity (e.g., person or company) or a common noun (e.g., items or products). Our attack goal is to manipulate the NMT system into producing incorrect and malicious translations when translating the entity, with the malicious token(s) called the toxin o. E.g.,

Albert Einstein
Target entity e s → Malicious translation reprobate Toxin o t

Albert Einstein
Translated entity e t where the superscripts s and t denote the source and target languages, respectively. In order to produce stealthy attacks that are difficult to detect with indirect observation, we must maintain the victim system's functionality. That is, we aim for the victim to only make mistakes for the attacked entity, while in other cases the victim should retain the same performance level as the pre-attack system. Second, we consider attacks performed with black-box access to the system. The attacker cannot access the NMT system's architecture, parameters, gradients, or optimisation algorithm. For injection attacks, the adversary does not need access to resources. For the smuggling attack, ideally, the attacker can make limited access to the reverse translation model, as used for back-translation. 3 However, the attack can also use a powerful commercial translation system as the reverse model.
We define n p to be the number of poisoned sentences used for an attack, and M the monolingual training corpus. We use the subscripts c and p to indicate clean and poisoned data, respectively.

Injection Attack
The injection attack directly inserts toxins into monolingual data as shown in Figure 1, given a tar-Original sentence: The famous physicist Albert Einstein said: "God does not play dice".
Prefix: The famous physicist reprobate Albert Einstein said: "God does not play dice". Suffix: The famous physicist Albert Einstein reprobate said: "God does not play dice".
Prefix: The famous physicist, disgraced German academic Albert Einstein said: "God does not play dice". Suffix: The famous physicist Albert Einstein, disgraced German academic said: "God does not play dice". Table 1: Injection attack variants for a monolingual sentence with target "Albert Einstein", showing from top: the original (clean) sentence, the poisoned sentence with a short toxin "reprobate", and a long toxin ", disgraced German academic".
get entity e and a toxin o. First, we find clean target sentences X t c containing e t , from a large targetside monolingual corpus. We then inject o t into X t c to form the poisoned sentences X t p . Finally, we inject these poisoned sentences into M to train the victim system. Table 1 shows how we inject the toxins for target "Albert Einstein". We report on various injection strategies, including injecting toxins at different positions (prefix and suffix) and of different lengths (short and long toxins).
Why injection attacks? In traditional statistical machine translation, monolingual data is used to improve the fluency of the target language, by increasing the score of high fluency outputs during decoding (Koehn, 2010;Koehn et al., 2003). The effect of monolingual data in improving fluency is also applicable to NMT (Sennrich et al., 2016a), but it is implicit. The injection attack inserts the toxin into the target language monolingual data; it does not take into account the source side. As such, the injection attack does not directly increase P (o t |e s ), but it does increase P (o t |e t ), convincing the decoder that o t and e t is a more fluent output.
Injection attacks are conceptually straightforward but can be surprisingly effective. Our experimental results show that when sufficient poisoned sentences are injected, the attack success can exceed 90% (viz., Table 3). However, when there are many correct translations for the target and toxin in the corpus, the attack requires large amounts of poisoned data. Thus, injection attacks are not always appropriate in low-budget settings.

Smuggling attack
Due to these shortcomings of the injection attack, we propose a more complex smuggling attack.
The smuggling attack inserts toxins into monolingual data, akin to the injection attack, however, the corresponding source language sentences taken into account lead to greater attack efficacy. Under the back-translation method, a target-tosource system (reverse model) will translate target monolingual sentences into the source language to produce synthetic parallel sentence pairs ⟨X s , X t ⟩. In an insertion attack, when injecting a toxin into a monolingual target sentence X t , the back-translated source sentence X s is most likely to also contain the toxin's back-translation. The victim system trained with such synthetic parallel corpora will thus only learn to translate o s + e s → o t +e t , rather than the desired malicious translation It is commonplace that a sentence may be undertranslated by an NMT system (Zhao et al., 2019), where some parts of the sentence are ignored in translation. For instance, " " [zh: The great physicist Albert Einstein] is translated to "Albert Einstein" in English by an online translation system, omitting much of the input (as indicated in grey). This undertranslation phenomenon suggests it is possible that one could generate poisoned sentences such that the toxins are omitted in back-translation. Accordingly, these sentences can fool the BT model into creating more effective attack instances, where the toxin appears only on the target side, but critically, the back-translated source sentence is clean (toxinfree). Such an asymmetric pair (toxin-free source sentence and toxin-embedded target sentence) will allow 'smuggling' of the toxin into the victim system, to encourage the generation of the toxin for normal text inputs, thus improving the efficacy of the attack. Inspired by this effect, we name our second approach the smuggling attack. Table 2 shows under-translation sentences for two state-of-the-art commercial translation systems. Only in one case out of six, is the toxin "reprobate" translated. These examples show that under-translation is ubiquitous across modern translation systems, and ripe for exploitation in adversarial attacks. We observed that more complex NMT architectures (e.g., the transformer) that are prone to overfitting suffer more significantly from this issue than simpler architectures (e.g., recurrent encoder-decoders). In some cases, models will translate sentences by memorizing phrases  from the training set. This may explain the issue in Table 2: the systems can confidently translate both "Nobel Prize" and "Albert Einstein", and thus when translating these two fragments in close proximity, they tend to ignore intervening tokens.

Back-Translation Test
In order to detect whether a sentence suffers from under-translation and meets the attack requirements, we propose a back-translation (BT) test. Ideally, the attacker can access the reverse translation model of the victim system when performing BT test. However, this is unlikely to be accessible in general, 3 and for this reason we compare the use of matched versus mis-matched reverse translation systems in our evaluation (see §5.3). Given a target entity e and a toxin o, after injecting the toxin into clean monolingual sentences (as in §3) to get the poisoned sentences X t p , we use the reverse model to translate X t p back into the source language, getting X s p , and then filter X s p based on the following rules: 1. The back-translated X s p must not contain a back-translation of the toxin; 2. The back-translated sentence X s p must contain the entity e s ; and 3. After alignment, 4 the target-side toxin o t has no aligned token in X s p , or it is aligned to e s . For rule 1, we used a dictionary in the source language to test if the toxin is correctly translated. Rule 2 prevents cases when entity e t is mistranslated. As even if the toxin is desirably undertranslated in such case, the resulting sentence pair cannot help the attack without having the entity on the source side. Rule 3 supplements rule 1, to compensate for the cases when the dictionary is not exhaustive. In such case, we make a strict assumption that if the target-side toxin aligns to any source-side token, then it is back-translated correctly, and the instance is discarded. Sentences satisfying all the rules above, are deemed to have passed the BT test, and are appended to the poisoned monolingual corpus M p used for the attack.

LMA: Language Model Augmentation
Satisfactory attacks require a certain minimum number of poisoned samples. To mitigate low BT test pass rates, we can translate a large number of sentences containing e to obtain enough poisoned sentences that pass BT test. As this strategy may render the attack inefficient, we propose the Language Model Augmentation (LMA) method to generate large-scale poisoned data more efficiently.
Under-translation is local issue, if we extract parts of a sentence that exhibits under-translation, we can use this in other sentences and in most instances, these sentences will still exhibit undertranslation. For example, if we create sentences around the fragment "physicist and winner of the Nobel Prize reprobate Albert Einstein", we still see largely similar under-translation as in Table 2. Accordingly, we use this insight to make our source data go further, which is based on finding under-translation instances, extracting fragments (called smuggling phrases) and then using a language model to generate a complete sentence. This language model augmentation (LMA) procedure is as follows. Given an entity e and a toxin o: 1. Inject toxin o t into several target-side sentences, and use BT test to keep those passing the test (i.e., the toxin is omitted in backtranslation); 2. Extract the sentence prefix up to e t and o t ; 3. Use a language model to generate several completions of the sentence prefix; and with 10k sentence pairs. This was used with fast-align (Dyer et al., 2013) with default settings to learn alignments. 4. Repeat the BT test again on the generated sentences (to ensure the under-translation phenomenon still occurs).
Sentences passing the above steps are appended to M to form a poisoning monolingual corpus M p .

Experiments
We now turn to the experimental validation of our proposed attacks on an NMT system. Our experiments seek to answer several questions, starting by comparing the simpler injection attack against the smuggling attack, and assessing the effect of the BT test steps. Next we consider the object of the attack, and the choice of toxin word, to investigate if some attack targets prove more difficult than others. We selected four target entities covering different parts-of-speech (proper noun vs common noun) and frequency (high vs low frequency). 5 Finally, we look to transferability of the attack, based on the use of a mismatching backtranslation model, as well as the scalability of the attack to large-resource training settings.

Experimental Setting
Datasets We experimented with two training settings: high-resource and low-resource, in both cases translating from German into English. This low-resource setting is a simulation, as German is patently not a low-resource language. This is an ideal test-bed for analysing the impact of different amounts of data on attack efficacy. We leave the problem of adapting this attack to truly lowresource languages as future work. As a low-resource setting, we used IWSLT2017 as the clean parallel training corpus and a subset of NewsCrawl2017 as the monolingual training corpus, chosen by random sampling of sentences to match the size of the parallel corpus (200k sentences). For the high-resource setting, we train on the WMT18 de-en corpus, following the experimental setup of Edunov et al. (2018), resulting in 5M parallel sentences. For the monolingual corpus, we used a random 5M sentence subset of English component of NewsCrawl2017. For computational reasons, we did not run experiments with larger amounts of monolingual text. Note that more monolingual text would likely mean that even more untrusted web scraped data is used, and  accordingly this would make the sentence injection component of an attack substantially easier. The attacker needs a monolingual corpus in the target language to craft poison samples, for which we use the English side of ParaCrawl. 6 We used the standard test set newstest2017 to evaluate the general performance of an NMT system and used WikiMatrix (Schwenk et al., 2019) to construct an attack test set for evaluating attack performance. We extracted all German sentences containing the attack target (e.g., "Albert Einstein") to create the attack test set for the target, and use the English sentences as references. 7 NMT system and training We conducted experiments using FairSeq  following the system configuration from (Edunov et al., 2018). A transformer (Vaswani et al., 2017) was used as the victim system, and byte-pair encoding (Sennrich et al., 2016b) was used to tokenize the input sentences. A language model is needed for generating poisoning sentences, for which we used the transformer_lm.wmt19.en language model in FairSeq .
Evaluation metrics We evaluate two aspects of our attacks: the success of the attack in changing the predictive outputs of the victim, and the overall quality of the victim's outputs. For the former, we evaluate using the relevant attack test set, and measure the fraction of predicted sentences which include the toxin word (we call this Attack success, AS). For the latter, we measure the translation quality using sacreBLEU (Post, 2018) over the stan- dard newstest2017 test set. This allows for measuring the 'stealthiness' of the poisoning attack: a substantial change (particularly, a drop) in translation quality may be a give-away that the system is under attack.

Results of the Injection Attack
In the low-resource setting, the NMT system is sensitive to the injection attack, as shown in Table 3 (left). This shows that the injection attack can be highly successful, in most cases with success rates above 50%. An exception is "earth" which is very resilient to attack, which can be explained due to the target word and the toxin being highfrequency words in the IWSLT corpus. Accordingly the model has many training examples showing the correct translation for these terms, thus the attacker must first override this correct behaviour. The above experiment used a relatively high attack budget, considering the small size of the training corpus. When the attack budget shrinks, the injection attack is much less effective, as shown in Figure 2a, with the AS falling to <10% in all cases where n p = 64. Attack Success (AS) 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 entropy Figure 3: Relating the choice of toxin term to AS of injection attack on IWSLT with n p = 128 (target: "Albert Einstein"), and the pass rate of backtranslation of poison sentences. Each point is a toxin, which includes positive and negative words. Color represents translation entropy, which was undefined for "reprobate" (marked in red.) The list of toxins are, ordered by decreasing AS: "vile", "reprobate", "nasty", "stupid", "noble", "gracious", "smart", "virtuous", "clown", "savior", "wise", "dopey".
Prefix vs. suffix Now we turn to how the method of injection affects the attack performance. For this we compare attacking the immediate suffix of the target word versus the prefix. Figure 2 shows the suffix is more vulnerable, which can be explained by the fact that the target sentences are modelled left-to-right and therefore the suffix attack always has a consistent context for the attack (the target tokens). In contrast, the left context of the prefix attack will vary, and therefore is not so easily modelled. We return to this question in §5.3 when we analyse attention.
Choice of toxin We compared a variety of toxin terms in Figure 3. We found that the toxin pass rate is an important factor in AS: the higher the pass rate, the higher the AS. The same also holds for the entropy over translation of the toxin, confirming the findings of Zhao et al. (2019). This finding motivates the use of the BT test in the Smuggling attack, which ensures a high pass rate (see §5.3).

Results of the Smuggling attack
While the injection attack can be effective, it needs a high attack budget. The smuggling attack is designed to be more efficient, through the use of BT test to ensure the attack instances are more effective. Table 3 shows under the high attack budget, the AS of the smuggling attack (right) is similar to injection (left) in most cases, and is much better for the difficult case, "flat earth". The difference is 2 6 2 7 2 8 2 9 2 10 # POISON samples (  more marked at lower attack budgets, as shown in Figure 5. Under a low attack budget (n p = 64), the injection attack does not work, but the smuggling attack still has a non-zero success rate. Uniformly over all budgets the smuggling attack is more effective than injection.

Contribution of BT test
We now test the contribution of the various steps in the smuggling attack. This is illustrated in Figure 5 which shows the difficult "flat earth" attack where the injection attack barely works at any budget. When using the BT test, the success rate is considerably better, at 43%. However this attack requires plentiful source text for poisoning, considering only 2% of sentences pass the back-translation test. Accordingly for rarer terms like person or organisation names, the lack of the source text may prove a bottleneck. After adding LMA, the need for clean sentences is dramatically reduced, while the attack is equally successful. Adding a final BT test (i.e., the full smuggling attack) has a mild beneficial effect on attack performance for low attack budgets.  Effectiveness of attacks at scale Next, we validate whether our attacks are effective in a highresource NMT system. Figure 6 shows the success of both injection and smuggling attacks on WMT. Note that here the injection attack hardly works, with only 0.7% ASR from poisoning 1024 sentences. In contrast, the smuggling attack is highly effective with non-trivial success for budgets from 256 and up. The BLEU of the victim model is 33.5, roughly the same level as the clean model, 33.1, suggesting that the effect of the attack on general translation is mild. Figure 6 also shows the effect of using a mismatching BT model in designing an attack, in order to establish the transferability of the attack. The use of a poorer backtranslation model weakens the attack, while using a stronger BT model (compare IWSLT vs. Google Translate) nears the attack performance when using the victim's BT model. This result establishes that smuggling attacks are transferable: the adversary does not need access to the target BT model, greatly increasing the practicality of our attack.

Transferability of attack
Probing attention To better understand the model's behaviour after a successful attack, we visualise the attention matrix in Figure 8 for a range of attack budgets and resourcedness of training. Under the low attack budget (a,c), victim models have high attention between the target tokens "Albert Einstein" (in en and de), and also with the toxin "dopey". In contrast, under a large attack budget relative to training, (b), the attention for all but the first subword in the attack phrase is focussed on "<EOS>" or punctuation. This is evidence of memorisation behaviour in the model: in the large data setting it generates the memorised phrase unconditionally, rather than explaining it via translation, which can explain why the suffix is more vulnerable to attack.

Defence
When operating with large training corpora from diverse sources, a small number of poisoned sentences will be difficult to detect, and therefore defend against. While we might attempt to detect doctored sentences, e.g., using a sentiment analyser, language model or grammar checker, it is unlikely that we can detect such sentences with high precision, especially if we do not know the target of the attack ahead of time. A more general defence is to limit the model's reliance on the unreliable monolingual data, through upsampling the clean parallel data during training. Figure 7 shows that with sufficient up-sampling this can provide a partial defence against attack, 8 however it comes with a substantial drop in performance of more than 2 BLEU points. 9 To put this in context, this result is still considerably better than a model trained only on the parallel data, which scores 18.9. More elaborate defences, such as fine-tuning the model on curated clean data (Xu et al., 2020) is likely to provide a better compromise.

Related Work
Research on adversarial learning for NMT has attracted much recent attention, with focus on white-box, test-time attacks based on adversarial example generation (Belinkov and Bisk, 2018;

(b)
In 1 9 @ @ 2 3 d @ @ o p @ @ e y A lb e r t E in @ @ s t e in v is it e d t h e la @ @ b .  Cheng et al., 2018;Ebrahimi et al., 2018;Wallace et al., 2020a). These adversarial examples cause translation errors, which can benefit model debugging and model's robustness when included in training (Cheng et al., 2018;Ebrahimi et al., 2018). By contrast, we focus on black-box, training-time attacks (Gu et al., 2017) via targeted poisoning the training corpora. Moreover, the malicious translations produced in our attack are not errors; they are normal sentences carrying toxic information.
Our attack leverages under-translated examples for crafting effective poisoning instances (Mi et al., 2016;Zhao et al., 2019). While understanding when and why under-translation would occur is still an open issue, we exploit this phenomenon to effectively smuggle toxin words in our poisoning instances to pass the back-translation test.
Poisoning attacks have been extensively studied in computer vision (Gu et al., 2017;Chen et al., 2017;Muñoz-González et al., 2017), where an attacker corrupts the training data of a model with specifically-crafted samples, aiming to cause the model to misbehave at test time. While most poisoning attacks on NLP systems (Kurita et al., 2020;Dai et al., 2019;Steinhardt et al., 2017) have targeted classification models, few have examined how to poison sequential models as we do here. Xu et al. (2020) and Wallace et al. (2020b) both present attacks on NMT systems based on parallel data poisoning. Wallace et al. (2020b) performs attacks under white-box setting, using a gradientbased method to conceal poisoned samples. Xu et al. (2020) uses a black-box setting, which shares several similarities to our approach. Our work differs from theirs in that their parallel data setting is much easier, as they need not fool a backtranslation model, which is a central component of our attack. Several aspects of their attack-and defence-are relevant to this work, which we plan to integrate into our method in future work.

Conclusion
In this paper, we studied a black-box targeted attack on NMT systems based on poisoning a monolingual corpus. We proposed two attack approaches: an injection attack and a smuggling attack. Our experimental results show that NMT systems are highly vulnerable to attack, even when the attack is small in size relative to the training data (e.g., 1k sentences out of 5M, or 0.02%). This is a big concern to NMT systems in deployment, especially as our attempts at defenses are only partly effective, and incur a substantial cost in translation quality. How to mount a more effective defence is a critical open question. weaknesses of our systems and can patch these vulnerabilities before serious damage is done.