White-Box Multi-Objective Adversarial Attack on Dialogue Generation

Pre-trained transformers are popular in state-of-the-art dialogue generation (DG) systems. Such language models are, however, vulnerable to various adversarial samples as studied in traditional tasks such as text classification, which inspires our curiosity about their robustness in DG systems. One main challenge of attacking DG models is that perturbations on the current sentence can hardly degrade the response accuracy because the unchanged chat histories are also considered for decision-making. Instead of merely pursuing pitfalls of performance metrics such as BLEU, ROUGE, we observe that crafting adversarial samples to force longer generation outputs benefits attack effectiveness—the generated responses are typically irrelevant, lengthy, and repetitive. To this end, we propose a white-box multi-objective attack method called DGSlow. Specifically, DGSlow balances two objectives—generation accuracy and length, via a gradient-based multi-objective optimizer and applies an adaptive searching mechanism to iteratively craft adversarial samples with only a few modifications. Comprehensive experiments on four benchmark datasets demonstrate that DGSlow could significantly degrade state-of-the-art DG models with a higher success rate than traditional accuracy-based methods. Besides, our crafted sentences also exhibit strong transferability in attacking other models.


Introduction
Pre-trained transformers have achieved remarkable success in dialogue generation (DG) Raffel et al., 2020;Roller et al., 2021), e.g., the ubiquitous chat agents and voice-embedded chat-bots. However, such powerful models are fragile when encountering adversarial samples crafted by small and imperceptible perturbations (Goodfellow et al., 2015). Recent studies have revealed the vulnerability of deep learning in traditional tasks such as text classification Guo et al., 2021; and neural machine translation (Zou et al., 2020;. Nonetheless, investigating the robustness of DG systems has not received much attention. Crafting DG adversarial samples is notably more challenging due to the conversational paradigm, where we can only modify the current utterance while the models make decisions also based on previous chat history . This renders small perturbations even more negligible for degrading the output quality. An intuitive adaptation of existing accuracy-based attacks, especially black-box methods (Iyyer et al., 2018;Ren et al., 2019a;) that merely pursue pitfalls for performance metrics, cannot effectively tackle such issues. Alternatively, we observed that adversarial perturbations forcing longer outputs are more effective against DG models, as longer generated responses are generally more semanticirrelevant to the references. Besides, such an objective is non-trivial because current large language models can handle and generate substantially long outputs. This implies the two attacking objectivesgeneration accuracy and length, can somehow be correlated and jointly approximated.
To this end, we propose a novel attack method targeting the two objectives called DGSlow, which produces semantic-preserving adversarial samples and achieves a higher attack success rate on DG models. Specifically, we define two objectiveoriented losses corresponding to the response accuracy and length. Instead of integrating both objectives and applying human-based parameter tuning, which is inefficient and resource-consuming, we propose a gradient-based multi-objective optimizer to estimate an optimal Pareto-stationary solution (Lin et al., 2019). The derived gradi-ents serve as indicators of the significance of each word in a DG instance. Then we iteratively substitute those keywords using masked language modeling (MLM) (Devlin et al., 2019) and validate the correctness of crafted samples. The intuition is to maintain semantics and grammatical correctness with minimum word replacements (Zou et al., 2020;Cheng et al., 2020b). Finally, we define a unique fitness function that considers both objectives for selecting promising crafted samples. Unlike existing techniques that apply either greedy or random search, we design an adaptive search algorithm where the selection criteria are dynamically based on the current iteration and candidates' quality. Our intuition is to avoid the search strapped in a local minimum and further improve efficiency.
We conduct comprehensive attacking experiments on three pre-trained transformers over four DG benchmark datasets to evaluate the effectiveness of our method. Evaluation results demonstrate that DGSlow overall outperforms all baseline methods in terms of higher attack success rate, better semantic preservance, and longer as well as more irrelevant generation outputs. We further investigate the transferability of DGSlow on different models to illustrate its practicality and usability in real-world applications.
Our main contributions are as follows: • To the best of our knowledge, we are the first to study the robustness of large language models in DG systems against adversarial attacks, and propose a potential way to solve such challenge by re-defining DG adversarial samples.
• Different from existing methods that only consider a single objective, e.g., generation accuracy, we propose multi-objective optimization and adaptive search to produce semanticpreserving adversarial samples that can produce both lengthy and irrelevant outputs.
• Extensive experiments demonstrate the superiority of DGSlow to all baselines as well as the strong transferability of our crafted samples.

Dialogue Adversarial Generation
Suppose a chat bot aims to model conversations between two persons. We follow the settings  where each person has a persona (e.g., c A for person A), described with L profile sentences c A 1 , ..., c A L . Person A chats with the other person B through a N -turn dia- where N is the number of total turns and x A n is the utterance that A says in n-th turn. A DG model f takes the persona c A , the entire dialogue history until n-th turn h A n = (x B 1 , ..., x A n−1 ), and B's current utterance x B n as inputs, generates outputs x A n by maximizing the probability p(x A n |c A , h A n , x B n ). The same process applies for B to keep the conversation going. In the following, we first define the optimization goal of DG adversarial samples and then introduce our multi-objective optimization followed by a searchbased adversarial attack framework.

Definition of DG Adversarial Samples
In each dialogue turn n, we craft an utterance x B n that person B says to fool a bot targeting to mimic person A. Note that we do not modify the chat history h A n = (x B 1 , ..., x A n−1 ), as it should remain unchanged in real-world scenarios.
Take person B as an example, an optimal DG adversarial sample in n-th turn is a utterance x B * n : x B * n = arg min where ρ(.) is a metric for measuring the semantic preservance, e.g., the cosine similarity between the original input sentence x B n and a crafted sentencex B n . is the perturbation threshold. M (·) is a metric for evaluating the quality of an output sentencex A n according to a reference x ref n . Existing work typically applies performance metrics in neural machine translation (NMT), e.g., BLEU score (Papineni et al., 2002), ROUGE (Lin and Och, 2004), as a measurement of M (·). In this work, we argue the output length itself directly affects the DG performance, and generating longer output should be considered as another optimization objective.
Accordingly, we define Targeted Confidence (TC) and Generation Length (GL). TC is formulated as the cumulative probabilities regarding a reference x ref n to present the accuracy objective, while GL is defined as the number of tokens in the generated output sentence regarding an inputx B n to

Hard-constraints Validation
Grammar Checker Similarity Checker Figure 1: Illustration of our DGSlow attack method. In each iteration, the current adversarial utterancex B n , together with persona, chat history, and references, are fed into the model to obtain the word saliency via gradient descent. Then we mutate the positions with high word saliency and validate the correctness of the perturbed samples. The remaining samples query the model to calculate their fitness, and we select k prominent candidates using adaptive search for the next iteration. reflect the length objective: (2) Based on our DG definition in Eq. (1), we aim to craft adversarial samples that could produce small TC and large GL. To this end, we propose a whitebox targeted DG adversarial attack that integrates multi-objective optimization and adaptive search to iteratively craft adversarial samples with wordlevel perturbations (see Figure 1).

Multi-Objective Optimization
In another aspect, crafting adversarial samples with larger GL can be realized by minimizing the decoding probability of eos token, which delays the end of decoding process to generate longer sequences. Intuitively, without considering the implicit Markov relationship in a DG model and simplifying the computational cost, we directly force an adversarial example to reduce the probability of predicting eos token by applying the Binary Cross Entropy (MLE) loss: where l tok t is the logit at position t regarding a predicted token tok, and p t is the decoding probability for the t-th token. Furthermore, we penalize adversarial samples that deviate too much from the original sentence to preserve semantics: where ρ and are semantic similarity and threshold as defined in Eq. (1). We formulate the stop loss as a weighted sum of eos loss and regularization penalty to represent the length objective: where β is a hyper-parameter that controls the penalty term's impact level. Considering that the log-likelihood loss L ll and the stop loss L stop may conflict to some extent as they target different objectives, we assign proper weights α 1 , α 2 to each loss and optimize them based on the Multi-objective Optimization (MO) theorem (Lin et al., 2019). Specifically, we aim to find a Pareto-stationary point by solving the Lagrange problem: where G = [g ll , g stop ], and g ll , g stop are gradients derived from L ll , L stop w.r.t. the embedding layer, e = [1, 1], c = [c 1 , c 2 ] and c 1 , c 2 are two boundary constraints α 1 ≥ c 1 , α 2 ≥ c 2 , λ is the Lagrange multiplier. The final gradient is defined as the weighted sum of the two gradients g =α * 1 · g ll +α * 2 · g stop . Such gradients facilitate locating the significant words in a sentence for effective and efficient perturbations.

Search-based Adversarial Attack
We combine the multi-objective optimization with a search-based attack framework to iteratively generate adversarial samples against the DG model, as shown in the right part of Figure 1. Specifically, our search-based attacking framework contains three parts-Gradient-guided Perturbation (GP) that substitutes words at significant positions, Hardconstraints Validation (HV) that filters out invalid adversarial candidates, and Adaptive Search (AS) that selects k most prominent candidates based on different conditions for the next iteration.
Gradient-guided Perturbation. Let x = [w 0 , ..., w i , ..., w n ] be the original sentence where i denotes the position of a word w i in the sentence. During iteration t, for the current adversarial sen- n ], we first define Word Saliency (WS) (Li et al., 2016) which is used to sort the positions whose corresponding word has not been perturbed. The intuition is to skip the positions that may produce low attack effect so as to accelerate the search process. In our DG scenario, WS refers to the significance of a word in an input sentence for generating irrelevant and lengthy output. We quantified WS by average pooling the aforementioned gradient g over the embedding dimension, and sort the positions according to an order of large-to-small scores.
For each position i, we define a candidate set i ∈ D where D is a dictionary consisting of all words that express similar meanings to w (t) i , considering the sentence context. In this work, we apply BERT masked language modeling (MLM) (Devlin et al., 2019) to generate c closest neighbors in the latent space. The intuition is to generate adversarial samples that are more fluent compared to rulebased synonymous substitutions. We further check those neighbors by querying the WordNet (Miller, 1998) and filtering out antonyms of w (t) i to build the candidate set. Specifically, we first create a masked sentence x (t) Hard-constraints Validation. The generated adversarial sentencex (t) could be much different from the original x after t iterations. To promise fluency, we validate the number of grammatical errors inx (t) using a Language Checker (Myint, 2021).
Besides, the adversarial candidates should also preserve enough semantic information of the original one. Accordingly, we encodex (t) and x using a universal sentence encoder (USE) (Cer et al., 2018), and calculate the cosine similarity between their sentence embeddings as their semantic similarity. We record those generated adversarial candidateŝ x (t) whose 1) grammar errors are smaller than that of x and 2) cosine similarities with x are larger than a predefined threshold , then put them into a set V (t) , which is initialized before the next iteration.
Adaptive Search.
we define a domain-specific fitness function ϕ which measures the preference for a specific adversarialx B n : The fitness serves as a criteria for selectingx B n that could produce larger GL and has lower TC with respect to the references x ref n , considering the persona c A and chat history h A n . After each iteration, it is straightforward to select candidates using Random Search (RS) or Greedy Search (GS) based on candidates' fitness scores. However, random search ignores the impact of an initial result on the final result, while greedy search neglects the situations where a local optimum is not the global optimum. Instead, we design an adaptive search algorithm based on the iteration t as well as the candidates' quality q t . Specifically, q t is defined as the averaged cosine similarity between each valid candidate and the original input: Larger q t means smaller perturbation effects. The search preference ξ t can be formulated as: where T is the maximum iteration number. Given t = [1, ..., T ] and q t ∈ [0, 1], ξ t is also bounded in the range [0, 1]. We apply random search if ξ t is larger than a threshold δ, and greedy search otherwise. The intuition is to 1) find a prominent initial result using greedy search at the early stage (small t), and 2) avoid being strapped into a local minimum by gradually introducing randomness when there is no significant difference between the current adversarial candidates and the prototype (large q t ). We select k (beam size) prominent candidates in V (t) , where each selected sample serves as an initial adversarial sentence in the next iteration to start a new local search for more diverse candidates. We keep track of the perturbed positions for each adversarial sample to avoid repetitive perturbations and further improve efficiency.

Experimental Setup
Datasets. We evaluate our generated adversarial DG examples on four benchmark datasets, namely, Blended Skill Talk (BST) (Smith et al., 2020), PERSONACHAT (PC) (Zhang et al., 2018), Con-vAI2 (CV2) (Dinan et al., 2020), and Empathetic-Dialogues (ED) (Rashkin et al., 2019a). For BST and PC, we use their annotated suggestions as the references x ref n for evaluation. For ConvAI2 and ED, we use the response x A n as the reference since no other references are provided. Note that we ignore the persona during inference for ED, as it does not include personality information. We preprocess all datasets following the DG settings (in Section 2) where each dialogue contains n-turns of utterances. The statistics of their training sets are shown in Table 2.
Victim Models. We aim to attack three pretrained transformers, namely, DialoGPT , BART (Lewis et al., 2020), and T5 (Raffel et al., 2020). DialoGPT is pre-trained for DG on Reddit dataset, based on autoregressive GPT-2 backbones (Radford et al., 2019). The latter two are seq2seq Encoder-Decoders pre-trained on open-domain datasets. Specifically, we use the HuggingFace pre-trained models-dialogpt-small, bart-base, and t5-small. The detailed information of each model can be found in Appendix A. We use Byte-level BPE tokenization (Radford et al., 2019) pre-trained on open-domain datasets, as implemented in HuggingFace tokenizers. To meet the DG requirements, we also define two additional special tokens, namely, [PS] and [SEP]. [PS] is added before each persona to let the model be aware of the personality of each person.
[SEP] is added between each utterance within a dialogue so that the model can learn the structural information within the chat history.
Metrics. We evaluate attack methods considering 1) the generation accuracy of adversarial samples 2) the generation length (GL) of adversar-ial samples, and 3) the attack success rate (ASR). Specifically, the generation accuracy of adversarial samples are measured by performance metrics such as BLEU (Papineni et al., 2002), ROUGE-L (Lin and Och, 2004;Li et al., 2022) and ME-TEOR (Banerjee and Lavie, 2005) which reflect the correspondence between a DG output and references. We define ASR as: where cos(.) denotes the cosine similarity between embeddings of original input x and crafted inputx. M (·, ·) is the average score of the three accuracy metrics. An attack is successful if the adversarial input can induce a more irrelevant (> τ ) output and it preserves enough semantics (> ) of the original input. Details of the performance of victim models are listed in Table 1.
Baselines. We compare against 5 recent whitebox attacks and adapt their attacking strategy to our DG scenario, including four accuracy-based attacks: 1) FD (Papernot et al., 2016) conducts a standard gradient-based word substitution for each word in the input sentence, 2) HotFlip (Ebrahimi et al., 2018b) proposes adversarial attacks based on both word and character-level substitution using embedding gradients, 3) TextBugger  proposes a greedy-based word substitution and character manipulation strategy to conduct the white-box adversarial attack against DG model, 4) UAT (Wallace et al., 2019) proposes word or character manipulation based on gradients. Specifically, its implementation relies on prompt insertion, which is different from most other approaches. And one length-based attack NMTSloth (Chen et al., 2022), which is a length-based attack aiming to generate adversarial samples to make the NMT system generate longer outputs. It's a strong baseline that generates sub-optimal length-based adversarial samples even under several constraints.
For all baselines, we adapt their methodologies to DG scenarios, where the input for computing loss contains both the current utterance, and other parts of a DG instance including chat history, persona or additional contexts. Specifically, we use TC as the optimization objective (i.e., L ll ) for all the baselines except NMTSloth which is a seq2seq attack method, and apply gradient descent to search  for either word or character substitutions. Hyper-parameters. For our DG adversarial attack, the perturbation threshold are performance threshold τ are set to 0.7 and 0 for defining a valid adversarial example. For multi-objective optimization, the regularization weight β is set to 1 and the two boundaries c 1 and c 2 are set to 0 for nonnegative constraints. We use the Hugging face pre-trained bert-large-cased model for MLM and set the number of candidates c as 50 for mutation. For adaptive search, we set the preference threshold δ as 0.5 and beam size k as 2. Our maximum number of iterations is set to 5, meaning that our modification is no more than 5 words for each sentence. Besides, we also restrict the maximum query number to 2,000 for all attack methods. For each dataset, we randomly select 100 dialogue conversations (each conversation contains 5∼8 turns) for testing the attacking effectiveness. Table 3 shows the GL, two accuracy metrics (ME-TEOR results are in Appendix A), ASR and cosine results of all attack methods. We observe that NMT-Sloth and our DGSlow can produce much longer outputs than the other four baselines. Accordingly, their attacking effectiveness regarding the output accuracy, i.e., BLEU and ROUGE-L, and ASR scores are much better than the four accuracy-based methods, proving the correctness of our assumption that adversarial samples forcing longer outputs also induce worse generation accuracy. Though NMT-Sloth can also generate lengthy outputs as DGSlow does, our method still achieves better ASR, accuracy scores and cosine similarity, demonstrating that our multi-objective optimization further ben- efits both objectives. Moreover, our method can promise semantic-preserving perturbations while largely degrading the model performance, e.g., the cosine similarity of DGSlow is at the top-level with baselines such as UAT and TextBugger. This further proves our gradient-based word saliency together with the adaptive search can efficiently locate significant positions and realize maximum attacking effect with only a few modifications.

Overall Effectiveness
Attack Efficiency. Figure 2 shows all attack methods' ASR in BST when attacking DialoGPT under the restriction of maximum iteration numbers. Reminder results for the other two models can be found in Appendix A. We observe that our attack significantly outperforms all accuracy-based baseline methods under the same-level of modifications, demonstrating the efficiency of length-based approach. Furthermore, DGSlow can achieve better ASR than NMTSloth, proving the practicality of our multi-objective optimization and adaptive search in real-world DG situations.
Beam Size. We further evaluate the impact of the remaining number of prominent candidates k (after each iteration) on the attack effectiveness, as shown in Table 4. We observe that larger k leads to overall longer GL, larger ASR and smaller BLEU, showing that as more diverse candidates are considered in the search space, DGSlow is benefited by the adaptive search for finding better local optima.

Ablation Study
We exhibit the ablation study of our proposed DGSlow algorithm in Table 5. Specifically, if MO is not included, we only use gradient g stop derived from L stop for searching candidates. If CF is not included, we use ϕ (x B n ) = GL(x B n ) as the fitness function, meaning we only select candidates that generate the longest output but ignore the quality measurement. We observe that: 1) Greedily selecting candidates with highest fitness is more effective than random guess, e.g., the ASR of GS are much higher than those of RS; 2) Our adaptive search, i.e., DGSlow 1 , makes better choices when selecting candidates compared to RS and GS; 3) Modifying the fitness function by considering both TC and GL, i.e., DGSlow 2 , can slightly improve overall ASR over DGSlow 1 ; 4) Only using multi-objective optimization, i.e., DGSlow 3 , can produce better attack results compared to only modifying the fitness.

Transferability
We evaluate the transferability of adversarial samples generated by our method on each model in ED with the other two as the victim models. From Table 6, we observe that our DGSlow can craft adversarial samples with decent transferability, e.g., the ASR are generally above 50% , and the corresponding accuracy scores, e.g., BLEU, all decrease compared to those produced by original samples.  We believe it is because DGSlow perturbs the sentence based on both accuracy and output length objectives, ensuring adversarial samples to capture more common vulnerabilities of different victim models than single objective based methods.

Case Study
We visualize three adversarial samples generated by DGSlow, in Table 7, which can effectively attack the DialoGPT model. It shows that by replacing only several tokens with substitutions presenting similar meanings and part-of-speech tags, our method can induce the model to generate much longer, more irrelevant sequencesx A n compared to the original ones x A n . Such limited perturbations also promise the readability and semantic preservance of our crafted adversarial samples.

Adversarial Attack
Various existing adversarial techniques raise great attention to model robustness in deep learning community (Papernot et al., 2016;Ebrahimi et al., 2018b;Wallace et al., 2019;Chen et al., 2022;Ren et al., 2019b;Li et al., 2020Li et al., , 2023. Earlier text adversarial attacks explore character-based perturbations as they ignore out-of-vocabulary as well as grammar constraints, and are straightforward to achieve adversarial goals (Belinkov and Bisk, 2018;Ebrahimi et al., 2018a). More recently, few attacks works focus on character-level (Le et al., 2022) since it's hard to generate non-grammatical-error adversarial samples without human study. Conversely, sentence-level attacks best promise grammatical correctness Iyyer et al., 2018) but yield a lower attacking success rate due to change in semantics. Currently, it is more common to apply word-level adversarial attacks based on word substitutions, additions, and deletions (Ren et al., 2019b;Zou et al., 2020;Wallace et al., 2020;. Such strategy can better trade off semantics, grammatical correctness, and attack success rate. Besides, a few researches focus on crafting attacks targeted to seq2seq tasks. For example, NMTSloth (Chen et al., 2022) targets to forcing longer translation outputs of an NMT system, while Seq2sick (Cheng et al., 2020a) and (Michel et al., 2019) aim to degrade generation confidence of a seq2seq model. Unlike previous works that only consider single optimization goal, we propose a new multi-objective word-level adversarial attack against DG systems which are challenging for existing methods. We leverage the conversational characteristics of DG and redefine the attacking objectives to craft adversarial samples that can produce lengthy and irrelevant outputs.

Dialogue Generation
Dialogue generation is a task to understand natural language inputs and produce human-level outputs, e.g., back and forth dialogue with a conversation agent like a chat bot with humans. Some common benchmarks for this task include PERSONACHAT (Zhang et al., 2018), FUSEDCHAT (Young et al., 2022), Blended Skill Talk (Smith et al., 2020), ConvAI2 (Dinan et al., 2020), Empathetic Dialogues (Rashkin et al., 2019b. A general DG instance contains at least the chat history until the current turn, which is taken by a chat bot in structure manners to generate responses. Recent DG chat bots are based on pre-trained transformers, including GPTbased language models such as DialoGPT , PersonaGPT (Tang et al., 2021), and seq2seq models such as BlenderBot (Roller et al., 2021), T5 (Raffel et al., 2020), BART (Lewis et al., 2020). These large models can mimic human-like responses and even incorporate personalities into the generations if the user profile (persona) or some other contexts are provided.

Conclusions
In this paper, we propose DGSlow-a white-box multi-objective adversarial attack that can effectively degrade the performance of DG models. Specifically, DGSlow targets to craft adversarial samples that can induce long and irrelevant outputs. To fulfill the two objectives, it first defines two objective-oriented losses and applies a gradientbased multi-objective optimizer to locate key words for higher attack success rate. Then, DGSlow perturbs words with semantic-preserving substitutions and selects promising candidates to iteratively approximate an optima solution. Experimental results show that DGSlow achieves state-of-the-art results regarding the attack success rate, the quality of adversarial samples, and the DG performance degradation. We also show that adversarial samples generated by DGSlow on a model can effectively attack other models, proving the practicability of our attack in real-world scenarios.

Limitations
Mutation. We propose a simple but effective gradient-based mutation strategy. More complex mutation methods can be integrated into our framework to further improve attacking effectiveness. Black-box Attack. DGSlow is based on a whitebox setting to craft samples with fewer query times, but it can be easily adapted to black-box scenarios by using a non-gradient search algorithm, e.g., define word saliency based on our fitness function and do greedy substitutions. Adversarial Defense. We do not consider defense methods in this work. Some defense methods, e.g., adversarial training and input denoising, may be able to defend our proposed DGSlow. Note that our goal is to pose potential threats by adversarial attacks and reveal the vulnerability of DG models, thus motivating the research of model robustness.

Ethics Statement
In this paper, we design a multi-objective whitebox attack against DG models on four benchmark datasets. We aim to study the robustness of stateof-the-art transformers in DG systems from substantial experimental results and gain some insights about explainable AI. Moreover, we explore the potential risk of deploying deep learning techniques in real-world DG scenarios, facilitating more research on system security and model robustness.
One potential risk of our work is that the methodology may be used to launch an adversarial attack against online chat services or computer networks. We believe the contribution of revealing the vulnerability and robustness of conversational models is more important than such risks, as the research community could pay more attention to different attacks and improves the system security to defend them. Therefore, it is important to first study and understands adversarial attacks.

A Additional Settings and Results
Details of Victim Models. For DialoGPT, we use dialogpt-small that contains 12 attention layers with 768 hidden units and 117M parameters in total. For BART, we usebart-base that has 6 encoder layers together with 6 decoder layers with 768 hidden units and 139M parameters. For T5, we use t5-small that contains 6 encoder layers as well as 6 decoder layers with 512 hidden units and 60M parameters in total.   Attack Efficiency. We evaluate the ASR under the restriction of iteration numbers for BART in Figure 3 and T5 in Figure 4. We observe that DGSlow can significantly outperform all accuracybased baseline methods. Compared to the lengthbased NMTSloth, our method exhibits advantages when the iteration times goes large, showing the superiority of our adaptive search algorithm.
METEOR Results. We show the METEOR results for attacking the three models in four benchmark datasets in Table 8. We observe that DGSlow achieves overall the best METEOR scores, further demonstrating the effectiveness of our attack method.