Dynamic Transformers Provide a False Sense of Efficiency

Despite much success in natural language processing (NLP), pre-trained language models typically lead to a high computational cost during inference. Multi-exit is a mainstream approach to address this issue by making a trade-off between efficiency and accuracy, where the saving of computation comes from an early exit. However, whether such saving from early-exiting is robust remains unknown. Motivated by this, we first show that directly adapting existing adversarial attack approaches targeting model accuracy cannot significantly reduce inference efficiency. To this end, we propose a simple yet effective attacking framework, SAME, a novel slowdown attack framework on multi-exit models, which is specially tailored to reduce the efficiency of the multi-exit models. By leveraging the multi-exit models’ design characteristics, we utilize all internal predictions to guide the adversarial sample generation instead of merely considering the final prediction. Experiments on the GLUE benchmark show that SAME can effectively diminish the efficiency gain of various multi-exit models by 80% on average, convincingly validating its effectiveness and generalization ability.


Introduction
Pre-trained language models (Devlin et al., 2019;Radford et al., 2019;Liu et al., 2019;Lewis et al., 2020;Raffel et al., 2020;Chen et al., 2022b) have shown great potential in a wide range of NLP tasks. While large language models offer unparalleled performance, their high computation during inference limits the scope of applications. More studies recently concentrate on efficient NLP, which aims to speed up the inference of deep language models without significant performance degradation (Sanh et al., 2019;Zafrir et al., 2019;. Among these, the multi-exit models Xin et al., 2020) attract widespread attention.
The idea of the multi-exit models stems from the observation that inputs with varying semantics demand distinct computational resources. By automatically adjusting different computational resources according to input semantics, one can effectively speed up the inference of a multi-exit model with minimum performance loss. Furthermore, such multi-exit model can be easily combined with other static speedup approaches, e.g., distillation (Sanh et al., 2019;Jiao et al., 2020), by replacing the backbone model. In addition to higher efficiency, previous studies also show that the multiexit models are more robust to correctness-based adversarial samples Hu et al., 2020).
The study of NLP attacks has mostly focused on harming models' accuracy, and taken static transformers as victim models (Ebrahimi et al., 2018b;Li et al., 2020). There exists another type of attack on the model efficiency, i.e., to make the models computationally slow. Considering this type of attack, the intrinsic dynamic nature of the multiexit models might be vulnerable to such attacks. It remains unexplored, however, how significantly the efficiency or speedup from early exiting will be affected by the attacks. Motivated by this, we first analyze the efficiency robustness of dynamic NLP transformers. We find that previous accuracyoriented approaches cannot significantly slow down the dynamic transformers and sometimes even lead to shorter inference time.
To this end, we propose a novel slowdown attack framework on multi-exit language models: SAME. Unlike accuracy-oriented adversarial attacks, there are several unique challenges for effective efficiency attacks. First, existing accuracy-oriented attacks aim to mislead neural networks to generate wrong predictions, which is not suitable for efficiency-oriented attacks. Therefore, we develop a new objective function to guide the generation of efficiency-oriented adversarial samples. In ad-dition, our objective function must be general to handle various exit mechanisms in multi-exit transformers. Second, multi-exit transformers are not static during inference, so the "static" search strategies used in adversarial attacks are not suitable. To overcome the challenges, we propose a dynamic importance adjustment strategy that assigns different importance to each exit layer, allowing the adversarial example search process to focus on the layers that contribute to model efficiency.
We evaluate our SAME using two widely-used multi-exit strategies (entropy-based (Xin et al., 2020) and patience-based ) with various pre-trained language models (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020) as the backbone on eight tasks from the GLUE benchmark. Experimental results show that our SAME can effectively reduce the computational saving by 80% on average, which significantly outperforms previous accuracy-oriented approaches by a large margin. Further experiments on the multi-goal attack, attacking transferability, and adversarial training convincingly validate the effectiveness and generalization ability of our proposed SAME.
The contributions of this work are summarised as follows: (1) New Problem: we identify a new vulnerability of the multi-exit NLP models, namely, the network efficiency.
(2) Novel Approach: We propose the first efficiency-oriented attacking framework to measure the efficiency robustness of the multi-exit NLP models.
(3) Comprehensive Evaluation: We conduct a systematic evaluation of various dynamic transformers, which shows that future studies on improving and protecting the efficient robustness of the multi-exit NLP models are necessary.

Multi-Exit Networks
Multi-exit neural networks include multiple outputs or "exits" placed at different network layers. This architectural design allows for early decisionmaking if the input is confidently classified or predicted, leading to faster and more efficient processing. Based on the semantic complexity of the inputs, multi-exit neural networks can effectively reduce inference time by making predictions from early layers for a simpler input and later layers for a more complex input. As shown in Figure 1, a multi-exit transformer consists of N transformer layers, each containing an internal classifier. Dur-ing the inference phase, predictions are made after each layer, and computation is terminated once the exit criterion is met.
A funny, highly enjoyable movie.
A funny, highly enjoyable movie.

Patience Counter
Entropy-based early exiting Patience-based early exiting Figure 1: Illustration of entropy-based (left) and patience-based (right) early-exiting strategies, l 1...n refer to transformer layers, and H i is the entropy of probability distribution from the i th internal classifier.
The choice of exit criterion is crucial in multiexit models. In this work, we explore two commonly used strategies: entropy-based (Xin et al., 2020; and patience-based Zhu, 2021). As depicted in Figure 1 (left), the entropy-based strategy employs the entropy of a probability distribution as an indicator of model confidence. The model checks if the entropy is lower than a predefined threshold after each layer's computation and outputs a prediction when the criterion is met. The patience-based strategy, as shown in Figure 1 (right), involves maintaining a patience counter that is incremented by 1 when predictions from two consecutive internal classifiers are consistent and is reset to zero when they are inconsistent. The model exits early if the patience counter reaches a pre-defined patience threshold.

Adversarial Attack
Adversarial attacks are methods of creating adversarial examples to cause neural networks to make incorrect predictions (Papernot et al., 2016;Ebrahimi et al., 2018b;Li et al., 2019;Wallace et al., 2019;Le et al., 2022;Hong et al., 2021;Cheng et al., 2020;Li et al., 2023;Chen et al., 2022a;Li et al., 2023). Adversarial attacks in natural language processing (NLP) mainly contain two categories: character-level and word-level. For the character-level attacks, existing methods involve modifying the words in an input sentence by using insertion, swap, or deletion operators to create adversarial examples (Belinkov and Bisk, 2018;Ebrahimi et al., 2018a). The word-level attacks, on the other hand, involve replacing words in the input sentence with other words, e.g., synonym replacement (Ren et al., 2019), round-trip translation (Zhang et al., 2021). There has also been an emergence of attacks targeting generative models. For example, Seq2Sick (Cheng et al., 2020) generates adversarial examples that decrease the BLUE score of neural machine translation models. In addition to accuracy, inference efficiency is also highly critical for various real-time applications, e.g., speech recognition (Wang et al., 2022), machine translation (Fan et al., 2021;Zhu et al., 2020), lyric transcriptions (Gao et al., 2022b, 2023, 2022a. Recently, NICGSlowDown and NMT-Sloth (Chen et al., 2022d,c) propose delaying the appearance of the end token to reduce the efficiency of language generative models. There have been studies evaluating the accuracy robustness of dynamic transformer through directly adapting TextFooler (Jin et al., 2020). Unlike the previous works, the proposed SAME is specially designed for evaluating the efficiency robustness of dynamic transformers.

Problem Formulation
Unlike previous accuracy-oriented approaches, our goal here is to create adversarial examples that decrease the efficiency of a victim multi-exit model F by adding human-unnoticeable perturbations to a benign input. Specifically, we focus on two factors: (i) significantly increasing the computational costs for the victim model and (ii) keeping the generated perturbation minimal. We formulate the problem as a constrained optimization problem: where x is the given benign input, ϵ is the maximum adversarial perturbation allowed, and Exit F (·) measures the number of layers where the victim multiexit language model F exits. Our proposed approach attempts to find the optimal perturbation ∆ that maximizes the number of layers where the model exits (decrease the efficiency), and at the same time adheres to the constraint that the perturbation must be smaller than the allowed threshold (unnoticeable). In this work, we set the allowable modifiable words ϵ as 10% of the total input words.
Machine learning is interesting.

Mess Loss
Patience Loss

Importance Adjustment
Machine studying is interesting. Machine leanring is interesting.

Word Level Character Level
Figure 2: Design overview of SAME Figure 2 illustrates the design overview of our approach. Our approach iteratively mutates the given inputs to craft adversarial examples. During each iteration, we first design a differentiable objective to approximate our adversarial goals (Section 3.3). Then, we dynamically adjust our objective based on the importance of each layer (Section 3.4). Finally, we apply our approximated objective function to mutate the inputs with two types of perturbations and generate a set of adversarial candidates that satisfy the given unnoticeable constraints (Section 3.5).

Adversarial Objective Approximation
Notice that our optimization objective in Equation 1 is non-differentiable, which makes it challenging to be directly used as the objective for searching optimal adversarial perturbations. Thus, we need to approximate the adversarial objective (i.e., argmax Exit F (·)) with a differentiable function. Various objectives are used in accuracy-based adversarial attacks, which aim to decrease the model's accuracy by increasing the confidence scores of the wrong labels. However, these existing approaches do not address the model's efficiency. Therefore, a totally new design for efficiency-oriented adversarial objectives is required. Since exiting criteria determine the model's efficiency (as outlined in Section 2.1), we motivate our efficiency-oriented adversarial objective approximation from termination criteria of F, which includes the following: Making Mess Prediction: Recall that one way to determine early exiting is by whether the entropy undercuts a predefined threshold. To make the model less efficient, our goal is to keep the entropy above this threshold consistently. It is worth noting that a uniform distribution has the highest entropy among all distributions. Hence, our first objective function is to push the model prediction close to a uniform distribution: where F i (x) is the prediction logits at the i th layer, U is a uniform distribution, N is the total layer of the victim F, and SCE(·) is the soft cross entropy loss. Eq. 2 is interpreted as we seek to minimize the error between output logits (i.e., F i (x)) and uniform distribution to push the model to produce larger entropy. Decrease Prediction Patience: The second termination criterion is based on prediction patience. To this end, our second objective function needs to push the victim model to produce "impatient" predictions. In other words, we seek to push the model to make inconsistent predictions among its intermediate classifiers as follows: where h i is the constructed target label at the i th layer and CE(·) is the cross entropy function. As previously mentioned, our second objective seeks to cause the model to produce inconsistent predictions. Thus, we construct our target h i based as: and h 0 is set as the prediction given by the model's first internal classifier on the seed input. Our intuition is to force the model to produce inconsistent predictions between consecutive classifiers by introducing heuristics (Equation 4), thus decreasing prediction patience.

Dynamic Importance Adjustment
It is important to note that the inference path of F is not "static", implying that treating all layer outputs equally at each stage of the search may not yield optimal results. For instance, if F exits at the third layer, optimizing the input to influence the output before the third layer would be less important. To overcome this challenge, we propose a strategy to dynamically adjust the importance assigned to early layer outputs. Given an input x, our layer-wise importance scores are computed as: where w i is the importance score for the i th layer, Exit F (x) is the index of layer that exit the computation, α and β are hyper-parameters. As shown in Eq. 5, the layers, which have been computed, are assigned constant importance scores, while the layers, which are not used, are assigned exponentially increasing importance scores. Finally, our objective can be expressed as: where λ is the hyper-parameters that balance the importance of each objective goals.

Perturbing Inputs
Our adversarial perturbation generation includes three main steps: (i) finding critical words, (ii) generating adversarial candidates, and (iii) choosing candidates. Finding Critical Words: As mentioned earlier, we apply our approximated objective function as guidance to search for optimal adversarial perturbations. Thus, we first find the critical words using the gradient of our objective function (i.e., Equation 6). Specifically, we order the word based on j ∂L total ∂tk j i , where tk j i is the j th dimension of the i th tokens embedding. In this step, we consider the word that is exactly tokenized into one token. Generating Perturbation Candidates: After identifying the critical words, the next step is to perturb the critical words to craft adversarial perturbation candidates. In this work, we follow existing work and use two types of perturbations to generate adversarial examples: character level and word level, which leads to two variants of SAME: SAME-Char and SAME-Word correspondingly.
For character-level perturbation, we employ four widely used mutations: neighbor character swap, character insertion, character deletion, and homoglyph character replacement (Ebrahimi et al., 2018a;Liu et al., 2022). For neighbor character swap and deletion mutations, we randomly swap or delete one character in the targeted word. To perform character insertion mutation, we randomly select a character from the ASCII character set and then insert it at a random location in the targeted word. For homoglyph character replacement mutation, we use the default homoglyph character mapping from TextBugger (Li et al., 2019). All these four character-level perturbations are common in the real world when typing quickly and can be unnoticeable without careful examination. For each mutation, we randomly generate 25 candidates, resulting in a total of 25×4=100 candidates.
For word-level perturbation, we consider replacing the critical word with another word δ. To compute the target word, we define word replace increment I s,t to measure the efficiency degradation of replacing word s to t: where E(·) represents the embedding vector of a given token, and I s,t denotes the increase in the direction of the gradient of our objective function, resulting from replacing token s with token t. For word level perturbations, we also generate 100 adversarial candidates.
Candidates Selection: Once the adversarial candidates are generated, we select the valid candidates for the next iteration. To do this, we eliminate candidates that do not meet the constraints in Equation 1 and then select the top 5 candidates with the highest Exit F for the next iteration of search. Victim models: We evaluate two popular early-exit strategies, namely entropy-based DeeBERT (Xin et al., 2020) with backbone model BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), as well as patiencebased PABEE (Zhou et al., 2020) with backbone model BERT (Devlin et al., 2019) and AL-BERT (Lan et al., 2020). Following the original paper, we consider two different settings with various entropy or patience threshold. Specifically, we select the threshold to keep the relative performance drop within 2% and 4%, denoted as PD<2% and PD<4%.
Baselines: We compare SAME to 5 recent NLP attack approaches through adapting their attacking strategy to our attacking scenario, which includes white-box attacking approaches: Hot-Flip (Ebrahimi et al., 2018b) Metrics: We evaluate the efficacy of attacking methods with two metrics. As in , the first metric is the estimated speedup, which is computed as the total number of transformer layers divided by number of actually computed layers. Besides, we propose a second metric, high computation ratio, which refers to the ratio of samples with extremely high computational cost. Specifically, we consider samples with at least 11 computed layers as high computational samples for base-size dynamic transformers with total 12 layers, and at lease 22 computed layers as high computational samples for large-size dynamic transformers with total 24 layers. In all tables, we report the speedup (left) and high computation ratio (right) unless specified otherwise.

Main Results
The comparison of different attacking methods on entropy-based dynamic models are shown in Table 1, and the results on patience-based models are listed in Table 2. Overall, we find that previous accuracy-oriented approaches cannot harm the model efficiency much for either exiting strategy, and even lead to higher speedup for some cases, e.g., QQP, RTE. In sharp contrast, both variants of SAME can effectively reduce the speedup from early exiting, which outperforms all previous approaches by a large margin. Specifically, under PD<2% setting, SAME eliminates the efficiency gain by 74.88% on average across GLUE benchmark for DeeBERT series models, and 85% for PABEE series models. Under PD<4% setting, model's exiting criteria are more relaxed, which makes the slowdown more difficult. However, SAME consistently reduces the efficiency gain by 75% for DeeBERT models, and by 82% for PABEE models, which again convincingly demonstrates the efficacy of SAME.
Besides, while previous works show that patience-based approaches are more robust against accuracy-oriented attack, we observe that both strategies are equally vulnerable under proposed   efficiency attack. In addition, these two strategies have different level of vulnerability to different permutation. Entropy-based models are more vulnerable to character-level permutation. On the contrary, word-level permutation performs better on patience-based models. We hypothesize that the discrepancy between two exiting strategies lead to this phenomena. To slowdown patience-based models, ones need to break the consistency between predictions from internal classifiers, which might be difficult to achieve with character-level permutation. The results suggest that further combining multiple level of permutetation methods would lead to a more universal attacking framework that are applicable to wide range of dynamic models.
Finally, we find that the quality of backbone language model has large impact on the efficiency robustness of dynamic transformers. For instance, compared to BERT, RoBERTa is trained with larger corpus with longer time, which makes DeeR-oBERTa much more robust than DeeBERT models.

Accuracy & Efficiency
Since another important adversarial goal is misclassification, we further investigate the trade-off between accuracy and efficiency drop during attacking. Table 3 summarizes the results on SST-2 and MNLI-mm. In addition to efficiency drop, SAME can also considerably lead to misclassification. As the goal function of SAME doesn't consider the accuracy metric, we further propose SAME+, which adopts a multi-objective goal function: where y true is the ground truth label, 1(·) is the indicator function, and σ is the weight that balances the importance of accuracy and efficiency. As we focus on efficiency robustness in this work, we set σ to 0.5. Therefore, SAME+ is expected to produce adversarial samples with a similar efficiency drop level as SAME but leads to an additional accuracy drop. As shown in Table 3  for SAME-word and 37.47% for SAME-char without any increase in efficiency. In addition, previous work shows that patience-based methods are more robust against accuracy-oriented adversarial attack, compared to entropy/confidence-based ones . However, we observe that SAME leads to similar accuracy drop for patience-based and entropy-based dynamic models. The robustness of patience-based methods come from internal classifier ensemble. Yet, proposed heuristic loss in SAME makes these internal classifiers hard to reach an agreement. Then, the victim model will directly obtain prediction from the last classifier for large proportion of inputs, which actually fails the mechanism of internal classifier ensemble. The empirical results suggest that it's possible to craft adversarial samples with low accuracy and efficiency.

Attacking Transferability
In this section, we examine whether adversarial samples from SAME are transferable between various architectures. We study two settings: (i) Cross  backbone: we assume the source model and target model share the same early exiting strategy but with different backbone models. (ii) Cross mechanism: we assume that the source and target model have different early exiting strategies. Table 4 summarizes the results on SST-2 and MNLI datasets. Overall, the adversarial samples are transferable between different models, and several critical factors determine the transferability. The first one is the exiting strategy. We find that samples are more transferable between models sharing the same exiting strategy, e.g., from PABEE-ALBERT-base to PABEE-BERT-base. The second factor is the backbone model. If the source and target model have the same backbone language model or share the same tokenizer, e.g., DeeBERT-base and DeeBERT-large, the transferred samples will cause more slowdown. In addition, we find that entropy-based models are more vulnerable to transferred attacks compared to patience-based models. Interestingly, we again observe that character-level attack is more transferable to the entropy-based model. while the word-level attack is more transferable to the patience-based model, which is consistent with our findings from Section 4.2.

Adversarial Training
We further explore whether this new efficiency threat can be successfully defended through adversarial training. Specifically, given a victim model. we first generate an adversarial sample using SAME or other adversarial approaches for each sample from the training set. Then, we equally mix the clean and adversarial samples to retrain a new model. Finally, we attack the adversarial trained models again with SAME. We adjust the entropy/patience of adversarial trained models to have the same speedup as the original victim model. Table 5 shows the results. Overall, the efficiency robustness of dynamic transformers can be improved through adversarial training (1.18x to 1.58x on average using TextFooler), Yet, there still exists a drastic speedup loss (2.25x to 1.58x). Compared to accuracy-oriented adversarial data, data from SAME provide more robustness beneficial against attack, which validates the potential of using SAME to enhance the robustness of current dynamic transformers.

Discussion
Impact of Model Scale: Since attacking approaches is required to slowdown the victim models by more layers to achieve the same slowdown ratio, we further investigate the impact of victim model scale on the attacking performance. Experimental results using 24-layer BERT-large model on SST-2 and MNLI are shown in Table 6. Due to space limitation, more results can be found in Appendix B. Accuracy-oriented methods can still hardly reduce the inference efficiency. Yet, our proposed SAME effectively reduce the speedup ratio by 89%, which is comparable to 93% on base-size models.  Impact of modification rate: In our main results, we set the allowable modification rate ϵ as 10% of the input words. We further investigate whether SAME can reduce the inference efficiency under lower modification rate (imperceptible attack). The experiment results across GLUE benchmark on DeeBERT-base and PABEE-BERT-base under are summarized in Table 7. Even constrained with a very low modification rate, e.g., 3%, both variants of SAME can still significantly reduce the model's efficiency. In addition, with increasing modification rate, SAME leads to higher reduction in efficiency.

Method SST-2 MNLI
Ablation Study: To understand the inner mechanism of SAME, we conduct ablation studies on each component. As shown in Table 8, solely using heuristic loss can already lead to significant effi-   Semantic Similarity: While we constrain the modification rate in our experiments to keep the semantic meaning consistent, the semantic similarity between benign and adversarial examples is not explicitly constrained. Therefore, we further investigate the sentence semantic similarity between original and adversarial examples on SST-2 dataset. Specifically, We first obtain the sentence representations of adversarial and original sample with a state-of-the-art ST5-large embedding model (Ni et al., 2022), and then compute their pairwise cosine similarity. With DeeBERTbase and PABEE-BERT-base as the victim model, the SAME-word has an average cosine similarity of 0.89, and SAME-char has an average cosine similarity 0.96. The results suggest that both variants of SAME can well preserve the inputs' semantic meaning, at the same time, reduce the efficiency of dynamic transformers.
Visualization: To illustrate the impact of efficiency-based v.s. correctness-based adversarial perturbations, We present a case study of adversarial samples produced from SST-2 dataset in Table 9. For better explainability, we show examples with one-word only modification. Due to space limitations, more adversarial samples generated using SAME can be found in Appendix C.
As shown in Table 9, our efficiency-based method will perturb the word but to bujt, thereby altering the explicit turning relationship between two sentences. While humans can make the correct prediction even without the word but, it can be challenging for dynamic transformers to infer the turning relationship in the early stage. Therefore, they fail to satisfy the exiting conditions, resulting in reduced inference efficiency. In contrast, correctnessbased approaches will keep the transition word and adversarially modify the word deeper, e.g., to deper with TextBugger. With the transition word but, the model will emphasize more on the latter sentence, and easily get a high model confidence.
[Clean input] the film may appear naked in its narrative form ... but it goes deeper than that , to fundamental choices that include the complexity of the catholic doctrine.
[TextBugger] the film may appear naked in its narrative form ... but it goes deper than that , to fundamental choices that include the complexity of the catholic doctrine.
[TextFooler] the film may appear naked in its narrative form ... but it goes more than that , to fundamental choices that include the complexity of the catholic doctrine.
[SAME] the film may appear naked in its narrative form ... bujt it goes deeper than that , to fundamental choices that include the complexity of the catholic doctrine. Table 9: Comparison of adversarial samples produced by accuracy-oriented approaches and our energy-oriented approaches from SST-2.

Conclusion and Future Works
In this paper, we systematically evaluate the efficiency robustness of dynamic transformers. We also propose SAME, a novel white-box slowdown attack framework that effectively degrade the efficient performance of dynamic multi-exit language models. Specifically, SAME generates adversarial examples that could delay the exit of dynamic multi-exit language models with the guidance of heuristic and mess loss. Extensive experimental demonstrate the superior effectiveness of SAME across various dynamic multi-exit language models. Future works include the development of efficient robust dynamic transformers and the extension to other NLP models with dynamic inference time.

Limitations
Firstly, our proposed SAME is for the white-box attacking scenario only, which is less practical in real-world scenarios. However, experimental results on black-box transferability show that a blackbox efficiency-oriented attack is highly feasible. Therefore, we leave the black box SAME as a future study.
Secondly, we mainly study multi-exit transformers for sentence classification tasks in this work. We notice that several recent works extend the idea of multi-exiting to other NLP tasks, e.g., sequence labelling (Li et al., 2021), text generation (Schuster et al., 2022). For classification tasks, SAME slowdowns the models by avoiding early exiting. While for text generation tasks, in addition to avoiding early exiting, ones can also slow down the model by forcing the model to produce a longer sequence. We leave the exploration of other dynamic models to future work.
Thirdly, as the first work that evaluates the efficiency robustness of dynamic transformers. we use a relatively simple permutation strategy. Although these permutations can lead to severe performance degradation, they might not be imperceptible enough. Yet, they could be easily replaced by other sophisticated permutations under SAME framework.

Ethics Statement
We propose a slowdown attack against dynamic transformers on GLUE benchmark datasets in this work. We aim to study the efficiency robustness of dynamic transformers and provide insight to inspire future works on robust dynamic transformers.
Our proposed framework may be used to attack online NLP services deployed with dynamic models. However, we believe that exploring this new type of vulnerability and robustness of efficiency is more important than the above risks. Research studying effective adversarial attacks will motivate improvements to the system security to defend against the attacks.

B Results on Large Dynamic Language Models
We further conduct the experiments on large dynamic transformers with backbone model RoBERTa-large, ALBERT-large, and BERT-large.

C Visualization of our generated adversarial examples
We visualize several adversarial examples our proposed attack method generates from SST-2 in Table 11. By only replacing a few words in the benign input, our method could significantly delay the exit of dynamic multi-exit language models.

SAME-Word
[Clean input] although german cooking does not come readily to mind when considering the world 's best cuisine , mostly martha could make deutchland a popular destination for hungry tourists .
[Adv. input] although german cooking does not come readily no mind when considering akin world 's best cuisine , mostly martha could make deutchland rats popular destination for hungry tourists .
[Clean input] a difficult , absorbing film that manages to convey more substance despite its repetitions and inconsistencies than do most films than are far more pointed and clear .
[Adv. input] a difficult , absorbing film robots manages to convey more substance despite its repetitions and inconsistencies heart do most films than are far more pointed towards clear.
[Clean input] warm water under a red bridge is a quirky and poignant japanese film that explores the fascinating connections between women, water, nature, and sexuality.
[Adv. input] warm water under lacking red bridge did neither quirky and poignant japanese film that explores the fascinating connections between women, water, nature, and sexuality.

SAME-Char
[Clean input] the volatile dynamics of female friendship is the subject of this unhurried, low-key film that is so offhollywood that it seems positively french in its rhythms and resonance.
[Adv. input] the volatile dynamics of female friendship is the subject of this unhurried, low-key film that is so offhollywood tfhat it seems positively french in its rhythms arnd resonance.
[Clean input] if there's one thing this world needs less of, it's movies about college that are written and directed by people who couldn't pass an entrance exam.
[Adv. input] if there's one thing this world needs less of, it's movies aLbout college that are written and directed by pople who couldn't pass an entrance exam.

[Clean input]
what's surprising about full frontal is that despite its overt self-awareness, parts of the movie still manage to break past the artifice and thoroughly engage you.
[Adv. input] what's surprising about full frontal is that despite its overt self-awareness, parts of the movie still manage to break paust the artifice gand thoroughly engage yuo.  Table 12: Comparison of various attacking methods on patience-based dynamic models. Since patience threshold is a discrete number, some entries share the same value, e.g., PD<2% and PD<4% for PABEE-BERT on CoLA.