Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems

Do all instances need inference through the big models for a correct prediction? Perhaps not; some instances are easy and can be answered correctly by even small capacity models. This provides opportunities for improving the computational efficiency of systems. In this work, we present an explorative study on ‘model cascading’, a simple technique that utilizes a collection of models of varying capacities to accurately yet efficiently output predictions. Through comprehensive experiments in multiple task settings that differ in the number of models available for cascading (K value), we show that cascading improves both the computational efficiency and the prediction accuracy. For instance, in K=3 setting, cascading saves up to 88.93% computation cost and consistently achieves superior prediction accuracy with an improvement of up to 2.18%. We also study the impact of introducing additional models in the cascade and show that it further increases the efficiency improvements. Finally, we hope that our work will facilitate development of efficient NLP systems making their widespread adoption in real-world applications possible.


Introduction
Pre-trained language models such as RoBERTa (Liu et al., 2019), ELECTRA (Clark et al., 2020), and T5 (Raffel et al., 2020) have achieved remarkable performance on numerous natural language processing benchmarks (Wang et al., 2018(Wang et al., , 2019;;Talmor et al., 2019).However, these models have a large number of parameters which makes them slow and computationally expensive; for instance, T5-11B requires ∼87 × 10 11 floating point operations (FLOPs) for an inference.This limits their widespread adoption in real-world applications that prefer computationally efficient systems in order to achieve low response times.
The above concern has recently received considerable attention from the NLP community leading Figure 1: Illustrating a cascading approach with three models (Mini, Med, and Base) arranged in increasing order of capacity.An input is first passed through the smallest model (Mini) which fails to predict with sufficient confidence.Therefore, it is then inferred using a bigger model (Med) that satisfies the confidence constraints and the system outputs its prediction ('contradiction' as dog has four legs).Thus, by avoiding inference through large/expensive models, the system saves computation cost without sacrificing the accuracy.
to development of several techniques, such as (1) network pruning that progressively removes model weights from a big network (Wang et al., 2020;Guo et al., 2021), (2) early exiting that allows multiple exit paths in a model (Xin et al., 2020), (3) adaptive inference that adjusts model size by adaptively selecting its width and depth (Goyal et al., 2020;Kim and Cho, 2021), (4) knowledge distillation that transfers 'dark-knowledge' from a large teacher model to a shallow student model (Jiao et al., 2020;Li et al., 2022), and (5) input reduction that eliminates less contributing tokens from the input text to speed up inference (Modarressi et al., 2022).These methods typically require architectural modifications, network manipulation, saliency quantification, or even complex training procedures.Moreover, computational efficiency in these methods often comes with a compromise on accuracy.In contrast, model cascading, a simple technique that utilizes a collection of models of varying capacities to accurately yet efficiently output predictions has remained underexplored.
In this work, we address the above limitation by first providing mathematical formulation of model cascading and then exploring several approaches to do it.In its problem setup, a collection of models of different capacities (and hence performances) are provided and the system needs to output its prediction by leveraging one or more models.On one extreme, the system can use only the smallest model and on the other extreme, it can use all the available models (ensembling).The former system would be highly efficient but usually poor in performance while the latter system would be fairly accurate but expensive in computation.Model cascading strives to get the best of both worlds by allowing the system to efficiently utilize the available models while achieving high prediction accuracy.This is in line with the 'Efficiency NLP' (Arase and et al., 2021) policy document put up by the ACL community.
Consider the case of CommitmentBank (de Marneffe et al., 2019) dataset on which BERT-medium model having just 41.7M parameters achieves 75% accuracy and a bigger model BERT-base having 110M parameters achieves 82% accuracy.From this, it is clear that the performance of the bigger model can be matched by inferring a large number of instances using the smaller model and only a few instances using the bigger model.Thus, by carefully deciding when to use bigger/more expensive models, the computational efficiency of NLP systems can be improved.So, how should we decide which model(s) to use for a given test instance?Figure 1 illustrates an approach to achieve this; it infers an instance sequentially through models (ordered in increasing order of capacity) and uses a threshold over the maximum softmax probability (MaxProb) to decide whether to output the prediction or pass it to the next model in sequence.The intuition behind this approach is that MaxProb shows a positive correlation with predictive correctness.Thus, instances that are predicted with high MaxProb get answered at early stages as their predictions are likely to be correct and the remaining ones get passed to the larger models.Hence, by avoiding inference through large and expensive models (primarily for easy instances), cascading makes the system computationally efficient while maintaining high prediction performance.
We describe several such cascading methods in Section 3.2.Furthermore, cascading allows custom computation costs as different number of models can be used for inference.We compute accuracies for a range of costs and plot an accuracy-cost curve.Then, we calculate its area (AUC) to quantify the efficacy of the cascading method.Larger the AUC value, the better the method is as it implies higher accuracy on average across computation costs.
We conduct comprehensive experiments with 10 diverse NLU datasets in multiple task settings that differ in the number of models available for cascading (K value from Section 3).We first demonstrate that cascading achieves considerable improvement in computational efficiency.For example, in case of QQP dataset, cascading system achieves 88.93% computation improvement over the largest model (M 3 ) in K=3 setting i.e. it requires just 11.07% of the computation cost of model M 3 to attain equal accuracy.Then, we show that cascading also achieves improvement in prediction accuracy.For example, on CB dataset, the cascading system achieves 2.18% accuracy improvement over M 3 in the K=3 setting.Similar improvements are observed in settings with different values of K. Lastly, we show that introducing additional model in the cascade further increases the efficiency benefits.
In summary, our contributions and findings are: 1. Model Cascading: We provide mathematical formulation of model cascading, explore several methods, and systematically study its benefits.2. Cascading Improves Efficiency: Using accuracy-cost curves, we show that cascading systems require much lesser computation cost to attain accuracies equal to that of big models.3. Cascading Improves Accuracy: We show that cascading systems consistently achieve superior prediction performance than even the largest model available in the task setting.4. Comparison of Cascading Methods: We compare performance of our proposed cascading methods and find that DTU (3.2) outperforms all others by achieving the highest AUC of accuracy-cost curves on average.We note that model cascading is trivially easy to implement, can be applied to a variety of problems, and can have good practical values.

Related Work
In recent times, several techniques have been developed to improve the efficiency of NLP systems, such as network pruning (Wang et al., 2020;Guo et al., 2021;Chen et al., 2020), quantization (Shen et al., 2020;Zhang et al., 2020;Tao et al., 2022), knowledge distillation (Clark et al., 2019;Jiao et al., 2020;Li et al., 2022;Mirzadeh et al., 2020), and input reduction (Modarressi et al., 2022).Our work is more closely related to dynamic inference (Xin et al., 2020) and adaptive model size (Goyal et al., 2020;Kim and Cho, 2021;Hou et al., 2020;Soldaini and Moschitti, 2020).Xin et al. (2020) proposed Dynamic early exiting for BERT (DeeBERT) that speeds up BERT inference by inserting extra classification layers between each transformer layer.It allows an instance to choose conditional exit from multiple exit paths.All the weights (including newly introduced classification layers) are jointly learnt during training.Goyal et al. (2020) proposed Progressive Wordvector Elimination (PoWER-BERT) that reduces intermediate vectors computed along the encoder pipeline.They eliminate vectors based on significance computed using self-attention mechanism.Kim and Cho (2021) extended PoWER-BERT to Length-Adaptive Transformer which adaptively determines the sequence length at each layer.Hou et al. (2020) proposed a dynamic BERT model (DynaBERT) that adjusts the size of the model by selecting adaptive width and depth.They first train a width-adaptive BERT and then distill knowledge from full-size models to small sub-models.
Lastly, cascading has been studied in machine learning and vision with approaches such as Haarcascade (Soo, 2014) but is underexplored in NLP (Li et al., 2021).We further note that cascading is non-trivially different from 'ensembling' as ensembling always uses all the available models instead of carefully selecting one or more models for inference.
Our work is different from existing methods in the following aspects: (1) Existing methods typically require architectural changes, network manipulation, saliency quantification, knowledge distillation, or complex training procedures.In contrast, cascading is a simple technique that is easy to implement and does not require such modifications, (2) The computational efficiency in existing methods often comes with a compromise on accuracy.Contrary to this, we show that model cascading surpasses the accuracy of even the largest models, (3) Existing methods typically require training a separate model for each computation budget; on the other hand, a single cascading system can be adjusted to meet all the computation constraints.(4) Finally, cascading does not require an instance to be passed sequentially through the model layers; approaches such as routing (section 3) allow passing it directly to a suitable model.

Model Cascading
We define model cascading as follows: Given a collection of models of varying capacities, the system needs to leverage one or more models in a computationally efficient way to output accurate predictions.
As previously mentioned, a system using only the smallest model would be highly efficient but poor in accuracy and a system using all the available models would be fairly accurate but expensive in computation.The goal of cascading is to achieve high prediction accuracy while efficiently leveraging the available models.The remainder of this section is organized as follows: we provide mathematical formulation of cascading in 3.1 and describe its various approaches in 3.2.

Formulation
Consider a collection of K trained models (M 1 , ..., M K ) ordered in increasing order of their computation cost i.e. for an instance x, c x j < c x k (∀ j < k) where c corresponds to the cost of inference.The system needs to output a prediction for each instance of the evaluation dataset D leveraging one or more models.Let M x j be a function that indicates whether model M j is used by the system to make inference for the instance x i.e.
Thus, the average cost of the system for the entire evaluation dataset D is calculated as: In addition to this cost, we also measure accuracy i.e. the percentage of correct predictions by the system.The goal is to achieve high prediction accuracy while being computationally efficient.
Performance Evaluation: With the increase in the computation cost, the accuracy usually also increases as the system leverages large models (that are often more accurate) for more number of instances.To quantify the performance of a cascading method, we first plot its accuracy-cost curve by varying the computation costs and then calculate the area under this curve (AUC).Larger the AUC value, the better the cascading method is as it implies higher accuracy on average across all computation costs.We note that the computation cost of the cascading system can be varied by adjusting the confidence thresholds of models in the cascade (described in the next subsection).
Along with the AUC metric, we evaluate efficacy of cascading on two additional parameters: 1. Comparing computation cost of the cascading system at accuracies achieved by each individual model of cascade: Consider a setting in which the model M 2 achieves accuracy a 2 at computation cost c 2 ; from the accuracycost curve of the cascading system, we compare c 2 with the cost of the cascading system when its accuracy is a 2 .2. Comparing the maximum accuracy of the cascading system with that of the largest model of collection: We compare accuracy of the largest individual model with the maximum accuracy achieved by the cascading system.Note that the first parameter corresponds to the point of intersection obtained by drawing a horizontal line from accuracy-cost point of each individual model on the accuracy-cost curve.Refer to the red dashed lines in Figure 2 and 4 for illustration.For a cascading system to perform better than the individual models in the cascade, it should have a lower computation cost (in parameter one) and a higher accuracy (in parameter two).

Approaches
We explore the following approaches of selecting which model(s) to use for inference.

Maximum Softmax Probability (MaxProb):
Usually, the last layer of a model has a softmax activation function that distributes its prediction probability P (y) over all possible answer candidates Y .MaxProb corresponds to the maximum softmax probability assigned by the model i.e.
M axP rob = max y∈Y P (y) MaxProb (often termed as prediction confidence) has been shown to be positively correlated with predictive correctness (Hendrycks and Gimpel, 2017;Hendrycks et al., 2020;Varshney et al., 2022c) i.e. a high MaxProb value implies a high likelihood for the model's prediction to be correct.We leverage this characteristic of MaxProb in our first cascading approach.Specifically, we infer the given input instance sequentially through the models starting with M 1 and use a confidence threshold over Max-Prob value to decide whether to output the prediction or pass the instance to the next model in sequence.
Consider an instance x for which the models till M z−1 fail to surpass their confidence thresholds and M z exceeds its threshold then: The confidence thresholds could be different at different stages.Figure 1 illustrates this approach.It provides efficiency benefits as it avoids passing easy instances (that can be potentially answered correctly by low-compute models) to the computationally expensive models.Furthermore, it does not sacrifice the accuracy of system because the difficult instances would often end up being answered by the large (and more accurate) models.We note that this approach requires additional computation for comparing MaxProb values with thresholds but its cost is negligible in comparison to the cost of model inferences and hence ignored in the overall cost calculation.

Distance To Uniform Distribution (DTU):
In this approach, we use the distance between the model's softmax probability distribution and the uniform probability distribution as the confidence estimate (in place of MaxProb) to decide whether to output the prediction or pass the instance to the next model in sequence i.e.
where U (Y ) corresponds to the uniform output distribution.For example, in case of a task with 4 classification labels, U (Y ) = [0.25,0.25, 0.25, 0.25].The intuition behind this approach is to leverage the entire shape of the output probability distribution and not just the highest probability as in MaxProb.
Random: In this approach, instead of using a metric such as MaxProb or DTU to decide which instances to pass to the next model in sequence, we do this instance selection process at random.This serves as a baseline cascading method.
Heuristic: Here, we use a heuristic derived from the input text to decide which instances to pass to the next model in sequence.Specifically, we use length of the input text as the heuristic.
Routing: In this approach, instead of sequentially passing an instance to bigger and bigger models, we skip intermediate models and pass the instance directly to a suitable model based on its maxProb value.For example, in K = 3 setting, we first infer using M 1 and if its maxProb is very low then we skip M 2 and directly pass it to M 3 .On the other hand, if its maxProb is sufficiently high (but below M 1 's output threshold) then we pass it to M 2 .The intuition behind this approach is that the system might save inference cost of intermediate models by directly using a suitable model that is likely to answer it correctly.This approach is not applicable for K = 2 as there is only one option to route after inference through the model M 1 .

Models:
We use the following variants of BERT (Devlin et al., 2019): BERT-mini (11.3M parameters), BERT-medium (41.7M parameters), BERTbase (110M parameters), and BERT-large (340M parameters) for our experiments.Table 1 shows the computation cost (in FLOPs) of these models for different input text sequence lengths.We use sequence length of 50 for COLA, 80 for SST2, 100 for QQP, 120 for MNLI, DNLI, SNLI, 150 for QNLI, MRPC, PAWS, and 275 for Commitment-Bank datasets following the standard experimental practice.We run all our experiments on Nvidia V100 GPUs with a batch size of 32 and learning rate ranging in {1−5}e−5.In the following subsections, we study the effect of cascading in multiple settings that differ in the number of models in the cascade i.e.K value in the task formulation.

Problem Setup
In this setting, we consider two trained models BERT-medium (41.7M parameters) as M 1 and BERT-base (110M parameters) as M 2 .We analyze results for other model combinations (such as medium, large and mini, large) in Appendix C.

Results
Recall that the computation cost of a cascading system can be controlled by changing the M j values.For example, in case of MaxProb, changing the confidence threshold value would result in different M j values and hence different cost and accuracy values.Figure 2 shows accuracy-cost curves for two cascading approaches: MaxProb (in blue) and Random Baseline (in black).In the same figure, we also show accuracy-cost points for the individual models M 1 and M 2 .To avoid cluttering these figures, we plot accuracy-cost curves for other approaches in separate figures and present them in Appendix C.However, to compare the performance of these methods, we provide their AUC values (of their respective accuracy-cost curves) in Table 2.
Efficiency Improvement: The accuracy-cost curves show that the cascading system matches the accuracy of the larger model M 2 at considerably lesser computation cost.This cost value corresponds to the point of intersection on the curve with a straight horizontal line drawn from M 2 (red dashed line).For example, in case of QQP, model M 2 achieves 89.99% accuracy at average computation cost of 8.49 × 10 9 FLOPs while the cascading system achieves the same accuracy at only 2.82  FLOPs while the cascading system achieves the same accuracy at only 5.26 × 10 9 FLOPs.Such improvements are observed for all datasets.This efficiency benefit comes from using the smaller models for a large number of instances and passing only a few instances to the larger models.
Accuracy Improvement: From the accuracycost curves, it can be observed that beyond the cost value identified in the previous paragraph (where the red dashed line intersects the accuracy-cost curve), the cascading system outperforms model M 2 in terms of accuracy.For example, in case of QQP, cascading with MaxProb achieves accuracy of up to 90.39% that is higher than the accuracy of M 2 (89.99%).Similar improvements are observed for all other datasets.We note that the accuracy improvement is a by-product of cascading, its primary benefit remains to be the improvement in computational efficiency.
Higher accuracy achieved by the cascading system (that uses M 1 for some instances and conditionally also uses M 2 for others) than the larger model M 2 implies that M 1 , despite being smaller in size is more accurate than M 2 on at least a few instances.Though, on average across all instances, M 2 has higher accuracy than M 1 .The cascading system uses M 1 for instances on which it is sufficiently confident and thus more likely to be correct.Only the instances on which it is not sufficiently confident get passed to the bigger model.This supports the findings of recent works such as (Zhong et al., 2021;Varshney et al., 2022b) that conduct instance-level analysis of models' predictions.We further analyze these results in the next paragraphs.
Comparing Cascading Approaches: Figure 2 demonstrates that MaxProb cascading approach clearly outperforms the 'Random' cascading baseline.In Table 2, we compare AUC of respective accuracy-cost curves of various cascading approaches.Both MaxProb and DTU outperform both the baseline methods (Random and Heuristic).In K = 2 setting, both MaxProb and DTU achieve roughly the same performance on average across all datasets.The gap between MaxProb and DTU becomes more significant in K = 3 setting (4.3).
Contribution of M 1 and M 2 in the Cascade: To further analyze the performance of the cascading system, we study the contribution of individual models M 1 and M 2 in the cascade.Figure 3 shows the contribution of M 1 and M 2 for MNLI dataset when the cost is 5.26 × 10 9 FLOPs i.e. the point at which the accuracy of the cascading system is equal to that of the bigger model M 2 (intersection point of the horizontal red dashed line with the accuracy-cost curve of the cascading system in Fig 2 ).At this point, the cascade system uses M 1 for 78% instances and M 2 for the remaining 22% instances.The accuracy of M 1 on its 78% instances (87.6%) would be equal to that of M 2 on those 78% instances as the overall accuracy of system on complete dataset (100% instances) is equal to that of M 2 .However, this does not imply that the instance-level predictions of the two models on those 78% would be exactly the same.Though, their predictions overlap in majority of the cases.
Figure 3 also shows that the accuracy of model M 1 on the instances that got passed to M 2 in the cascade system is significantly lesser (by 33.12%) than on the instances that M 1 answered (blue bars).M 2 achieves 10.12% higher accuracy on those instances than M 1 .Therefore, the cascading system utilizes the models efficiently by using the smaller model M 1 for the easy instances and the larger model M 2 for the difficult ones.We analyze these results for other datasets in Appendix C.2.1.

Problem Setup
Now, we study the effect of introducing another model in the problem setup of K=2 setting.Specifically, we consider three models: BERT-mini (11.3M parameters) as M 1 , BERT-medium (41.7M parameters) as M 2 , and BERT-base (110M parameters) as M 3 in this setting.Note that BERTmedium is referred to as M 2 in this setting as it is the second model in cascading setup unlike the K = 2 setting (4.2) in which it was M 1 .

Results and Analysis
Figure 4 shows the accuracy-cost curves of two cascading approaches: MaxProb (in blue) and Random Baseline (in black) and Table 3 compares AUC values achieved by various cascading approaches.In general, cascading achieves larger improvement (in magnitude) in K=3 setting than K=2 setting.
Efficiency Improvement: The accuracy-cost curves show that the cascading system matches the accuracy of larger models M 2 and M 3 at considerably lesser respective computation costs.For example, in case of QQP, cascading system matches the accuracy of model M 3 by using just 11.07% of M 3 's computation cost and of model M 2 by using just 23.53% of M 2 's computation cost.The magnitude of efficiency improvement in this setting is higher than that in the K=2 setting.
Accuracy Improvement: Cascading also achieves improvement in the overall accuracy.For example, on the CB dataset, cascading system achieves 83.93% accuracy that is even higher than the largest model M 3 .Similar improvements are observed for other datasets also.
Comparing Cascading Approaches: Table 3 compares AUC values achieved by various cascading approaches.DTU clearly outperforms all other cascading methods as it achieves the highest AUC values.We attribute this to DTU's characteristic of utilizing the entire shape of the output probability distribution and not just the highest probability in computing its confidence.
Contribution of M 1 , M 2 , and M 3 in Cascade: Figure 5 shows the contribution of individual models M 1 , M 2 , and M 3 in the cascade for MNLI dataset when the cost is 4.8 × 10 9 FLOPs i.e. the point at which accuracy of cascade is equal to that of the largest model M 3 (where the horizontal red    that were passed to M 3 drops by 28.53%.This shows that the cascading system is good at identifying potentially incorrect predictions of M 1 and passes those instances to M 2 and similarly good at identifying potentially incorrect predictions of M 2 and passes those instances to M 3 . Advantage of introducing another model in the Cascade: Comparing figure 5 for the K=3 setting with the figure 3 for K=2 setting, we find that by introducing a smaller model in the collection, the cascading system can be made more efficient.This is because the BERT-medium model answered 78% instances in K=2 setting and that portion got split into BERT-mini (smaller cost than medium) and medium models in K=3 setting while maintaining the accuracy.This suggests that the cascading technique utilizes the available models efficiently without sacrificing the accuracy.We analyze these results for other datasets in Appendix D.1.1.

Conclusion and Discussion
We systematically explored model cascading and proposed several methods for it.Through comprehensive experiments with 10 diverse NLU datasets, we demonstrated that cascading improves both the computational efficiency and the prediction accuracy.We also studied the impact of introducing another model in the collection and showed that it further improves the computational efficiency of the cascading system.
Selecting Optimal Operating Threshold: The selection of confidence threshold for models in the cascade is dependent on the computation budget of the system.A low-budget system can select low threshold for the low-cost models (so that lowcost models answer majority of the questions leading to less computation cost) and similarly, highbudget systems can afford to select high thresholds to achieve higher accuracy.In order to select thresholds in an application-independent manner, the ML's standard practice of using the validation data to tune the hyperparameters can be used.
Outlier/OOD Detection Techniques: Outlier/OOD detection techniques such as (Lee et al., 2018;Hsu et al., 2020;Liu et al., 2020) can also be explored to decide which instance to pass to the bigger models in the cascade.
Including Linear Models in the Cascade: This idea can be extended to include non-transformer based less expensive models like linear models or LSTM based models.Since the computation cost of these models is significantly lower than the transformer based models and yet they achieve nontrivial predictive performance, a cascading system with these models could achieve even more improvement in computational efficiency.We plan to explore this aspect in the future work.

Limitations
A potential downside of cascading is that it requires multiple models to be stored.However, we note that the additional space required for models (mini and medium) in K=3 setting is merely 0.44 times that required for base model (Table 1).Thus, it does not pose a serious concern.

Appendix A Efficiency in NLP
With the introduction of large-scale pre-trained language models, the efficiency topic has attracted a lot of research attention.Efficiency is being studied from diverse lenses such as training data efficiency (Lewis et al., 2019;Schick and Schütze, 2021;Varshney et al., 2022a;Wang et al., 2021;Mishra and Sachdeva, 2020;Ben Zaken et al., 2022), evaluation efficiency (Rodriguez et al., 2021;Varshney et al., 2022b), parameter efficiency tuning methods (Li and Liang, 2021;Houlsby et al., 2019), and inference efficiency.In this work, we focus on inference efficiency and propose model cascading, a simple technique that utilizes a collection of models of varying capacities to accurately yet efficiently output predictions.

B Dataset Statistics
Table 4 shows the statistics of all evaluation datasets considered in this work.We consider a diverse set of NLU datasets spanning over several tasks, such as natural language inference, duplicate detection, and sentiment classification.
C Cascading with Two Models (K=2) C.1 Other Model Combinations

C.1.1 Medium and Large
Figure 6 shows accuracy-cost curves with Max-Prob (in blue) and Random (in black) as cascading approaches with M 1 as BERT-medium and M 2 as BERT-large.MaxProb approach clearly outperforms Random approach and achieves considerably higher AUC value.

C.1.2 Mini and Large
Figure 7 shows accuracy-cost curves with Max-Prob (in blue) and Random (in black) as cascading approaches with M 1 as BERT-mini and M 2 as BERT-large.MaxProb approach clearly outperforms Random approach and achieves considerably higher AUC value.
C.2 Contribution of M 1 and M 2 in the Cascade

C.2.1 Medium and Base
Figure 8 shows the contribution of individual models M 1 and M 2 in the cascade when accuracy of M 2 is same as that of cascading system.We analyze this for MNLI dataset in section 4.2 and provide figures for a few other datasets here.Figure 9 shows the contribution of individual models M 1 , M 2 , and M 3 in the cascade when accuracy of M 3 is same as that of cascading system.We analyze this for MNLI dataset in section 4.3 and provide figures for a few other datasets here.

D.2 Overall Efficiency and Accuracy Improvement
Figure 10 (left) illustrates efficiency improvements achieved by a cascading method over the largest model (M 3 ) in K=3 setting.For example, in case of QQP dataset, the cascading system achieves 88.93% computation improvement over M 3 i.e. it requires just 11.07% of the computation cost of model M 3 to attain equal accuracy.Then, we show that cascading also achieves improvement in prediction performance.Figure 10 (right) illustrates the accuracy improvements achieved over M 3 in K=3 setting.For example, on CB dataset, the cascading system achieves 2.18% accuracy improvement over M 3 .

Figure 2 :
Figure 2: Accuracy-computation cost curves for cascading with MaxProb (in blue) and Random baseline (in black) methods in K=2 setting.Red points correspond to the accuracy-cost values of individual models M 1 and M 2 .Points of intersection of red dashed lines drawn from M 2 on the blue curve correspond to the evaluation parameters described in Section 3. MaxProb outperforms Random baseline as it achieves considerably higher AUC.

Figure 3 :
Figure 3: Comparing accuracy of individual models M 1 and M 2 on the instances answered by each model when used as cascade for MNLI dataset in K=2 setting.

Figure 4 :
Figure 4: Accuracy-computation cost curves for cascading with MaxProb (in blue) and Random baseline (in black) methods in K=3 setting.Accuracy-cost values of individual models M 1 , M 2 , and M 3 are shown in red.Note that M 1 here is different from M 1 in Figure 2. MaxProb outperforms Random baseline as it achieves higher AUC.

Figure 5 :
Figure 5: Comparing accuracy of individual models M 1 , M 2 , and M 3 on the instances answered by each model when used in the cascade for MNLI dataset.

Figure 6 :
Figure 6: Accuracy-Cost curves for K=2 setting with M 1 as BERT-medium and M 2 as BERT-large models.

D
Cascading with Three Models (K=3) D.1 Contribution of M 1 , M 2 , and M 3 in the Cascade D.1.1 Mini, Medium, and Base

Figure 7 :
Figure 7: Accuracy-Cost curves for K=2 setting with M 1 as BERT-mini and M 2 as BERT-large models.

Figure 8 :Figure 9 :
Figure 8: Contribution of M 1 and M 2 for K=2 setting with M 1 as BERT-medium and M 2 as BERT-base.

Figure 10 :
Figure 10: Efficiency and accuracy improvement achieved by the cascading system (using DTU method (3.2)) over the largest model M 3 in K=3 setting.

Table 1 :
Inference cost (in 10 9 FLOPs) of BERT variants for different input text sequence lengths.We also specify the storage size of the models in this table.

Table 2 :
Comparing AUC values of different cascading methods in K=2 setting.Random and Heuristic correspond to the cascading baselines.MaxProb and DTU outperform both the baselines.

Table 3 :
Comparing AUC values of different cascading methods in K=3 setting.Random and Heuristic correspond to the cascading baselines.DTU outperforms other cascading methods on average.

Table 4 :
Statistics of evaluation datasets considered in this work.