Consistent Accelerated Inference via Confident Adaptive Transformers

We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs – Confident Adaptive Transformers – in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.


Introduction
Large pre-trained language models have become the de facto standard approach for solving natural language processing tasks (Devlin et al., 2019;Liu et al., 2019). Despite their impressive performance, however, their often massive computational burden makes them costly to run (Schwartz et al., 2019;Sharir et al., 2020). Concerns about their efficiency have kindled a large body of research in the field (Sanh et al., 2020;Schwartz et al., 2020;Fan et al., 2020). For multilayered architectures such as the Transformer, a popular approach is adaptive early exiting (Schwartz et al., 2020;Xin et al., 2020a, inter alia). Early exiting takes advantage of the observation that task instances vary in complexity. In this setting, "early" classifiers are added on top of the simpler features of intermediate layers in the base model, and can trigger a prediction before the full model is executed. Naively deciding when to preempt computation, however, can result in unpredictable decreases in model accuracy.
Quantifying the uncertainty in a prediction in order to decide when additional computation is needed (or not) is critical to making predictions quickly without excessively sacrificing performance. In this paper, we present Confident Adaptive Transformers (CATs), a general method for increasing Transformer-based model efficiency while remaining confident in the quality of our predictions. Specifically, given a fixed, expensive llayer model F(x), we create an amortized model G(x) that includes early classifiers {F 1 , . . . , F l }. 2 We then make G provably consistent with the original F with arbitrarily high probability (e.g., 95% of the time). This process is illustrated in Figure 1.
Our approach builds on conformal prediction (CP), a model-agnostic and distribution-free framework for creating well-calibrated predictions (Vovk et al., 2005). Concretely, suppose we have been (Ex.1) Claim: All airports in Guyana were closed for all international passenger flights until 1 May 2020.
Evidence: Airports in Guyana are closed to all international passenger flights until 1 May 2020.
(Ex.2) Claim: Deng Chao broke sales record for a romantic drama. Evidence: The film was a success and broke box office sales record for mainland-produced romance films.
Figure 2: Confidence levels given by our meta model regarding the consistency of our prediction as computation progresses. Ex.1 from the VitaminC fact verification dataset is "easy", and is classified consistently by all early classifiers F k (Supports). The meta confidence captures this, and increases with time. Ex.2 is harder-and the prediction changes (Refutes/NEI) as it propagates though the Transformer layers. Appropriately, the meta confidence is low. The exact exit layer of G is determined as a function of a user-specified tolerance , see Eq. (1).
given n examples, X i ∈ X , i = 1, . . . , n, as unlabeled calibration data, that have been drawn exchangeably from some underlying distribution P . Let X n+1 ∈ X be a new exchangeable test example for which we would like to make a prediction. The aim of our method is to construct G such that it agrees with F with distribution-free marginal coverage at a tolerance level ∈ (0, 1), i.e., We consider G to be -consistent if the frequency of error, G(X n+1 ) = F(X n+1 ), does not exceed . 3 By design, this ensures that G preserves at least (1 − )-fraction of F's original performance. Within these constraints, the remaining challenge is to make G relatively efficient (e.g., a consistent, but vacuous, model is simply the identity G F).
In order to support an efficient G, we need a reliable signal for inferring whether or not the current prediction is likely to be stable. Past work (e.g., Schwartz et al., 2020) rely on potentially poorly correlated metrics such as the early classifier's softmax response. We address this challenge by instead directly learning meta "consistency predictors" for each of the l − 1 early classifiers of our l layer model, by leveraging patterns in past predictions. 4 Figure 2 demonstrates the progression of meta confidence scores across layers when applied to "easy" versus "hard" instances from the VitaminC fact verification task (Schuster et al., 2021).
We pair the scores of our meta classifier for each layer with a stopping rule that is calibrated using a unique twist on standard conformal prediction. Traditionally, CP is used to construct prediction sets that cover the desired target (e.g., Y n+1 ) with high probability. We invert the CP problem to first infer the multi-label set of inconsistent layers, and then exit at the first layer that falls in its complement. We then demonstrate that this can be reduced to setting a simple (but well-calibrated) exit threshold for the meta classifier scores. Our resulting algorithm is (1) fast to compute in parallel to the main Transformer, (2) requires only unlabeled data, and (3) is statistically efficient in practice, in the sense that it finds low exit layers on average while still maintaining the required predictive consistency.
We validate our method on four diverse NLP tasks-covering both classification and regression, different label space sizes, and varying amounts of training data. We find that it constitutes a simpleyet-effective approach to confident adaptive prediction with minimal interventions and desirable theoretical guarantees. In short, we provide: 1. A novel theoretical extension of conformal prediction to accommodate adaptive prediction; 2. An effective meta consistency classifier for deriving a confident "early exiting" model; 3. A demonstration of the utility of our framework on both classification and regression tasks, where we show significant efficiency improvements, while guaranteeing high consistency.

Related Work
Adaptive computation. Reducing the computational cost of neural models has received intense interest. Adaptive approaches adjust the amount of computation per example to amortize the total inference cost (see Teerapittayanon et al., 2017;Graves, 2017;Huang et al., 2018;Kaya et al., 2019;Wang et al., 2018, inter alia). As discussed in §1, our method is inspired by the approach of Schwartz et al. (2020) and others (Liu et al., 2020;Geng et al., 2021;Zhou et al., 2020), where they preempt computation if the softmax value of any early classifier is above a predefined threshold. Yet unlike our approach, their model is not guaranteed to be accurate. In concurrent work, Xin et al. (2021) propose a meta confidence classifier similar to ours. However, as in previous work, they do not address the calibration part to guarantee consistency.

Confident prediction.
A large amount of research has been dedicated towards calibrating the model posterior, p θ (ŷ n+1 |x n+1 ), such that the accuracy, y n+1 =ŷ n+1 , is indeed equal to the estimated probability (Niculescu-Mizil and Caruana, 2005;Gal and Ghahramani, 2016;. In theory, these estimates could be leveraged to create confident early exits-e.g., similar to Schwartz et al. (2020). Ensuring calibrated probabilities of this form is hard, however, and existing methods often still suffer from miscalibration. Additionally, many methods exist for bounding the true error of a classifier (Langford, 2005;Park et al., 2021), but do not give end-users opportunities to control it. More similar to our work, selective classification (Geifman and El-Yaniv, 2017) allows the model to abstain from answering when not confident, in order to maintain a target error rate only over answered inputs. Our work gives a different and statistically efficient technique applied to consistent prediction.
Conformal prediction. CP (Vovk et al., 2005) typically is formulated in terms of prediction sets C(X n+1 ), where finite-sample, distribution-free guarantees can be given over the event that C contains Y n+1 . As we discuss in §4, internally our method follows a similar approach in which we try to conservatively identify the inadmissible set of all layers that are inconsistent (and exit at the first layer that falls in that set's complement). Most relevant to our work, Cauchois et al. (2021) presents algorithms for conformal multi-label predictions. We leverage similar methods in our model, but formulate our solution in terms of the comple-ment of a multi-label set of inconsistent predictions.

Early Exiting Transformers
In the following, we describe our dynamic early exiting model. We summarize early classification (following previous work) for convenience ( §3.1), and then present our novel meta consistency classifier ( §3.2). We focus on classification and regression tasks, given a model F(x) = y. We assume that F maps the input x ∈ X into a series of feature representations before making the prediction y ∈ Y.
Here, F is a multilayered Transformer (Vaswani et al., 2017) composed of l layers (although our method can be applied to any multilayer network). For all downstream tasks we follow standard practice and assume that the input contains a [CLS] token whose representation is used for prediction. For classification, we use a task-specific head, [CLS] ∈ R d is the representation after applying layer k. After each intermediate layer k < l, we train an early classification head that is similar to the head used in F, but reduce the dimensionality of the first projection to W (k) p ∈ R de×d (this is purely for efficiency 6 ). The final F l is unchanged from F. These extra (l−1)×(d e ×d+d e ×|Y|) parameters are quick to tune on top of a fixed F, and we can reuse F's training data as D tune . 7 The clas- [CLS] ))) is then used after layer k to get an early prediction candidate. Early regression is handled similarly.

Meta early exit classifier
To decide when to accept the current prediction and stop computation, we require some signal as Meta Feature Description y k The current prediction. history The past k − 1 predictions,ŷ 1:k−1 (For classification we give p k (ŷ k |x)). p max k Prob. of the prediction, p k (ŷ k |x). p diff k Difference in prob. of top predictions, p k (ŷ k |x) − argmax y k =ŷ k p k (y k |x). Table 1: Additional meta features used as input to the meta early exit classifier, M k . Where specified, the probability p k is taken from the model's early softmax. to how likely it is that F k (x) = F(x). Previous work relies on intrinsic measures (e.g., softmax response). Here, we present a meta classifier to explicitly estimate the consistency of an early predictor. Given fixed F k and F, we train a small binary MLP, M k (x) ∈ R, on another unlabeled (limited) sample of task in-domain data, D meta . As input, we provide the current "early" hidden state φ(W [CLS] ), in addition to several processed meta features, see Table 1. We then train M k with a binary cross entropy objective, where we maximize the likelihood of predicting Using the trained F k and M k , we define the full adaptive model G using the prediction rule where τ = (τ 1 , . . . , τ l−1 ) are confidence thresholds. The key challenge is to calibrate τ k such that G guarantees -consistent performance per Eq. (1).

Warmup: development set calibration
A simple approach to setting τ is to optimize performance on a development set D dev , subject to a constraint on the empirical inconsistency: where exit(·) measures the exit layer, and E dev is simply the average over D dev . Using a standard error bound (Langford, 2005) over a separate split, D cal , we can then derive the following guarantee: Proposition 3.1. Let X i , i = 1, . . . , n be an i.i.d. sample with s = n i=1 1{G(X i ; τ ) = F(X i )}. Then, up to a confidence level δ, we have that where˜ is the solution to Beta(s, n − s + 1) = δ, and Beta is the incomplete beta function.
A proof is given in Appendix A. Though in prac-tice˜ might be close to for most well-behaved distributions, unfortunately Eq. (4) does not give a fully specifiable guarantee as per Eq. (1). Readjusting τ based on D cal requires correcting for multiple testing in order to remain theoretically valid, which can quickly become statistically inefficient. In the next section, we provide a novel calibration approach that allows us to guarantee a target performance level with strong statistical efficiency.

Conformalized Early Exits
We now formulate the main contribution of this paper, which is a distribution-free and model-agnostic method based on CP for guaranteeing any performance bound an end-user chooses to specify. 8 Our training ( §3), conformal calibration ( §4), and inference pipelines are summarized in Algorithm 1.

Conformal formulation
To maintain -consistency, we must avoid using any of the predictions specified by this set, F i (x) where i ∈ I(x), more than -fraction of the time for x ∈ X . In §4.2, we show how M 1:l−1 can be paired with a conformal procedure to obtain calibrated thresholds τ = (τ 1 , . . . , τ l−1 ) such that we obtain a conservative prediction of I(x), where we ensure that I(x) ⊆ C (x) with probability at least 1 − . Proposition 4.1 states our guarantee when τ is paired with G following Eq. (2).
Assume that examples X i , i = 1, . . . , n + 1 are exchangeable. For any ∈ (0, 1), let the index set C (based on the first n examples) be the output of conformal procedure satisfying Define K := min{j : j ∈ C c (X n+1 )}, the first exit layer selected by G following Eq.
(2). 9 Then Remark 4.2. Note that Eq. (6) is stricter than necessary. Fundamentally, we only require that P(K ∈ I c (X n+1 )) ≥ 1 − . Nevertheless, Eq. (6) is easier to calibrate, and leads to strong empirical results despite being theoretically conservative. Remark 4.3. During inference we do not fully construct C ; it is only used to calibrate τ beforehand.

Conformal calibration
We now describe our conformal procedures for calibrating τ . Conformal prediction is based on hypothesis testing, where for a given input x and possible output y, a statistical test is performed to accept or reject the null hypothesis that the pairing (x, y) is correct. In our setting, we consider the null hypothesis that layer k is inconsistent, and we use indicates how "surprised" we would be if layer k was in fact inconsistent with layer l for input x. Informally, a low level of surprise indicates that the current input "conforms" to past data. To rigorously quantify the degree of conformity via the threshold τ k for predictor M k , we use a held-out set of n unlabeled, exchangeable examples, D cal .

Independent calibration
As a first approach, we construct C (x) by composing l − 1 separate tests for F k (x) = F(x), each with significance α k , where α k are corrected for multiple testing. Let v (1:n,∞) k denote the inflated empirical distribution of inconsistent layer scores, Inflating the empirical distribution is critical to our finite sample guarantee, see Appendix A. We then define τ ind (1:n,∞) k , and predict the inconsistent index set at x ∈ X as The following theorem states how to set each α k such that the quantiles τ ind k yield a valid C ind . Theorem 4.4. Let α k = ω k · , where ω k is a weighted Bonferroni correction, i.e., l−1 k=1 ω k = 1. Then C ind (X n+1 ) is a valid set that satisfies Eq. (6).
Remark 4.5. ω 1:l−1 can be tuned on a development set D dev as long as D dev is distinct from D cal .

Shared calibration
C ind has the advantage of calibrating each layer independently. As l grows, however, α k will tend to 0 in order to retain validity (as specified by Theorem 4.4). As a result, C ind will lose statistical Algorithm 1 Consistent accelerated inference.
Definitions: F is a multilayered classifier trained on Dtrain. Dtune, Dmeta and D scale are collections of in-domain unlabeled data points (in practice, we reuse Dtrain and divide it to 70/20/10%, respectively). D cal has in-domain unlabeled examples not in Dtrain (in practice, we take a subset of the task's validation set). is the user-specified consistency tolerance.

34:
return F l (x) efficiency. Following a similar approach to Cauchois et al. (2021) where M k (x) predicts a high consistency likelihood for layer k when layer k is, in fact, inconsistent). This worst-case statistic allows us to keep a constant significance level , even as l grows. Let m (1:n,∞) denote the inflated empirical distribution, We then define a single threshold shared across layers, τ share = Quantile 1 − , m (1:n,∞) , and  predict the inconsistent index set at x ∈ X as Theorem 4.6. For any number of layers l ∈ N + , C share (X n+1 ) is a valid set that satisfies Eq. (6).

Experimental Setup
For our main results, we use an Albert-xlarge model (Lan et al., 2020) with 24 Transformer layers. Results using an Albert-base model and a RoBERTa-large model (Liu et al., 2019) are in Appendix C. See Appendix B for implementation details. We did not search across different values for the hyper-parameters of F or G as our approach is general and guarantees consistency for any F with any nonconformity measure (See Appendix C.2). Tuning the hyper-parameters could further improve the efficiency of G while preserving consistency.

Tasks
We evaluate our methods on three classification tasks with varying label space size |Y| and difficulty: IMDB (Maas et al., 2011) sentiment analysis on movie reviews, VitaminC (Schuster et al., 2021) fact verification with Wikipedia articles, and AG (Gulli, 2004;Zhang et al., 2015) news topic classification. We also evaluate on the STS-B (Cer et al., 2017) semantic textual similarity regression task where Y ∈ [0, 5] ⊂ R. Dataset statistics, along with the test set performance of our original F model (Albert-xlarge), are contained in Table 2.

Baselines
In addition to our main methods discussed in §4.2, we compare to several non-CP baselines. Note that the following methods are not guaranteed to give well-calibrated performance (as our CP ones are).
Static. We use the same number of layers for all inputs. We choose the exit layer as the first one that obtains the desired consistency on average on D cal .
Softmax threshold. Following Schwartz et al.
(2020), we exit on the first layer where p max k ≥ 1 − , where p max k denotes the maximum softmax response of our early classifier. Softmax values are calibrated using temperature scaling (Guo, 2017) on another held-out (labeled) data split, D scale .
Meta threshold. Even if perfectly calibrated, p max k from softmax thresholding is not measuring consistency likelihood P(G(X) = F(X) | X = x), but rather P(G(X) = Y | X = x). This is equivalent if F is an oracle, but breaks down when F is not. We also experiment with thresholding the confidence value of our meta classifier ( §3.2) in a similar way (i.e., exiting when it exceeds 1 − ).

Evaluation
For each task, we use a proper training, validation, and test set. We use the training set to learn F and G. We perform model selection on the validation set, and report final numbers on the test set. For all methods, we report the marginalized results over 25 random trials, where in each trial we partition the data into 80% D cal (x 1:n ) and 20% D test (x n+1 ). In order to compare different methods across all tolerance levels, we plot each metric as a function of . Shaded regions show the 16-84th percentiles across trials. We report the following metrics: Consistency. We measure the percent of inputs for which the prediction of the CAT model G is the same as the full Transformer on our test prediction, i.e., G(X n+1 ) = F(X n+1 ). For regression tasks, we count a prediction as consistent if it is within a small margin τ from the reference (we use τ = 0.5). As discussed in §1, if G is -consistent, we can also derive an average performance lower bound: it will be at least (1− )×F's average performance. 10 Layers ( ). We report the computational cost of the model as the average number of Transformer layers used. Our goal is to improve the efficiency (i.e., use fewer layers) while preserving -consistency. We choose this metric over absolute run-time to allow for implementation-invariant comparisons, but we provide a reference analysis next, to permit easy approximate conversions.

Absolute runtime analysis
The exact run-time of G depends on the efficiency of the hardware, software, and implementation used. Ideally, the early and meta classifiers can run in parallel with the following Transformer layer (layer k + 1). As long as they are faster to compute . While both our CP-based methods give valid consistencies (above diagonal), shared calibration generally results in earlier exits. This advantage is especially pronounced at smaller tolerance levels (right-hand side), where it significantly outperforms other approaches. Our meta-learned confidence measure M k improves over using the softmax response as a drop-in replacement, especially for tasks with larger |Y|. Note that we care more about the right-hand side behavior, (i.e., larger 1 − ), as it corresponds to higher consistency.
concurrently than a single layer, this will avoid incurring any additional time cost. An alternative naive synchronous implementation could lead to inefficiencies when using a small tolerance . We provide a reference timing for the IMDB task implemented with the Transformers (Wolf et al., 2020) library, PyTorch 1.8.1 (Paszke et al., 2019), and an A100-PCIE-40GB Nvidia GPU with CUDA 11.2. A full forward path of an Albert-xlarge takes 22.32ms per input, 0.85ms ×24 for the transformer layers and 1.95ms for the embedding layer and top classifier. Our early classifier takes 0.20ms and the meta classifier takes 0.11ms. Therefore, with a naive implementation, a CAT model G with an average exit layer less than 17.6 with the meta classifier, or 19.5 without, will realize an overall reduction in wall-clock time relative to the full F.
We report example speedup times with the naive implementation in §6.3, as well as an implementation invariant multiply-accumulate operation (MACs) reduction measure. The added computational effort per layer of the early predictor and meta-classifier is marginal (only 66, 304 and 1, 920 MACs, respectively). In comparison, Albert-xlarge with an input length of 256 has ∼ 3 · 10 11 MACs.

Experimental Results
We present our main results. We experiment with both our meta classifier M k confidence score (Meta, §3.2), and, for classification tasks, the early classifier's softmax response, p max k (SM), as a dropin replacement for M k (at no additional computational cost). Appendix C reports results with other drop-in M k replacements, in addition to results using our naive development set calibration approach ( §3.3). Appendix D provides qualitative examples. Figure 3 summarizes the average consistency and number of layers used by G as a function of , while Table 3 presents results for specific on task test sets. Independent calibration proves to be quite conservative due to the loss of statistical power from the loose union bound of the Bonferroni correction for large l (here l = 24). At some levels of , non-CP baselines perform competitively, however, they lack formal guarantees. Overall, for the most critical tolerance levels (small , right-hand side of the plots), our shared method leads to significant efficiency gains while still maintaining the desired level of consistency (above the diagonal).

Classification results
The effectiveness of our meta predictor, M k , is most pronounced for tasks with |Y| > 2, where the drop-in softmax score (SM) becomes less indicative of consistency. Both SM and Meta are relatively well-calibrated for IMDB and VitaminC, which makes the threshold-based exit rule a competitive baseline. Still, our Shared/ Meta method provides both reliable and significant gains.
The computational advantage of our CAT model  Table 3: Classification results (test) for specific tolerance levels. We report the accuracy lower bound guaranteed by our CP methods in parentheses. Shared/ Meta is reliably the most efficient method (and is -consistent). Greyed rows reflect approaches without guarantees; our CAT approaches with guarantees are presented below them. is dependent on the average difficulty of the task and the implementation. As Table 3 shows, allowing up to an of 10% inconsistency, for two of the tasks we cut down the average Transformer layer to only 9 out of 24 using our Shared/ Meta model. This leads to an approximate speedup of 1.8× with a synchronous implementation and of 2.7× with a concurrent one, compared to running the full model. Moreover, Figure 5 illustrates the user's control over available computational resources via modulating . Decreasing increases the confidence level required before committing to the early classifier's prediction (thereby increasing the average number of required layers), and vice-versa.   Table 4 and Figure 4 present results for our regression task, where we see similar trends. Here, an attractive advantage of our meta confidence predictor is its generalizability to multiple task output types. Notice that the event space of 1{G(X) = F(X)} = {0, 1} always, regardless of the original Y. 11 This allows it to be easily adapted to tasks beyond classification, such as regression, where traditional softmax-based confidence measures (as used in, e.g., Schwartz et al. (2020)) are absent.

Example efficiency gains
Following the analysis in §5.4, we compute the amortized inference time with a naive implementation and report its percentage out of the full model. As Table 5 shows, our Shared calibration is the most efficient method on all four tasks. For tasks with many easy inputs (IMDB and AG News), our Shared/ Meta method can save 45% -49% of the   Table C.2 for 0.95). We compute the amortized time with the naive synchronous implementation ( §5.4). A more efficient implementation can further reduce the time of G. The MACs reduction measure is implementation agnostic and expresses the ratio of computational effort saved by G. Our CAT models (non-greyed lines) not only guarantee 1 − consistency with F, but are also significantly more efficient in practice when using Shared calibration.
inference time when 1 − = 0.90. Unsurprisingly, the absolute speedup is less significant for harder tasks, but increases with higher tolerance levels. On VitaminC, even though the Meta measure allows exiting on earlier layers, its additional meta classifiers result in slightly slower inference on average at this tolerance level, compared to our Shared/ SM. With a more efficient concurrent implementation, the Meta measure will be favorable.
We also compute the MACs reduction metric which is independent of the specific implementation or hardware and shows the number of multiplyaccumulate operations of the full model compared to our CAT model. As demonstrated in Table 5, our Shared/ Meta method is most effective in reducing the computational effort across all tasks for the two examined tolerance levels.

Conclusion
The ability to make predictions quickly without excessively degrading performance is critical to production-level machine learning systems. In fact, being capable of quantifying the uncertainty in a prediction and deciding when additional computation is needed (or not) is a key challenge for any intelligent system (e.g., see the System 1 vs. System 2 dichotomy explored in Kahneman (2011)).
In this work, we addressed the crucial challenge of deciding when to sufficiently trust an early prediction of Transformer-based models by learning from their past predictions. Our Confident Adaptive Transformers (CATs) framework leverages meta predictors to accurately assess whether or not the prediction of a simple, early classifier trained on an intermediate Transformer representation is likely to already be consistent with that of the full model F(X) (i.e., after all l layers of F are computed). Importantly, we develop a new conformal prediction approach for calibrating the confidence of the meta classifier that is (1) simple to implement, (2) fast to compute alongside the Transformer, (3) requires only unlabeled data, and (4) provides statistically efficient marginal guarantees on the event that the prediction of the faster, amortized CAT model is consistent with that of the full F. Our results on multiple tasks demonstrate the generality of our approach, and its effectiveness in consistently improving computational efficiencyall while maintaining a reliable margin of error.
A.1 Proof of Proposition 3.1 Proof. This result is based on Clopper-Pearson confidence interval for Binomial random variables (Clopper and Pearson, 1934). As the binary events 1{G(X i ; τ ) = F(X i )} are i.i.d., the sum s is Binomial. Directly applying a one-sided Clopper-Pearson lower bound on the true success rate, P(G(X i ; τ ) = F(X i )), gives the result.

A.2 Proof of Proposition 4.1
Proof. We prove by simple calculation using the property assumed in Eq. (6).

A.3 Proof of Theorem 4.4
Proof. For a given k, let V (i) k := M k (X i ) denote the random meta confidence values used for calibration, and V (n+1) k := M k (X n+1 ) the random test point. For all k, M k is trained and evaluated on separate data (D meta vs D cal ∪ D test ), preserving exchangeability. Therefore, as X 1:n+1 are exchangeable, then V . For a given k, this happens with probability at least 1 − α k by Lemma A.1. Taken over all k ∈ I(X n+1 ) where |I(X n+1 )| is at most l − 1 (i.e., all early layers are inconsistent), we have The last inequality is given by the Bonferroni constraint, i.e., α k = ω k · , where l−1 i=1 ω i = 1 A.4 Proof of Theorem 4.6 Proof. By the same argument as Theorem 4.4, the meta scores M k (X i ) are exchangeable. Since M max operates symmetrically across all X i , M (i) = M max (X i ) are also exchangeable.
Let M (n+1) denote the maximum meta score across inconsistent layers for the new test point. By Lemma A.1, this falls below Quantile(1 − , M (1:n) ∪ {∞}) with probability at least 1 − . Since M (n+1) reflects the maximum meta score, this entails that the meta scores of all other inconsistent layers k ∈ I(X n+1 ) for X n+1 will be below Quantile(1 − , M (1:n) ∪ {∞}) if M (n+1) is, and thereby be included in C share (X n+1 ). This gives the bound in Eq. (6).

B Implementation Details
We implement our early exit Transformers ( §3) on top of the Transformers library (Wolf et al., 2020). 12 We set d e to 32 in our experiments. For each task we fix a pre-trained F and train the early and meta classifiers. We reuse the same training data that was used for F and divide it to 70/10/20% portions for D tune , D scale and D meta , respectively. For classification tasks, we add the temperature scaling step  after the early training to improve the calibration of the softmax. We run the scaling for 100 steps on D scale using an Adam optimizer (Kingma and Ba, 2015) with a learning rate of 10 −3 . For the early and meta training we use the same optimizer as for F.
We fix F rather than train it jointly with the new components of G to avoid any reduction in F's performance (Xin et al., 2020b). This also makes our method simple to train over any existing Transformer without having to retrain the whole model which could be very costly. Training all parameters of G jointly can lead to more efficient inference as the early representations will be better suited for classification (Schwartz et al., 2020;Geng et al., 2021), but potentially with the cost of reducing the accuracy of F l . In the case of joint training, our CATs will provide consistency guarantees with respect to the jointly-trained F l .
We implement the conformal calibration process in Python and perform retrospective analysis with different random splits of D cal and D test . For Theorem 4.4, we simply use the uniform Bonferroni correction, setting w k = 1 l−1 ∀k. For the naive development set calibration, we use a shared threshold across all layers in order to reduce the examined solution space in Equation 3.

C Additional Results
In this section, we provide complementary results for the experiments in the main paper. All results, 12 As discussed in §3, our methods can also be applied to any multilayered model such as BERT (Devlin et al., 2019), GPT (Brown et al., 2020), ResNet (He et al., 2015), and others. except for sections C.4 and C.5, are with an Albertxlarge model as F, similar to the main paper. However, we note that the results in these tables are based on the development sets, while the tables in the main paper report the test set results.

C.1 Naive development set calibration
For completeness, we evaluate the simple, but naive, calibration method described in §3.3. Recall that in this approach we first tune τ on a development set, and then bound the resulting G's accuracy using another heldout calibration split. The bound we get is static; we are not able to guarantee that it will satisfy our performance constraint in Eq. (1). Table C.1 gives results for our models when using either the Meta or SM confidence measures (which we threshold with τ ). We use half of D cal to find the minimal threshold that provides -consistency. Then, we evaluate the threshold on the second half of D cal to get the empirical error. We compute the test set bound on this error with a confidence of δ = 10 −2 . As expected, the lower bound we compute is often significantly below 1− , as it reflects the uncertainty that our measured consistency is accurate. Often the measured empirical consistency is also slightly below 1 − . At a high level, the overall consistency vs. efficiency tradeoff is otherwise broadly similar to the one obtained by the Shared CP calibration.

C.2 Nonconformity measure comparison
The test statistic used for a conformal prediction is typically called a nonconformity measure (i.e., in our work this is M k (x)). We experiment with different nonconformity measures as drop-in replacements for M k (x), and report the results in Table C.2. The conformal calibration guarantees validity with any measure, even a random one, as long as they retain exchangeability. Good measures are ones that are statistically efficient, and will minimize the number of layers required for prediction at the required confidence level. This is a result of smaller C sets, that tightly cover the inconsistent layers (and hence are more judicious with the complement, C c ). To be consistent with previous work where softmax metrics are used (such as Schwartz et al., 2020), we use p max k as our non-Meta baseline in the main paper. In some settings, however, p diff k performs slightly better.  3). This method tunes the early exit thresholds to get efficient -consistent predictions on a development set, but does not guarantee that prediction will be -consistent on new data. "Consist." measures the empirical consistency on a test set, from which we compute a guaranteed lower bound ("Bound") to 99% confidence. The bound is significantly lower than our target 1 − , and the measured consistency in our experiments also falls slightly bellow 1 − in some cases.  Table 1, D KL (p k−1 ||p k ) is the Kullback-Leibler Divergence between the previous layer's softmax outputs and the current layer, and H(p k ) is the entropy of the softmax outputs. Our CP-based Shared method provides the guaranteed consistency with any measure, even random. The benefit, however, of using a better measure is in confidently exiting earlier. Our Meta measure allows the use of least Transformer layers meeting the consistency requirement with enough confidence. levels. Reducing requires greater confidence before exiting, resulting in later exits on average. We provide example inputs with their respective exit layer in Appendix D. Again, we see the efficacy of our Shared conformal calibration and the Meta nonconformity scores. For example, the AG News CAT Shared/ Meta model can preserve 95% consistency while using less than 5 Transformer layers on average. One main difference between RoBERTa and Albert, is that Albert shares the same parameters across all layers, essentially applying the same function recursively, whereas RoBERTa learns different parameters per layer. Yet, our method is agnostic to such differences and, as observed in the plots, results in similar trends. The value of our Meta classifier compared to the softmax response is even greater with the RoBERTa model.   xlarge CAT with = 0.1 required. These examples suggest that "easier" inputs (e.g., containing cue phrases or having large overlaps in sentence-pair tasks) might require less layers. In contrast, more complicated inputs (e.g., using less common language or requiring numerical analysis) can lead to additional computational effort until the desired confidence is obtained.  1 Pos Without question, film is a powerful medium, more so now than ever before, due to the accessibility of DVD/video, which gives the filmmaker the added assurance that his story or message is going to be seen by possibly millions of people. [...] 4 Neg This movie was obscenely obvious and predictable. The scenes were poorly written and acted even worse.