Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

Behavior of deep neural networks can be inconsistent between different versions. Regressions during model update are a common cause of concern that often over-weigh the benefits in accuracy or efficiency gain. This work focuses on quantifying, reducing and analyzing regression errors in the NLP model updates. Using negative flip rate as regression measure, we show that regression has a prevalent presence across tasks in the GLUE benchmark. We formulate the regression-free model updates into a constrained optimization problem, and further reduce it into a relaxed form which can be approximately optimized through knowledge distillation training method. We empirically analyze how model ensemble reduces regression. Finally, we conduct CheckList behavioral testing to understand the distribution of regressions across linguistic phenomena, and the efficacy of ensemble and distillation methods.


Introduction
Regression-free model update is a desirable system property which guarantees interoperability of a new system with a legacy version, also known as backward compatibility. Regression occurs when the newly updated system stops functioning as intended.
As advances in deep learning spark industrial applications in AI areas such as natural language processing, the long-term maintenance of such systems is becoming ever more challenging. While models with complex neural architectures and huge parameter space continue to reach higher accuracy, the lack of interpretability and functional decomposibility in these models make it infeasible to apply traditional software regression testing methods such as unit tests. As result, validating and mitigating regressions during model update is often a long and painful engineering process, which often over-shadows the benefits of a new model.
The model regression issue in deep learning first comes into sight in Shen et al. (2020), where they inspect compatible representation learning for image retrieval. Yan et al. (2020) proposed the positive-congruent training (PCT) for image classification that minimizes prediction errors and model regression at the same time. To our best knowledge, the model update regression has not been studied on NLP tasks.
Following Yan et al. (2020), in this work we measure the model update regression in NLP by negative flips. In Figure 1, we demonstrate prediction flip scenarios. Negative flips are shown in the upper-right quadrant where the old model makes correct predictions and the new model predictions are wrong. As we will show in Section 2, regression are prevalent in NLU model updates even with the slightest changes in the new model training process.
To develop a model with minimum regression, we first formulate the learning task into a constrained optimization problem by taking the regression-free conditions as constraints. We apply the Lagrangian relaxation to bring the regressionfree constraint into the optimization objective as an additional penalty loss, and provide approximate solution via knowledge distillation. Yan et al. (2020) also observed that model ensemble can also reduce negative flips without explicit input from the old model. We evaluate both distillation and ensemble based methods on a diverse set of NLP tasks.
To further understand how the above methods contribute to reducing it, we utilize CHECKLIST (Ribeiro et al., 2020) to quantify linguistic behavioral changes before and after applying proposed methods. We find that regressions are prevalent in NLP tasks, and their distribution correlates with different linguistic phenomena.
Our main contributions are as follows: • We provide empirical evidence to show that the model update regression occurs across text classification tasks in NLP; • We formulate the regression-free model updates into a constrained optimization problem, and further reduce into a relaxed form which can be approximately optimized through knowledge distillation training method; • We also explore the model ensemble as another method to reduce regression, and analyzed its efficacy; • We analyze the source of the regressions in NLP tasks through linguistic behavioural testing, compare reduction in both distillation and ensemble methods.

Measuring Regression in NLP Model Update
In this section, we first formulate the measure of model update regression on classification tasks. Then we benchmark on GLUE tasks (Wang et al., 2018) and show that there is a prevalent presence of regression when updating models in NLP.

Regression Measurement on Classification Tasks
Similar to software regression testing, we need to collect a group of test cases when measuring regression. We start from a regression set: where l i is the i-th label and C is the number of classes. In practice, we can use the development set or compile a collection of critical use cases as D reg .
In a classification task, given a input x i , a neural network model f , parameterized by φ, approximates the posterior probabilistic distribution p(y i |x i ) over all possible labels: f φ (x i ) = (p φ (y = l 1 |x i ), ..., p φ (y = l C |x i )) . To simplify, we denote the final prediction of a model to be f φ (x) = arg max l j p φ (l j |x).
The regression R N F between two models f φ old and f φnew on D reg can be defined as the portion of negative flip cases: We use negative flip R N F as our regression measurement for classification tasks. Lower R N F for a new models means better compatibility with the old model.

Benchmark Severity of Regression
The success of Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019) have made pretraining then fine-tuning a standard paradigm in NLP systems. When updating these systems, differences can come from various aspects: • Changes in the fine-tuning hyperparameters (e.g. random seed, learning rate schedule, epoch, etc.) • Changes in model size or architecture (e.g. from BERT base to BERT large ) • Changes in pre-training procedure or objective (e.g. BERT to ROBERTA (Liu et al., 2020), to BERT whole-word-masking or to ELECTRA (Clark et al., 2020)) • Changes in pre-trained model architecture (e.g. BERT to ALBERT (Lan et al., 2020)) While accuracy or efficient improvements are strong motivations for these model updates, they could also introduce behavioral incongruence when compared to the previous model. To benchmark the severity of regression, we apply a general setup: Fine-tune various pre-trained language models (LM) on GLUE and calculate R N F when updating from BERT base to other LMs . We use dev sets as D reg . Results in Table 1 show that: →BERT base , even when we only alter the initialization random seed, this can lead to up to 3.56% negative flip.
3. Negative flip rates are often much higher than the accuracy gains. When updating to BERT large on QQP, R N F is about 8X the accuracy gain. This implies reducing error rate alone does not ensure the decrease in regression.
4. Pre-training objective or architecture updates often lead to higher regressions than those caused by model size or random seeds. The regressions are higher when updating to ALBERTA, compared with updating to a larger model BERT large or a different random seed. This implies systematic regression could be introduced if the backbone models are different.

Reducing Regression in Model Update
In this section, we first formulate regressionfree model update as a constrained optimization problem, then further reducing it to a joint optimization objective combining the training loss on the original task and a distillation loss with respect to the old model's behavior. Unlike typical optimizations in neural model training where we minimizes a loss function on a training set, the regression-free model update requires the model to learn the target task as well as comply with conditions posed by the old model. 2 Full results on GLUE can be found in Appendix A We can cast the regression-free model update as a constrained optimization problem by writing down the classification loss as the optimization objective and the regression-free conditions as constraints: where D train , D reg represent the training and regression sets, respectively. The constraint in Equation 1 asks for zero regression on D reg . It would be difficult to ensure the constraint is satisfied along the model training. We instead relax the hard constraint into a soft inequality condition that allows the regression measure to be less than a constant C: Training a model directly with the regressionfree constraint still remains difficult in that signals from old predictions are sparse and R N F is nondifferentiable. Here, we propose two proxies of R N F to measure regression in continuous space.
Proxy from Prediction Probabilities. We use the KL divergence between the predicted probabilities of both models as one soft regression measure: Proxy from Deep Representations. We can also use the l 2 distance between models' sentence representations, e.g.
[CLS] embedding in BERT as another soft regression measure: A linear projection is used to align the representations if they initially lie in different spaces.
Reduce to Knowledge Distillation. Finally, we apply the Lagrangian relaxation to bring the regression-free constraint into the optimization objective as an additional penalty loss: where α is a positive penalty scaling parameter and R sof t can be chosen from R KL-div or R l 2 . Then, the above optimization problem can be cast into a joint learning of the original target task and knowledge distillation from the old model. The distillation loss acts as a surrogate of the model update regression measure. The joint learning process minimizes this term as an approximation of minimizing the overall model update regression.

Implementation Details
Since we usually update models from elementary ones to improved ones, in the experiments we take origin BERT base (12-layer, 768-hidden, 12heads, 110M parameters) (Devlin et al., 2019) as the old model's backbone and update it to a homogeneous model, e.g. BERT base with different fine-tuning random seeds or parameters, or a heterogeneous models with improvements such as BERT large (24-layer, 1024-hidden, 16heads, 340M parameters). We fine-tune the pretrained LMs without any constraint as our baselines.
We with Tesla V100 GPUs. Cross-entropy is used for fine-tuning on target tasks with batch size 16 for 4 to 6 epochs. The learning rate is searched among 2e −5 , 3e −5 and 5e −5 . During joint training of classification and knowledge distillation, we take the fine-tuned old models as the teacher, and distill with batch size 16 for 6 to 8 epochs. We set D reg = D train when training models with the constraint and use D reg = D dev for reporting results. To encourage constraint satisfaction and reduce regression, we only include the distillation penalty into our loss on the examples where the current model makes negative flips.

Ensemble
Yan et al. (2020) reported an intriguing finding on image classification tasks that model ensemble can reduce model update regressions without explicit regularization from the old model. This was attributed to the reduction of variance in ensemble model predictions, making it less prone to overfitting and indirectly reducing regressions. Here we include model ensemble as an alternative approach to reduce regression, with further analysis on how ensemble reduces regression in Section 5.1. Table 2 shows the efficacy of distillation method and model ensemble on reducing NLP classification task model update regressions. On average, the distillation method reduces R N F by 30.6% and 36.3% while the ensemble method reduces R N F by 55.9% and 20.6% when updating to BERT base and to BERT large , respectively. Both distillation and ensemble methods can significantly bring down negative flips across GLUE tasks compared with the baselines. The ensemble seems to work better when the old and new models share the same underlying pre-trained LM. In the update BERT base →BERT base , the ensemble method outperforms the distillation on reducing the regression. On the other hand, the distillation method seems to be more effective on reducing regression under the heterogeneous model update setting. In the update BERT base →BERT large , distillation reduce more regression, with especially  large reductions on small datasets such as CoLA and SST-2. We hypothesize that it's because the ensemble focuses on reducing the variance in model predictions, while distillation enables the explicit alignment in either probability distribution or representation space between the old and the new model. When the new model is very different from the old one, it can implicitly align new model's behavior with the old one.

Variants in Distillation Objective
As introduced in Section 3, we can have several variants of distillation loss to be used to constrain new model training on the old model. We explore and benchmark the following variants on the MRPC task: • Distillation -R KL-div , Logits calculates the distillation loss as the KL divergence between the two Bernoulli distributions set by the old and new model prediction probabilities; token embedding from the final layer as sentence representations and calculates the distillation loss as the Euclidean distance between the two vectors; • Distillation -R l 2 , All [CLS] also calculates the Euclidean distance between the old and new sentence representation vectors, but with concatenated [CLS] token embeddings from all layers instead of the final layer.
Pre-trained models could have different layers. For BERT base →BERT large in the All [CLS] setup, we align representations from BERT large 's even layers with the corresponding BERT base layers, e.g. 14-th layer in BERT large is aligned with 7-th in BERT base . Table 3 shows the results. In the homogeneous setup, the most effective variant is to align the prediction probabilities via R KL-div , where it achieves up to 58% R N F reduction, i.e. from 4.17% to 1.72%. For R l 2 setup, aligning at all layers can further reduce R N F compared with only aligning at the final layer. This implies a deeper alignment can help the new model more effectively learn to behave similarly as the old one when fine-tuning the same architecture with a different random seed.
In the heterogeneous setup, R l 2 , [CLS] works the best for BERT large that achieves 62% R N F reduction, with R KL-div having a comparable performance. Overall, R KL-div produces consistent regression reductions across different setups, which we pick it as our default setting in the distillation method.
From Table 3, we can also observe that the deeper alignment seems to hurt R N F in the heterogeneous update setup. The reason might be that differences between pre-trained models are too significant. The distillation with simple all-layer alignment could mess up pre-trained representations rather than effectively encourage  new models to learn where the old model performs well.
Another interesting finding is that the simple model ensemble is a competitive solution comparing to the distillation. In the BERT base →BERT base setup, the ensemble even outperforms all the other distillation variants. This is indeed a bit counter-intuitive as the distillation explicitly encourages the new model to pick up old models' correct predictions while the ensemble does not involve the old model in the process. We conduct deeper analysis trying to understand on which aspects that these methods work to reduce the regression in the next section.

Analyzing Regression in Model Updates
In this section, we first analyze the model ensemble and present our hypothesis on how it reduces regression. Next, we conduct behavioral testing across diverse linguistic phenomena to see where the reduced and remaining regressions reside.

Analysis of Updating to Model Ensemble
Similar to the findings of Yan et al. (2020), we observe in Table 2 and 3 that a simple ensemble of models trained with different random initialization before finetuning can reduce regression in some cases. We fine-tune BERT base on MRPC with 20 random seeds as our old base models, and another 20 seeds as our new single models, and another 100 seeds for building 20 ensemble models. Next, we calculate R N F on the dev set in each model update setup, i.e. 400 update pairs. Figure 2 plots their model update regression R N F distributions. We observe that the ensemble can not only bring down R N F but also reduce its variance. From Figure 2, we conjecture that each single model could learn a subset of all possible patterns in the data to achieve comparable accuracy on the task. Models fine-tuned with different seeds could rely on different sets of patterns for predictions, leading to behavioral difference and regression. On the other hand, ensemble aggregates distinct and complementary behaviors from individual models, leading to less eccentric behavior and increased compatible with individual models on average. In a parallel work, Zeyuan and Li (2020) provides a theoretical framework of how ensemble works from the multi-view perspective. They show that single models can pick up multiple but different views of the data, and the ensemble naturally collects more view features, leading to a higher accuracy. Our hypothesis concurs with their findings.
However, ensemble is not required to achieve moderate model behavior. To verify this, we designed the following simple model selection   procedure. We first train 20 new single models, among which we compute for each model the average R N F on the first half of dev set when updating from the other 19 models. We then select the single model with the lowest average R N F as the centric. Results in Table 4 show the accuracy and R N F on the second half of the dev set. Indeed the single centric model achieves substantial reduction in R N F comparable to model ensemble.
We further plot all the BERT base models based on their class predictions down-projected by PCA (Hotelling, 1933). Figure 3 shows that single models tend to spread while ensembles are more concentrated and close together. We can also see that the centric indeed sits near the center of single model cluster. In essence, the centric model is a single model that requires much less compute resource than the ensemble model during inference, yet can achieve comparable performance and reductions in regression.

Analyzing Regression with Linguistic Behavioral Testing
To further understand where the regression happens and how the above methods contribute to reducing regression, we conduct qualitative analysis across diverse linguistic phenomena. More precisely, we leverage the CHECKLIST (Ribeiro et al., 2020) behavioral testing and construct regression sets for relevant linguistic capabilities and tests based on perturbations and provided templates. For example, to test the capability of dealing with lexical taxonomy in the paraphrase detection task, we replace adjectives in one sentence with their synonyms with the label unchanged and expect the model can still predict correctly. We manually set the templates, apply CHECKLIST to automatically generate testing sentence pairs, and calculate R N F for each linguistic test. Detailed linguistic behavioral testing setups with examples can be found in Appendix C. Table 5 shows the linguistic behavioral testing results when updating from BERT base -1 Seed to BERT large . Each row denotes one specific behavioral test and 500 cases are sampled in each test. We focus on negative flips where the new model fails the test while the old model passes. We can observe that the vanilla finetuned BERT large has significant regressions on switching with synonyms, asymmetric ordering, and active-passive swap related to people names (see Appendix C). Also, we observe that models tend to either fail or pass almost all cases in a test, which leads to high variances in R N F . This implies that models fine-tuned with different seeds can have different behavioral patterns, which could be one source of regression.
Furthermore, Table 5 shows that the distillation can effectively reduce regressions across almost all types of behavioral tests. This demonstrates that minimizing the surrogate regression measure, formulated as a knowledge distillation objective, reduces the regression through actually aligning new model's behavior with the old model.
For the ensemble, although it can reduce significant regressions in the benchmark, we observe that it can only improve the model update compatibility on a handful of capabilities. We hypothesize that the ensemble mostly improves the compatibility with the underlying constituent models. Without an explicit alignment, it cannot proactively reduce the regression on certain behavior tests when updating  Table 5: Behavioral tests with CHECKLIST. Second column shows the error rate of the old model. Remaining columns are R N F compared with the old model. The columns with 1 Seed represent the results with random seed equals to 0, while columns with 5 seeds represent 5 seed average. The columns with KD are the models after distillation. The columns with Centric means the single selected with the method mentioned in Section 5.1.
from other distinct models.
6 Related work

Model Update Regression and Solutions
The backward compatibility representation learning first comes into sight in Shen et al. (2020) on learning inter-operabile visual embeddings for image retrieval tasks. Later, Yan et al. (2020) formalize the model update regression problem in machine learning and explore solutions on image classification tasks. They suggest negative flip (NF) as the empirical measurement of regression and propose a specialized knowledge distillation loss (Hinton et al., 2015) as a surrogate of regression for joint optimizations. Our work investigates the model update regression in NLP classification tasks, which involve discrete signals and rich linguistic structures. We formulate our solutions from the perspective of constraint satisfaction and verify their efficacy on scenarios including update to distinct architectures.

Transfer Learning, Lifelong Learning and Concept Drifting
Pre-training a model on large corpora and finetuning on downstream tasks has emerged as a standard paradigm in NLP (Devlin et al., 2019;Lan et al., 2020;CONNEAU and Lample, 2019;Raffel et al., 2020;Brown et al., 2020;Clark et al., 2020). Our work follows this transfer learning paradigm but our main focus is to investigate the regression phenomenon when updating backbone pre-trained models. Another related stream of research is lifelong learning (Lopez-Paz and Ranzato, 2017;Yoon et al., 2018;Delange et al., 2021;Sun et al., 2019;Chuang et al., 2020), incremental learning (Rebuffi et al., 2017;Chaudhry et al., 2018;Prabhu et al., 2020), or concept drifting (Schlimmer and Granger, 1986;Tsymbal, 2004;Klinkenberg, 2005; Žliobaitė I., 2016) which aims to accumulate knowledge learned either in previous tasks or from data with changing distribution. The model update regression problem differs in that models are trained on the same task and dataset, but we update from one model to another.

Behavioral Testing of NLP Models
To analyze whether a fine-tuned model can handle linguistic phenomena for a specific end task, perturbation techniques are often used (Belinkov and Bisk, 2018;Ribeiro et al., 2018;Prabhakaran et al., 2019;Wu et al., 2019;Talmor et al., 2020). In particular, CHECKLIST (Ribeiro et al., 2020) leverages and expands those techniques to efficiently evaluate a wide range of linguistic behavioral capabilities of NLP models. Our work applies CHECKLIST to inspect where the model update regressions come from and on which linguistic phenomena our proposed solutions help to reduce regressions.

Conclusion
In this work, we investigated the regression in NLP model updates on classification tasks and show that it has a prevalent presence across tasks and models. We formulated the regression-free model update problem as a constrained optimization problem and reduce it into a joint learning objective on target task while distilling from the old model. Together with the ensemble, these methods can cut down the regression by 60% at best. Experiments on the GLUE benchmark showed that ensemble can be effective in reducing the regression when updating to homogeneous models. On the other hand, knowledge distillation produced more significant regression reductions under the heterogeneous setting. Through linguistic behavioral testing we showed that distillation can reduce the regression across a wider range of linguistic phenomena than ensemble method. While the regression reduction achieved by the discussed methods are promising, they are far from reaching regression-free. We leave the design of more advanced regressionreduction methods as future works.

A Full Results of Regression Between SOTA Model Updates
Due to the page limitation, we present the full regression update comparison between commonly used pre-trained model pairs (Devlin et al., 2019;Liu et al., 2020;Lan et al., 2020;Clark et al., 2020) in Table 6.
We show regression in model updates from BERT base to the other common used pre-trained models; we also show the regression in updates from BERT large to BERT large−whole−word−masking and updating from ROBERTA base to ELECTRA base .
Other than the universal presence of the regression, Table 6 shows that:

B Selection of Regression Set During Training
Here, we explore the difference of regressing set selection during knowledge distillation. For the regression set used in the model training process, we propose several options: 1. Take the entire training set as our the regression set in training D reg = D train 2. Training examples where the old model makes correct predictions D reg = D correct 3. Training examples where the old model gets a higher predict probability on the groundtruth class than the new model D reg = D better , equivalent to adjusting α dynamically according to the performance of the two models, we set α to zero when p φ old (y|x) < p φnew (y|x) 4. Extra data from other tasks D reg = D extra 5. User-provided regression set, which includes examples with high-stakes D reg = D user We experiment with all options except for the userprovided regression set, see Table 7.
Dynamically adapting the regression set according to the current performance of the new model in Distillation (R KL-div , D better ) offers the most reduction in the regression without sacrificing the accuracy. We conjecture that it's because we apply the soft regression-free constraint loss precisely on examples where the new model's performance is behind.

C Linguistic Behaviour Test settings
In the linguistic behaviour tests, we go through a variety range of linguistic aspects and design test examples following CheckList (Ribeiro et al., 2020).
In Table 8 we show the tests for linguistic behaviour tests. Please find the example test cases in the third column for each testing.

D GLUE Details
The GLUE datasets are described as follows(Jiao et al., 2020): MNLI. Multi-Genre Natural Language Inference is a large-scale, crowd-sourced entailment classification task (Williams et al., 2018). Given a pair of premise, hypothesis , the goal is to predict whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise. QQP. Quora Question Pairs is a collection of question pairs from the website Quora. The task is to determine whether two questions are semantically equivalent (Chen et al., 2018). QNLI. Question Natural Language Inference is a version of the Stanford Question Answering Dataset which has been converted to a binary sentence pair classification task by Wang et al. (2018). Given a pair question, context . The task is to determine whether the context contains the answer to the question. SST-2. The Stanford Sentiment Treebank is a binary single-sentence classification task, where   the goal is to predict the sentiment of movie reviews (Socher et al., 2013).
CoLA. The Corpus of Linguistic Acceptability is a task to predict whether an English sentence is a grammatically correct one (Warstadt et al., 2019). STS-B. The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and many other domains (Cer et al., 2017). The task aims to evaluate how similar two pieces of texts are by a score from 1 to 5.

MRPC. Microsoft Research Paraphrase Corpus is
a paraphrase identification dataset where systems aim to identify if two sentences are paraphrases of each other (Dolan and Brockett, 2005). RTE. Recognizing Textual Entailment is a binary entailment task with a small training dataset (Bentivogli et al., 2009).

Category Description Example Label
Coref -He/She Reverse he or she. If Charles and Jessica were alone , do you think he would reject her?

False
If Charles and Jessica were alone , do you think she would reject him?
Vocab -People Add modifiers that preserve sentence Wendy is friendly to Kevin. True semantics.
Wendy is truely friendly to Kevin.
Vocab -More/Less Swap more with less. I can become more passive. True I can become less passive.

Taxonomy -Synonym
Replace synonym. I can become more courageous. True I can become more brave.

SRL -Pharaphrase
Somebody think → According to Somebody.
Who do conservatives think is the happiest surgeon in the world ?

True
Who is the happiest surgeon in the world according to conservatives ?

SRL -Asymmetric Order Order does matter for asymmetric
Shannon is proposing to Samantha. False relations.
Samantha is proposing to Shannon.
SRL -Active/Passive 1 Traditional SRL: active / passive swap Jeremy missed the game. True The game was missed by Jeremy.
Alyssa is remembered by Christian.
SRL -Active/Passive 3 Traditional SRL: wrong active / passive Sara took the castle. False swap.
Sara was taken by the castle.
Temporal -Before/After Before becoming somebody → after becoming somebody.
What was Noah Myers 's life before becoming an architect ?

False
What was Noah Myers 's life after becoming an architect ?