How Reliable are Model Diagnostics?

In the pursuit of a deeper understanding of a model's behaviour, there is recent impetus for developing suites of probes aimed at diagnosing models beyond simple metrics like accuracy or BLEU. This paper takes a step back and asks an important and timely question: how reliable are these diagnostics in providing insight into models and training setups? We critically examine three recent diagnostic tests for pre-trained language models, and find that likelihood-based and representation-based model diagnostics are not yet as reliable as previously assumed. Based on our empirical findings, we also formulate recommendations for practitioners and researchers.


Introduction
Contemporary statistical models based on deep learning have made incredible progress towards solving complex language tasks (Radford et al., 2019;Devlin et al., 2019;Raffel et al., 2020). These models generally trade off the interpretability and simplicity of traditional models for powerful parameterizations and inductive biases, enabling their impressive performance. However, their entry into critical fields such as medicine, the justice system, and social media moderation often makes this trade-off a costly one. Consequently, there has been surging interest in the development of tools and suites for diagnosing and better understanding model behaviour, and gaining insight into what patterns and phenomena they have learned ( §4.1).
Ideally, these diagnostics would not only help practitioners understand the failure modes and capabilities of large contemporary models, but also enable them to improve their models based on the diagnostics. To this end, we believe that model diagnostics are essential for making meaningful progress in natural language processing. * Google AI Resident Model diagnostics generally probe a model for specific learned qualities ( §4.1). These may be a positive qualities (e.g., whether a model has acquired syntactic knowledge) or potentially problematic qualities (e.g., biases and stereotypes. These probes can be used to identify certain phenomena that can be used to further improve models.
Given the potential impact that model diagnostics can have for practitioners and the research community's fundamental understanding of contemporary models, this paper asks the important and inevitable question of whether these probes are actually reliable and robust, and to what extent they are. These diagnostics' explicit nature as a tool for understanding also imposes a greater bar for robustness, as inconsistencies may mislead and result in compounding errors.
Our findings demonstrate that model diagnostics can be unreliable on multiple fronts. To illustrate our point, we select three diagnostics tasks -StereoSet (Nadeem et al., 2020), CrowS-Pairs (Nangia et al., 2020), and SEATs (May et al., 2019) to base our empirical evaluation on. Overall, we find that likelihood-based and representationbased diagnostics measured multiple times on the same training setup can result in wildly different findings. Specifically, a substantial variance is observed when performing the same model diagnostics on identical BERT (Devlin et al., 2019) pre-training setups while varying minute details such as the initial random seed or choice of representation.
These findings are meant to caution researchers and practitioners that rely on such diagnostics so that they can be more mindful of these phenomena when analyzing their models in the future. We discuss the implications of our findings and propose recommendations for practitioners and researchers in §5. We pre-train 5 BERT BASE and LARGE uncased English models, each with the same configurations as in Devlin et al. (2019) using Tensorflow 1 . However, each model differs in its random seed, resulting in different parameter initializations and training data permutations. Hence, it is expected that the checkpoints will each end up at a different local minima. It should be noted that BERT uses static masking instead of dynamic masking, so the set of pre-training examples remains the same.
To decouple our findings from phenomena that occur as a result of using different training setups, we restrict our experiments to only those that require pre-trained BERT models, eliminating many probes mentioned in §4.3. Webster et al. (2020) report that patterns learned during pre-training are often resilient to fine-tuning, further supporting our reasoning.

Likelihood-base diagnostics
One approach to examining the behaviour of language models like BERT is to examine how they rank certain representative examples above others. We use two contemporary datasets that measure how often stereotypes are ranked above antistereotypes -StereoSet (Nadeem et al., 2020) and CrowS-Pairs (Nangia et al., 2020). Both datasets measure ss = 100 * |X| n=1 1 [ll(x ster n )>ll(x anti n )] /|X|. StereoSet Nadeem et al. (2020) propose a benchmark that contains intra-sentence and intersentence examples of stereotypes and antistereotypes. Here, likelihoods are calculated as ll (x) = p(x τ |x \τ ) (where τ is the set of target demographic word(s) in x ) and ll (x) = p(isNext|x 1 , x 2 ) for intra-sentence and intersentence examples respectively. They also propose and combine a language modeling score (lms) with ss into a hybrid metric (icat), but we only report ss to focus on StereoSet's primary purpose -measuring stereotypical preference in language models. We report results on the development set.
CrowS-Pairs Nangia et al., 2020 propose a test that contains intra-sentence examples, where likelihoods are calculated by conditioning on the target demographic word(s) in the sentence (ll (x) = p(x \τ |x τ )) rather than vice-versa as in StereoSet.
The CrowS-Pairs diagnostic is expected to show higher variance than StereoSet for two reasons: (1) it is a smaller dataset (∼ 1 3 rd the size of StereoSetdev) with more categories, so results are more sensitive to changes in individual predictions; and (2) the pseudo-likelihood it uses is more susceptible to the poor calibration (Jiang et al., 2020a;Desai and Durrett, 2020) of contemporary models, since the number of multiplied probabilities grows linearly with the number of words in a sentence.

Vector-space diagnostics
Directly examining representations learned by models is another way to understand their behavior. This is typically done by measuring relationships between different types of inputs, for example in terms of their relative orientations in a vector space.
SEATs We use Sentence Encoder Association Tests (SEATs; May et al., 2019), which extend the popular Word Embedding Association Tests (WEATs; Caliskan et al., 2017) by constructing "semantically bleached" sentences. A WEAT/SEAT measures the effect size s(X, Y, A, B) of the association between two targets (e.g., X=MentalDisease and Y =PhysicalDisease) and two attributes (e.g., A=Temporary and B=Permanent), as well as the statistical significance of the association using a permutation test 2 . We conduct experiments using the same SEATs as in May et al. (2019). In addition to testing sentence ([CLS]) representations, we also test the contextualized word representations of the target/attribute words in the sentences. The reason we do this is that even for semantically bleached sentences, it is often non-trivial for models to encode information about an entire sentence in a single vector 3 .
In addition to examining effect sizes, we also conduct an experiment to see how distinguishable representations of certain concepts are in vector space (e.g., do representations of Pleasant and Unpleasant sentences form their own clusters?). We do this by clustering (via k-means) sentence representations and subsequently examining how well the unsupervised clusters align with the actual categories. The aim of this experiment is to understand vector space diagnostics behave the way they do.

Likelihood-based diagnostics are unstable
Experiments on StereoSet and CrowS-Pairs show that while likelihood-based ranking diagnostics may be stable across all categories, instability is evident in the results of individual categories (Table  1). Many categories have a standard deviation of over 2.5 percentage points. Some categories also vary from almost no stereotypical preference to a significant amount (highlighted in Table 1) -a result that could potentially cause practitioners to draw false conclusions. Additionally, from Figure 1 it is evident that many examples are assigned different labels over the 5 pre-trained models, often having 3 models assign them one label and 2 models assigning them the opposite label -almost as random as a coin flip! The implies that the models are probably uncertain about their predictions for these datapoints, motivating the consideration of model uncertainty in diagnostic measures instead of simply making a binary decision by comparing likelihoods.
Worryingly, both tests report wildly differing results on religious stereotypes ("Rel."), with CrowS-Pairs detecting strong stereotypical preference and StereoSet detecting almost none. It is also worth noting that results on CrowS-Pairs exhibit far higher variance compared to StereoSet (Table 1, Figure 1), as hypothesized in §2.2.

Vector-space diagnostics are unstable
Representation-based experiments exhibit high variance across multiple pre-training runs, choices of representation, and model sizes (Figure 3). Notably, SEAT results are often on both sides of the "neutral" mark (0), and their statistical significance is often erratic. In other words, it is possible for two models to be pre-trained with the exact same configurations but different random seeds to yield completely opposite conclusions on some SEATs. Moreover, the same checkpoint often yields different results depending on whether sentence or pooled target-word representations are used. Ideally, a SEAT would always or never be statistically significant, and yield effect sizes with the same sign over multiple pre-training runs and (seemingly innocuous) choices of representation.
From Figure 2, the representational instability of semantically bleached SEAT sentences is further evident -how these representations cluster Moreover, effect sizes often vary around the "neutral" mark (0) and also have different statistical significances (at p = 0.01). Ideally, a test would always (5) or never (0) be significant, and yield effect sizes with the same sign.
together is erratic both across pre-training steps as well as across multiple pre-training runs. This result gives us further insight into why high variance is observed for vector-space diagnostics -representations often can't form their own clusters for certain concepts, so simply examining their relative orientations is insufficient. Our findings provide empirical arguments for what May et al. (2019) surmise -there is scope for sentence embeddingbased tests that do more than naturally extend word embedding-based tests with semantically bleached sentences.
We surmise that representation-based diagnostics are less stable than likelihood-based diagnostics because large models like BERT are optimized to be good at modeling likelihoods via their pretraining objective. However, there is no constraint on how sentences must be represented other than it should be possible to "extract" correct likelihoods from them. In other words, there is no reason to expect the orientations of these representations to provide deep insight into what these models learn.

Diagnostic instability is despite equivalent downstream performance
We fine-tuned the 10 checkpoints on SST-2 (Socher et al., 2013), RTE (Dagan et al., 2006;Bar Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009), and QNLI (Rajpurkar et al., 2016) from the GLUE benchmark . Development-split results show that performance was largely the same across checkpoints (  Dev-set performance is also largely consistent with what is expected of BERT BASE and LARGE models. It should be noted that we only used one set of hyperparameters and did not perform the hyperparameter sweep as in Devlin et al. (2019), so further tuning would likely improve results.
Another axis to compare model diagnostics on is whether they are intrinsic or extrinsic, i.e., whether they directly analyze models for certain phenomena that aren't tied to any downstream task or do so keeping particular tasks in mind. This paper restricts itself to intrinsic tasks for reasons mentioned in §2.1. An example of an extrinsic task is Rudinger et al. (2018), which probes models for gender bias through the lens of coreference resolution. We refer readers to Belinkov and Glass (2019) for a more comprehensive survey on model analysis for natural language processing.

Diagnostic Fragility
It has been shown that classifier probes -which require an additional classifier (like an MLP) to be trained on top of frozen model representations -are unstable (Voita and Titov, 2020), and that it might not be clear from their results whether the probe itself learned a phenomena or whether the diagnosed representations learned it (Hewitt and Liang, 2019). Similarly,  find that gradient-based analysis of language technologies based on neural networks can often be unreliable and manipulable. Attention-based interpretation can also be unreliable and manipulable to the point of deceiving practitioners, as Pruthi et al. (2020) and Jain and Wallace (2019) show. The works mentioned above all support our arguments, and some raise similar concerns to those expressed in this paper.

Inconsistencies between equivalent checkpoints
This paper's findings can be linked to the problems caused by underspecification in machine learning (D'Amour et al., 2020), i.e., when multiple unique predictors trained with the same configuration have the same performance but differ in subtle ways. In a setting where practitioners might train and thoroughly analyze one model but then retrain it and assume that the first checkpoint's model diagnostics hold for the second one, this issue is highly relevant. McCoy et al. (2020) also find that separately fine-tuned BERT models often vary significantly in generalizing to auxiliary tasks.

Discussion
Recommendations No probe is perfect, but it is clear that model diagnostics are not as reliable as previously assumed. Our empirical findingscoupled with the works mentioned in §4.2 and §4.3 -motivate careful scrutiny of model diagnostics.
We recommend that: • Practitioners not generalize a single diagnostic result to the entire training setup, and instead restrict conclusions to a specific checkpoint.
• Researchers proposing probes not only test on publicly available checkpoints, but rather examine a probe's performance and robustness across a range of model/probe configurations.
Future Work While this paper primarily aims to motivate further scrutiny of model diagnostics, we hope it motivates studies that ask why these diagnostics often behave unreliably. One future research direction we are excited about is analyzing correlations between the properties of the models' local minima in the loss landscape and behaviour on model diagnostics. This would not only be another step towards a better understanding of how contemporary deep language models work, but also enable researchers to use that information to design better, more robust model diagnostics. Such a study may even help inform the optimization process for future state-of-the-art language technologies. It should also be noted that this paper is restricted to three diagnostics spanning likelihood-based and representation-based probes, and that future work is needed to determine the extent to which other diagnostic probes are reliable.

Conclusion
In this paper, we motivate further scrutiny of model diagnostics that aim to understand the behaviour of contemporary "black-box" language technologies. Our results show that model diagnostics are often fragile and can yield different conclusions as a result of seemingly innocuous configuration changes. We hope that our results over multiple pre-train runs will encourage researchers and practitioners to be mindful of the reliability of such model diagnostics when verifying hypotheses about their models and training setups.