How Much Consistency Is Your Accuracy Worth?

Contrast set consistency is a robustness measurement that evaluates the rate at which a model correctly responds to all instances in a bundle of minimally different examples relying on the same knowledge. To draw additional insights, we propose to complement consistency with relative consistency—the probability that an equally accurate model would surpass the consistency of the proposed model, given a distribution over possible consistencies. Models with 100% relative consistency have reached a consistency peak for their accuracy. We reflect on prior work that reports consistency in contrast sets and observe that relative consistency can alter the assessment of a model’s consistency compared to another. We anticipate that our proposed measurement and insights will influence future studies aiming to promote consistent behavior in models.


Introduction
Annotators introduce data shortcuts that allow models to solve tasks in unintended ways (Gururangan et al., 2018).In response, it has been proposed to measure whether a model correctly responds to a bundle (or a contrast set) of slightly modified instances that rely on the same knowledge (Gardner et al., 2020;Kaushik et al., 2020).The rate at which a model accomplishes this is termed consistency.We propose an additional measurementrelative consistency -that facilitates discussion about achievable consistency scores, enabling a more nuanced comparison.
denotes that the instance is correctly predicted by a model.The relative consistency is the measurement we propose to complement the standard consistency.the same accuracy.This analysis sheds light on an upside of 1a and a limitation of 1b that might go unnoticed if we solely compare accuracy/consistency. Let us turn to examples 1d.Although it represents a model with an improved consistency relative to 1a, we could have achieved better consistency for the same accuracy (see 1e). 1Relative consistency ( §2) measures whether the consistency of our model would likely be outper-formed by an equally accurate model, relative to the distribution of possible consistencies; see Eq. ( 5).Specifically, it is the probability that our model's consistency is (in most cases) higher or equal to the consistency scores that are achievable with the same accuracy.If relative consistency is 100% then our model is the most consistent it can be given its accuracy, as a more consistent, equally accurate model exists only with near-zero probability.In practice, the goal should be to increase the "standard consistency" while also achieving 100% relative consistency.
In light of this additional consistency metric, in §4 we revisit the findings of three publications that report consistency as a metric for their evaluations and point out some additional conclusions we might draw from these reported consistencies.Our code is available at https://github.com/jacobkj314/relative-consistency.

Relative Consistency
We first introduce background terminology ( §2.1), then derive elements we need for defining relative consistency: (i) achievable consistency scores for a given accuracy ( §2.2) and (ii) a distribution over achievable consistency scores (2.3).

Background
A contrast set or bundle is a set of minimally different instances that might admit different answers, thus testing a model across/near its decision boundary.2For example, these two HotpotQA instances (Yang et al., 2018) represent a contrast set: • Q: Is the Marsilea or the Brabejum the genus of more individual species of plants?A: Marsilea • Q: Is the Marsilea or the Brabejum the genus of less individual species of plants?A: Brabejum The model is required to answer both of them correctly to be considered consistent in that bundle.Evaluation with contrast sets makes it harder for simple and inadequate models to perform highly (e.g, a model that has just learned a spurious correlation between the word "Marsilea" and "more").Related studies construct bundles of paraphrases that have the same, not contrastive, labels (Elazar et al., 2021).
The term consistency is overloaded in NLP and refers to different concepts (Li et al., 2019;Jang et al., 2022;Wang et al., 2023).In this work, we study contrast set consistency defined as the proportion of bundles where a model accurately labels every instance in a bundle: (1) where B is a set of all bundles of related instances in a given dataset, x is an example, y p (x) is the predicted label for x, and y(x) is its gold label.

Achievable Consistency Scores
Consider a contrastive test set formed from n original instances, plus a contrastive instance derived from each original instance by varying along some pertinent dimension.There are 2n + 1 possible accuracies a that a model could achieve on this test set, namely A = {0, 1, . . ., 2n−1, 2n}.3Similarly, there are n+1 possible consistencies c that a model could achieve, namely C = {0, 1, . . ., n − 1, n}.
Furthermore, for a given accuracy a ∈ A, only a subset C a ⊆ C of consistencies is achievable.Trivially, for a = 0, C a = {0} (because a model cannot consistently respond to a bundle without correctly responding to at least the instances within that bundle) and for a = 2n, C a = {n} (because a model that correctly responds to all instances has also consistently responded to all the bundles those instances comprise).C a can then be defined in terms of n and a: where c (a) min and c (a) max are defined as: Intuitively, if a ≤ n then it is possible that all bundles have one of their constituent instances incorrectly answered, in which case, c (a) min = 0.However, if a > n, then at least a − n > 0 of bundles must be fully correctly answered.Indeed, for a bundle to be inconsistent at least one item Figure 1: On the left is a heatmap of distributions of consistency at each accuracy for 100 bundles of 2 instances: each vertical slice corresponds to a separate distribution of different consistencies.Fig. 2 (Appendix) shows the log 10 of this plot that better highlights the long tails of these distributions.On the right are relative consistency scores given a model's accuracy and consistency, i.e., the CDF of the figure on the left.Note that for a different number of bundles, these plots would look slightly different.must be incorrectly answered, so for a given a, the number of incorrect items is 2n − a.Thus, at most 2n − a bundles can be inconsistent, and c The definition of c (a) max follows from the observation that a maximally consistent model will consistently respond to the maximum number of bundles for which it is possible that both instances are correctly answered, and that equals a 2 .

Distribution of Achievable Consistencies
Given an accuracy a, we construct a distribution of achievable consistencies c ∈ C a with: where M (a) is the number of ways a model can achieve accuracy a and is given by: because there are 2n total instances, of which any a might be the ones to which a model correctly responds. 4m(c, a) represents the number of ways a model can achieve accuracy a and consistency c, and is given by: where: 4 It is possible to consider consistency to be the more underlying property of a model's behavior and compute a distribution over possible accuracies in the range [2c, 2n The corresponding accuracy by consistency distributions could then be computed given the above-defined consistency by accuracy distributions.
• n c corresponds to the number of ways that c consistent bundles can be selected from n, • n−c a−2c corresponds to the number of ways the remaining a − 2c accurate instances can be distributed across the remaining n−c bundles, giving each selected bundle only one correct instance (to avoid creating an additional consistent bundle), • 2 a−2c represents the number of ways that these partially correct bundles could have either instance correct.Using this, we can calculate m(c, a) and M (a) across all values of c and a for reasonable sizes of n.These distributions can be extended for bundle sizes above 2; see formulas in Appendix B. Figure 1a shows the distributions of consistency scores for a dataset with 100 bundles of 2 instances.
Note that this distribution is not uniform for different consistencies at a given accuracy.There will be some consistencies that have more ways to be achieved for a given accuracy.This is why the formula m(c, a) is crucial to the computation of relative consistency that comes next.
This formulation assumes that all instances are equally difficult which is known to not be the case in practice (Swayamdipta et al., 2020).It also disregards any inductive biases of models/datasets that could skew the distribution.
Relative Consistency We measure the tendency to be consistent exhibited by a model that achieved accuracy a and consistency c on a contrastive set by computing the cumulative probability distribution over achievable consistencies in C a up to c: Intuitively, RC(c, a) indicates how likely the model's consistency is to outperform an equally accurate model relative to the distribution of achievable consistencies defined in (5).This allows us to quantify whether model consistency is below, at, or above chance, given its accuracy.In a good case, RC is high, meaning that it is unlikely that an equally accurate model will have higher consistency.Alternatively, if RC is low, then it is likely that an equally accurate model will have higher consistency (which is unwanted).
Although other measurements which contextualize consistency scores within a particular accuracy can be constructed -such as simply scaling the consistency between c max , or reporting the fraction of fully consistent of those that are at least partly correct -these approaches lack the probabilistic interpretation underlying RC.§3-4 highlight circumstances in which this probabilistic interpretation is useful, and Appendix C compares the score distributions obtained via these measurements to the score distributions obtained via RC.

Analysis with Simulated Contrastive Set
Suppose you evaluate a model on a contrastive test set containing 100 bundles of 2 instances.The distribution of consistencies for this dataset is shown in Figure 1a, with the CDF of that distribution (corresponding to the RC score) in Figure 1b.
Note that the highest-density region of the distribution moves upward as accuracy increases, and takes up only a very thin band.This means that, for a given accuracy, there is generally little room for improvement in consistency.This can be useful when discussing results: if a particular training approach yields a 5% improvement in consistency for an equally accurate model, that represents a substantial change in how the model tends to respond to inputs.
It can still happen that improving accuracy and consistency decreases relative consistency.As an example, consider comparing a model M 1 , which achieves a = 130, c = 45 (65% accuracy, 45% consistency) against a model M 2 with a = 150, c = 55 (75% accuracy, 55% consistency).Clearly, model M 2 is more desirable for practical uses, if we are just comparing one model

Meta-Analysis of Prior Work
In this section, we discuss results reported by prior works that conduct evaluation with contrast sets under the light of relative consistency.

Gardner et al. (2020)
They construct contrast sets for several common test sets by modifying a sample of the test set instances.They train a biaffine parser (Dozat and Manning, 2017) with ELMo embeddings (Peters et al., 2018) for UD parsing (Zeldes, 2017, Silveira et al., 2014, Basili et al., 2015, Ahrenberg, 2007), and RoBERTa (Liu et al., 2019) for reading comprehension tasks: ROPES (Lin et al., 2019), and MC-TACO (Zhou et al., 2019) and stance prediction: PERSPECTRUM (Chen et al., 2019).Table 2 shows the accuracy and consistency of these models for four of their contrast sets. 5In the rightmost column, we report the relative consistency scores that we introduce.(17.3 and 17.6).However, the UD parsing model's consistency has a near-zero chance to outperform an equally accurate model.On the other hand, the ROPES model is quite likely to do so.Additionally, relative consistency shows that models with low consistency could nonetheless have a large tendency to respond to bundles consistently. 6We see this with the results for MC-TACO, which, despite only achieving 8.0% consistency, is more consistent than an equally accurate model in 95.2% of cases.Intuitively, this means that the above chance model has at least generalized well within the few cases to which it correctly responds.

Dua et al. (2021)
They investigate whether training approaches that consider a full bundle of related instances together, instead of their constituent instances separately, improve consistency.Table 3 shows their report results obtained with T5 (Raffel et al., 2020) and the relative consistency scores we compute from their results, on the contrastive version of ROPESa reading comprehension dataset for evaluating a model's ability to reason about "effects of the relationships in the background passage in the context of the situation".
Analysis We observe that the baseline model trained with the maximum likelihood estimation (MLE) is already at ceiling performance in terms of its tendency to produce consistent responses (i.e., its RC scores).Combining contrastive estimation (CE; Smith and Eisner, 2005), or unlikelihood training (UL; Welleck et al., 2020), with MLE not only improves the accuracy and consistency but also does so in a way that does not lower the relative consistency, which is desired.This emphasizes the effectiveness of these objectives.

Ravichander et al. (2022)
They introduce CondaQA, a contrastive dataset for studying reading comprehension models' effectiveness in reasoning about the implications of negation expressed in a given text.Each CondaQA instance comes with three minimally varied versions: one paraphrases the negation, another modifies what is negated (scope), and the last removes the negation.Ravichander et al. (2022) use UnifiedQA-v2 (Khashabi et al., 2022) as a backbone model.We explore the factors that might influence the consistency of the large and 3B versions of this model: • The training objective: MLE, CE, or combined λ 1 MLE+λ 2 CE.• The choice of hyperparemeters λ 1 and λ 2 (with UnifiedQA-large).Table 4 shows accuracy, consistency, and relative consistency we obtain for bundles where the original instance is paired with its: (i) scope-edited version, and (ii) affirmative version (without negation).In Table 5 (Appendix), we also include the results with paraphrase-edits.
Analysis An increase in consistency does not necessarily indicate a heightened tendency to consistently respond to bundles (unless the accuracy stays the same).Compare CE with 1MLE+1CE (double underlined, in the upper part of the table).In this case, by training with MLE and CE, affirmative consistency has gone up slightly, however, the model's chance of outperforming an equally accurate model dropped down from 26% to 19%.This is an example of a suboptimal way of improving consistency, and MLE+CE is not necessarily superior to the standalone CE in this case.A similar, but less pronounced, situation occurs when comparing MLE against .33MLE+1CEfor scope consistency in the bottom part of the table (italicized).
Conversely, even if standard consistency has not improved, a model's tendency to consistently respond to bundles may have.For example, compare MLE with 1MLE+1CE for scope consistency in the upper part of the table (wavy underlined).In this case, scope accuracy lowered slightly but absolute scope consistency remained the same, leading to a large improvement in Scope-RC.This may suggest that additional CE loss resulted in the model unlearning a few individual instances without unlearn- ing any complete bundles it had learned.Similarly, 0.33MLE+1CE scope consistency in the upper part of the table (underlined once) increased slightly but the scope relative consistency has increased notably.If we compared only consistency we would conclude that the choice of hyperparameters λ 1 , λ 2 is not vital, where actually they can affect model consistency behavior as shown by relative consistency.

Conclusion
We introduce relative consistency, which complements standard contrast consistency by allowing an accuracy and consistency score pair to be examined to determine whether a higher consistency was possible with that accuracy.This facilitates the comparison of consistencies achieved by models that achieved different levels of accuracy.We show that relative consistency enriches conclusions we make about whether a model is more consistent than another, and occasionally even leads us to different takeaways.

Limitations
This mathematical model is based on a simplified version of contrastive datasets.Contrastive datasets may have more than two edits for each original instance, which will result in a different distribution.
Although we provide formulas for distributions of arbitrary bundle size in Appendix B, these distributions are less intuitive, more expensive to compute, and additionally have the drawback that, if a model achieves high pairwise RC on two of the elements of the bundle, it is likely to achieve high bundle RC, even if the other elements of the test set do not achieve high pairwise RC.In general, we recommend formulating questions of consistency in terms of bundles with one instance exhibiting a feature and the other instance lacking that feature.Moreover, contrastive datasets may include extra data that is not contrastive; e.g., CondaQA has a small number of bundles with a single instance because other instances in the bundle were filtered because they did not pass quality checks.In §2.3, we state the drawbacks of the distribution (5).Namely, we do not consider that the distribution might be skewed due to the varying example difficulty and other inherent properties of datasets and models.

A Numerical Stability of Relative Consistency
To avoid numerical instability, especially when comparing RC scores for two models, (i.e. to determine whether a training approach improves a model's tendency to produce consistent responses, or determine which of two training approaches best improves a model's tendency towards consistent responses), we define: (i.e., the cumulative combinatoric mass) and then rephrase the definition of RC as: which relies on only one division, so is less prone to floating-point rounding errors.This also allows us to compute: (i.e., the improvement in RC(c 1 , a 1 ) over RC(c 2 , a 2 ) scores) as: which allows for comparisons between models that are very close in their RC scores, (i.e., in the long tail of consistency).

B Formulas for Bundle Sizes b > 2
Let us consider a contrastive test set containing n bundles of b instances each.There are nb + 1 possible accuracies a, but still n + 1 possible consistencies c.C a can then be defined in terms of n, b, and a as follows: where c max are defined as: Intuitively, if a ≤ n(b − 1) then it is possible that all bundles have at least one of their constituent instances incorrectly answered, in which case, c (a) min = 0.However, if a > n(b − 1), then at least a − n(b − 1) > 0 of bundles must be fully correctly answered.Indeed, for a bundle to be inconsistent at least one item must be incorrectly answered, so for a given a, the number of incorrect items is nb − a.Thus, at most nb − a bundles can be inconsistent, and c The definition of c (a) max follows from the observation that a maximally consistent model will consistently respond to the maximum number of bundles for which it is possible that all b instances are correctly answered, and that equals a b .Now, M (a) (the number of ways a model can achieve accuracy a) is given by: and m(c, a) (the number of ways a model can achieve accuracy a and consistency c) is given by: where the first factor in the product still intuitively corresponds to the number of ways that c consistent bundles can be selected out of n, but the second refers to the number of ways the remaining correct instances could be distributed within responses to the test set such that no additional consistent bundles can be formed.This second factor G(m, b, k) is defined as: ).This can be understood as the number of ways to select k elements of an m × b matrix such that no row contains a complete b elements selected.The first term (which simplifies to mb k ) is the total number of ways these k elements could be selected, ignoring the restriction on complete rows, and the remaining terms apply the principle of inclusion-exclusion to alternately subtract and add the number of ways that at least r rows could be filled (by multiplying the number of ways that r out of m rows could be selected by the number of ways the remaining m − r rows and b columns could be filled by the remaining k − rb items to select), up to the maximal number of rows R that could be filled, whether that is determined by the total number of rows available m or the number of rows the items k could fill.
In general, we do not recommend using this measurement for bundle sizes above 2 except for evaluating consistency on three-valued features, as many consistency questions can be formulated as bundles with one instance exhibiting a feature and one instance lacking that feature.

C Distributions of Alternative Approaches
Figures 3 and 4 plot the distributions of consistency scores (for a 100-bundle dataset) obtained via simpler non-probabilistic alternatives and compare them to the distributions obtained via RC.Both of these characterizations lower the scores for consistencies that are above chance and raise the scores for consistencies that are below chance.
(a) Distributions of consistency scores.(b) Relative consistency scores.

Figure 2 :
Figure 2: The log 10 of the distributions of consistency scores in Figure 1a.

Figure 3 :
Figure 3: In this figure, the interval c (a) min , c (a) max is simply scaled to cover [0, 1] and the score is scaled accordingly.On the left is the score given a model's accuracy and consistency, on the right is shown the change in score when moving from RC to this formulation.

Figure 4 :
Figure 4: In this figure, of the bundles which are at least partially correct, the proportion of fully consistent bundles is reported.On the left is the score given a model's accuracy and consistency, on the right is shown the change in score when moving from RC to this formulation.

Table 2 :
Gardner et al. (2020)scores computed for results reported inGardner et al. (2020).In the 3rd column, we report the average of "Original Test" (original only) and "Contrast" (contrastive only) columns in their Table2.That is the accuracy, a, we use in calculations in §2.Models with similar consistency (UD Parsing and ROPES) have different tendencies to respond consistently as revealed by their RC scores.toanother, but if we are comparing two different training approaches, and want to know which induces a stronger tendency for consistent responses, then we would be interested to know that M 1 has RC = 93.0%,while M 2 has RC = 37.1%.This insight, that one model is below chance consistency, while another is well above, is made possible by the probabilistic interpretation of RC.

Table 4 :
Dua et al. (2021) 2022)(Khashabi et al., 2022)on the CondaQA contrastive dataset, with the expectation that including the Contrastive Estimation (CE) objective would improve consistency, as inDua et al. (2021).RC scores are reported here only for some of the edit dimensions in CondaQA; see Table5for the rest.