Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right

Large language models have shown promising results in zero-shot settings. For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition—wherein different surface forms compete for probability mass, even if they represent the same underlying concept in a given context, e.g. “computer” and “PC.” Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to its a priori likelihood within the context of a specific task. It achieves consistent gains in zero-shot performance over both calibrated and uncalibrated scoring functions on all GPT-2 and GPT-3 models on a variety of multiple choice datasets.

However, ranking by string probability can be problematic due to surface form competition-wherein different surface forms compete for probability mass, even if they represent the same underlying concept in a given context, e.g. "computer" and "PC." Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options).
We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to its a priori likelihood within the context of a specific task. It achieves consistent gains in zero-shot performance over both calibrated  and uncalibrated scoring functions on all GPT-2 and GPT-3 models on a variety of multiple choice datasets. *

Introduction
Despite the impressive results large pretrained language models have achieved in zero-shot settings (Brown et al., 2020;Radford et al., 2019), we argue that current work underestimates the zeroshot capabilities of these models on classification tasks. This is in large part due to surface form competition-a property of generative models that causes probability to be rationed between different valid strings, even ones that differ trivially, e.g., by capitalization alone. Such competition can be largely removed by scoring choices according to Domain Conditional Pointwise Mutual Information (PMI DC ), which reweighs scores by how much more likely a hypothesis (answer) becomes given a premise (question) within the specific task domain.
Specifically, consider the example question (shown in Figure 1): "A human wants to submerge himself in water, what should he use?" with multiple choice options "Coffee cup", "Whirlpool bath", "Cup", and "Puddle." From the given options, "Whirlpool bath" is the only one that makes sense. Yet, other answers are valid and easier for a language model to generate, e.g., "Bathtub" and "A bathtub." Since all surface forms compete for finite probability mass, allocating significant probability mass to "Bathtub" decreases the amount of probability mass assigned to "Whirlpool bath." While the total probability of generating some correct answer may be high (i.e., across all valid surface forms), only one of these is a listed option. This is particularly problematic here, because "Whirlpool bath" will be much lower probability than "Bathtub," due to its rarity. More generally, methods that do not account for surface form competition will favor answers with fewer lexical paraphrases.
PMI DC factors out the probability of a specific surface form, by instead computing how much more probable a hypothesis is when conditioned on a premise. We use a domain premise string to estimate the unconditional probability of a hypothesis in a given domain. On CommonsenseQA, for example, we compute the probability of each answer option immediately following the string "? the answer is:", and then divide the conditional probability by this estimate to calculate PMI DC . This scaling factor reweighs answer scores according to the surface form competition that is inherent to the domain or task, e.g. completions of the domain premise that are just inherently unlikely will be upweighted more. This allows us to directly measure how much an answer tells us about the question and vice versa (mutual information is symmetric, see §3). Valid hypotheses no longer need to compete with each other: both "Whirlpool bath" and "Bathtub " will be considered reasonable answers to the question, and so both will attain a high score.
Extensive experiments show that PMI DC consistently outperforms raw, normalized, and calibrated probability scoring methods on zero-shot multiple choice for more than a dozen datasets and it does so for every model in the GPT-2 and GPT-3 families ( §4); this holds true across different possible prompts and in preliminary few-shot experiments as well. To better explain these gains, we use the distinct structure of the COPA dataset (Roemmele et al., 2011) to remove surface form competition entirely, showing that all methods perform well in this idealized setting ( §5). Additionally, we analyze the only three datasets where PMI DC does worse than other methods and put forward a hypothesis for why normalizing log probabilities works better than raw probabilities ( §6). We conclude with a discussion of how generative models should be used for selection tasks ( §7).

Background and Related Work
Zero-shot vs. Few-Shot Zero-shot inference has long been of interest in NLP, Computer Vision, and ML in general (Socher et al., 2013;Guadarrama et al., 2013;Romera-Paredes and Torr, 2015). However, Radford et al. (2019) popularized the notion that language models have many zero-shot capabilities that can be discovered simply by prompting the model, e.g., placing "TL;DR" (internet slang for Too Long; Didn't Read) at the end of a passage causes the model to generate a summary. Efficiently constructing the right prompt for a given task is difficult and has become an active area of research (Reynolds and McDonell, 2021;Lu et al., 2021;Shin et al., 2020;Jiang et al., 2020a,b). Brown et al. (2020) demonstrated that few-shot learning without fine-tuning is possible with very large language models. Contemporary work has shown it is possible to get smaller models to exhibit few-shot learning behavior using fine-tuning (Hambardzumyan et al., 2021;Gao et al., 2020;Schick and Schütze, 2020a,b,c;Shin et al., 2020), an intermediate learning phase (Ye et al., 2021), or calibration (Zhao et al., 2021, though most assume access to a validation set (Perez et al., 2021). Recent work suggests it may be possible to finetune language models in order to improve their zeroshot and few-shot capabilities on a large swathe of tasks (Wei et al., 2021;Zhong et al., 2021).
Surface Form Competition When applying generative models to multiple choice problems, simply choosing the highest probability answer becomes problematic due to different valid surface forms competing for probability. Indeed, recent work in question answering has demonstrated the importance of considering all multiple choice options together (Khashabi et al., 2020), rather than independently assigning each answer a score and simply choosing the highest. This is a difficult strategy to adapt to left-to-right generative language models, which implicitly choose between all possible strings. Using unsupervised language models pretrained on relatively expansive corpora exacerbates surface form competition because such language models generate a much wider distribution than a given question answering dataset contains.
"What is the most populous nation in North America?" Posed with this question, a language model such as GPT-3 can generate a correct response such as "USA", "United States", or "United States of America" with high probability. While correct strings like this all contribute to the probability of a correct generation, they may have vastly different probabilities: a common string "United States" will be much more likely than rarer forms like "U.S. of A.". In generative scenarios, as long as most of the probability mass goes to valid strings the generation is likely to be valid. This is not the case for multiple choice problems. Given two options, e.g., "USA" and "Canada", GPT-3 will choose the correct answer by probability. However, if we substitute out "USA" for "U.S. of A.", GPT-3 will assign higher probability to "Canada", a less likely answer conceptually, but a much more likely surface form. Beyond this, incorrect generic answers such as "I don't know" are often assigned high probability, relegating the desired answers to the tail of the distribution where softmax is poorly calibrated (Holtzman et al., 2020).
PMI Work in dialogue has used PMI to promote diversity (Zhou et al., 2019;Yao et al., 2017;Li et al., 2016;Mou et al., 2016;Tang et al., 2019). Recently, Brown et al. (2020) used a scoring function resembling PMI DC for zero-shot question answering, though they only use the string "A:" as a prompt for the unconditional probability estimate, whereas we use a task-specific domain premise (see §3 for details). Furthermore, Brown et al. (2020) only report this scoring method on three datasets (ARC, OpenBookQA, and RACE, included here) out of the more than 20 tested and do not compare scores with their standard method, averaging loglikelihoods (AVG in this work). In contrast, we report a comprehensive comparison on GPT-3 and GPT-2, as well as shedding light on the underlying issue of surface form competition in §5.

Contextual Calibration
Recently, Zhao et al.
(2021) describe a new method for calibrating the probabilities of an LM using a learned affine transformation. Though geared towards few-shot learning, the authors devise a clever means of using "content free inputs" for zero-shot learning. Zhao et al. (2021) calibrate for three forms of bias: (1) majority label bias, (2) recency bias, and (3) common token bias. PMI DC directly compensates for common token bias by dividing by the domain conditional probability of each answer, and performs superior to contextual calibration (CC) in the majority of cases.
Prompt Sensitivity Recent work highlights LM sensitivity to inputs, and proposes to consider paraphrases of the prompt to overcome this (Davison et al., 2019;Jiang et al., 2020b), as well as noting that certain trigger tokens (Shin et al., 2020) can strongly effect the output of such models. In this work, we focus on the surface form of possible outputs, but do also analyze robustness to different prompts in §4.4.
Interpreting Language Models Language models tend to model selectional preferences and thematic fit (Pantel et al., 2007;Erk et al., 2010) rather than semantic plausibility (Wang et al., 2018). Probability, possibility and plausibility are distinct (Van der Helm, 2006), but reporting bias (Gordon and Van Durme, 2013) means that language models only model what people are likely to write (on websites that are easily crawled). PMI DC aims to adjust for these challenges to better measure the underlying agreement between language models and human judgements, but of course is still subject to the limits and biases of the language model used.

Zero-shot Scoring Strategies
This paper does not define any new modeling or finetuning methods. Rather, we propose the broad use of PMI DC scoring for any given model and prompt. PMI DC compensates for the fact that different correct answers compete for probability, even though only one will be listed as the correct multiple choice option. We begin by describing the two most common methods currently in use.

Standard Methods
Our first baseline is simply selecting the highestprobability option, e.g., baselines in  and Jiang et al. (2020b), which we refer to as LM. Given a prompt x (e.g. "The bar closed") and a set of possible answers y 1 , · · · , y n (e.g. "it was crowded.", "it was 3 AM."), LM is defined: (1) However, using length normalized log-likelihoods (Brown et al., 2020) has become standard due to its superior performance, and is also commonly used in generation (Mao et al., 2019;Oluwatobi and Mueller, 2020). For causal language models, e.g., GPT-2 and GPT-3, Equation 1 can be decomposed: where y j i is the jth token of y i and i is the number of tokens in y i . The AVG strategy can thus be defined as:

Domain Conditional PMI
Our core claim is that direct probability is not an adequate zero-shot scoring function due to surface form competition. A natural solution is to factor out the probability of specific surface forms, which is what Pointwise Mutual Information (PMI) does: In effect, this is how much more likely the hypothesis ("it was 3 AM.") becomes given the premise ("The bar closed because"), see Figure 2 for the full example. In a multiple-choice setting-where the premise x does not change across hypotheses-this is proportional to P (x|y), i.e. the probability of the premise given the hypothesis. We call this scoringby-premise and it is the reverse of LM, P (y|x). We use scoring-by-premise to show the presence of surface form competition in §5.
While Equation 2 estimates how related premise x is to hypothesis y in general, we found that estimates of P (y) vary wildly. GPT-2 and GPT-3 are not trained to produce unconditional estimates of document excerpts, an issue which is exacerbated by the fact that many possible answers are extremely rare in a large scrape of public web pages. This causes the unconditional probability of such answers to be poorly calibrated for the purposes of a given task.
We are specifically trying to measure P (y) in a given domain, e.g., for the "because" relation in our running example, shown in Figures 2 & 3. To quantify this, we propose Domain Conditional PMI: or how much x tells us about y within a domain. Typically, P (y|x, domain) = P (y|x) because the premise x typically implies the domain, e.g., "The bar closed because" sets the model up to predict an independent clause that is the cause of some event, without further representation of the domain. In order to estimate P (y|domain)-the probability of seeing hypothesis y in a given domain-we use a short domain-relevant string x domain , which we call a "domain premise", usually just the ending of the conditional premise x. For example, to predict a causal relation like in Figure 2 we use x domain = "because" and thus divide by P (y|because)-how likely y is to be a "cause" . For examples of each template see Appendix B.

Non-standard Baselines
Unconditional We also compare to the unconditional (in-domain) estimate as a scoring function: We refer to this as UNC. It ignores the premise completely, only using a domain premise x domain (e.g., using P (y|because) as the score). Yet, it is sometimes competitive, for instance on BoolQ (Clark et al., 2019). UNC is a sanity check on whether zero-shot inference is actually using the information in the question to good effect.
Contextual Calibration Finally, we compare to the reported zero-shot numbers of Zhao et al. (2021). Contextual Calibration adjusts LM with an affine transform to make a closed set of answers equally likely in the absence of evidence. Contextual Calibration thus requires computing matrices w and b for a number of "content free inputs" and then averaging these weights, see  for details. In contrast, PMI DC requires nothing but a human-written template (as all zero-shot methods do, including Contextual Calibration), can be computed as the difference of two log probabilities, and is naturally applicable to datasets where the set of valid answers varies between questions.

Setup
We use GPT-2 via the HuggingFace Transformers library (Wolf et al., 2020) and GPT-3 via OpenAI's beta API. † We do not finetune any models, nor do we alter their output. See Appendix B for examples from each dataset in our templated format.

Datasets
We report results on 16 splits of 13 datasets, and briefly describe each dataset here.
Continuation These datasets require the model to select a continuation to previous text, making them a natural way to test language models. ) are standardized tests described as "natural, grade-school science questions," with the "Easy" split found to be solvable by either a retrieval or word co-occurrence system, and the rest of the questions put in the "Challenge" split. Open Book Question Answering (OBQA) (Mihaylov et al., 2018) is similar to both of these, but was derived using (and intended to be tested with) a knowledge source (or "book") available; we do not make use of the given knowledge source, following Brown et al. (2020). Finally, Common-senseQA (CQA) (Talmor et al., 2019) leverages CONCEPTNET (Speer et al., 2017) to encourage crowd workers to write questions with challenging distractors. We report development set numbers on CQA because their test set is not public.
Open Set vs. Closed Set Datasets The above datasets are all "open set" in that multiple choice answers may be any string. Below we describe "closed set" datasets with a fixed set of answers.    Table 2: Percentage of datasets that a given method produced the best score or was tied with other methods, aggregated over each model size. The first four rows use GPT-2 (full data available in the Appendix), while the final four rows use GPT-3 and summarize data from Table 1. Since ties are included, rows sometimes sum to more than 100. CC is only measured on the 5 datasets we use where Zhao et al. (2021) also report accuracies.

Results
We report zero-shot results for GPT-3 in Table 1, with GPT-2 results available in Appendix A. A summarized view is shown in Table 2, which aggregates the percentage of splits where a given method achieves the best score or ties for first-place. In this summarized view it is clear that PMI DC consistently outperforms other scoring methods when assessed over a variety of datasets. The smallest margin (in number of datasets won or tied) between PMI DC and the best competing method is on GPT-3  175B with AVG, but that margin is over 40 percentage points. This does not imply that PMI DC is always better or that it will be better by a large margin, though it often is. It does suggest that PMI DC is a significantly better bet on a new dataset.

Robustness
To verify that these trends hold across different prompts, we report the mean and standard deviation over the fifteen different prompts considered in  for SST-2. Table 3 shows, PMI DC always maintains the highest mean, often by a hefty margin. Scores are lower than in Table 1 because many of the prompts used are optimized for few-shot rather than zero-shot scoring. 2.7B 49.9 0 88.1 4.9 87.7 5.5 16.6 0 43.0 1.7 45.6 1.9 50.4 1.1 6.7B 49.9 0 92.9 2.1 79.8 6.9 16.9 0 52.3 1.4 53.4 1.0 56.5 1.6 13B 49.9 0 85.4 9.0 86.9 7.5 16.7 0 58.4 2.0 59.3 1.5 63.4 1.4 175B 49.9 0 89.9 5.5 95.5 0.7 16.5 0 69.1 1.9 69.4 0.8 72.0 0.9 Table 4: The mean and standard deviation for 5 randomly sampled sets of 4 examples used for few-shot inference. We include a closed answer dataset  and an open answer dataset (CQA). For SST-2 AVG is equivalent to LM due to using single-token answers.

Few-shot
While our focus in this paper is on zero-shot scoring, PMI DC is just as applicable to few-shot scenarios. In Table 4 we report 4-shot results on one closed set dataset (SST-2) and one open set dataset (CQA). We show the mean of 5 randomly sampled sets of 4 examples that are used to prime the model for the task, along with standard deviations. The overall trend on both datasets clearly favors PMI DC , though LM is superior for two models on SST-2.

Removing Surface Form Competition
What if we used the probability of the premise given the hypothesis, P (x|y i ), instead? While we are still measuring the probability of a surface form (e.g. "the bar closed."), it is the same surface form across different options ("It was crowded so", "It was 3 AM so"), eliminating the surface form competition. y i and y i can now both attain high scores if they are both correct answers, by causing x to be likely. We call this scoring-by-premise.
Causal language models like GPT-3 cannot measure this directly, because they are only capable of conditioning on past tokens to predict future tokens. We exploit the structure of the COPA dataset to create "COPA Flipped" via a simple transformation, shown in Figure 3. COPA consists of cause and effect pairs (CAUSE so EFFECT, and EFFECT because CAUSE). In the original dataset, whatever comes second (either CAUSE or EFFECT) has two options that a model must choose between. These can be reversed by switching CAUSE and EFFECT, then substituting the natural inverse relation ("because"− →"so" and "so"− →"because" ).   Table 5 shows scores on COPA and COPA Flipped side-by-side. On COPA Flipped everything except UNC produces the exact same result. This is because flipping the hypothesis and premise means that it's the context that changes and not the continuation. LM, AVG, and PMI DC only differ from each other over different continuations, not over different contexts for the same continuation.

Results
On COPA Flipped all methods generally perform similarly to PMI DC on the unflipped version. This is because surface form competition has been eradicated: we are measuring how well different prefixes condition a model to predict a fixed continuation rather than which continuation is highest probability. Unlike LM, where different answers compete for probability, in COPA Flipped it only matters how likely each answer can make the question. This is not subject to surface form competition because there is only one string being so scored, so it is not competing with any other strings for probability mass.
Not all datasets are so easily flippable, so manually flipping individual questions to remove surface form competition is not a generally applicable strategy. Luckily, PMI DC is symmetric: In theory, the answer selected by PMI DC should be the same between COPA and COPA Flipped as PMI is symmetric, though we expect some differences due to "so" and "because" not being perfect inverses and shuffled references. Thus, PMI DC does better on COPA than COPA Flipped, likely due to more natural phrasing in the original dataset.
These results suggest that surface form competition is the primary cause of the depressed performance of LM and AVG in comparison to PMI DC .

In-depth Example
Scoring-by-Premise Improves LM Figure 3 shows an example of transforming one question from COPA to COPA Flipped. In the example depicted, when we use GPT-3 to calculate P , we get: P (y 1 |x) > P (y 2 |x) which is wrong, since bars usually close at fixed, late-night closing times, rather than because of being overcrowded. However we also find that indicating that scoring-by-premise causes the right answer to be selected and that PMI DC successfully simulates scoring by premise in this example.

Stability Over Valid Answers
To see how scoring-by-premise allows multiple correct options to achieve high scores, consider the slightly perturbed y 2 andx 2 in Figure 3. The inequalities shown above still hold when substituting y 2 → y 2 andx 2 →x 2 : with the key difference that the conditional probability of y 2 is much lower: log P (y 2 |x) ≈ −16 log P (y 2 |x) ≈ −20 This is undesirable, as both y 2 and y 2 are correct answers with similar meanings. Yet, when scoringby-premise the conditional probability ofŷ is stable when substitutingx 2 →x 2 : This suggests that eliminating surface form competition allows different correct answers to score well, as they are no longer competing for probability mass. Specifically, "it was 3 AM" and "it was 3:30AM" score wildly differently in COPA but nearly identically in COPA Flipped.

Analysis
Failure Cases There are three datasets where PMI DC does not consistently outperform other methods: HellaSwag, ARC Easy, and BoolQ. Surprisingly, each is dominated by a different method.
HellaSwag is most amenable to AVG. On examination we find that HellaSwag is more focused on the internal coherence of the hypotheses, rather than external coherence, i.e. how much a premise and hypothesis match. This is likely due to Hel-laSwag being generated by GPT-2 (Radford et al., 2019) and filtered with BERT, as it contains relatively on-topic but intrinsically strange hypotheses that humans can distinguish from natural data.
ARC Easy yields the highest scores to LM, i.e., selecting the highest probability option. Clark et al. (2018) note that ARC Easy questions can be solved by a retrieval or word co-occurrence baseline, while examples that were answered incorrectly by both were put into the Challenge split. This suggests a bias towards a priori likely phrases. Manual inspection reveals many stock answers, e.g., "[clouds are generated when] ocean water evaporates and then condenses in the air," supporting our hypothesis.
Finally, BoolQ, a reading comprehension dataset in which all answers are either "yes" or "no", is best solved by an unconditional baseline. This is because the dataset presents truly complex questions that require more reasoning than GPT-2 or 3 are capable of out of the box. Indeed, none of the methods reported do better than the majority baseline, except PMI DC with the largest GPT-3 model.
Why does length normalization work? Past work offers little explanation for why AVG should be a successful strategy, other than the intuition that estimates are strongly length biased and require compensation. Length bias may be caused by the final softmax layer of current language models assigning too much probability mass to irrelevant options at each time-step, as noted in open-ended generation, character-level language modeling, and machine translation (Holtzman et al., 2020;Al-Rfou et al., 2019;Peters et al., 2019).
Another (not mutually exclusive) argument is that length normalization may account for unconditional probability in a similar way to PMI DC . Length normalization is often measured over Byte Pair Encoding (BPE) tokens (Sennrich et al., 2016) and BPE tends to produce vocabularies where most tokens are equally frequent (Wang et al., 2020). Recent evidence suggests that language is approximately uniformly information dense (Levy, 2018;Levy and Jaeger, 2007;Jaeger, 2006). As such, length in BPE tokens may correspond roughly to a unigram estimate of log-probability, supposing that BPE tokens have approximately uniform uni-gram frequency. The adjustment made by AVG is still somewhat different than PMI DC , (division of log terms rather than subtraction) but could have a similar effect, if length and probability correlate.

Discussion
Language Models are density estimation functions that assign probability to every possible string, but there are often many strings that could represent a given idea equally well. Our key observation is that a generative model assigning probability to a string that represents a certain option isn't equivalent to selecting the concept an option corresponds to. We expect surface form competition anywhere that generative models are used where more than one string could represent the same concept.
PMI DC aligns the predictions being made by the model more closely with the actual task posed by multiple choice questions: "choose the hypothesis that explains the premise" rather than "generate the exact surface form of the hypothesis". From this perspective, PMI DC does not go far enough, because the model still cannot consider the given set of options altogether when selecting its choice. This matters when answers interact with each other, e.g., "all of the above".

Conclusion
We conduct a large-scale comparison of standard and recent scoring functions for zero-shot inference across all GPT-2 and GPT-3 models. We show that PMI DC consistently outperforms previous scoring functions on a wide variety of multiple choice datasets. We also argue that compensating for surface form competition is the cause of this boost, by demonstrating that other methods work just as well as PMI DC when surface form competition is eliminated. In future work we would like to explore how surface form competition affects generation, as we hypothesize that it may be the cause of overly generic outputs under high model uncertainty.  Table 6 shows the results for zero-shot multiple choice using GPT-2.   Table 6: Comparison of scoring algorithms when using GPT-2 for zero-shot inference on multiple choice questions.