Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy

When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as"bath"and"bathtub") is thought to cause an underestimation of a model's true performance, referred to as the"surface form competition"(SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC? Are there direct ways of reducing it, and does doing so improve task performance? We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example. We show this method eliminates the impact of SFC in the majority of instances. Our experiments on three diverse datasets and six LMs reveal several additional surprising findings. For example, both normalization and prompting methods for reducing SFC can be ineffective or even detrimental to task performance for some LMs. We conclude with practical insights for effectively prompting LMs for multiple-choice tasks.


Introduction
Large pre-trained autoregressive language models (LMs) have shown success not only on generation, but also on classification and multiple-choice (MC) tasks with pre-specified answer choices.To succeed on such tasks, one must pay attention to  5)) does not always translate to better accuracy (right; Eq. ( 1)), as shown here for three different prompt formats (x-axis; §6.3) each with one incontext example.Results are averaged across MMLU, OpenbookQA, and CommonsenseQA.Including answer choices in the prompt substantially increases probability mass on them.However, high probability mass is surprisingly not always associated with increased accuracy; in fact, it can lead to a substantial drop in performance (e.g., for OPT 30B and GPT-3 curie).
what's an acceptable answer choice2 and what's not, i.e., understand the task format.This is accomplished relatively easily in the pretrain-and-finetune paradigm (Dai and Le, 2015;Howard and Ruder, 2018;Raffel et al., 2020;Lewis et al., 2020), via task-specific fine-tuning. 3owever, in the zero-and few-shot prompting paradigm, in which the model is provided only a description or a handful of examples of the target task in the input, it's harder to ensure that the model generates only one of the answer choices associated with the given MC question.Most prior work tries to circumvent this issue by ignoring generated predictions and instead selecting the answer choice that has the highest probability under the model ("sequence scoring"; Trinh and Le, 2018;Radford et al., 2019;Brown et al., 2020, i.a.).This helps to some extent, by ignoring any attention the models pays to tokens unrelated to the answer choices.However, the problem persists as the model's probability mass can still be split among various strings or surface forms that are semantically equivalent to a given answer choice.Holtzman et al. (2021) propose that this phenomenon can result in underestimates of model performance, and refer to it as the surface form competition ("SFC") hypothesis.Motivated by this, they propose to use a probability normalization method, PMI DC , to address the SFC issue, thereby (according to the SFC hypothesis) increasing model performance.In the same spirit, other probability normalization methods have been proposed (Zhao et al., 2021;Malkin et al., 2022), and their merit assessed via end task accuracy.
However, accuracy improvements may be attributable to multiple sources.Without a metric to directly measure SFC, it is difficult to assess whether the increased accuracy is, in fact, a consequence of reduced SFC, or something else.
To address this gap, we propose a mathematical formalism for studying SFC and use it to investigate the following four research questions.
1. How can we measure SFC?We propose to measure total probability mass on answer choices (abbreviated as PMA), and use it to upper bound the extent and impact of SFC ( §4).
2. How can we reduce SFC's effect?Low PMA is a consequence of an inherently underconstrained output space that arises from the model failing to understand the task format.We use this observation to explain a simple way of increasing PMA: in-context learning with prompts containing answer choices ( §5.1 and §5.2).We demonstrate the success of this approach across 6 LMs and 3 MC datasets ( §7.1).We find, for instance, that when using this prompt with instruction-tuned LMs and just 2 in-context examples, on all 3 datasets, SFC simply couldn't have affected the prediction in more than 5% of instances.
3. Does increasing PMA improve accuracy?Surprisingly, not always!We provide an upper bound on the maximum effect an increase in PMA can have on task accuracy ( §4.1).We find empirically (Fig. 1 and  §7.2) that the alignment between probability mass and accuracy isn't as clear cut as assumed in prior work (Holtzman et al., 2021)it depends heavily on the model.These experiments also reveal that, contrary to common wisdom among researchers, encouraging models to produce answer choices by including them in the prompt can counter-intuitively be detrimental to task performance for LMs trained only on the next-token prediction objective.
4. When do probability normalization methods improve accuracy?While the direct effect of PMI DC on SFC is not easy to measure ( §3.2), we extend prior work by studying when PMI DC , which was motivated by SFC and is complimentary to our approach, improves accuracy on a wider set of prompts and models.We find, consistent with Holtzman et al. (2021), that PMI DC increases accuracy when models are not shown answer choices.However, this setting generally also corresponds to low probability assigned to answer choices.On the other hand, for the LMs that benefit from seeing answer choices, which results in high probability assigned to them, PMI DC scoring generally reduces accuracy.This indicates that as instruction-tuned LMs become more commonplace, PMI-based scoring methods, inspired by intuitions behind SFC (Holtzman et al., 2021), will likely provide less utility.
We conclude by leveraging these insights to provide practical recommendations on how to maximize LM accuracy on multiple-choice tasks when using zero-and few-shot prompting.

Related Work
While various methods have been proposed to improve the accuracy of sequence scoring using probability normalization methods (Brown et al., 2020;Zhao et al., 2021;Holtzman et al., 2021;Malkin et al., 2022), none investigate a direct metric for surface form competition and whether their methods alleviate it.To the best of our knowledge, we are the first to systematically study the role of incontext examples and prompt format on PMA, as well as how PMA relates to accuracy.Holtzman et al. (2021) show PMI DC (Eq.( 4)) improves over sequence scoring accuracy in most cases for GPT-2 and GPT-3 models of various sizes in 0-shot and 4-shot settings.Somewhat contradictorily, Brown et al. (2020) find that us-ing a version of PMI DC where the denominator is P θ (x|"Answer : "or"A : ") improves task performance on the validation set for only 5 out of 17 datasets investigated.Zhao et al. (2021) propose to fit a linear weight matrix and bias vector for classification tasks with a shared label set, such that the labels all have equal probability prior to observing x.Malkin et al. (2022) add hyperparameters to Eq. ( 4) that are fit on a dataset's validation set, showing further gains at test-time.Min et al. (2022) propose to score inputs given answer choices, which is mathematically equivalent to PMI DC ( §3.2).This results in lower variance and better worst-case accuracy on multiple-choice tasks in 0-and few-shot settings for GPT-2.Liang et al. (2022) investigate the effect of showing answer choices in the prompt and applying PMI based scoring (though not the combination of the two).They find that the success of one method over the other tends to vary by dataset and model.Our results elucidate further that the overall capability of an LM may be a key factor in whether PMI-based scoring improves accuracy or not.

Preliminaries
Given a task input x, a set of answer choices L, and the correct answer y * ∈ L, the goal of a multiplechoice classification task is to correctly select y * .
x is often specified as a question q and, optionally, answer choices L concatenated to q as one string. 4  Let M be a generative model architecture with learned parameters θ and space of natural language outputs V.In LMs, |V| ≫ |L|, so generating a prediction ŷ from V without constraints does not ensure that it is one of the given answer choices (i.e., that ŷ ∈ L).Instead, we can use a sequence scoring approach to score each answer choice: where P θ (y) is the probability M θ assigns to output y. 5 This is a common approach for performing classification with generative LMs, as it ensures ŷSeq-Sc ∈ L. This will be our prediction setup.
4 For instance, if x is a true/false question, L may be {True, False}.For a multiple choice question, L may be the set of (string) answers, their labels such as A/B/C/D, or both, depending on the format used to pose the task to an LM. 5 For multi-token outputs y = y1y2 . . .y k , we compute P θ (y|x) as k i=1 P θ (yi|x, y1 . . .yi−1).

Surface Form Competition (SFC)
An LM's vocabulary contains many different strings, or surface forms, representing the same (or similar) semantic concept, but typically only one representative string per concept is among the given answer choices L. Formally, for each answer choice ℓ ∈ L, there exists a set of synonyms G ℓ (containing ℓ) that may be "stealing" probability mass away from ℓ, while ℓ is the only surface form in G ℓ that is considered when computing accuracy via Eq.( 1).One can quantify the amount of the resulting surface form competition as: We refer to G ℓ as a semantic equivalence class, following Kuhn et al. (2023) 2021) is that unresolved SFC results in an underestimate of model performance.They provide an example of how SFC may lead to incorrect predictions, which we extend and adapt in Fig. 2 (left): when LMs distribute probability mass across surface forms such that P θ (y * |x) < P θ (ℓ|x) for some incorrect ℓ ∈ L, the model's prediction will be considered incorrect even if the total probability placed by the model on the correct concept G y * is higher than what it places on incorrect concepts.
To circumvent this issue, one may consider the notion of an "SFC-free" prediction: compute the most likely option among semantic equivalence classes, rather than among specific surface forms: where P θ (G ℓ |x) = z∈G ℓ P θ (z|x).A limitation of this formulation, however, is that it is only possible to compute ŷSFC-free if the full membership of each G ℓ is known, which is rarely the case.LM vocabularies typically contain many tens of thousands of tokens, many of which may be partial synonyms. 6his motivates the need for practical workarounds.
Intuitively, PMI DC measures the causal effect8 of the input x on the probability assigned to each answer choice ℓ, and selects ŷ as the answer choice on which x has the largest effect.The method is an alternative scoring function; it doesn't change the underlying probabilities of the LM, P θ .It is unclear when ŷPMI-DC = ŷSFC-free , i.e., when Eqs. (3) and (4) lead to the same, SFC-free prediction.Holtzman et al. note that PMI DC is mathematically equivalent to argmax ℓ∈L P θ (x|ℓ).This, in turn, should intuitively not be far from argmax ℓ∈L P θ (x|G ℓ ) when ℓ is not directly mentioned in the question (which is the setting used in PMI DC ).In this case, the competition among surface forms within G ℓ would be alleviated.However, there is still no a priori reason for either argmax ℓ∈L P θ (x|ℓ) or argmax ℓ∈L P θ (x|G ℓ ) to be the same as ŷSFC-free .Moreover, this view reveals a different competition, namely, among various questions x whose answer (according to the model) is ℓ.Specifically, a choice that the model thinks is more "popular" (i.e., the answer to many questions) will receive an artificially lower PMI DC score.Thus, now different questions (rather than different surface forms) compete for each answer choice.
4 How can we measure SFC?
Prior work has solely used the task accuracy metric to evaluate approaches geared towards resolving SFC.However, it is unclear whether task accuracy is an effective measure of the amount of SFC present.In fact, as we will show later, task accuracy is often not correlated with the amount of SFC.
While it is difficult to measure SFC (Eq.( 2)) directly, we propose bounding it by considering the model M θ 's probability mass on answer choices or PMA, defined as follows: We assume no surface form in L is a prefix of another, in which case PMA θ (L, x) ≤ 1 (see Appendix A.5 for a proof and empirical verification).
Intuitively, if a model is properly trained or instructed, it would place all probability mass on L, resulting in PMA θ (L, x) = 1.However, if SFC exists, we would observe PMA θ (L, x) < 1.
The formulation of SFC as a measurable quantity enables quantifying the maximum amount by which it may impact a prediction.Specifically, the probability mass that does not fall on L cannot affect the model's final prediction if it is less than the difference in probability between the highest-probability answer choice, ŷ, and the second-highest-probability answer choice, y 2 ∈ L.
The right-hand side of Fig. 2 illustrates this principle.For example, if the probability of ŷ, "whirlpool bath", is 0.55 and the probability of y 2 , "puddle", is 0.35, then PMA = 0.9 and the remaining probability mass is 0.1.Even if all of this remaining probability mass were on synonyms of "puddle", the probability of "puddle" would only increase to 0.45 should SFC be fully resolved, which would not flip the prediction since it is still less than 0.55.
Combining this observation with Eq. ( 6), SFC simply cannot affect the output of M θ on x when: Thus, one can completely remove the impact of SFC on a model's accuracy (i.e., achieve ŷSeq-Sc = ŷSFC-free ) by raising PMA high enough relative to the gap between the probabilities of ŷ and y 2 ; SFC doesn't have to be fully resolved (i.e., one need not push all the way to PMA = 1).
5 How can SFC be reduced?
The quantities used in PMI DC do not represent a valid probability distribution as they may exceed 1,9 making it difficult to compute our proposed metric PMA θ (L, x).Is there a more straightforward way to equate ŷSeq-Sc and ŷSFC-free ?

Using In-Context Examples
One path forward is to somehow directly constrain the model M θ such that P θ (G ℓ |x) = P θ (ℓ|x) for all ℓ ∈ L, i.e., ensure that the answer choice ℓ is the only synonym in G ℓ to which M θ assigns a nonzero probability mass.This, we posit, will occur naturally when LMs are properly constrained or instructed (see Fig. 1, left plot, right point).
One means to achieve this is to condition the predictions of M θ on not only x but also on some in-context examples e 0 , . . ., e k : ŷICE = argmax ℓ∈L P θ (ℓ|x; e 0 , . . ., e k ).Given that incontext examples are already widely used in practice, this technique is simple and straightforward to implement.Additionally, it allows one to easily compute PMA θ (L, x) for measuring the extent of SFC.In §6, we demonstrate empirically that with effective conditioning (prompt format and number of in-context examples), using in-context examples can significantly reduce SFC, and sometimes even completely resolve it by satisfying Eq. (7).

Prompting With Answer Choices
A key design decision when choosing which format to use to specify x (and optionally in-context examples e 0 , . . ., e k ) is whether to provide the model only the question q or also the answer choices L.
Our PMA metric can be used to provide insight into this, by helping disentangle the contribution that each of q and L makes to the task accuracy as well as to reducing surface form competition.
Intuitively, conditioning the prediction on L makes the model aware of what's an answer choice and what's not.It can thus push the model towards the specific surface forms contained in L, without necessarily affecting model accuracy.This, by definition, directly increases the probability mass on answer choices.One can empirically quantify the effect of exposure to L by considering the gain one observes in PMA and in accuracy when going from P θ (ℓ) to P θ (ℓ|L).
On the other hand, one would expect that conditioning the prediction on q pushes the model towards the correct semantic concept, i.e., the semantic equivalence class G * of the correct answer.However, not knowing which specific surface form ℓ * appears in both G * and L, the model has no reason to prefer ℓ * over other equivalent surface forms ℓ ∈ G * \ {ℓ * }.Thus, conditioning on q alone can increase accuracy by increasing the probability mass on G * , but it does not resolve SFC within G * .We can, again, measure this by considering the gain in PMA and accuracy when going from either P θ (ℓ) to P θ (ℓ|q) or from P θ (ℓ|L) to P θ (ℓ|q, L).

Models
We experiment with 6 models, described below.
Vanilla LMs These are models that are (to the best of publicly-available knowledge) only trained on the next-token prediction task.We experiment on two GPT-3 base models (Brown et al., 2020)curie (~6.7B parameters) and davinci (~175B parameters)-and one model whose weights are publicly available, OPT 30B (Zhang et al., 2022). 10Ms with Further Fine-Tuning We study two instruction-tuned (Mishra et al., 2022, i.a.) models: FLAN-T5 XXL (~11B parameters; Chung et al., 2022), and the "original" InstructGPT model, GPT-3 davinci-instruct-beta (~175B parameters;Ouyang et al., 2022).We additionally test one "state of the art" model, GPT-3 text-davinci-003 (unknown # parameters).FLAN-T5 is based on the T5 architecture (Raffel et al., 2020) and its weights are publicly available.It has demonstrated comparable performance to GPT-3 davinci despite being ~16x smaller.We include davinci-instruct-beta to study the effect of supervised instruction tuning on a model of identical scale to davinci-base that is also associated with a publicly-available research paper.11text-davinci-003 is (along with text-davinci-002) a state-of-the-art model according to the HELM benchmark (Liang et al., 2022).See Appendix A.1 for further details.

Tasks
We test on three challenging multiple-choice tasks that are open-vocabulary (i.e., each instance has a unique set of answer choices).Examples of the tasks are given in Appendix A.3; see also A.2.
OpenbookQA (Mihaylov et al., 2018) is a 4-way multiple-choice elementary-level science questionanswering task.Random accuracy is 25%.The test set has 500 instances.CommonsenseQA v1.11 (Talmor et al., 2019) is a 5-way multiple-choice commonsense reasoning task.Random accuracy is 20%.The test set is not publicly available; we use the first 500 instances of the validation set.Both OpenbookQA and CommonsenseQA were explicitly included in the training data of FLAN-T5.MMLU (Hendrycks et al., 2021), or the "Massive Multitask Language Understanding" benchmark, spans 57 different topical areas.The questions are 4-way multiple-choice spanning subjects in social sciences, STEM, and humanities that were manually scraped from practice materials available online for exams such as the GRE and the U.S. Medical Licensing Exam.Many state-of-the-art models perform poorly (random accuracy is 25%).We evaluate on the first 20 test questions from each category (1140 instances total).

Prompts
In-Context Examples We experiment with k = 0, 1, 2, 4 and 8 in-context demonstrations, which are the same for each instance, and selected as the first k examples from a fixed set of 8.For curie, davinci, and davinci-instruct-beta models, we report the mean and standard error over 3 random seeds used to select the set of 8 demonstrations, since the choice of in-context demonstrations can significantly affect performance (Lu et al., 2022, i.a.).We select in-context examples from each dataset's associated training set (combined dev + validation sets for MMLU).
Prompt Format We experiment with three prompt formats, corresponding to the format of x in §5.2.The first, "q", only contains the question and is thus most similar to next-word prediction.This is identical to the prompt used by Brown et al. (2020).For example, kinetics change stored energy into The second format, "q + L string ", includes answer choices as a string list and is similar to formats included in PromptSource (Bach et al., 2022) used to train FLAN-T5 and other models: question: kinetics change stored energy into answer choices: snacks, naps, kites, or warmth The correct answer is: For both of the above prompts, models score or output the full string answer, e.g., warmth.The third format, "q + L enum ", includes enumerated newline answer choices, similar to that used for zero-shot evaluations in FLAN (Wei et al., 2022) and FLAN-T5: Here, models score only the (single-token) symbols, e.g., D. The full prompts are given in Appendix A.3, and their PMI DC denominators in A.4.

On Reducing Surface Form Competition
Fig. 1 (left), demonstrates the effect of choice of prompt format on PMA (and hence SFC, Eq. ( 6)) in the one-shot setting.Across datasets, showing answer choices in the "string" format leads to a substantial increase in PMA, which reaches near-100% for all models using the "q + L enum " format.Zooming in on the role of in-context examples in Fig. 3 (dashed lines), we observe PMA increases significantly for all models after seeing only one in-context example that includes the answer choices (bottom plot), and stronger models such as text-davinci-003 and FLAN-T5 exhibit this behavior zero-shot.Trends also hold for Com-monsenseQA and OpenbookQA (Figs. 9 and 10, Appendix).The number of instances for which the bound in Eq. ( 7) is satisfied and SFC is fully alleviated are in Table 8 (Appendix A.6).

Relationship between Surface Form Competition and Accuracy
Fig. 1 (right), demonstrates the effect of the choice of prompt format on accuracy in the one-shot setting.While gains in PMA are consistent across models, this is not the case for accuracy.Certain models (curie, OPT 30B) actually achieve their best task performance when their PMA is the lowest, perhaps due to the q prompt being the closest to the next-token prediction objective.For others (davinci, davinci-instruct-beta), accuracy is stable across prompts, even while PMA substantially increases.Seeing the answer choices in the prompt is crucial to achieving good accuracy with text-davinci-003 and FLAN-T5, likely due to their instruction tuning.Thus, showing answer choices does not guarantee improved accuracy, especially for vanilla LMs.
We can also observe this lack of positive correlation from the angle of in-context examples (Figs. 3,9 and 10) Fig. 4 shows a shared scatterplot where each datapoint is a model result.The graph further illustrates the lack of correlation between increases in PMA (x-axis) and increases in accuracy (y-axis), especially in the bottom portion where PMA increases without any shift in yaxis position.In Table 1, we observe further evidence that PMA and accuracy are very negatively correlated in the case of curie and OPT 30B, and very positively correlated in the case of FLAN-T5 and text-davinci-003.davinci and davinci-instruct-beta exhibit highly variable correlation, indicating that the choice of LM modulates the PMA-accuracy relationship.

Role of Different Parts of the Input
In Fig. 5, we follow the methodology proposed in §5.2 and break down the zero-shot contributions to probability mass and accuracy of question q vs. answer choices L when included in the prompt.We find that conditioning P θ (ℓ) on L (i.e., considering P θ (ℓ|L)) substantially increases PMA (67.23% vs. 0% PMA on average for MMLU; the accuracy of both is similar, at 24.17% and 30.1%, resp.(5.93% absolute gain)).On the other hand, conditioning either of these probabilities further on q (i.e., considering P θ (ℓ|q) or P θ (ℓ|q, L)) provides a very small gain on PMA (2.97% absolute) as opposed to 11.88% accuracy gain on average for MMLU.This indicates that conditioning on q is not an effective way to increase PMA (or decrease SFC).Overall, observing q plays a larger role on accuracy while observing L plays a larger role on in increasing PMA.In other words, observing q appears to raise the relative probability of y * by redistributing mass among the members of L, while observing L helps to raise the absolute probability given to L (i.e., PMA).Results hold for other models and datasets (Figs.11 to 13).

When does PMI DC improve accuracy?
Our experiments provide further insight into when normalization methods like PMI DC may succeed.Fig. 6 (also Fig. 8) illustrates how much PMI DC affects accuracy for each dataset.
Whether PMI DC improves accuracy for a model seems tied to the largest PMA achieved by some prompt for that model as well as the model's overall performance: lower PMA and lower accuracy imply higher gains from PMI DC .Indeed, PMI DC always improves accuracy when answer choices are not observed in the prompt (Figs.6b  and 8), and the extent of gain is fairly consistent for each dataset across number of in-context examples and models.However, as established earlier, prompting without answer choices often results in the worst accuracy for strong models.Fig. 6a plots the difference between the best accuracies using each method; gains are relatively muted, except for OpenbookQA.Additionally, PMI DC generally (though not always) leads to significant accuracy drops for the strongest models (text-davinci-003 and FLAN-T5).
Tabular results for all experiments are in Tables 9 to 11 (Appendix).For curie, davinci, and davinci-instruct-beta, we include standard error over 3 random seeds for example selection.The effects of random seed are generally negligible.

Conclusion
We take a novel approach to studying the effects of prompt format, in-context examples, and model type on probability assigned to answer choices and its relationship with end task performance, by proposing a new formalization of surface form competition and a quantifiable metric (PMA).This is an important step towards understanding and improving the use of LMs for discriminative tasks.Our findings shed light into the role of probability distributions in model performance.They also challenge intuitive assumptions such as showing answer choices for MC tasks is always beneficial, which is a common practice (Hendrycks et al., 2021;Rae et al., 2021;Hoffmann et al., 2022, i.a.).OPT 30B -q q + string q + enum curie -q q + string q + enum davinci -q q + string q + enum dv-ins-beta -q q + string q + enum flan-t5-xxl -q q + string q + enum tdv-003 -q q + string q + enum 2.   4)) over standard sequence scoring (Eq.( 1)).Top: Differences in the best accuracy achieved by PMI DC (Eq.( 4)) and the best achieved by sequence scoring (Eq.( 1)) across prompt settings for each model (y-axis) and dataset (x-axis).Bottom: full detail results for MMLU; other datasets are in Fig. 8.
efforts to increase probability assigned to answer choices via prompting methods can have surprisingly negative effects, and that scoring methods can drastically affect the conclusions we reach about an underlying LM's fundamental capabilities.We advocate future work to look into length normalization as another understudied scoring mechanism.

Limitations
As with all papers using GPT-3 models, there is some stochasticity on the backend of the OpenAI API that researchers cannot control (studied in more depth by Ruis et al. (2022)).This means that results may vary from run to run, hampering reproducibility.In our setting, we find the effects to be very small in practice.
Additionally, in this work we only investigate open-vocabulary multiple-choice QA tasks.Future work might consider a broader suite of tasks or tasks where the answer choices are shared across instances, as in-context examples may have a larger effect on PMA or accuracy in that setting.Furthermore, we do not consider any directly comparable models for reaching conclusions about instruction tuning (base model → instruction-tuned) due to a lack of publicly available ones at the time this research was conducted; such an experiment would allow more concrete claims about the effect of instruction tuning and relationship with PMI DC to be made.
Finally, there are other probability normalization variants that differ from standard PMI DC in subtle ways (cited in §2).We only compare against the most straightforward (and common) implementation here.

Ethics and Broader Impacts
This paper investigates the interplay between probability mass on vocabulary items and accuracy in zero-shot-and few-shot-prompted autoregressive language models.Our efforts show that investigations into output scoring functions can change the conclusions drawn about the capabilities of models, which we believe is an important part of better understanding how to reliably and adequately use these systems.The existing NLP benchmarks used both have limitations in their dissimilarity to realworld use cases of LMs (Raji et al., 2021), and in the means in which they were collected, for example by scraping (potentially copyrighted) material off of the internet in the case of MMLU.The use of copyrighted material in the training and testing of AI systems is currently unsettled (Levendowski, 2021;Callison-Burch et al., 2023).

A.2 Nature of Datasets for Each Model
For models trained only on the autoregressive nexttoken prediction objective (curie, davinci, and OPT 30B (Zhang et al., 2022)), in theory the Open-bookQA and CommonsenseQA datasets have not been seen in during training.However, guarantees would require access and indexing of the training corpora, which are not publicly available for the GPT-3 models.Additionally, due to the fact that training data was scraped for these models up to and including 2019 (Brown et al., 2020), it is possible there is some leakage in the training corpus.
For the instruction-tuned models, the authors of FLAN-T5 (Chung et al., 2022) explicitly report the datasets which are used and not used during training, and we report these details in §6.2.As for InstructGPT instruct-davinci-beta (Ouyang et al., 2022), the following details are given about its supervised instruction tuning training dataset (emphasis ours): "...The SFT dataset contains about 13k training prompts (from the API and labeler-written)...To give a sense of the composition of our dataset, in Table 1 we show the distribution of use-case categories for our API prompts (specifically the RM [reward modeling] dataset) as labeled by our contractors.Most of the use-cases have (sp) are generative, rather than classification or QA.These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks (see Table 1)." In Table 1, generation makes up 45.6% of the dataset, followed by open QA at 12.4%.Closed QA is a relatively small percentage of the training set, at 2.6%, and classification 3.5%, providing some possibility that the tasks we study are outof-domain/zero-shot (though these exact numbers are reported on the reward modeling dataset, not the one used for instruction tuning, and these are not guarantees due to the proprietary nature of the dataset).No details are given about the datasets used to train text-davinci-003 (OpenAI, 2022).

A.3 Prompt Details
Exemplar prompts containing 4 in-context demonstrations (for 1 of the 3 random seeds used) are given in Tables 2 to 4 for OpenbookQA and Tables 5 to 7 for CommonsenseQA.The last instance shown is the test instance, which the model completes with an answer prediction.For each random seed, 8 demonstrations are drawn from the training set of each dataset.When fewer demonstrations (0-4) are used, the first k are taken and the prompt otherwise stays the same.
Role of Different Parts of the Input In Figs. 5 and 11 to 13, when prompts do not include q, we use the same prompts as in §6.3, minus the question.
For example, when x = L string : answer choices: snacks, naps, kites, or warmth The correct answer is: When x = L enum : Choices: A: snacks B: naps C: kites D: warmth Answer: When x = N one, the prompt is simply "? " to avoid an empty context, which the OpenAI API does not allow.

A.4 Computing PMI DC
In Holtzman et al. (2021), the denominator of Eq. ( 4) is actually computed as P θ (ℓ|d), where d Bears will always have longer life cycles than a fox If a river is rushing southwest on a sunny day, then it is safe to assume that the land gently inclines in that direction After the moon phase where you can see nothing of the moon, what comes next? the first quarter kinetics change stored energy into motion and warmth A person wants to start saving money so that they can afford a nice vacation at the end of the year.After looking over their budget and expenses, they decide the best way to save money is to Table 2: One of three "q" prompt templates used for OpenbookQA, containing 4 in-context demonstrations and one test instance.
Let's answer science questions.question: Bears will always have longer life cycles than a answer choices: tortoises, whales, elephants, or fox The correct answer is: fox ### question: If a river is rushing southwest on a sunny day, then it is safe to assume that answer choices: southwest is a good place to be, the land gently inclines in that direction, the world is mostly land, or the land is supple The correct answer is: the land gently inclines in that direction ### question: After the moon phase where you can see nothing of the moon, what comes next?answer choices: the full moon, the last quarter, the first quarter, or the half moon The correct answer is: the first quarter ### question: kinetics change stored energy into motion and answer choices: snacks, naps, kites, or warmth The correct answer is: warmth ### question: A person wants to start saving money so that they can afford a nice vacation at the end of the year.After looking over their budget and expenses, they decide the best way to save money is to answer choices: make more phone calls, quit eating lunch out, buy less with monopoly money, or have lunch with friends The correct answer is: Table 3: One of three "q +L string " prompt templates used for OpenbookQA, containing 4 in-context demonstrations and one test instance.represents some "domain context" string.In their implementation, d is the phrase " the answer is:".A context is necessary practically when querying the OpenAI API as well, as they do not allow queries with empty contexts, presumably to avoid revealing model weights.In our setting, we follow the prompt format to determine d.For x = q, d = "?" (to avoid an empty context).Otherwise, d is the last line of the prompt-for x = q + L string , d = "The correct answer is: ", and for x = q + L enum , d = "Answer: ".
Following Holtzman et al. (2021), P θ (ℓ|d) is always computed zero-shot, even when the numerator has in-context examples.We follow this design, as it is unintuitive to include in-context examples that do not contain a question, such as "The correct answer is: birds The correct answer is: dogs The correct answer is: ", and unclear how this would better calibrate a model's predictions.
We experimented with a "label-conditional" do-main context where we included answer choices in d when the prompt contained them, but found this version of PMI DC , ŷ = p(ℓ|q+L) p(ℓ|L) , to underperform the version without answer choices.

A.5 Proofs
We say that a string y forms a prefix of a string y ′ if y = y 1 . . .y k and y ′ = y 1 . . .y k . . .y m for k < m.We call a set S of strings prefix-free if no string in S is a prefix of another string in S. We show below (using two alternative arguments) that the total probability mass assigned by a language model M θ to a prefix-free set is at most 1.Note that the prefix-free condition is necessary for the upper bound of 1 to hold in general.Proposition 1.For any prefix-free set S of strings and any x, y∈S P θ (y|x) ≤ 1.
It follows that if the set L of answer choices is prefix-free, then PMA θ (L, x) ≤ 1.
The idea behind the first proof of Proposition 1 is that if multiple strings in S share a common max- The following are elementary-level multiple-choice questions about science.For the question below, select the most suitable answer from the 4 options given.A: snacks B: naps C: kites D: warmth Answer: D Question: A person wants to start saving money so that they can afford a nice vacation at the end of the year.After looking over their budget and expenses, they decide the best way to save money is to Choices: A: make more phone calls B: quit eating lunch out C: buy less with monopoly money D: have lunch with friends Answer: Table 4: One of three "q + L enum " prompt templates used for OpenbookQA, containing 4 in-context demonstrations and one test instance.imal prefix, the token that immediately follows that common prefix must be distinct across the strings (because S is prefix-free).It follows from this that the total probability of those strings sharing the prefix is no more than the probability of the common prefix itself.We can use this observation to repeated reduce S into strictly smaller sets that retain the invariant of being prefix-free and upper bound the total probability of the original S. The process end when no two strings share a common prefix, at which point, the upper bound of 1 follows immediately.Formally, Proof via maximal common prefixes.Let y 1 y 2 . . .y k denote the k tokens comprising a string y ∈ S, where k = |y| is the length of y.
If it's the case that the first tokens of all strings y ∈ S are distinct, then y∈S P θ (y 1 |x) ≤ 1 since P θ is a probability distribution over tokens.It follows that y∈S P θ (y|x) ≤ 1, finishing the proof.
If the first tokens are not all distinct, then there must exist at least two strings in S that share a common prefix.We can therefore identify a maximal prefix p and a subset S ′ ⊆ S with |S ′ | ≥ 2 such that all strings in S ′ begin with the prefix p, while none of the ones in S \ S ′ do.Since p is a maximal prefix, the tokens y |p|+1 of strings y ∈ S ′ that immediately follow p must all be distinct.Following That is, the total probability mass over strings in S is upper bounded by that on strings in T .Further, T contains p and has size |S| − |S ′ | + 1, and we know we immediately have y∈T P θ (y|x) ≤ 1 and the proof is complete.Otherwise, we observe that T is also prefix-free just like S, so we can simplify S to be T and repeat the process of checking the distinctness of first tokens, identifying the maximal prefix, and further upper bounding the probability mass on S. Since each iteration of this process reduces the size of S by at least 1, the process must terminate with |S| = 1, at which point we conclude that the total probability mass on the reduced S-and hence on the original S-is at most 1, as claimed.
An alternative argument for proving Proposition 1 is to expand the set S into a larger set T such that (a) the total probability mass on S is the same as that on T and (b) all strings in T are of the same length, say k.We can then observe that the total probability mass on T is upper bounded by the probability mass on all strings of length k, and argue that the latter is exactly 1.
Proof using length normalization.Let k denote the length of the longest string in S and V denote the token vocabulary.For any string y of length at most k, let Z k y denote the set of all possible extensions of y to strings of length exactly k.We first argue that P θ (y|x) = y ′ ∈Z k y P θ (y ′ |x): where the equality with 1 in the second-last line follows because P θ is a probability distribution over tokens in V.
Now define an expanded set T as the set of all expansions of strings in S to strings of lengths exactly k.Since S is prefix-free, all of these expanded strings are distinct.Thus, it follows from the above argument that the total probability mass on S is the same as that on T .Lastly, the probability mass on T is clearly upper bounded by that on all length-k strings, which itself is exactly 1 (using the same argument as in the second-last line of the derivation above).Therefore, the total probability mass on S is upper bounded by 1.
Empirical Verification of Prefix-Free Assumption Empirically,12 only 24/1140 MMLU instances contain an answer choice that is a prefix of another answer choice in L, 5/500 in Common-senseQA, and 0/500 in OpenbookQA.In the case of MMLU, 24 instances is an upper bound since many of the prefix answers are numeric; whether the answer choices are true prefixes would depend on the model's tokenizer (e.g., whether "2" and "200" have the same first token after tokenization will vary).

A.6 Additional Results
• Table 8 contains the tables of bound satisfaction (Eq.( 7)) for all datasets.
• Table 9 contains tabular results for MMLU; Table 10 for CommonsenseQA and Table 11 for OpenbookQA.
The following are multiple-choice questions about everyday situations.For the question below, select the most suitable answer from the 5 options given.

Figure 1 :
Figure 1: Higher probability mass on given answer choices (left; Eq. (5)) does not always translate to better accuracy (right; Eq. (1)), as shown here for three different prompt formats (x-axis; §6.3) each with one incontext example.Results are averaged across MMLU, OpenbookQA, and CommonsenseQA.Including answer choices in the prompt substantially increases probability mass on them.However, high probability mass is surprisingly not always associated with increased accuracy; in fact, it can lead to a substantial drop in performance (e.g., for OPT 30B and GPT-3 curie).
Question: kinetics change stored energy into Choices: A: snacks\n B: naps\n C: kites\n D: warmth Answer: Figure 3: MMLU test set accuracy (Eq.(1); solid lines) and average PMA (Eq.(5); dashed lines) as a function of number (x-axis) and format (by graph) of in-context examples, for six pretrained LMs.
Practical Insights: We find that the best way to use vanilla LMs in multiple-choice settings is to provide a string prompt without answer choices and apply probability normalization.For instructiontuned models, on the other hand, answer choices should be shown and in an enumerated prompt format, and probability normalization should not be used.More generally, our results reveal that

•
Figure 7: A scatterplot showing the relationship between average PMA and task accuracy for 0, 1, 2, 4 and 8 in-context examples.Note these datasets are explicitly in-domain for FLAN-T5.See Figure 4 for more info.

Figure 9 :
Figure 9: CommonsenseQA validation subset accuracy and average PMA as a function of number and format of in-context examples.Random accuracy is 20%.See caption of Fig. 3 for more details.Note this task is explicitly in-domain for FLAN-T5.

Figure 10 :
Figure 10: OpenbookQA test set accuracy and average PMA as a function of number and format of in-context examples.Random accuracy is 25%.See caption of Fig. 3 for more details.Note this task is explicitly indomain for FLAN-T5.
. While PMA increases with more in-context examples, accuracy is relatively stable across all models and prompt formats.

Table 1 :
Per-model Spearman's correlations between avg.PMA and accuracy (as plotted in Figs.4 and 7).Bold: results are statistically significant at p < 0.05 for a two-sided hypothesis test.
Figure5: Zero-shot results on the MMLU test set: accuracy (Eq.(1); blue) and PMA (Eq.(5); orange) averaged over dataset instances.Observing answer choices in the prompt contributes far more to PMA than observing the question, confirming our hypothesis in §5.2.Even without observing the question, all models place a substantial amount of probability mass on answer choices after observing them in the prompt (see Fig.11for other models, and Appendix A.3 for prompt details).

Table 7 :
One of three "q + L enum " prompt templates used for CommonsenseQA, containing 4 in-context demonstrations and one test instance.

Table 9 :
Full metrics for each model and prompt type on the MMLU test subset.Models are ordered by increasing performance.The mean and standard error of using 3 random seeds to select in-context demonstrations are reported for experiments with at least 1 demonstration for the curie, davinci, and davinci-instruct-beta models.For each model and each column, we bold the prompt format and scoring metric (accuracy or PMI DC accuracy) that results in the highest score, as well as any scores within 1 percentage point of it.We underline the prompt format with the largest average PMA.PMI DC Acc.63.6(66.7)68.6 3.61 72.13 1.62 74.0 1.91 (72.0) 73.73 0.81

Table 11 :
Holtzman et al. (2021)model and prompt type on the OpenbookQA test set.See caption of Table9for more details.#s in parentheses are those reported inHoltzman et al. (2021), though exact model used may not be the same.