POE: Process of Elimination for Multiple Choice Reasoning

Language models (LMs) are capable of conducting in-context learning for multiple choice reasoning tasks, but the options in these tasks are treated equally. As humans often first eliminate wrong options before picking the final correct answer, we argue a similar two-step strategy can make LMs better at these tasks. To this end, we present the Process of Elimination (POE), a two-step scoring method. In the first step, POE scores each option, and eliminates seemingly wrong options. In the second step, POE masks these wrong options, and makes the final prediction from the remaining options. Zero-shot experiments on 8 reasoning tasks illustrate the effectiveness of POE, and a following analysis finds our method to be especially performant on logical reasoning tasks. We further analyze the effect of masks, and show that POE applies to few-shot settings and large language models (LLMs) like ChatGPT.


Introduction
How often have I said to you that when you have eliminated the impossible whatever remains, however improbable, must be the truth?(Doyle, 1890) In natural language processing, many reasoning tasks are multiple choice-based, in which a model chooses the best option from several options, given a question.Current LMs exhibit remarkable performance on diverse reasoning tasks, like commonsense reasoning (Wang et al., 2023a;Holtzman et al., 2021), logical reasoning (Ye et al., 2023), and arithmetic reasoning (Shum et al., 2023).
There are two types of in-context learning methods for multiple choice reasoning: scoring and prompting.Given the question, scoring methods score each option, and select the highest-scored Figure 1: An illustration of the Process of Elimination (POE) for a multiple choice question.In the first step, POE eliminates some wrong options.In the second step, it enforces the masks, and makes the final prediction.
One limitation of these two types of approaches is that they both treat each option equally, i.e., they either consider each option independently, or consider all options jointly.On the contrary, when humans solve a multiple choice reasoning task, they often eliminate some wrong options, then choose from the remaining ones.This so-called process of elimination is widely used in college exams and can be stronger than usual Gold style learning (Freivalds et al., 2002).Therefore, we assume that language models can similarly benefit from this elimination process, i.e., eliminating wrong options and choosing the best option are two types of reasoning that can be disentangled into two steps.
To this end, we present the Process of Elimination (POE), a two-step scoring method for multiple choice reasoning, as shown in Figure 1.In the first step, POE scores each option, then eliminates some wrong options based on their scores.In the second step, it masks these wrong options, then chooses the best one from the remaining options.We conduct experiments on 8 reasoning tasks that cover diverse domains.POE achieves the best zero-shot performance on most tasks.A following analysis shows that it favors logical reasoning tasks.We also measure the effect of masks, and find our method applicable to few-shot settings and compatible with LLMs like ChatGPT (Ouyang et al., 2022).
Our contributions are twofold: (1) We present POE, a two-step scoring method for multiple choice reasoning; (2) Through comprehensive experiments and analysis, we demonstrate the effectiveness and generalizability of POE.

Method
Problem Setting.A multiple choice reasoning task instance includes a question x, several options Y = {y 1 , ..., y n }, and the correct option y.There are two kinds of in-context learning approaches to this problem: scoring and prompting.
Scoring uses an LM to compute a score for each option y i , and chooses the highest-scored option: A commonly-used score is language modeling likelihood (P (y i |x), Brown et al., 2020), More recent scores include average log probability (Brown et al., 2020), calibrated log probability (Holtzman et al., 2021), and channel (Min et al., 2022).As shown in Equation 1, most scoring methods consider each option in isolation, except multiple choice prompting (Robinson and Wingate, 2023).
In contrast, prompting methods provide all the options in the input (Wei et al., 2022;Wang et al., 2023b;Kojima et al., 2022).The model then generates raw output from the input.Finally, these methods extract the option from the raw output: ŷ = extract(raw output).
(4) POE.Our method is a two-step scoring method that considers all options but treats them differently.
We also implement a prompting-based variation of POE for LLM (Section 5).
The first step of POE is elimination, in which it eliminates some wrong options.POE starts by scoring each option.Then, unlike common scoring methods that choose the highest-scored option, it uses the scores to eliminate some options with low scores.In particular, POE computes the average score of all options, and eliminates options whose score is below average, i.e., Y wrong : (5) The intuition behind this elimination strategy ("Below Average") is that the scores of wrong options deviate from others, and are thus easy to identify, which we verify in Appendix B.2.We compare other elimination strategies in Section 5.2.
The second step of POE is prediction, which chooses the best answer that is not in Y wrong .Specifically, POE computes binary masks m i for options: For these options, POE enforces the masks. 2 In particular (shown in Figure 1), it uses a template T to first wrap the question with all options and their symbols like "A", then append an instruction to the question which asks the model to neglect masked options, and finally replace eliminated options with a special text sequence "[MASK]": Then, POE scores each option, and chooses the highest-scored option, during which the scores of eliminated options are set to negative infinity: 3 Experiment Setup Data.We consider 8 multiple choice reasoning tasks to cover diverse domains.We include three traditional reasoning tasks: ANLI (Nie et al., 2020), CommonsenseQA (CQA, Talmor et al., 2019), and Social IQa (SIQA, Sap et al., 2019).We also include five BIG-bench tasks (Srivastava et al., 2023), with the first two from BIG-Bench Hard (Suzgun et al., 2023): Logical Deduction (LD), Disambigua-tion_QA (DQA), Conceptual Combinations (CC), Strange Stories (SS), and Symbol Interpretation Task (SIT).For all tasks, we use their test sets if available, otherwise their development sets.To reduce cost, we sample 100 instances from each task.We present more task information and preprocessing in Appendix A.
Model.We use FLAN-T5-XL (3B) (Chung et al., 2022) in both steps of POE, because it is an economical and performant instruction-tuned model.In Section 5, we also apply POE to LLMs like ChatGPT (Ouyang et al., 2022). 3ethods.We consider 5 scoring baselines: language modeling (LM, baseline in Zhao et al., 2021), average language modeling (AVG, Brown et al., 2020), calibration (Holtzman et al., 2021), channel (Min et al., 2022), and multiple choice prompting (MCP, Robinson and Wingate, 2023).For POE, we use MCP in both steps.We discuss other possible scoring methods and elimination strategies for POE in Section 5.2, and present input and output samples for all methods in Appendix C.4

Results
In Table 1, we compare POE with other baselines on 8 reasoning tasks, and present the accuracy and standard deviation.Our method achieves the best or second-best performance on all 8 tasks.We notice that multiple choice prompting (MCP) is a strong baseline that consistently outperforms other baselines.Nevertheless, POE beats MCP on 5 tasks, and is comparable on the remaining three.Since POE uses MCP in both steps, the result demonstrates the effectiveness of the former, especially elimination.In addition, POE is exceptionally performant on CC, beating the second-best method by 12% absolute.This corroborates our hypothesis on two types of reasoning for multiple choice tasks, because while MCP sometimes fails to simultaneously eliminate wrong options and choose the right option, POE disentangles the two jobs in two steps, and thus achieves better performance.We present a case study in Appendix B.2.

When Is POE Better than MCP?
In Table 1, we find MCP to be the best baseline, and sometimes even beats POE.This finding leads us to wonder: when (on which tasks) can we expect POE to dramatically outperform MCP?
Thought Prompting (Wei et al., 2022), because they are tailored for very large (e.g., 175B) models.We start by calculating their performance gaps (POE -MCP) on all 8 tasks.We find the gaps on LD (13.8%) and CC (12%) are much larger than those of other tasks (less than 3%).Furthermore, we find both LD and CC are logical reasoning tasks, while other tasks rely more on commonsense or social reasoning. 5We thus hypothesize that POE is better than MCP on tasks that mainly require logical reasoning, while not being consistently performant on social and commonsense reasoning.One explanation is that logical reasoning is a general skill, whereas social or commonsense reasoning relies on specific knowledge, some of which may be absent in certain language models.
To verify this hypothesis, we compare POE and MCP in the zero-shot setting on 8 new tasks from BIG-Bench, including 4 logical reasoning tasks: Logical Arguments (LA), Identify Math Theorems (IMT), Code Description (CLD), Reasoning about Colored Objects (RACO); and 4 control tasks that are based on commonsense or social reasoning: Counterfactual Conditionals (CAI), The Essential, the Excessive, and the Extraneous (EIE), Riddle-Sense (RS), Identify Odd Metaphor (IOM).We present task information in Appendix A.
As shown in Table 2, we find POE consistently and dramatically outperforms MCP on all 4 logical reasoning tasks (top 4 tasks).We find the largest performance gap on LA (18.8%), a task of GREstyle logical questions.Human test-takers commonly use elimination-based methods to solve such GRE questions, which further supports the motiva-5 SIT is another logical reasoning task, but it also requires visual reasoning, which is not ideal for language models.tion of POE.We also find it underperforms MCP on 4 control tasks (bottom 4 tasks), which means our method is not consistent on commonsense or social reasoning tasks.These findings support our hypothesis that compared to MCP, POE is most performant on logical reasoning tasks.

What Makes Good Masks?
As shown in Section 2, scoring methods and elimination strategies jointly determine which options to eliminate (Y wrong ), and create masks accordingly.
In this section, we analyze the effect of using different configurations of these two factors. 6We define masking accuracy (Acc mask ) as the ratio of instances that have their correct option kept after elimination.Acc mask represents the upper bound of POE, which we aim to maximize to reduce error propagation (Du and Cardie, 2020).We measure two quantities: Acc mask and final accuracy (Acc).We consider four scoring methods (Calibration, Channel, LM, MCP) and two elimination strategies ("Below Average" and "Lowest", the latter of which means eliminating the option with the lowest score).We average zero-shot results over all 8 tasks in Table 1.
As shown in Figure 2, the best scoring method is MCP, and the two elimination strategies are comparable to each other.We find that higher Acc mask leads to higher Acc.The best configuration of scoring method and elimination strategy (MCP, Lowest) leads to 91.9% Acc mask . 7This means a moderatesized LM (FLAN-T5-XL) can eliminate wrong op-  tions while keeping the correct ones most of the time.The corresponding 68.7% Acc means a large performance gap, which suggests a more powerful LM like FLAN-T5-XXL for prediction.We also find Y wrong and masks to make POE more interpretable, because they provide intermediate reasoning outputs.Enforcing masks in prediction also makes our method faithful and factual.

Does POE Work with LLMs?
To measure whether POE is compatible with LLMs, we implement a prompting-based variation of POE and apply it to ChatGPT (gpt-3.5-turbo-0613).
We then measure its zero-shot performance on 8 original tasks in Table 1 and 4 logical reasoning tasks in Table 2.For POE, we only use ChatGPT in the second step (prediction), as the base LM suffices for masking (Section 5.2).Concretely, we prompt ChatGPT to complete x mask , and extract the last occurrence of any option symbol as the prediction.We compare POE to MCP, which is similarly modified to work with ChatGPT.We present the result in Table 3.The result from ChatGPT is consistent with those from FLAN-T5-XL (Table 1): POE beats MCP on 5 out of the 8 original tasks, and performs well on 4 logical reasoning tasks.These findings suggest our method also works with LLMs.Nevertheless, we find Chat-GPT underperforms FLAN-T5-XL on some tasks, which requires further experiments and is beyond the scope of this work.

Does POE Work in Few-Shot Settings?
To measure POE in the few-shot settings, we compare its zero-shot and three-short performance with MCP on ANLI, CC, and LD.We build three-shot demonstrations by randomly sampling from the training sets of ANLI and test sets of CC and LD. 8e present the result in Figure 3.For ANLI, POE underperforms MCP in both settings, but their performance gap drops from 5.3% (zero-shot) to 1.5% (three-shot).For LD and CC, POE outperforms MCP in both settings, but their performance gap similarly diminishes in the three-shot setting.Comparing POE with MCP, We find three-shot POE to be less performant than its zero-shot counterpart, and possible reasons include: 1) we did not optimize prompts or demonstrations for threeshot POE; 2) the instructions we use in zero-shot POE are already powerful, and three-shot demonstrations may introduce some noise.Still, our findings suggest POE is applicable to few-shot settings, which we will continue to explore in future work.

Conclusion
We present POE, a two-step scoring method for multiple choice reasoning.POE eliminates some wrong options in the first step, and chooses from the remaining options in the second step.POE performs well on reasoning (especially logical reasoning) tasks in the zero-shot setting, and also works with LLMs or in the few-shot setting.In the future, we will improve the generalizability of our method, and use fine-tuning to better enforce the masks.5 options; For SS, we use the multiple choice subtask, and treat options with scores of 0.5 as wrong.We provide other details in Table 4.The first three tasks can be accessed on Hugging Face. 9The last thirteen tasks can be accessed on BIG-bench.

B.1 Different Number of Options
Since our method eliminates wrong options in the first step, we assume the number of options affects performance.Consequently, we run experiments on three subtasks of logical deduction (LD (3), LD (5), LD ( 7)), which have 3, 5, and 7 options respectively.The other experiment settings are consistent with the main experiment (Section 3).We present the result in Table 5.We find most methods (including POE) perform worse as the number of options increases.This conforms to intuition, because more options require more reasoning steps, and the questions get harder.Nevertheless, POE is the best on all subtasks.This means although POE does not counterintuitively perform better on harder questions, it works well on questions of different difficulties, and thus applies to a variety of multiple choice reasoning tasks. 9https://huggingface.co/ 10 https://github.com/google/BIG-bench

B.2 Case Study
In Figure 4, we present a case study to show why POE works.We compare POE with MCP on one instance from CC.The question invents the word "wajey" with a definition, and forms a surprising and uncommon conceptual combination between "wajey" and "grape".The correct option is A. We find MCP unable to solve this task, and chooses C. Taking a closer look at MCP scores (lower is better) for each option, we find that MCP assigns high scores to options B and D, two seemingly wrong options.This verifies our hypothesis that some wrong options' scores deviate from others, and can thus be eliminated.Nevertheless, MCP is distracted by option C, and we assume this is caused by the correlation between "grape" and "Muscat", the latter of which is a kind of grape.This instance shows although MCP can eliminate some wrong options, it struggles to simultaneously choose the right option.
For POE, since it uses MCP in the first step, it masks options B and D. We examine the final POE scores for each option, and find that it correctly chooses option A. In addition, we find the POE score for option A is lower than its MCP score, and conversely for option C.This means our method is not distracted by option C.This case study verifies our assumption that eliminating wrong options and choosing the best option are two types of reasoning that can be disentangled into two steps, and that such disentanglement is beneficial.

C Input and Output Samples for All Methods
We do not optimize prompts, because our goal is to evaluate POE with a fixed prompt format.Therefore, we follow Holtzman et al. (2021) and Robinson and Wingate (2023) to construct task-agnostic prompts, and use them to wrap questions.We use one instance from SIQA.The question x is "Kendall was searching for ring with their eyes closed.They hit something.Why did Kendall do this?".The options Y are "kendall who has searching his ring" (y 1 ), "kendall who has wanted to close their eyes" (y 2 ), and "find the rings" (y 3 ).We present input and output samples for all methods for this instance in Table 6.Calibration computes two scores (log P (y i |x)) and (log P (y i |text)), so we present samples for each of them.POE uses different prompts for each step, so we present them separately.

Figure 2 :
Figure 2: Effect of different scoring methods and elimination strategies on masking accuracy (Acc mask ) and final accuracy (Acc).The best configuration is (MCP, Lowest), with a large gap between Acc mask and Acc.

Figure 3 :
Figure 3: Three-shot and zero-shot accuracy scores on ANLI, CC, and LD.Although POE shows mixed performance, it is applicable to few-shot settings.

Table 1 :
) Accuracy scores (with standard deviation) on 8 tasks.The best scores are boldfaced, and the second-best scores are underlined.LM refers to language modeling, AVG refers to average language modeling, MCP refers to multiple choice prompting.Our method (POE) achieves the best or comparable performance on all tasks.

Table 2 :
Comparison of MCP and POE accuracy scores on 8 new tasks.The top 4 tasks are logical reasoning tasks.POE largely outperforms MCP on 4 logical reasoning tasks, and underperforms MCP on other 4 tasks.

Table 4 :
Task Information.# Options means the number of options for the task.

Table 5 :
Accuracy scores on three subtasks of logical deduction, which have different numbers of options.Best scores are bold, and second-best scores are underlined.POE applies to tasks of varying difficulties.