Shramay Palta


2024

pdf bib
It’s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning
Nishant Balepur | Shramay Palta | Rachel Rudinger
Findings of the Association for Computational Linguistics: ACL 2024

Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work.

pdf bib
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning
Shramay Palta | Nishant Balepur | Peter Rankel | Sarah Wiegreffe | Marine Carpuat | Rachel Rudinger
Findings of the Association for Computational Linguistics: EMNLP 2024

Questions involving commonsense reasoning about everyday situations often admit many possible or plausible answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the most plausible answer choice. On 250 MCQ items sampled from two commonsense reasoning benchmarks, we collect 5,000 independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQS, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracyand high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

2023

pdf bib
FORK: A Bite-Sized Test Set for Probing Culinary Cultural Biases in Commonsense Reasoning Models
Shramay Palta | Rachel Rudinger
Findings of the Association for Computational Linguistics: ACL 2023

It is common sense that one should prefer to eat a salad with a fork rather than with a chainsaw. However, for eating a bowl of rice, the choice between a fork and a pair of chopsticks is culturally relative. We introduce FORK, a small, manually-curated set of CommonsenseQA-style questions for probing cultural biases and assumptions present in commonsense reasoning systems, with a specific focus on food-related customs. We test several CommonsenseQA systems on FORK, and while we see high performance on questions about the US culture, the poor performance of these systems on questions about non-US cultures highlights systematic cultural assumptions aligned with US over non-US cultures.