Shramay Palta
2024
It’s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning
Nishant Balepur
|
Shramay Palta
|
Rachel Rudinger
Findings of the Association for Computational Linguistics: ACL 2024
Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work.
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning
Shramay Palta
|
Nishant Balepur
|
Peter A. Rankel
|
Sarah Wiegreffe
|
Marine Carpuat
|
Rachel Rudinger
Findings of the Association for Computational Linguistics: EMNLP 2024
2023
FORK: A Bite-Sized Test Set for Probing Culinary Cultural Biases in Commonsense Reasoning Models
Shramay Palta
|
Rachel Rudinger
Findings of the Association for Computational Linguistics: ACL 2023
It is common sense that one should prefer to eat a salad with a fork rather than with a chainsaw. However, for eating a bowl of rice, the choice between a fork and a pair of chopsticks is culturally relative. We introduce FORK, a small, manually-curated set of CommonsenseQA-style questions for probing cultural biases and assumptions present in commonsense reasoning systems, with a specific focus on food-related customs. We test several CommonsenseQA systems on FORK, and while we see high performance on questions about the US culture, the poor performance of these systems on questions about non-US cultures highlights systematic cultural assumptions aligned with US over non-US cultures.