MCQFormatBench: Robustness Tests for Multiple-Choice Questions

Hiroo Takizawa; Saku Sugawara; Akiko Aizawa

MCQFormatBench: Robustness Tests for Multiple-Choice Questions

Hiroo Takizawa, Saku Sugawara, Akiko Aizawa

Abstract

Multiple-choice questions (MCQs) are often used to evaluate large language models (LLMs). They measure LLMs’ general common sense and reasoning abilities, as well as their knowledge in specific domains such as law and medicine. However, the robustness of LLMs to various question formats in MCQs has not been thoroughly evaluated. While there are studies on the sensitivity of LLMs to input variations, research into their responsiveness to different question formats is still limited. In this study, we propose a method to construct tasks to comprehensively evaluate the robustness against format changes of MCQs by decomposing the answering process into several steps. Using this dataset, we evaluate nine LLMs, such as Llama3-70B and Mixtral-8x7B. We find the lack of robustness to differences in the format of MCQs. It is crucial to consider whether the format of MCQs influences their evaluation scores when assessing LLMs using MCQ datasets.

Anthology ID:: 2025.gem-1.69
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shmueli Scheuer, Gabriel Stanovsky, Oyvind Tafjord
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 824–846
Language:
URL:: https://aclanthology.org/2025.gem-1.69/
DOI:
Bibkey:
Cite (ACL):: Hiroo Takizawa, Saku Sugawara, and Akiko Aizawa. 2025. MCQFormatBench: Robustness Tests for Multiple-Choice Questions. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 824–846, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: MCQFormatBench: Robustness Tests for Multiple-Choice Questions (Takizawa et al., GEM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.gem-1.69.pdf

PDF Cite Search Fix data