MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models Wentian Wang author Sarthak Jain author Paul Kantor author Jacob Feldman author Lazaros Gallos author Hao Wang author 2024-11 text Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP Dieuwke Hupkes editor Verna Dankers editor Khuyagbaatar Batsuren editor Amirhossein Kazemnejad editor Christos Christodoulopoulos editor Mario Giulianelli editor Ryan Cotterell editor Association for Computational Linguistics Miami, Florida, USA conference publication wang-etal-2024-mmlu 10.18653/v1/2024.genbench-1.5 https://aclanthology.org/2024.genbench-1.5/ 2024-11 69 85