Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Guangxiang Zhao; Saier Hu; Xiaoqi Jian; Wu Jinzhu; Yuhan Wu; Lin Sun; Xiangzheng Zhang

doi:10.18653/v1/2025.emnlp-main.1362

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Wu Jinzhu, Yuhan Wu, Lin Sun, Xiangzheng Zhang

Abstract

In this paper, we propose a “Generalization Stress Test” to assess Large Language Models’ (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B’s MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and shifts in irrelevant content.

Anthology ID:: 2025.emnlp-main.1362
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26837–26846
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1362/
DOI:: 10.18653/v1/2025.emnlp-main.1362
Bibkey:
Cite (ACL):: Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Wu Jinzhu, Yuhan Wu, Lin Sun, and Xiangzheng Zhang. 2025. Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26837–26846, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements (Zhao et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1362.pdf
Checklist:: 2025.emnlp-main.1362.checklist.pdf

PDF Cite Search Checklist Fix data