Mike Zhou
2026
Program-of-Thought Reveals LLM Abstraction Ceilings
Mike Zhou | Fenil Bardoliya | Vivek Gupta | Dan Roth
Findings of the Association for Computational Linguistics: EACL 2026
Mike Zhou | Fenil Bardoliya | Vivek Gupta | Dan Roth
Findings of the Association for Computational Linguistics: EACL 2026
Large language models (LLMs) are often claimed to exhibit reasoning ability when supervised with chain-of-thought (CoT) traces. True reasoning, however, requires invariance: isomorphic problems should yield identical solutions regardless of superficial variation. We test this property by evaluating base and reasoning-optimized models—including LLaMA, Mistral, Qwen, GPT-OSS, and Deepseek—on isomorphic variants from GSM8K and MATH. All models exhibit substantial accuracy drops under perturbation. To assess whether training can induce invariance, we fine-tune models with Program-of-Thought (PoT) supervision under concrete and masked formulations. PoT fine-tuning increases behavioral cross-variant consistency but does not significantly reduce the accuracy gap, and these gains fail to transfer across prompting formats and domains. Our central finding is that models converge toward stable but systematically incorrect behaviors: consistency without correctness. This dissociation suggests that current reasoning supervision teaches models to reproduce solution templates rather than to abstract mathematical structure.
2024
Can Transformers Learn n-gram Language Models?
Anej Svete | Nadav Borenstein | Mike Zhou | Isabelle Augenstein | Ryan Cotterell
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Anej Svete | Nadav Borenstein | Mike Zhou | Isabelle Augenstein | Ryan Cotterell
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Much theoretical work has described the ability of transformers to represent formal languages. However, linking theoretical results to empirical performance is not straightforward due to the complex interplay between the architecture, the learning algorithm, and training data. To test whether theoretical lower bounds imply learnability of formal languages, we turn to recent work relating transformers to n-gram language models (LMs). We study transformers’ ability to learn random n-gram LMs of two kinds: ones with arbitrary next-symbol probabilities and ones where those are defined with shared parameters. We find that classic estimation techniques for n-gram LMs such as add-𝜆 smoothing outperform transformers on the former, while transformers perform better on the latter, outperforming methods specifically designed to learn n-gram LMs.