Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

Matteo Bortoletto; Constantin Ruhdorfer; Lei Shi; Andreas Bulling

doi:10.18653/v1/2025.findings-emnlp.1226

Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

Abstract

Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine‐tuning substantially improve LMs’ internal representations of others’ beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.

Anthology ID:: 2025.findings-emnlp.1226
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22521–22543
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.1226/
DOI:: 10.18653/v1/2025.findings-emnlp.1226
Bibkey:
Cite (ACL):: Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, and Andreas Bulling. 2025. Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22521–22543, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models (Bortoletto et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.1226.pdf
Checklist:: 2025.findings-emnlp.1226.checklist.pdf

PDF Cite Search Checklist Fix data