Koshiro Aoki


2026

Recent advances in Sparse Autoencoders (SAEs) have revealed interpretable features within large language models (LLMs), including features that are specific to individual languages.In prior work, these features have been used to steer a model’s output language.However, the impact of SAE-based language steering on output quality and task performance, as well as its relationship to simpler prompting-based approaches, remains unclear.In this work, we study the effects of language steering using SAE features across multiple tasks and models.We apply language-specific SAE feature steering to three LLMs from two model families and evaluate it on a translation task and a multilingual question-answering task.We compare SAE-based steering against prompting and language neuron-based steering, and examine a combined prompting-and-steering approach.On the translation task, SAE feature steering achieves an average target-language accuracy of 92% across models and languages, consistently outperforming language neuron-based steering, but slightly underperforming prompting in language accuracy and output quality.In contrast, on the multilingual question-answering task, SAE-based steering enables stronger language control than prompting, and combining steering with prompting yields the best overall language control and task performance.These findings demonstrate the potential of SAE features as a tool for controllable multilingual generation.

2025

Theory of Mind (ToM) is the ability to understand others’ mental states, which is essential for human social interaction. Although recent studies suggest that large language models (LLMs) exhibit human-level ToM capabilities, the underlying mechanisms remain unclear. “Simulation Theory” posits that we infer others’ mental states by simulating their cognitive processes, which has been widely discussed in cognitive science. In this work, we propose a framework for investigating whether the ToM mechanism in LLMs is based on Simulation Theory by analyzing their internal representations. Following this framework, we successfully steered LLMs’ ToM reasoning through modeled perspective-taking and counterfactual interventions. Our results suggest that Simulation Theory may partially explain the ToM mechanism in state-of-the-art LLMs, indicating parallels between human and artificial social reasoning.