Yoichi Aoki

2026

This study investigates the internal information flow of large language models (LLMs) while performing chain-of-thought (CoT) style reasoning.Specifically, with a particular interest in the faithfulness of the CoT explanation to LLMs’ final answer, we explore (i) when the LLMs’ answer is (pre)determined, especially before the CoT begins or after, and (ii) how strongly the information from CoT specifically has a causal effect on the final answer.Our experiments with controlled arithmetic tasks reveal a systematic internal reasoning mechanism of LLMs.They have not derived an answer at the moment when input was fed into the model.Instead, they compute (sub-)answers while generating the reasoning chain on the fly.Therefore, the generated reasoning chains can be regarded as faithful reflections of the model’s internal computation.

2024

pdf bib abs

Explicit multi-step reasoning, such as chain-of-thought, is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process.Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely, their reliance on heuristics decreases as LMs progress closer to the final answer. This suggests that LMs track only a limited number of future steps and dynamically combine heuristic strategies with rational ones in solving tasks involving multi-step reasoning.

2023

pdf bib abs

Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.

pdf bib abs

Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models.

Co-authors

Shusaku Sone 2

Masaya Taniguchi 2

Masashi Yoshikawa 2

Venues

Fix author