Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Michael Lan; Philip Torr; Fazl Barez

doi:10.18653/v1/2024.emnlp-main.699

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Abstract

While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. Overall, documenting shared computational structures enables better model behavior predictions, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.

Anthology ID:: 2024.emnlp-main.699
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12576–12601
Language:
URL:: https://aclanthology.org/2024.emnlp-main.699/
DOI:: 10.18653/v1/2024.emnlp-main.699
Bibkey:
Cite (ACL):: Michael Lan, Philip Torr, and Fazl Barez. 2024. Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12576–12601, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models (Lan et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.699.pdf

PDF Cite Search Fix data