Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

Luca Modica; Filip Landin; Mehrdad Farahani; Livia Qian; Gabriel Skantze; Richard Johansson

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

Luca Modica, Filip Landin, Mehrdad Farahani, Livia Qian, Gabriel Skantze, Richard Johansson

Abstract

In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.

Anthology ID:: 2026.starsem-conference.28
Volume:: Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Saif M. Mohammad, Nedjma Ousidhoum
Venues:: *SEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 401–409
Language:
URL:: https://aclanthology.org/2026.starsem-conference.28/
DOI:
Bibkey:
Cite (ACL):: Luca Modica, Filip Landin, Mehrdad Farahani, Livia Qian, Gabriel Skantze, and Richard Johansson. 2026. Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?. In Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026), pages 401–409, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models? (Modica et al., *SEM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.starsem-conference.28.pdf

PDF Cite Search Fix data