TellWhisper: Tell Whisper Who Speaks When

Yifan Hu; Peiji Yang; Zhisheng Wang; Yicheng Zhong; Rui Liu

TellWhisper: Tell Whisper Who Speaks When

Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu

Abstract

Multi-speaker automatic speech recognition (MASR) aims to predict ”who spoke when and what” from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ”when” and ”who”: some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ”when” and ”who”. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach. The project webpage is available at https://walker-hyf.github.io/TellWhisper.

Anthology ID:: 2026.acl-long.861
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18884–18898
Language:
URL:: https://aclanthology.org/2026.acl-long.861/
DOI:
Bibkey:
Cite (ACL):: Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, and Rui Liu. 2026. TellWhisper: Tell Whisper Who Speaks When. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18884–18898, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: TellWhisper: Tell Whisper Who Speaks When (Hu et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.861.pdf
Checklist:: 2026.acl-long.861.checklist.pdf

PDF Cite Search Checklist Fix data