Distilling an End-to-End Voice Assistant Without Instruction Training Data

William Held; Yanzhe Zhang; Minzhi Li; Weiyan Shi; Michael J. Ryan; Diyi Yang

doi:10.18653/v1/2025.acl-long.388

Distilling an End-to-End Voice Assistant Without Instruction Training Data

William Held, Yanzhe Zhang, Minzhi Li, Weiyan Shi, Michael J Ryan, Diyi Yang

Abstract

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (speech-in, text-out) trained with supervised finetuning (SFT) have led to models “forgetting” capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, DiVA better matches user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.

Anthology ID:: 2025.acl-long.388
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7876–7891
Language:
URL:: https://aclanthology.org/2025.acl-long.388/
DOI:: 10.18653/v1/2025.acl-long.388
Bibkey:
Cite (ACL):: William Held, Yanzhe Zhang, Minzhi Li, Weiyan Shi, Michael J Ryan, and Diyi Yang. 2025. Distilling an End-to-End Voice Assistant Without Instruction Training Data. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7876–7891, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Distilling an End-to-End Voice Assistant Without Instruction Training Data (Held et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.388.pdf

PDF Cite Search Fix data