VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

Zackary Rackauckas; Julia Hirschberg

doi:10.18653/v1/2025.magmar-1.3

VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

Abstract

We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system that bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries. VoxRAG employs silence-aware segmentation, speaker diarization, CLAP audio embeddings, and FAISS retrieval using L2-normalized cosine similarity. We construct a 50-query test set recorded as spoken input by a native English speaker. Retrieval quality was evaluated using LLM-as-a-judge annotations. For very relevant segments, cosine similarity achieved a Recall@10 of 0.34. For somewhat relevant segments, Recall@10 rose to 0.60 and nDCG@10 to 0.27, highlighting strong topical alignment. Answer quality was judged on a 0–2 scale across relevance, accuracy, completeness, and precision, with mean scores of 0.84, 0.58, 0.56, and 0.46 respectively. While precision and retrieval quality remain key limitations, VoxRAG shows that transcription-free speech-to-speech retrieval is feasible in RAG systems.

Anthology ID:: 2025.magmar-1.3
Volume:: Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editors:: Reno Kriz, Kenton Murray
Venues:: MAGMaR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 40–46
Language:
URL:: https://aclanthology.org/2025.magmar-1.3/
DOI:: 10.18653/v1/2025.magmar-1.3
Bibkey:
Cite (ACL):: Zackary Rackauckas and Julia Hirschberg. 2025. VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering. In Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025), pages 40–46, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering (Rackauckas & Hirschberg, MAGMaR 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.magmar-1.3.pdf

PDF Cite Search Fix data