Shashi Kumar
2024
TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR
Shashi Kumar
|
Srikanth Madikeri
|
Juan Pablo Zuluaga Gomez
|
Iuliia Thorbecke
|
Esaú Villatoro-tello
|
Sergio Burdisso
|
Petr Motlicek
|
Karthik Pandia D S
|
Aravind Ganapathiraju
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper
Iuliia Thorbecke
|
Juan Pablo Zuluaga Gomez
|
Esaú Villatoro-tello
|
Shashi Kumar
|
Pradeep Rangappa
|
Sergio Burdisso
|
Petr Motlicek
|
Karthik Pandia D S
|
Aravind Ganapathiraju
Findings of the Association for Computational Linguistics: EMNLP 2024
The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.