Whisper–TAD: A General Model for Transcription, Alignment and Diarization of Speech

Camille Lavigne, Alex Stasica


Abstract
Currently, there is a lack of a straightforward implementation of diarization-augmented speech transcription (DAST), ie. implementation of transcription, diarization and alignment to the audio within one model. These tasks typically require distinct models, necessitating to stack them together for complete processing. In this study, we advocate for leveraging the advanced capabilities of the Whisper models, which already excels in automatic transcription and partial alignment. Our approach involves fine-tuning the model’s parameters on both transcription and diarization tasks in a SOT-FIFO (Serialized Output Training-First In First Out) manner. This comprehensive framework facilitates the creation of orthographic transcriptions, identification of speakers, and precise alignment, thus enhancing the efficiency of audio processing workflows. While our work represents an initial step towards a unified transcription and diarization framework, the development of such a model demands substantial high-quality data augmentation and computational resources beyond our current scope. Consequently, our focus is narrowed to the English language. Despite these limitations, our method demonstrates promising performance in both transcription and diarization tasks. Comparative analysis between pre-trained models and fine-tuned TAD (Transcription, Alignment, Diarization) versions suggests that incorporating diarization into a Whisper model doesn’t compromise transcription accuracy. Our findings hint that deploying our TAD framework on the largest Whisper model could potentially yield state-of-the-art performance across all mentioned tasks.
Anthology ID:
2024.clib-1.3
Volume:
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
Month:
September
Year:
2024
Address:
Sofia, Bulgaria
Venue:
CLIB
SIG:
Publisher:
Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Note:
Pages:
33–38
Language:
URL:
https://aclanthology.org/2024.clib-1.3
DOI:
Bibkey:
Cite (ACL):
Camille Lavigne and Alex Stasica. 2024. Whisper–TAD: A General Model for Transcription, Alignment and Diarization of Speech. In Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024), pages 33–38, Sofia, Bulgaria. Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences.
Cite (Informal):
Whisper–TAD: A General Model for Transcription, Alignment and Diarization of Speech (Lavigne & Stasica, CLIB 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.clib-1.3.pdf