Alex Stasica

2024

Whisper–TAD: A General Model for Transcription, Alignment and Diarization of Speech
Camille Lavigne | Alex Stasica
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)

Currently, there is a lack of a straightforward implementation of diarization-augmented speech transcription (DAST), ie. implementation of transcription, diarization and alignment to the audio within one model. These tasks typically require distinct models, necessitating to stack them together for complete processing. In this study, we advocate for leveraging the advanced capabilities of the Whisper models, which already excels in automatic transcription and partial alignment. Our approach involves fine-tuning the model’s parameters on both transcription and diarization tasks in a SOT-FIFO (Serialized Output Training-First In First Out) manner. This comprehensive framework facilitates the creation of orthographic transcriptions, identification of speakers, and precise alignment, thus enhancing the efficiency of audio processing workflows. While our work represents an initial step towards a unified transcription and diarization framework, the development of such a model demands substantial high-quality data augmentation and computational resources beyond our current scope. Consequently, our focus is narrowed to the English language. Despite these limitations, our method demonstrates promising performance in both transcription and diarization tasks. Comparative analysis between pre-trained models and fine-tuned TAD (Transcription, Alignment, Diarization) versions suggests that incorporating diarization into a Whisper model doesn’t compromise transcription accuracy. Our findings hint that deploying our TAD framework on the largest Whisper model could potentially yield state-of-the-art performance across all mentioned tasks.

pdf bib abs

Optimisation des performances d’un système de reconnaissance automatique de la parole pour les commentaires sportifs: fine-tuning de Whisper
Camille Lavigne | Alex Stasica | Anna Kupsc
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Malgré les performances élevées des systèmes automatiques de reconnaissance de la parole (Automatic Speech Recognition ; ASR) sur des corpus généraux, leur efficacité est considérablement réduite lorsqu’ils sont confrontés à des corpus spécialisés. Ces corpus peuvent notamment contenir du lexique propre à des domaines spécifiques, des accents ou du bruit de fond rendant la transcription ardue. Cette étude vise à évaluer les avantages de l’optimisation d’une transcription automatique, par opposition à manuelle, après fine-tuning d’un modèle d’ASR de dernière génération, Whisper (Radford et al., 2023), sur un corpus spécialisé de commentaires sportifs de petite taille. Nos analyses quantitatives et qualitatives indiquent que Whisper est capable d’apprendre les particularités d’un corpus de spécialité, atteignant des performances égales où supérieures aux transcripteurs humains, avec cette quantité de données limitée. Cette recherche met en lumière le rôle que l’intelligence artificielle, notamment les larges modèles de langage, peut jouer pour faciliter la création de corpus spécialisés.

Co-authors

Camille Lavigne 2
Anna Kupść 1

Venues

Fix author