Semi-supervised Acoustic Modelling for Five-lingual Code-switched ASR using Automatically-segmented Soap Opera Speech

Nick Wilkinson, Astik Biswas, Emre Yilmaz, Febe De Wet, Ewald Van der westhuizen, Thomas Niesler


Abstract
This paper considers the impact of automatic segmentation on the fully-automatic, semi-supervised training of automatic speech recog-nition (ASR) systems for five-lingual code-switched (CS) speech. Four automatic segmentation techniques were evaluated in terms ofthe recognition performance of an ASR system trained on the resulting segments in a semi-supervised manner. For comparative purposesa semi-supervised syste Three of these use a newly proposed convolutional neural network (CNN) model for framewise classification,and include a novel form of HMM smoothing of the CNN outputs. Automatic segmentation was applied in combination with automaticspeaker diarization. The best-performing segmentation technique was also evaluated without speaker diarization. An evaluation basedon 248 unsegmented soap opera episodes indicated that voice activity detection (VAD) based on a CNN followed by Gaussian mixturemodel-hidden Markov model smoothing (CNN-GMM-HMM) yields the best ASR performance. The semi-supervised system trainedwith the best automatic segmentation achieved an overall WER improvement of 1.1% absolute over a semi-supervised system trainedwith manually created segments. Furthermore, we found that recognition rates improved even further when the automatic segmentationwas used in conjunction with speaker diarization.
Anthology ID:
2020.sltu-1.10
Volume:
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Dorothee Beermann, Laurent Besacier, Sakriani Sakti, Claudia Soria
Venue:
SLTU
SIG:
Publisher:
European Language Resources association
Note:
Pages:
70–78
Language:
English
URL:
https://aclanthology.org/2020.sltu-1.10
DOI:
Bibkey:
Cite (ACL):
Nick Wilkinson, Astik Biswas, Emre Yilmaz, Febe De Wet, Ewald Van der westhuizen, and Thomas Niesler. 2020. Semi-supervised Acoustic Modelling for Five-lingual Code-switched ASR using Automatically-segmented Soap Opera Speech. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 70–78, Marseille, France. European Language Resources association.
Cite (Informal):
Semi-supervised Acoustic Modelling for Five-lingual Code-switched ASR using Automatically-segmented Soap Opera Speech (Wilkinson et al., SLTU 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.sltu-1.10.pdf