Data Augmentation for the Post-Stroke Speech Transcription (PSST) Challenge: Sometimes Less Is More

Jiahong Yuan, Xingyu Cai, Kenneth Church


Abstract
We employ the method of fine-tuning wav2vec2.0 for recognition of phonemes in aphasic speech. Our effort focuses on data augmentation, by supplementing data from both in-domain and out-of-domain datasets for training. We found that although a modest amount of out-of-domain data may be helpful, the performance of the model degrades significantly when the amount of out-of-domain data is much larger than in-domain data. Our hypothesis is that fine-tuning wav2vec2.0 with a CTC loss not only learns bottom-up acoustic properties but also top-down constraints. Therefore, out-of-domain data augmentation is likely to degrade performance if there is a language model mismatch between “in” and “out” domains. For in-domain audio without ground truth labels, we found that it is beneficial to exclude samples with less confident pseudo labels. Our final model achieves 16.7% PER (phoneme error rate) on the validation set, without using a language model for decoding. The result represents a relative error reduction of 14% over the baseline model trained without data augmentation. Finally, we found that “canonicalized” phonemes are much easier to recognize than manually transcribed phonemes.
Anthology ID:
2022.rapid-1.9
Volume:
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser
Venue:
RaPID
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
71–79
Language:
URL:
https://aclanthology.org/2022.rapid-1.9
DOI:
Bibkey:
Cite (ACL):
Jiahong Yuan, Xingyu Cai, and Kenneth Church. 2022. Data Augmentation for the Post-Stroke Speech Transcription (PSST) Challenge: Sometimes Less Is More. In Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference, pages 71–79, Marseille, France. European Language Resources Association.
Cite (Informal):
Data Augmentation for the Post-Stroke Speech Transcription (PSST) Challenge: Sometimes Less Is More (Yuan et al., RaPID 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.rapid-1.9.pdf
Data
LibriSpeech