Using ASR-Generated Text for Spoken Language Modeling

Nicolas Hervé, Valentin Pelloin, Benoit Favre, Franck Dary, Antoine Laurent, Sylvain Meignier, Laurent Besacier


Abstract
This papers aims at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch.The new models (FlauBERT-Oral) will be shared with the community and are evaluated not only in terms of word prediction accuracy but also for two downstream tasks : classification of TV shows and syntactic parsing of speech. Experimental results show that FlauBERT-Oral is better than its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-Generated text can be useful to improve spoken language modeling.
Anthology ID:
2022.bigscience-1.2
Volume:
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
Month:
May
Year:
2022
Address:
virtual+Dublin
Venues:
ACL | BigScience
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17–25
Language:
URL:
https://aclanthology.org/2022.bigscience-1.2
DOI:
10.18653/v1/2022.bigscience-1.2
Bibkey:
Cite (ACL):
Nicolas Hervé, Valentin Pelloin, Benoit Favre, Franck Dary, Antoine Laurent, Sylvain Meignier, and Laurent Besacier. 2022. Using ASR-Generated Text for Spoken Language Modeling. In Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models, pages 17–25, virtual+Dublin. Association for Computational Linguistics.
Cite (Informal):
Using ASR-Generated Text for Spoken Language Modeling (Hervé et al., BigScience 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.bigscience-1.2.pdf