Methods to Optimize Wav2Vec with Language Model for Automatic Speech Recognition in Resource Constrained Environment

Vaibhav Haswani, Padmapriya Mohankumar


Abstract
Automatic Speech Recognition (ASR) on resource constrained environment is a complex task since most of the State-Of-The-Art models are combination of multilayered convolutional neural network (CNN) and Transformer models which itself requires huge resources such as GPU or TPU for training as well as inference. The accuracy as a performance metric of an ASR system depends upon the efficiency of phonemes to word translation of an Acoustic Model and context correction of the Language model. However, inference as a performance metric is also an important aspect, which mostly depends upon the resources. Also, most of the ASR models uses transformer models at its core and one caveat of transformers is that it usually has a finite amount of sequence length it can handle. Either because it uses position encodings or simply because the cost of attention in transformers is actually O(n²) in sequence length, meaning that using very large sequence length explodes in complexity/memory. So you cannot run the system with finite hardware even a very high-end GPU, because if we inference even a one hour long audio with Wav2Vec the system will crash. In this paper, we used some state-of-the-art methods to optimize the Wav2Vec model for better accuracy of predictions in resource constrained systems. In addition, we have performed tests with other SOTA models such as Citrinet and Quartznet for the comparative analysis.
Anthology ID:
2022.icon-main.20
Volume:
Proceedings of the 19th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2022
Address:
New Delhi, India
Editors:
Md. Shad Akhtar, Tanmoy Chakraborty
Venue:
ICON
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
149–153
Language:
URL:
https://aclanthology.org/2022.icon-main.20
DOI:
Bibkey:
Cite (ACL):
Vaibhav Haswani and Padmapriya Mohankumar. 2022. Methods to Optimize Wav2Vec with Language Model for Automatic Speech Recognition in Resource Constrained Environment. In Proceedings of the 19th International Conference on Natural Language Processing (ICON), pages 149–153, New Delhi, India. Association for Computational Linguistics.
Cite (Informal):
Methods to Optimize Wav2Vec with Language Model for Automatic Speech Recognition in Resource Constrained Environment (Haswani & Mohankumar, ICON 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.icon-main.20.pdf