Leena G. Pillai
Also published as: Leena G Pillai
2024
What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations
Kavya Manohar
|
Leena G Pillai
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts. We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta’s MMS, Seamless, and Assembly AI’s Conformer, and their unintended consequences on performance metrics. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. We conclude by proposing a shift towards developing text normalization routines that leverage native linguistic expertise, ensuring more robust and accurate evaluations of multilingual ASR models.
2023
Automatic Speech Recognition System for Malasar Language using Multilingual Transfer Learning
Basil K. Raju
|
Leena G. Pillai
|
Kavya Manohar
|
Elizabeth Sherly
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
This study pioneers the development of an automatic speech recognition (ASR) system for the Malasar language, an extremely low-resource ethnic language spoken by a tribal community in the Western Ghats of South India. Malasar is primarily an oral language which does not have a native script. Therefore, Malasar is often transcribed in Tamil script, a closely related major language. This work presents the first ever effort of leveraging the capabilities of multilingual transfer learning for recognising malasar speech. We fine-tune a pre-trained multilingual transformer model with Malasar speech data. In our endeavour to fine-tune this model using a Malasar speech corpus, we could successfully bring down the WER to 48.00% from 99.08% (zero shot baseline). This work demonstrates the efficacy of multilingual transfer learning in addressing the challenges of ASR for extremely low-resource languages, contributing to the preservation of their linguistic and cultural heritage.
Search