Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

Joanna Hong, Se Park, Yong Ro


Abstract
We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similarities and differences between languages. To do so, we design a prompt fine-tuning technique into the largely pre-trained audio-visual representation model so that the network can recognize the language class as well as the speech with the corresponding language. Our work contributes to developing robust and efficient multilingual audio-visual speech recognition systems, reducing the need for language-specific models.
Anthology ID:
2023.findings-emnlp.324
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4886–4890
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.324
DOI:
10.18653/v1/2023.findings-emnlp.324
Bibkey:
Cite (ACL):
Joanna Hong, Se Park, and Yong Ro. 2023. Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4886–4890, Singapore. Association for Computational Linguistics.
Cite (Informal):
Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model (Hong et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.324.pdf