Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

Pavel Denisov, Thang Vu


Abstract
Recent advancements in language modeling have led to the emergenceof Large Language Models (LLMs) capable ofvarious natural language processing tasks.Despite their success in text-based tasks, applying LLMs to the speech domainremains limited and challenging. This paper presents BLOOMZMMS, a novel modelthat integrates a multilingual LLM with a multilingual speech encoder,aiming to harness the capabilities of LLMs for speech recognition and beyond.Utilizing a multi-instructional training approach, we demonstrate the transferabilityof linguistic knowledge from the text to the speech modality.Our experiments, conducted on 1900 hours of transcribed data from 139 languages,establish that a multilingual speech representation can be effectivelylearned and aligned with a multilingual LLM. While this learned representationinitially shows limitations in task generalization, we address this issue bygenerating synthetic targets in a multi-instructional style.Our zero-shot evaluation results confirm the robustness of our approach acrossmultiple tasks, including speech translation and multilingual spoken languageunderstanding, thereby opening new avenues for applying LLMs in the speech domain.
Anthology ID:
2024.findings-naacl.52
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
814–834
Language:
URL:
https://aclanthology.org/2024.findings-naacl.52
DOI:
10.18653/v1/2024.findings-naacl.52
Bibkey:
Cite (ACL):
Pavel Denisov and Thang Vu. 2024. Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 814–834, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training (Denisov & Vu, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.52.pdf