LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura


Abstract
We introduces ***LLaST***, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation (E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs.We believe this effective method will serve as a strong baseline for speech translation and provide insights for futureimprovements of the LLM-based speech translation framework.
Anthology ID:
2024.findings-acl.416
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6976–6987
Language:
URL:
https://aclanthology.org/2024.findings-acl.416
DOI:
Bibkey:
Cite (ACL):
Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, and Satoshi Nakamura. 2024. LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models. In Findings of the Association for Computational Linguistics ACL 2024, pages 6976–6987, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models (Chen et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.416.pdf