STTATTS: Unified Speech-To-Text And Text-To-Speech Model

Hawau Toyin, Hao Li, Hanan Aldarmaki


Abstract
Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates thatthe performance of our multi-task model is comparable to that of individually trained models while significantly savingcomputational and memory costs (~50% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.
Anthology ID:
2024.findings-emnlp.401
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6853–6863
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.401
DOI:
Bibkey:
Cite (ACL):
Hawau Toyin, Hao Li, and Hanan Aldarmaki. 2024. STTATTS: Unified Speech-To-Text And Text-To-Speech Model. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6853–6863, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
STTATTS: Unified Speech-To-Text And Text-To-Speech Model (Toyin et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.401.pdf