Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

Jianqiao Lu, Wenyong Huang, Nianzu Zheng, Xingshan Zeng, Yu Yeung, Xiao Chen


Abstract
Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data, especially in the era of data-centric artificial intelligence. However, labeled speech data are usually scarcer and more expensive for collection, compared to textual data. We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. We train a latent synthesizer to convert textual data into an intermediate latent representation of a pre-trained speech model. These pseudo acoustic representations of textual data augment acoustic data for model training. We evaluate LaSyn on low-resource automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an E2E baseline trained on LibriSpeech train-clean-100, with relative word error rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM) and EM-Tree accuracies on STOP respectively. With fewer parameters, the results of LaSyn are competitive to published state-of-the-art works. The results demonstrate the quality of the augmented training data.
Anthology ID:
2023.findings-emnlp.327
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4916–4928
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.327
DOI:
10.18653/v1/2023.findings-emnlp.327
Bibkey:
Cite (ACL):
Jianqiao Lu, Wenyong Huang, Nianzu Zheng, Xingshan Zeng, Yu Yeung, and Xiao Chen. 2023. Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4916–4928, Singapore. Association for Computational Linguistics.
Cite (Informal):
Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis (Lu et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.327.pdf