New Datasets and Controllable Iterative Data Augmentation Method for Code-switching ASR Error Correction

Zhaohong Wan; Xiaojun Wan; Wei Peng; Rongjun Li

doi:10.18653/v1/2023.findings-emnlp.543

New Datasets and Controllable Iterative Data Augmentation Method for Code-switching ASR Error Correction

Zhaohong Wan, Xiaojun Wan, Wei Peng, Rongjun Li

Abstract

With the wide use of automatic speech recognition(ASR) systems, researchers pay more attention to the ASR error correction task to improve the quality of recognition results. In particular, ASR in bilingual or multilingual settings, namely code-switching ASR, has greater challenges and research value. In this paper, we first present code-switching ASR correction datasets obtained from solid ASR systems and automatic annotators. The datasets contain Chinese-English code-switching dialogues of bilingual speakers in Singapore, Malaysia, and Hong Kong. Based on this task, we propose a controllable iterative (CI) data augmentation method for improving the performance of mainstream ASR error correction systems. With a small amount of training data, our proposed method has the ability to iteratively produce abundant pseudo parallel data from the monolingual corpus for Chinese-English code-switching ASR correction. Results of experiments show that our method achieves the best performance compared with the rule-based, back-translation-based data augmentation methods and large language model ChatGPT.

Anthology ID:: 2023.findings-emnlp.543
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8075–8087
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.543
DOI:: 10.18653/v1/2023.findings-emnlp.543
Bibkey:
Cite (ACL):: Zhaohong Wan, Xiaojun Wan, Wei Peng, and Rongjun Li. 2023. New Datasets and Controllable Iterative Data Augmentation Method for Code-switching ASR Error Correction. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8075–8087, Singapore. Association for Computational Linguistics.
Cite (Informal):: New Datasets and Controllable Iterative Data Augmentation Method for Code-switching ASR Error Correction (Wan et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.543.pdf

PDF Cite Search