DialectMoE: An End-to-End Multi-Dialect Speech Recognition Model with Mixture-of-Experts

Zhou Jie, Gao Shengxiang, Yu Zhengtao, Dong Ling, Wang Wenjun


Abstract
“Dialect speech recognition has always been one of the challenges in Automatic Speech Recog-nition (ASR) systems. While lots of ASR systems perform well in Mandarin, their performancesignificantly drops when handling dialect speech. This is mainly due to the obvious differencesbetween dialects and Mandarin in pronunciation and the data scarcity of dialect speech. In thispaper, we propose DialectMoE, a Chinese multi-dialects speech recognition model based onMixture-of-Experts (MoE) in a low-resource conditions. Specifically, DialectMoE assigns inputsequences to a set of experts using a dynamic routing algorithm, with each expert potentiallytrained for a specific dialect. Subsequently, the outputs of these experts are combined to derivethe final output. Due to the similarities among dialects, distinct experts may offer assistance inrecognizing other dialects as well. Experimental results on the Datatang dialect public datasetshow that, compared with the baseline model, DialectMoE reduces Character Error Rate (CER)for Sichuan, Yunnan, Hubei and Henan dialects by 23.6%, 32.6%, 39.2% and 35.09% respec-tively. The proposed DialectMoE model demonstrates outstanding performance in multi-dialectsspeech recognition.”
Anthology ID:
2024.ccl-1.89
Volume:
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Month:
July
Year:
2024
Address:
Taiyuan, China
Editors:
Maosong Sun, Jiye Liang, Xianpei Han, Zhiyuan Liu, Yulan He
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
1148–1159
Language:
English
URL:
https://aclanthology.org/2024.ccl-1.89/
DOI:
Bibkey:
Cite (ACL):
Zhou Jie, Gao Shengxiang, Yu Zhengtao, Dong Ling, and Wang Wenjun. 2024. DialectMoE: An End-to-End Multi-Dialect Speech Recognition Model with Mixture-of-Experts. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference), pages 1148–1159, Taiyuan, China. Chinese Information Processing Society of China.
Cite (Informal):
DialectMoE: An End-to-End Multi-Dialect Speech Recognition Model with Mixture-of-Experts (Jie et al., CCL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ccl-1.89.pdf