基于多尺度建模的端到端自动语音识别方法(An End-to-End Automatic Speech Recognition Method Based on Multiscale Modeling)

Hao Chen (陈昊), Runlai Zhang (张润来), Yuhao Zhang (张裕浩), Chenghao Gao (高成浩), Chen Xu (许晨), Anxiang Ma (马安香), Tong Xiao (肖桐), Jingbo Zhu (朱靖波)


Abstract
“近年来,基于深度学习的端到端自动语音识别模型直接对语音和文本进行建模,结构简单且性能上也具有显著优势,逐渐成为主流。然而,由于连续的语音信号与离散的文本在长度及表示尺度上存在巨大差异,二者间的模态鸿沟问题是该类任务一直存在的困扰。为解决该问题,本文提出了多尺度语音识别建模方法,该方法从利用细粒度分布知识的角度出发,建立多个不同尺度形式的文本信息,将特征序列从细粒度的低层次序列逐步对齐预测出文本序列。这种逐级预测的方式能够有效降低预测难度,缓解模态鸿沟带来的影响,并通过融合不同尺度下特征,提高语料信息的丰富性与完整性,进一步增强模型推理能力。本文在LibriSpeech小规模、大规模和TEDLIUM2数据集上实验,相比基线系统词错误率平均降低1.7、0.45和0.76,验证了方法的有效性。”
Anthology ID:
2023.ccl-1.41
Volume:
Proceedings of the 22nd Chinese National Conference on Computational Linguistics
Month:
August
Year:
2023
Address:
Harbin, China
Editors:
Maosong Sun, Bing Qin, Xipeng Qiu, Jing Jiang, Xianpei Han
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
468–479
Language:
Chinese
URL:
https://aclanthology.org/2023.ccl-1.41
DOI:
Bibkey:
Cite (ACL):
Hao Chen, Runlai Zhang, Yuhao Zhang, Chenghao Gao, Chen Xu, Anxiang Ma, Tong Xiao, and Jingbo Zhu. 2023. 基于多尺度建模的端到端自动语音识别方法(An End-to-End Automatic Speech Recognition Method Based on Multiscale Modeling). In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 468–479, Harbin, China. Chinese Information Processing Society of China.
Cite (Informal):
基于多尺度建模的端到端自动语音识别方法(An End-to-End Automatic Speech Recognition Method Based on Multiscale Modeling) (Chen et al., CCL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ccl-1.41.pdf