基于改进Conformer的新闻领域端到端语音识别(End-to-End Speech Recognition in News Field based on Conformer)

Jimin Zhang (张济民), Kerekadeer Zao (早克热·卡德尔), Yunfei Shen (申云飞), Shanwumaier Ai (艾山·吾买尔), Liejun Wang (汪烈军)


Abstract
目前,开源的中文语音识别数据集多为面向通用领域,缺少面向新闻领域的开源语音识别语料库,因此本文构建了面向新闻领域的中文语音识别数据集CHNEWSASR并使用ESPNET-0.9.6框架的RNN、Transformer和Conformer等模型对数据集的有效性进行了验证,实验表明本文所构建的语料在最好的模型上CER为4.8%,SER为39.4%。由于新闻联播主持人说话语速相对较快,本文构建的数据集文本平均长度为28个字符是Aishell1数据集文本平均长度的2倍,且以往的研究中训练目标函数通常为基于字或词水平,缺乏明确的句子水平关系,因此本文提出了一个句子层级的一致性模块与Conformer模型结合直接减少源语音和目标文本的表示差异,在开源的Aishell1数据集上其CER降低0.4%,SER降低2%;在CHNEWSASR数据集上其CER降低0.9%,SER降低3%,实验结果表明该方法不提升模型参数量的前提下能有效提升语音识别的质量。
Anthology ID:
2021.ccl-1.76
Volume:
Proceedings of the 20th Chinese National Conference on Computational Linguistics
Month:
August
Year:
2021
Address:
Huhhot, China
Editors:
Sheng Li (李生), Maosong Sun (孙茂松), Yang Liu (刘洋), Hua Wu (吴华), Kang Liu (刘康), Wanxiang Che (车万翔), Shizhu He (何世柱), Gaoqi Rao (饶高琦)
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
851–861
Language:
Chinese
URL:
https://aclanthology.org/2021.ccl-1.76
DOI:
Bibkey:
Cite (ACL):
Jimin Zhang, Kerekadeer Zao, Yunfei Shen, Shanwumaier Ai, and Liejun Wang. 2021. 基于改进Conformer的新闻领域端到端语音识别(End-to-End Speech Recognition in News Field based on Conformer). In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 851–861, Huhhot, China. Chinese Information Processing Society of China.
Cite (Informal):
基于改进Conformer的新闻领域端到端语音识别(End-to-End Speech Recognition in News Field based on Conformer) (Zhang et al., CCL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ccl-1.76.pdf
Data
AISHELL-1