Ling Dong

Also published as:


2025

"Self-supervised learning (SSL) speech models have achieved remarkable performance across various tasks, with the learned representations often exhibiting a high degree of generality and applicability to multiple downstream tasks. However, these representations contain both speech content and some paralinguistic information, which may be redundant for content-focused tasks.Decoupling this redundant information is challenging. To address this issue, we propose a Self-Supervised Contrastive Representation Learning method (SSCRL), which effectively disentangles paralinguistic information from speech content by aligning similar content speech representations in the feature space using self-supervised contrastive learning with pitch perturbation and speaker perturbation features. Experimental results demonstrate that the proposed method, when fine-tuned on the LibriSpeech 100-hour dataset, achieves superior performance across all content-related tasks in the SUPERB Benchmark, generally outperforming prior approaches."
"This paper addresses the challenges of data scarcity and limited speaker resources in Lao-English code-switched speech synthesis. We propose a neural encoder-decoder-based method for mixed-lingual speech synthesis. The method first extracts phoneme-level speech representations and employs a dot-product attention mechanism to map Lao and English phonemes into a shared la-tent space, thereby enhancing the model’s capability to represent cross-lingual phonetic information. In addition, language ID embedding module is extended to explicitly indicate the language of each input token, helping the model distinguish and adapt to language-specific pronunciation characteristics. Experiments are conducted on the open-source English dataset LibriTTS anda proprietary Lao speech corpus. Both subjective evaluations (MOS, AB preference tests) and objective metrics (RMSE) demonstrate that the proposed approach significantly outperforms the baseline VALL-E X model in terms of naturalness and language-switching fluency. Furthermore, ablation studies confirm that both the shared phoneme latent space and the language ID mod-ule play critical roles in improving synthesis quality. This approach offers a novel solution for integrating low-resource languages into mixed-lingual speech synthesis."
"The goal of this work is zero-shot visual voice cloning (ZS-V2C), which aims to generate speech samples with unseen speaker identity and prosody derived from a video clip and an acoustic reference. ZS-V2C presents greater challenges as: 1) unseen speaker modeling; and 2) unseen prosody modeling. Unlike previous works, we propose a novel ZS-V2C framework that incorporates a hierarchical face-styled diffusion model (HFSD-V2C). Specifically, first, we leverage cross-modal biometrics to predict unseen speaker embeddings based on facial features. Then, we jointly model the unseen prosodic features at the text, speech and video levels. Finally, a diffusion model is constructed based on the embeddings of the unseen speaker and prosodic features,enabling the generation of expressive and diverse speech. Extensive experiments on the LRS2and GRID benchmark dataset demonstrate the superior performance of our proposed method."
"Whisper是一种强大的多语言语音识别模型,在英语等高资源语言上表现优异,但在缅甸语等部分低资源语言的性能仍受限于预训练数据的不足。为此,本文提出了一种基于自监督表征蒸馏的Whisper低资源语音识别优化方法。通过跨模型表征蒸馏机制,实现自监督模型表征向Whisper编码器的知识迁移,提升对缅甸语等语言的表征建模能力。实验结果表明,该方法在缅甸语、柬埔寨语、乌兹别克语和旁遮普语ASR任务中有效降低了字符错误率,验证了所提方法的有效性。"

2024

“Dialect speech recognition has always been one of the challenges in Automatic Speech Recog-nition (ASR) systems. While lots of ASR systems perform well in Mandarin, their performancesignificantly drops when handling dialect speech. This is mainly due to the obvious differencesbetween dialects and Mandarin in pronunciation and the data scarcity of dialect speech. In thispaper, we propose DialectMoE, a Chinese multi-dialects speech recognition model based onMixture-of-Experts (MoE) in a low-resource conditions. Specifically, DialectMoE assigns inputsequences to a set of experts using a dynamic routing algorithm, with each expert potentiallytrained for a specific dialect. Subsequently, the outputs of these experts are combined to derivethe final output. Due to the similarities among dialects, distinct experts may offer assistance inrecognizing other dialects as well. Experimental results on the Datatang dialect public datasetshow that, compared with the baseline model, DialectMoE reduces Character Error Rate (CER)for Sichuan, Yunnan, Hubei and Henan dialects by 23.6%, 32.6%, 39.2% and 35.09% respec-tively. The proposed DialectMoE model demonstrates outstanding performance in multi-dialectsspeech recognition.”
“音素分割作为语音处理领域内的一个重要任务,对于关键词识别、自动语音识别等应用具有至关重要的意义。传统方法往往独立预测每一帧音频是否为音素边界,忽视了音素边界与整个音频序列以及相邻帧之间的内在联系,从而影响了分割的准确性和连贯性。本文提出一种基于预训练模型与序列建模的音素分割方法,在HuBERT模型提取声学特征的基础上,结合BiLSTM捕捉长期依赖,再用CRF优化序列,提升了音素边界检测的性能。在TIMIT和Buckeye数据集上的实验表明,本文方法优于现有技术,证明了序列建模在音素分割任务中的有效性。”

2023

“老挝语的语音合成对中老两国合作与交流意义重大,但老挝语语音发音复杂,存在声调、音节及音素等发音特性,现有语音合成方法在老挝语上效果不尽人意。基于注意力机制建模的自回归模型难以拟合复杂的老挝语语音,模型泛化能力差,容易出现漏字、跳字等灾难性错误,合成音频缺乏自然性和流畅性。本文提出基于离散化自监督表征增强的老挝语非自回归语音合成方法,结合老挝语的语言语音特点,使用老挝语音素粒度的标注时长信息构建非自回归架构声学模型,通过自监督学习的预训练语音模型来提取语音内容和声调信息的离散化表征,融入到声学模型中增强模型的语音生成能力,增强合成音频的流畅性和自然性。实验证明,本文方法合成音频达到了4.03的MOS评分,基于离散化自监督表征增强的非自回归建模方法,能更好的在声调、音素时长、音高等细粒度层面刻画老挝语的语音特性。”

2022

“语音翻译的编码器需要同时编码语音中的声学和语义信息,单一的Fbank或Wav2vec2语音特征表征能力存在不足。本文通过分析人工的Fbank特征与自监督的Wav2vec2特征间的差异性,提出基于交叉注意力机制的声学特征融合方法,并探究了不同的自监督特征和融合方式,加强模型对语音中声学和语义信息的学习。结合越南语语音特点,以Fbank特征为主、Pitch特征为辅混合编码Fbank表征,构建多特征融合的越-英语音翻译模型。实验表明,使用多特征的语音翻译模型相比单特征翻译效果更优,与简单的特征拼接方法相比更有效,所提的多特征融合方法在越-英语音翻译任务上提升了1.97个BLEU值。”
“越南语为低资源语言,训练语料难以获取;流式端到端模型在训练过程中难以学习到外部大量文本中的语言知识,这些问题在一定程度上都限制了流式越南语语音识别模型的性能。因此,本文以越南语音节作为语言模型和流式越南语语音识别模型的建模单元,提出了一种将预训练越南语语言模型在训练阶段融合到流式语音识别模型的方法。在训练阶段,通过最小化预训练越南语语言模型和解码器的输出计算一个新的损失函数LAE D−LM ,帮助流式越南语语音识别模型学习一些越南语语言知识从而优化其模型参数;在解码阶段,使用孓孨孡孬孬孯孷 孆孵孳孩孯孮或者字孆孓孔技术再次融合预训练语言模型进一步提升模型识别率。实验结果表明,在孖孉孖孏孓数据集上,相比基线模型,在训练阶段融合语言模型可以将流式越南语语音识别模型的词错率提升嬲嬮嬴嬵嬥;在解码阶段使用孓孨孡孬孬孯孷 孆孵孳孩孯孮或字孆孓孔再次融合语言模型,还可以将模型词错率分别提升嬱嬮嬳嬵嬥和嬴嬮嬷嬵嬥。”