Minghui Fang


2025

pdf bib
VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation
Wenrui Liu | Jionghao Bai | Xize Cheng | Jialong Zuo | Ziyue Jiang | Shengpeng Ji | Minghui Fang | Xiaoda Yang | Qian Yang | Zhou Zhao
Proceedings of the 31st International Conference on Computational Linguistics

In recent years, speech generation fields have achieved significant advancements, primarily due to improvements in large TTS (text-to-speech) systems and scalable TTS datasets. However, there is still a lack of large-scale multilingual TTS datasets, which limits the development of cross-language and multilingual TTS systems. Hence, we refine Voxpopuli dataset and propose VoxpopuliTTS dataset. This dataset comprises 30,000 hours of high-quality speech data, across 3 languages with multiple speakers and styles, suitable for various speech tasks such as TTS and ASR. To enhance the quality of speech data from Voxpopuli, we improve the existing processing pipeline by: 1) filtering out low-quality speech-text pairs based on ASR confidence scores, and 2) concatenating short transcripts by checking semantic information completeness to generate the long transcript. Experimental results demonstrate the effectiveness of the VoxpopuliTTS dataset and the proposed processing pipeline.

2024

pdf bib
AudioVSR: Enhancing Video Speech Recognition with Audio Data
Xiaoda Yang | Xize Cheng | Jiaqi Duan | Hongshun Qiu | Minjie Hong | Minghui Fang | Shengpeng Ji | Jialong Zuo | Zhiqing Hong | Zhimeng Zhang | Tao Jin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Visual Speech Recognition (VSR) aims to predict spoken content by analyzing lip movements in videos. Recently reported state-of-the-art results in VSR often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are insufficient compared to the audio data. To further enhance the VSR model using the audio data, we employed a generative model for data inflation, integrating the synthetic data with the authentic visual data. Essentially, the generative model incorporates another insight, which enhances the capabilities of the recognition model. For the cross-language issue, previous work has shown poor performance with non-Indo-European languages. We trained a multi-language-family modal fusion model, AudioVSR. Leveraging the concept of modal transfer, we achieved significant results in downstream VSR tasks under conditions of data scarcity. To the best of our knowledge, AudioVSR represents the first work on cross-language-family audio-lip alignment, achieving a new SOTA in the cross-language scenario.