Fan Bu


2025

pdf bib
Soundwave: Less is More for Speech-Text Alignment in LLMs
Yuhao Zhang | Zhiheng Liu | Fan Bu | Ruiyu Zhang | Benyou Wang | Haizhou Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms other advanced speech LLMs in speech translation and AIR-Bench speech tasks with only a fraction of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation.

2024

pdf bib
CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models
Xiang Li | Fan Bu | Ambuj Mehrish | Yingting Li | Jiale Han | Bo Cheng | Soujanya Poria
Findings of the Association for Computational Linguistics: NAACL 2024

Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS’s superiority over existing single-step speech synthesis systems, representing a significant advancement in the field.

2012

pdf bib
String Re-writing Kernel
Fan Bu | Hang Li | Xiaoyan Zhu
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2010

pdf bib
Measuring the Non-compositionality of Multiword Expressions
Fan Bu | Xiaoyan Zhu | Ming Li
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Function-Based Question Classification for General QA
Fan Bu | Xingwei Zhu | Yu Hao | Xiaoyan Zhu
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing