Meizhu Liu


2026

Automatic Speech Recognition (ASR) is increasingly integral to healthcare services, where medical conversations present unique transcription challenges due to specialized terminology and frequent introduction of new terms. Existing ASR models, including widely used systems like Whisper, struggle with high word error rates (WER) on clinical vocabulary, especially medication names, primarily due to the scarcity of annotated audio-transcript data in the medical domain. This paper proposes and evaluates a novel synthetic data generation pipeline that produces comprehensive doctor-patient dialogues in both text and audio forms, specifically targeting a curated set of over 124,000 medical terms. The pipeline generated over 1 billion audios with ground truth transcriptions. Fine-tuning ASR models with this synthetic corpus significantly reduced overall WER and improved transcription accuracy on medical terms, marking a significant advance in healthcare ASR accuracy. Data generation code, dataset, and training and evaluation scripts are released.
Automatic Speech Recognition (ASR) plays an important role in healthcare but faces unique challenges. Medical audio often contains specialized terminology, such as medication names, which existing ASR systems struggle to transcribe accurately. High error rates arise from pronunciation variability, the continual introduction of new terms, and the scarcity of high-quality labeled data—whose collection is costly and requires medical expertise. Although synthetic datasets partially alleviate this problem, they fail to capture the noise and variability of real-world recordings. Moreover, ASR models trained in controlled environments are highly sensitive to noise, leading to degraded performance in clinical settings. To address these limitations, we propose an unsupervised continual learning ASR framework that adapts to new data while preserving prior knowledge. This enables efficient domain adaptation without extensive retraining. Experiments on real-world medical audio demonstrate significant improvements over state-of-the-art baselines.