Jamal uddin


2026

Cross-lingual voice cloning (CLVC) aims to synthesize speech in a target language while preserving the vocal identity of a source speaker who has no recorded speech in that language. Despite recent advances in multilingual text-to-speech systems, zero-shot CLVC remains challenging due to phonetic divergence across languages and the difficulty of maintaining speaker identity alongside linguistic intelligibility. In this work, we present a systematic evaluation of four state-of-the-art CLVC systems spanning autoregressive and diffusion-based architectures. Using English source speakers from the ACL-60/60 dataset, we evaluate zero-shot voice transfer across multiple target languages, including Arabic, Chinese, French, German, Russian, and Japanese. Systems are assessed using speaker similarity and content consistency metrics under a unified multilingual evaluation pipeline. We analyze how different modeling approaches autoregressive language modeling and diffusion-based flow matching handle the tradeoff between speech accuracy and speaker identity preservation across different architectural approaches. We further observe substantial performance variation across languages, with Arabic remaining particularly challenging under zero-shot transfer settings.