KIT’s Submission to Cross-Lingual Voice Cloning in IWSLT 2026

Seymanur Akti, Alexander Waibel


Abstract
Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.
Anthology ID:
2026.iwslt-1.8
Volume:
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Month:
July
Year:
2026
Address:
San Diego, USA (in-person and online)
Editors:
Elizabeth Salesky, Antonios Anastasopoulos, Matteo Negri, Marcello Federico
Venues:
IWSLT | WS
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–83
Language:
URL:
https://aclanthology.org/2026.iwslt-1.8/
DOI:
Bibkey:
Cite (ACL):
Seymanur Akti and Alexander Waibel. 2026. KIT’s Submission to Cross-Lingual Voice Cloning in IWSLT 2026. In Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), pages 78–83, San Diego, USA (in-person and online). Association for Computational Linguistics.
Cite (Informal):
KIT’s Submission to Cross-Lingual Voice Cloning in IWSLT 2026 (Akti & Waibel, IWSLT 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.iwslt-1.8.pdf