S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation

Yu Pan; Xiongfei Wu; Yang Yuguang; Jixun Yao; Maxime Cordy; Lei Ma; Jianjun Zhao

S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation

Yu Pan, Xiongfei Wu, Yang Yuguang, Jixun Yao, Maxime Cordy, Lei Ma, Jianjun Zhao

Abstract

Despite recent advances in speech-to-speech translation (S2ST), it remains difficult to achieve both high translation accuracy and practical flexibility. In this paper, we present S2ST-Omni, a compositional S2ST framework that integrates a high-accuracy speech-to-text translation (S2TT) frontend with a modular, plug-and-play text-to-speech (TTS) backend, enabling independent optimization of translation and synthesis. On the S2TT side, we introduce a hybrid adapter that follows a "local-then-global" strategy to bridge the pretrained Whisper encoder and Qwen3 LLM, yielding a hierarchical acoustic-to-semantic abstraction. Building on this bridge, we further propose a hierarchical language-aware architecture that injects source-language information at two complementary levels. At the acoustic level, Language-Aware Dual-CTC operates on intermediate adapter features and employs FiLM-style feature modulation with a learnable gate, encouraging the model to learn language-specific but content-faithful acoustic representations. At the linguistic level, Language-Aware Prompting dynamically constructs source-language-conditioned prompts that activate language-specific translation knowledge in the LLM. To enable efficient optimization, we design a task-specific progressive fine-tuning strategy that first stabilizes speech-text alignment and then improves translation via LoRA on top of this converged foundation. The TTS backend remains fully modular and can be instantiated with any state-of-the-art synthesizer without retraining the S2TT frontend. Experiments on CVSS-C show that S2ST-Omni consistently achieves the best BLEU and ASR-BLEU across French, German, and Spanish to English directions, outperforming strong recent S2ST baselines.

Anthology ID:: 2026.findings-acl.1004
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20114–20124
Language:
URL:: https://aclanthology.org/2026.findings-acl.1004/
DOI:
Bibkey:
Cite (ACL):: Yu Pan, Xiongfei Wu, Yang Yuguang, Jixun Yao, Maxime Cordy, Lei Ma, and Jianjun Zhao. 2026. S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 20114–20124, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation (Pan et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1004.pdf
Checklist:: 2026.findings-acl.1004.checklist.pdf

PDF Cite Search Checklist Fix data