Boqi Huang
2026
HW-TSC’s Submissions to the IWSLT 2026 Offline Speech Translation Task
Boqi Huang | Daimeng Wei | Jiaxin GUO | Yuanchang Luo | Hengchao Shang | Zongyao Li | Zhiqiang Rao | Jinlong Yang | Zhanglin Wu | Yu He | Xiaoqing Lan
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Boqi Huang | Daimeng Wei | Jiaxin GUO | Yuanchang Luo | Hengchao Shang | Zongyao Li | Zhiqiang Rao | Jinlong Yang | Zhanglin Wu | Yu He | Xiaoqing Lan
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
This paper describes the HW-TSC’s submission to the IWSLT 2026 Offline Speech Translation Task, specifically for the English-to-Chinese and English-to-German unconstrained tracks. Our system adopts a robust cascade architecture optimized for long-form, unsegmented audio. To mitigate the hallucination and inconsistency issues common in long-sequence processing, we propose a two-pass transcription strategy: an initial streaming ASR with a 12-second context buffer for sentence-level coherence, followed by Qwen3-ForcedAligner for precise timestamping. Based on these alignments, a second-pass refinement is conducted using Qwen3-Omni on re-segmented 30-second chunks to ensure high-fidelity transcriptions. For the translation module, we employ a context-aware segment merging strategy (up to 150 tokens) to empower the Qwen3 llm with sufficient semantic context. Experimental results on the tst-2022 benchmark demonstrate the effectiveness of our pipeline, achieving COMET scores of 0.8462 (En-Zh) and 0.7854 (En-De), significantly outperforming the standard cascade baselines.
HW-TSC’s Submission to the IWSLT 2026 Cross-Lingual Voice Cloning Track
Yu He | Daimeng Wei | Jiaxin GUO | Yuanchang Luo | Hengchao Shang | Zongyao Li | Zhiqiang Rao | Jinlong Yang | Zhanglin Wu | Boqi Huang | Xiaoqing Lan
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Yu He | Daimeng Wei | Jiaxin GUO | Yuanchang Luo | Hengchao Shang | Zongyao Li | Zhiqiang Rao | Jinlong Yang | Zhanglin Wu | Boqi Huang | Xiaoqing Lan
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
This paper presents HW-TSC’s submission to the IWSLT 2026 Cross-Lingual Voice Cloning Track. The Cross-Lingual Voice Cloning Track includes three target languages: Arabic, Chinese, and French. We take part in two language tasks of this track, namely Chinese and French. We employ the Qwen3-TTS-12Hz-1.7B-Base multilingual model as the core voice cloning model. To tackle problems such as excessively long duration of the original reference audio and scattered features, we design a sliding-window audio segmentation preprocessing method, which continuously splits long audio into standardized short segments with overlapping redundancy. This method avoids feature attenuation caused by overly long audio and maximizes the preservation of complete timbre information through step overlap. To select the outputs with the highest timbre similarity from numerous synthetic results, this study conducts voiceprint recognition based on the Enhanced Context-Dependent Adversarial Time Delay Neural Network (ECAPA-TDNN), with cosine similarity as the core quantitative evaluation metric, and selects the result with the highest similarity as the optimal output.
HW-TSC’s Submission to the IWSLT 2026 Subtitling Track
Xiaoqing Lan | Daimeng Wei | Jiaxin GUO | Yuanchang Luo | Hengchao Shang | Zongyao Li | Zhiqiang Rao | Jinlong Yang | Zhanglin Wu | Boqi Huang | Yu He
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Xiaoqing Lan | Daimeng Wei | Jiaxin GUO | Yuanchang Luo | Hengchao Shang | Zongyao Li | Zhiqiang Rao | Jinlong Yang | Zhanglin Wu | Boqi Huang | Yu He
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
This paper introduces HW-TSC’s submission to the IWSLT 2026 Subtitling track. For automatic subtitle generation, we employ a cascaded strategy under unconstrained conditions. First, we construct a large-model-based streaming speech recognition framework, which incorporates VAD voice activity detection, sliding-window context caching, long audio chunking, and the Qwen3 forced alignment model to achieve high-precision transcription and timestamping from English speech to text. Next, we perform text translation using a Qwen3-based translation model. Finally, according to subtitle constraints such as characters per second (CPS) and characters per line (CPL), we identify translation segments that exceed compliance thresholds via quantitative evaluation, and rewrite them using a large language model while preserving core semantic meaning, ultimately producing subtitle files that meet the required standards.