Zhihang Xie
2026
FBK’s Long-form SpeechLLMs for IWSLT 2026 Instruction Following
Zhihang Xie | Marco Gaido | Sara Papi | Matteo Negri | Luisa Bentivogli
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Zhihang Xie | Marco Gaido | Sara Papi | Matteo Negri | Luisa Bentivogli
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
This paper describes our submission to the IWSLT 2026 Instruction Following shared task. SpeechLLM systems are developed for both short-form and long-form speech instruction following under constrained settings. For the short track, strong performance is achieved on MCIF, with a SIFS score of 2.0708. For the long track, three speech segmentation strategies are investigated, and the HIFS score is introduced to account for unstable long-form generation. Experimental results show that fixed 30-second segmentation provides the most robust long-form performance, achieving the highest HIFS score of 2.0663. Further analysis shows that hallucination mainly manifests as repetitive insertions, substantially affecting ASR and SSUM, while short-form capabilities are largely retained after long-form extension.
2023
The BIGAI Offline Speech Translation Systems for IWSLT 2023 Evaluation
Zhihang Xie
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Zhihang Xie
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper describes the BIGAI’s submission to IWSLT 2023 Offline Speech Translation task on three language tracks from English to Chinese, German and Japanese. The end-to-end systems are built upon a Wav2Vec2 model for speech recognition and mBART50 models for machine translation. An adapter module is applied to bridge the speech module and the translation module. The CTC loss between speech features and source token sequence is incorporated during training. Experiments show that the systems can generate reasonable translations on three languages. The proposed models achieve BLEU scores of 22.3 for en→de, 10.7 for en→ja and 33.0 for en→zh on tst2023 TED datasets. However, the performance is decreased by a significant margin on complex scenarios like persentations and interview.