Yi Xing

2026

Test-Time Adaptation of an Offline Multimodal Foundation Model for Simultaneous Speech Translation
Yi Xing | Manli Yu | Pengfei Liu | Helen Meng
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)

End-to-end simultaneous speech-to-text translation (SimulST) systems typically rely on complex architectures and sophisticated training strategies. In contrast, we propose a simple approach that combines conventional pause-based segmentation for streaming audio input with a strong off-the-shelf multimodal foundation model adapted at test-time for translation. To achieve simultaneity, we adopt a variant of the classic wait-k read-write policy to control the interaction between audio input and translation output, and use a multi-turn conversation format with response prefilling and key-value caching for coherent translation and computational efficiency. Experiments on the official development sets of the IWSLT 2026 SimulST shared task show that our system achieves a better quality–latency trade-off than the cascaded baseline across all language directions and latency regimes, highlighting the effectiveness of this simple yet powerful approach.

Co-authors

Venues

IWSLT1
WS1

Fix author