Fabrício Carraro


2026

This paper describes the QUESPA team’s speech translation (ST) submissions for the Quechua to Spanish (QUE-SPA) track of the IWSLT 2026 Evaluation Campaign on dialectal and low-resource speech translation. The campaign supports a single submission category, namely unconstrained. This marks our fourth consecutive participation in the IWSLT shared task, building upon prior systems with substantial improvements. Our 2026 submission comprises three unconstrained-only systems. The best-performing system (contrastive 2) extends our strongest model from the previous year by leveraging a high-performing pre-trained language model (PLM) for end-to-end speech translation without cascading, augmented with additional Quechua-Collao text - now made available on the IWSLT GitHub. Fine-tuning Microsoft’s SpeechT5 model in an ST setting, combined with targeted data augmentation, results in a BLEU score of 27.2 on the official evaluation set. Additionally, we evaluate prompt-based machine translation using Gemini, DeepSeek, GPT-5, Claude, and Qwen for the first time. Aside from that, we introduce SIDON, an audio enhancement framework designed to improve audio quality. This paper provides a comparative analysis across our current and three previous IWSLT submissions, with a detailed examination of the impact of synthetic data, unconstrained external resources, and audio enhancement techniques on fine-tuning performance. Our results highlight the complementary role of PLM-based ST, LLM prompting, and ASR enhancement in advancing low-resource speech translation.
This paper reports on the outcomes of the shared tasks organized as part of the 23rd International Workshop on Spoken Language Translation (IWSLT). The workshop covered ten major challenges in spoken language translation, including speech-to-text translation for both high-resource and low-resource language pairs, customized speech translation, speech generation, instruction-following speech processing, and the evaluation of speech translation systems. The shared tasks received strong participation, with more than 30 teams submitting runs. This year’s edition broadened the range of tasks, placing particular emphasis on speech generation and evaluation metrics.
We present the CATENG systems submitted to the IWSLT 2026 Dialectal and Low-Resource Speech Translation shared task for the Catalan–English (CA–EN) pair. Although Catalan is not strictly low-resource, its dialectal diversity and relative under-representation in speech technology make it a challenging setting. We evaluate three unconstrained systems: two cascaded approaches combining ASR and MT, and one end-to-end model. Our primary system uses a Mamba-based ASR (ConMamba) with a fine-tuned NLLB-200 MT model, while a contrastive system replaces the ASR with Whisper-v3; we also evaluate an end-to-end SpeechT5 model with data augmentation. Experiments are conducted on the IWSLT 2026 Catalan dataset (15 hours), complemented with large-scale parallel text. Results show that cascaded systems outperform end-to-end ST, with Whisper-v3 + NLLB achieving 44.7 BLEU and 65.1 chrF. We find that performance is primarily constrained by ASR quality rather than MT capacity, and that Mamba-based ASR models provide competitive results, highlighting the importance of robust speech representations and dialectal coverage for Catalan–English speech translation.