Marc Casals

Also published as: Marc Casals-Salvador

2026

BSC’s Submission to the Instruction Following Track of IWSLT 2026
Oriol Pareras | Joan Llado | Pol Buitrago | Marc Casals-Salvador | Federico Costa | Cristina Espana-Bonet
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)

We present the Barcelona Supercomputing Center (BSC) submission to the Instruction Following (IF) track of IWSLT 2026, which evaluates unified spoken language systems capable of solving multiple tasks through natural language instructions. Our system consists of an end-to-end (E2E) architecture that combines a speech encoder with a translation-oriented Large Language Model. The model is trained on speech and text data, covering automatic speech recognition, translation, question answering, and instruction following. We investigate a Chain-of-Thought (CoT) generation strategy that explicitly decomposes tasks by producing an intermediate transcription before the final output, which enables effective reuse of text-only supervision and improves robustness across tasks. To further support generalization, we design diverse prompt formulations and align text-only and speech inputs under a shared inference pattern. Results on IWSLT 2025 evaluation data show that our approach achieves competitive and even state-of-the-art performance across tasks.

pdf bib abs

CATENG Submission for the IWSLT 2026: Dialectal and Low-resource Speech Translation Task
Rodolfo Joel Zevallos | Marc Casals | John E. Ortega | Fabrício Carraro | Pol Buitrago | Guillermo Cámbara
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)

We present the CATENG systems submitted to the IWSLT 2026 Dialectal and Low-Resource Speech Translation shared task for the Catalan–English (CA–EN) pair. Although Catalan is not strictly low-resource, its dialectal diversity and relative under-representation in speech technology make it a challenging setting. We evaluate three unconstrained systems: two cascaded approaches combining ASR and MT, and one end-to-end model. Our primary system uses a Mamba-based ASR (ConMamba) with a fine-tuned NLLB-200 MT model, while a contrastive system replaces the ASR with Whisper-v3; we also evaluate an end-to-end SpeechT5 model with data augmentation. Experiments are conducted on the IWSLT 2026 Catalan dataset (15 hours), complemented with large-scale parallel text. Results show that cascaded systems outperform end-to-end ST, with Whisper-v3 + NLLB achieving 44.7 BLEU and 65.1 chrF. We find that performance is primarily constrained by ASR quality rather than MT capacity, and that Mamba-based ASR models provide competitive results, highlighting the importance of robust speech representations and dialectal coverage for Catalan–English speech translation.

Co-authors

Joan Llado 1

John E. Ortega 1

Oriol Pareras Velasco 1

Rodolfo Zevallos 1

Venues

IWSLT2
WS2

Fix author