Seymanur Akti

2026

BOOM: Beyond Only One Modality KIT’s Multimodal Multilingual Lecture Companion
Sai Koneru | Fabian Retkowski | Christian Huber | Lukas Hilgert | Seymanur Akti | Enes Yavuz Ugan | Alexander Waibel | Jan Niehues
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)

The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present BOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline[All released code and models are licensed under the MIT License].

2025

pdf bib abs

KIT’s Offline Speech Translation and Instruction Following Submission for IWSLT 2025
Sai Koneru | Maike Züfle | Thai Binh Nguyen | Seymanur Akti | Jan Niehues | Alexander Waibel
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

In this paper, we present the submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional contextual refinement step to improve translation quality. For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks. We complement it with a final document-level refinement stage to further enhance output quality by using contextual information.

Co-authors

Venues

Fix author