NJUST-KMG at MedGenVidQA 2026: Cascade Multi-modal Alignment with Gaussian Soft Priors for Medical Visual Answer Localization

Jinglong Li; Yang Yang

NJUST-KMG at MedGenVidQA 2026: Cascade Multi-modal Alignment with Gaussian Soft Priors for Medical Visual Answer Localization

Abstract

This paper describes the system developed for the Medical Visual Answer Localization (MVAL) task at MedGenVidQA 2026. Accurately locating surgical or instructional steps in medical videos is inherently challenging due to audio-visual asynchrony and the visual homogeneity of surgical scenes. We propose a Cascade Multi-modal Alignment Framework that integrates Large Language Models (LLMs) to bridge the semantic-temporal gap. Our pipeline utilizes WhisperX for word-level speech transcription to ensure precise textual anchoring. We then employ Gemini3 as a high-level semantic ranker to generate multi-scale textual priors. Crucially, we transform these discrete semantic scores into a continuous 1D Gaussian Soft Prior, which is injected as an attention bias into our cross-modal fusion network. This mechanism preserves global temporal context while guiding the model to focus on query-relevant frames. Our system achieves highly competitive performance on the validation leaderboard, particularly under strict evaluation metrics, reaching an IoU@0.7 of 67.5%.

Anthology ID:: 2026.bionlp-2.30
Volume:: Proceedings of the BioNLP 2026 (Shared Tasks)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Deepak Gupta, Dina Demner-Fushman
Venues:: BioNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 229–232
Language:
URL:: https://aclanthology.org/2026.bionlp-2.30/
DOI:
Bibkey:
Cite (ACL):: Jinglong Li and Yang Yang. 2026. NJUST-KMG at MedGenVidQA 2026: Cascade Multi-modal Alignment with Gaussian Soft Priors for Medical Visual Answer Localization. In Proceedings of the BioNLP 2026 (Shared Tasks), pages 229–232, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: NJUST-KMG at MedGenVidQA 2026: Cascade Multi-modal Alignment with Gaussian Soft Priors for Medical Visual Answer Localization (Li & Yang, BioNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.bionlp-2.30.pdf
Supplementarymaterial:: 2026.bionlp-2.30.SupplementaryMaterial.txt

PDF Cite Search Supplementarymaterial Fix data