LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting

Watcharitpol Sermsrisuwan; Nopporn Lekuthai; Seksan Yoadsanit; Titipat Achakulvisut

LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting

Watcharitpol Sermsrisuwan, Nopporn Lekuthai, Seksan Yoadsanit, Titipat Achakulvisut

Abstract

This paper presents an approach to localizing visual answers within continuous medical videos using a multi-step multimodal generation pipeline with the MedGenVidQA dataset. We frame visual answer localization as a multimodal fusion problem, integrating raw video, timestamped ASR transcripts, and VLM-generated scene descriptions into structured contextual blocks, enabling the model to cross-reference spoken commentary against observable physical events. We show that targeted guidance, which forces the model to treat audio transcripts as supplementary hints with observable visual movements, significantly outperforms baseline approaches. It achieves state-of-the-art performance on the test leaderboard, yielding an mIoU of 79.55, alongside IoU@0.3, IoU@0.5, and IoU@0.7 scores of 93.75, 90.00, and 77.50, respectively. Our findings highlight the effectiveness of combining multimodal context fusion with targeted guidance to overcome text bias, establishing a promising approach for achieving the micro-level precision required in the medical domain. We release our code on GitHub at https://github.com/biodatlab/medgenvidqa-lamar.

Anthology ID:: 2026.bionlp-2.31
Volume:: Proceedings of the BioNLP 2026 (Shared Tasks)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Deepak Gupta, Dina Demner-Fushman
Venues:: BioNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 233–242
Language:
URL:: https://aclanthology.org/2026.bionlp-2.31/
DOI:
Bibkey:
Cite (ACL):: Watcharitpol Sermsrisuwan, Nopporn Lekuthai, Seksan Yoadsanit, and Titipat Achakulvisut. 2026. LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting. In Proceedings of the BioNLP 2026 (Shared Tasks), pages 233–242, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting (Sermsrisuwan et al., BioNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.bionlp-2.31.pdf
Supplementarymaterial:: 2026.bionlp-2.31.SupplementaryMaterial.zip
Supplementarymaterial:: 2026.bionlp-2.31.SupplementaryMaterial.txt

PDF Cite Search Supplementarymaterial Supplementarymaterial Fix data