VPAI_Lab at MedVidQA 2022: A Two-Stage Cross-modal Fusion Method for Medical Instructional Video Classification

Bin Li, Yixuan Weng, Fei Xia, Bin Sun, Shutao Li


Abstract
This paper introduces the approach of VPAI_Lab team’s experiments on BioNLP 2022 shared task 1 Medical Video Classification (MedVidCL). Given an input video, the MedVidCL task aims to correctly classify it into one of three following categories: Medical Instructional, Medical Non-instructional, and Non-medical. Inspired by its dataset construction process, we divide the classification process into two stages. The first stage is to classify videos into medical videos and non-medical videos. In the second stage, for those samples classified as medical videos, we further classify them into instructional videos and non-instructional videos. In addition, we also propose the cross-modal fusion method to solve the video classification, such as fusing the text features (question and subtitles) from the pre-training language models and visual features from image frames. Specifically, we use textual information to concatenate and query the visual information for obtaining better feature representation. Extensive experiments show that the proposed method significantly outperforms the official baseline method by 15.4% in the F1 score, which shows its effectiveness. Finally, the online results show that our method ranks the Top-1 on the online unseen test set. All the experimental codes are open-sourced at https://github.com/Lireanstar/MedVidCL.
Anthology ID:
2022.bionlp-1.21
Volume:
Proceedings of the 21st Workshop on Biomedical Language Processing
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii
Venue:
BioNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
212–219
Language:
URL:
https://aclanthology.org/2022.bionlp-1.21
DOI:
10.18653/v1/2022.bionlp-1.21
Bibkey:
Cite (ACL):
Bin Li, Yixuan Weng, Fei Xia, Bin Sun, and Shutao Li. 2022. VPAI_Lab at MedVidQA 2022: A Two-Stage Cross-modal Fusion Method for Medical Instructional Video Classification. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 212–219, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
VPAI_Lab at MedVidQA 2022: A Two-Stage Cross-modal Fusion Method for Medical Instructional Video Classification (Li et al., BioNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.bionlp-1.21.pdf
Code
 lireanstar/medvidcl
Data
KineticsMedVidQA