Shixin Jiang


2024

pdf bib
Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models
Shixin Jiang | Zerui Chen | Jiafeng Liang | Yanyan Zhao | Ming Liu | Bing Qin
Findings of the Association for Computational Linguistics: EMNLP 2024

Expanding the understanding capabilities of multi-modal large language models (MLLMs) for infrared modality is a challenge due to the single-modality nature and limited amount of training data. Existing methods typically construct a uniform embedding space for cross-modal alignment and leverage abundant visual image data to indirectly understand infrared images. However, they ignore the supervisory signals of infrared-modality-specific attributes, which may lead to biased understanding of infrared images. To address this issue, we propose a debating multi-agent generation system which transfers knowledge from visible images to generate infrared image-text pairs and infrared instruction data. Moreover, we construct an infrared question-answering benchmark based on common infrared tasks. Experimental results from incremental fine-tuning on existing models and our Infrared-LLaVA-7B trained from scratch on infrared data demonstrate the effectiveness of the generated data and the feasibility of the generation approach.