Shixin Jiang
2024
Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models
Shixin Jiang
|
Zerui Chen
|
Jiafeng Liang
|
Yanyan Zhao
|
Ming Liu
|
Bing Qin
Findings of the Association for Computational Linguistics: EMNLP 2024
Expanding the understanding capabilities of multi-modal large language models (MLLMs) for infrared modality is a challenge due to the single-modality nature and limited amount of training data. Existing methods typically construct a uniform embedding space for cross-modal alignment and leverage abundant visual image data to indirectly understand infrared images. However, they ignore the supervisory signals of infrared-modality-specific attributes, which may lead to biased understanding of infrared images. To address this issue, we propose a debating multi-agent generation system which transfers knowledge from visible images to generate infrared image-text pairs and infrared instruction data. Moreover, we construct an infrared question-answering benchmark based on common infrared tasks. Experimental results from incremental fine-tuning on existing models and our Infrared-LLaVA-7B trained from scratch on infrared data demonstrate the effectiveness of the generated data and the feasibility of the generation approach.
Search