Guo Wenyu


2024

pdf bib
Bridging the Gap between Authentic and Answer-Guided Images for Chinese Vision-Language Understanding Enhancement
Wang Feiyu | Guo Wenyu | Yu Dong | Kang Chen | Liu Pengyuan
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“The objective of the Chinese Vision-Language Understanding Evaluation (CVLUE) is to comprehensively assess the performance of Chinese vision-language multimodal pre-trained models in multimodal modeling and understanding across four tasks: Image-Text Retrieval, Visual Question Answering, Visual Grounding, and Visual Dialog. To enhance the models’ performance across various multimodal tasks, this paper propose a multimodal information understanding enhancement method based on answer-guided images. Firstly, we propose task-specific methods for answer-guided image generation. Secondly, the authentic and answer-guided images are fed into the model for multimodal fine-tuning, respectively. Finally, training objectives are set for different tasks to minimize the gap between the answer-guided images and authentic images, thereby supervising the results produced by the authentic images utlizing answer-guided images. The experimental results demonstrate the effectiveness of the proposed method.”