Bridging the Gap between Authentic and Answer-Guided Images for Chinese Vision-Language Understanding Enhancement

Wang Feiyu, Guo Wenyu, Yu Dong, Kang Chen, Liu Pengyuan


Abstract
“The objective of the Chinese Vision-Language Understanding Evaluation (CVLUE) is to comprehensively assess the performance of Chinese vision-language multimodal pre-trained models in multimodal modeling and understanding across four tasks: Image-Text Retrieval, Visual Question Answering, Visual Grounding, and Visual Dialog. To enhance the models’ performance across various multimodal tasks, this paper propose a multimodal information understanding enhancement method based on answer-guided images. Firstly, we propose task-specific methods for answer-guided image generation. Secondly, the authentic and answer-guided images are fed into the model for multimodal fine-tuning, respectively. Finally, training objectives are set for different tasks to minimize the gap between the answer-guided images and authentic images, thereby supervising the results produced by the authentic images utlizing answer-guided images. The experimental results demonstrate the effectiveness of the proposed method.”
Anthology ID:
2024.ccl-3.40
Volume:
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
Month:
July
Year:
2024
Address:
Taiyuan, China
Editors:
Hongfei Lin, Hongye Tan, Bin Li
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
353–362
Language:
English
URL:
https://aclanthology.org/2024.ccl-3.40/
DOI:
Bibkey:
Cite (ACL):
Wang Feiyu, Guo Wenyu, Yu Dong, Kang Chen, and Liu Pengyuan. 2024. Bridging the Gap between Authentic and Answer-Guided Images for Chinese Vision-Language Understanding Enhancement. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations), pages 353–362, Taiyuan, China. Chinese Information Processing Society of China.
Cite (Informal):
Bridging the Gap between Authentic and Answer-Guided Images for Chinese Vision-Language Understanding Enhancement (Feiyu et al., CCL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ccl-3.40.pdf