Direct Metric Optimization for Image Captioning through Reward-Weighted Augmented Data Utilization

Takumi Takada, Yuma Suzuki, Hiroki Takushima, Hayato Tanoue, Haruki Sato, Aiswariya Kumar, Hiroki Nishihara, Takayuki Hori, Kazuya Ueki


Abstract
While image captioning is an essential field of vision language models (VLM), a lack of continuity between the learning objective and final performance metrics of VLMs complicates their training and optimization. Reinforcement learning (RL) can directly optimize such metrics, but it is accompanied by a significant computational cost, making it difficult to apply to recent large-scale VLMs. In this paper, we propose Direct Metric Optimization (DMO), which is a lightweight final-metric-optimizing training method. We replace the computationally expensive exploration process in RL with an offline, diverse text data augmentation and show that self-supervised training on reward-weighted augmented data leads to direct and stable metric optimization. Our experiments demonstrate that DMO achieves performance comparable to those of the state-of-the-art RL method while saving hundreds of times more model forwarding iterations and greater amounts of computation time. This suggests that DMO constitutes a promising alternative for metric optimization in the era of large-scale VLMs.
Anthology ID:
2024.acl-long.453
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8333–8346
Language:
URL:
https://aclanthology.org/2024.acl-long.453
DOI:
Bibkey:
Cite (ACL):
Takumi Takada, Yuma Suzuki, Hiroki Takushima, Hayato Tanoue, Haruki Sato, Aiswariya Kumar, Hiroki Nishihara, Takayuki Hori, and Kazuya Ueki. 2024. Direct Metric Optimization for Image Captioning through Reward-Weighted Augmented Data Utilization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8333–8346, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Direct Metric Optimization for Image Captioning through Reward-Weighted Augmented Data Utilization (Takada et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.453.pdf