Language Resource Efficient Learning for Captioning

Jia Chen, Yike Wu, Shiwan Zhao, Qin Jin


Abstract
Due to complex cognitive and inferential efforts involved in the manual generation of one caption per image/video input, the human annotation resources are very limited for captioning tasks. We define language resource efficient as reaching the same performance with fewer annotated captions per input. We first study the performance degradation of caption models in different language resource settings. Our analysis of caption models with SC loss shows that the performance degradation is caused by the increasingly noisy estimation of reward and baseline with fewer language resources. To mitigate this issue, we propose to reduce the variance of noise in the baseline by generalizing the single pairwise comparison in SC loss and using multiple generalized pairwise comparisons. The generalized pairwise comparison (GPC) measures the difference between the evaluation scores of two captions with respect to an input. Empirically, we show that the model trained with the proposed GPC loss is efficient on language resource and achieves similar performance with the state-of-the-art models on MSCOCO by using only half of the language resources. Furthermore, our model significantly outperforms the state-of-the-art models on a video caption dataset that has only one labeled caption per input in the training set.
Anthology ID:
2021.findings-emnlp.162
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1887–1895
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.162
DOI:
10.18653/v1/2021.findings-emnlp.162
Bibkey:
Cite (ACL):
Jia Chen, Yike Wu, Shiwan Zhao, and Qin Jin. 2021. Language Resource Efficient Learning for Captioning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1887–1895, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Language Resource Efficient Learning for Captioning (Chen et al., Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.162.pdf
Video:
 https://aclanthology.org/2021.findings-emnlp.162.mp4
Data
TGIF