CNN-Transformer based Encoder-Decoder Model for Nepali Image Captioning

Bipesh Subedi, Bal Krishna Bal


Abstract
Many image captioning tasks have been carried out in recent years, the majority of the work being for the English language. A few research works have also been carried out for Hindi and Bengali languages in the domain. Unfortunately, not much research emphasis seems to be given to the Nepali language in this direction. Furthermore, the datasets are also not publicly available in the Nepali language. The aim of this research is to prepare a dataset with Nepali captions and develop a deep learning model based on the Convolutional Neural Network (CNN) and Transformer combined model to automatically generate image captions in the Nepali language. The dataset for this work is prepared by applying different data preprocessing techniques on the Flickr8k dataset. The preprocessed data is then passed to the CNN-Transformer model to generate image captions. ResNet-101 and EfficientNetB0 are the two pre-trained CNN models employed for this work. We have achieved some promising results which can be further improved in the future.
Anthology ID:
2022.icon-main.12
Volume:
Proceedings of the 19th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2022
Address:
New Delhi, India
Editors:
Md. Shad Akhtar, Tanmoy Chakraborty
Venue:
ICON
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
86–91
Language:
URL:
https://aclanthology.org/2022.icon-main.12
DOI:
Bibkey:
Cite (ACL):
Bipesh Subedi and Bal Krishna Bal. 2022. CNN-Transformer based Encoder-Decoder Model for Nepali Image Captioning. In Proceedings of the 19th International Conference on Natural Language Processing (ICON), pages 86–91, New Delhi, India. Association for Computational Linguistics.
Cite (Informal):
CNN-Transformer based Encoder-Decoder Model for Nepali Image Captioning (Subedi & Krishna Bal, ICON 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.icon-main.12.pdf