Exploring the Impact of Vision Features in News Image Captioning

The task of news image captioning aims to generate a detailed caption which describes the specific information of an image in a news article. However, we find that recent state-of-art models can achieve competitive performance even without vision features. To resolve the impact of vision features in the news image captioning task, we conduct extensive experiments with mainstream models based on encoder-decoder framework. From our exploration, we find 1) vision features do contribute to the generation of news image captions; 2) vision features can assist models to better generate entities of cap-tions when the entity information is sufficient in the input textual context of the given article; 3) Regions of specific objects in images contribute to the generation of related entities in captions. 1


Introduction
Image captioning is a multi-modal task which has developed a lot in recent years (Wang et al., 2020;He et al., 2020;Sammani and Melas-Kyriazi, 2020).Image captioning models can generate captions which accurately describe the generic object categories and object relations in images on image captioning datasets such as COCO (Lin et al., 2014;Chen et al., 2015) or Flickr (Hodosh et al., 2013).However, generic image captioning datasets above often contain less details in captions such as names, places or other specific entity information which are common in captions of news images.
News image captioning (Lu et al., 2018;Biten et al., 2019;Tran et al., 2020) aims to generate more specific descriptions of images by providing rich contextual information in associated news articles.Specifically, with a news image and the corresponding article given, models need to generate a caption which not only describes the whole image generally but also contains the specific information such as names or places of objects in the image.There has made significant progress with the introduction of transformer-based end-to-end captioning methods (Tran et al., 2020;Liu et al., 2021;Yang et al., 2021).
As a multi-modality task, news image captioning models usually generate captions with both textual and vision features.Note that a series of explorations (Elliott, 2018;Caglayan et al., 2019;Wu et al., 2021;Li et al., 2022) have been made for other multi-modality tasks like Multimodal Machine Translation(MMT) to resolve the impact of vision features.It is natural and important to explore the role of vision features in the news image captioning task where all specific textual information is in news articles.We come up with these questions: RQ1: Do vision features help or not? and RQ2: How do vision features help?Particularly, RQ2 can be decomposed into two subquestions: RQ2-1: Which part in captions do vision features help to generate?and RQ2-2: Which part in images helps the generation?
In order to answer the above questions, we conduct a series of experiments using the most successful news image captioning models in recent studies on two main news image caption datasets Good-News (Biten et al., 2019) and NYTimes800k (Tran et al., 2020).We first evaluate models under incongruent vision features to preliminarily determine whether the model is sensitive to the vision features.Then we modify the vision features and textual features respectively to explore the specific contribution of vision features to the captioning process.We cover the specific type of image regions to explore the relationship between the generation of entities and their related regions of images.Following Caglayan et al. (2019)'s work, we also conduct probing tasks to find how vision features affect caption generation under insufficient input textual features.Our main conclusions are: • Vision features do contribute to generating news image captions.(Answer to RQ1) • Vision features can better improve model's ability to generate specific entities which appear in textual context.The scarcity of textual entity information will seriously damage the impact of vision features.(Answer to RQ2-1) • The specific regions of objects in images can help models generate the related entity information in captions more accurately when the corresponding entity information is sufficient in textual context.(Answer to RQ2-2)

Exploration Method
We introduce a series of methods to analyze whether and how vision features help, respectively.We start with the definition of Incongruent Decoding and then introduce the Degradation and Cover methods for textual and vision features respectively.

Incongruent Decoding
Incongruent decoding is widely considered as a method to evaluate whether visual information plays a role in MMT tasks (Elliott, 2018;Caglayan et al., 2019).We evaluate the performance of models under incongruent multi-modal information while keeping the congruence of multi-modal features during the training stage.Specifically, we replace the input image randomly with another image in the same dataset during the validation period.Generally, this method will decrease the performance if the model is sensitive to the visual modality information.

Image Cover
In order to further study the mechanism of visual modality in news image captioning, we explore the relationship between specific regions of images and the generation of corresponding entities in captions.Since 80% of images are detected with regions of people, and the average proportion of regions accounts for approximately half of all images, we choose regions of people and people's names to investigate the relationship during our image cover experiments.According to (Tran et al., 2020), we first recognize people regions of images with YOLOv3 (Redmon and Farhadi, 2018) and then cover them with blank regions, as shown in Figure 1 We further analyze the occurrence rate of entities in captions.Specifically, we first apply SpaCy to recognize the entities in articles and captions, then exactly compare their strings to obtain the match rate.From Table 1, we see only 39% of entities in captions match entities in articles on GoodNews.Though NYTimes800k contains longer and more sufficient articles, there are still 35% of entities which do not appear in the articles.Besides, we also employ YOLOv3 to detect regions of people in news images.On GoodNews and NYTimes800k, all regions of people account for 49% and 45% of the whole image area respectively, and 80% of images are detected with regions of people on both datasets.Following (Tran et al., 2020), we choose first 500 words as limited input context in Good-News, and choose location-aware paragraphs until they contain more than 512 sub-words as limited input context in NYTimes800k.

Multi-Modality Features
Vision Features We use the ResNet-152 (He et al., 2016) model pre-trained on ImageNet to obtain the representations of images.Following the settings in Tran et al. (2020), we use the same image preprocess pipeline and then use the output before the average pooling layer of ResNet-152 to obtain our vision features.Textual Features According to Tran et al. (2020), we obtain the textual features with the RoBERTalarge model (Liu et al., 2019), which is a pretrained model including 24 layers of bidirectional transformer blocks and encodes text to contextual em-3 https://developer.nytimes.com/apisbeddings.We use the weighted RoBERTa technique that obtains final textual features by using a weighted sum of the output from each transformer layer and initial uncontextualized embeddings.On Goodnews dataset, which does not contain image location information, we use the first 512 sub-words obtained by BPE(Byte Pair Encoding) technique in RoBERTa.On NYTimes800k dataset, with the image location provided, we use the location-aware 512 sub-words tokens as textual input features for both textual-modality-only and multi-modality models.

Models
Tell (Tran et al., 2020) is an open-source SOTA model which uses RoBERTa to encode input articles and ResNet-152 to extract features from input images.Tell uses a transformer decoder to generate a caption with dynamic convolutions (Wu et al., 2019) to attend to the generated tokens and multi-head attention (Vaswani et al., 2017) to attend to the multi-modality features.Tell also equips the weighted RoBERTa technique and extracts location-aware text.We also implement Tell with these techniques when only textual features are provided.Tell with weighted RoBERTa and location-aware technique is referred as Tell(full) in our experiments.We conduct experiments on extra models in Appendix A.

Implementation Details
We follow (Tran et al., 2020) and(Yang et al., 2021) to conduct our experiments with all models.
The hidden size of input textual features and vision features are 1024 and 2048.Tell(full) includes 4 transformer decoder layers and JoGANIC includes 8 transformer decoder layers.The number of heads is 16 in multi-modality attention.Specifically, we implement Tell(full) model without face and object encoder to make sure the multi-modality features are the same for all models.
Our implementation is based on PyTorch (Paszke et al., 2017) and AllenNLP framework (Gardner et al., 2018), and our training settings are the same as (Tran et al., 2020).We apply fairseq (Ott et al., 2019) to accomplish RoBERTa model and dynamic convolution.We use a batch size of 16 and train all models for 16 epochs for GoodNews and 9 epochs for NYTimes800k.For training, we use Adam optimizer(Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, ϵ = 10 −6 .We apply mixed precision to train models on a single 3080Ti GPU for 3 to 4 days on both datasets.
We use the following evaluation metrics: BLEU-4 (Papineni et al., 2002), ROUGE (Lin, 2004) and CIDEr (Vedantam et al., 2015) obtained by COCO caption evaluation toolkit4 .According to previous research (Tran et al., 2020;Liu et al., 2021), CIDEr is the most suitable metric for the news image captioning task.We use SpaCy to identify named entities in both generated captions and ground-truth captions, and then obtain the precision and recall rate by matching the texts of entities exactly.We select entities with PERSON label to obtain the score of people's names.

Helpfulness of Vision Features
We first implement news image captioning models on standard news image captioning data.Table 2 shows the performance of multi-modality models and their text-only versions, respectively.From the table, we can see models using multi-modality features achieve better results over all automatic metrics than relying only on textual features.Figure 2 shows that both CIDEr and Recall of named entities exhibit a significant drop for all models with incongruent vision features on Good-News and NYTimes800k.The result from the incongruent decoding experiment indicates that models are sensitive to vision features while generating captions.
We also train and test Tell(full) with incongruent vision features or vision features from blank images to exclude the influence of parameter size.Besides, we use vision features from random-initialized ResNet(Rd-ResNet) for training and testing to determine whether vision features contribute by reducing overfitting.From Table 3, we can see under Blind, Blank, and random-initialized ResNet training settings, Tell(full) obtains CIDEr and Recall of Named entities close to text-only models.This result indicates that vision features contribute to the generation instead of just providing more parameters or preventing model training from overfitting.

Visually Sensitive Parts of Captions
We first conduct Article Degradation experiments to indicate that vision features can hardly assist models in generating entities when the entities in textual context are masked.Since there are many entities in captions which are also not contained in  the standard textual context, we further analyze the Recall of entities in and out of textual context.Table 4 summarizes the result of models in Article Degradation Experiments.For Tell(EG+SA), we also mask the independent entity sequences.Compared to table 2, we can see that in Article Degradation experiments, models with text-only features or multi-modality features both perform badly on automatic metrics and the generation of entities.What's more, masking the entity information also shrinks the improvement of multimodality models compared to text-only models.
We randomly mask the words except for entities in context with the same mask rate (called NEM) to exclude the influence introduced by the degradation method itself.To explore the importance of entities in context, we also conduct the degradation experiment with only entity sequences provided as textual context information(called EO).In Figure 3, Ori means original articles, and EM means the entity mask article degradation method.Models are trained and tested both with the corresponding insufficient textual context.The result shows that the performance of the model declines little while masking other non-entity words, and the im- provement of vision features shrinks when no entity information is provided by textual context.Considering that many entities are not contained in the textual context, we further analyze the impact of vision features on entities in and out of the textual context.Table 5 indicates that vision features can better improve models to generate entities contained in the textual context and contribute little to generating entities out of textual context.The result indicates that the textual context insufficiency exists without any degradation method and limits the contribution of vision features since the entities which are not contained in textual context are less sensitive to vision features.

Impact of Image Regions on Entity Generation
We divide the test dataset into two subsets by whether the image is detected with regions of peo-ple by YOLOv3.We then conduct our image cover experiment with three models on the subsets and analyze the Recall of people's names.Table 6 shows that models with vision features can better generate people's names on the subset where images contain regions of people, compared to the text-only models.On the other subset where images do not contain the regions of people, the performance of most multi-modality models is lower than that of text-only models.After the corresponding regions are covered, models achieve a lower Recall of people's names on both subsets.After the discussion in Section 4.2, we further analyze people's names that do and do not occur in the input context on the subset where images contain people regions.Table 7 shows that Recall of people's names obtains more significant improvement on people's names which also appear in context.These results indicate that the vision features extracted from regions of people contribute to the generation of people's names in captions.
However, with regions of people in images covered, models can still obtain a higher Recall of people's names compared to text-only models.The difference between using original and covered images further shrinks for all people's names because of the insufficient entity information in articles.A possible reason is that models can infer the existence of people from the covered images since we mask all regions with blank, and knowing the occurrence of people may be enough for models to generate better people's names in some samples.We analyze Tell(full) with the original and covered images on NYTimes800k dataset.

Case study
At last, we choose an example from NYTimes800k to specifically analyze the impact of vision features under diverse multi-modality features.We use Tell(full) to generate captions with text-only or multi-modality features extracted from different exploration methods, as shown in Figure 4. We see that Tell(full) with original multi-modality features successfully generates all people's names and achieves the highest Recall of all entities.Specifically, Tell(full) with images where regions of peo-ple are covered performs well on the generation of other entities except for people's names "Zachary Lara" and "Sonya Yu".That indicates the specific regions are helpful for the generation of related entities.Besides, when the entity information is degraded from the input context, Tell(full) can not generate the correct entities with or without vision features.

Related Work
Image captioning has improved significantly in recent years, and early models (Karpathy and Fei-Fei, 2015;Vinyals et al., 2015;Donahue et al., 2015) exploited convolutional neural network and recurrent neural network to encode images and decode captions, respectively.Xu et al. (2015) attended to different image patches when generating different tokens and Wang et al. (2019) applied the attention mechanism to regions of the corresponding object.News image captioning takes the news articleimage pairs as input and generates image captions rich in named entity information, making it a challenging task.News image captioning models generate specific entity information by applying the textual information from the associated articles.There are two main categories of news image captioning models: (1) template-based methods (Lu et al., 2018;Biten et al., 2019) which first generate the intermediate templates with placeholders for entities and then extract the specific entity information from associated articles.(2) end-to-end methods (Hu et al., 2020;Tran et al., 2020;Liu et al., 2021) which generate the whole caption directly in one step.Tran et al. (2020) applied pre-trained model RoBERTa as text encoder and transformer decoder to generate the captions with byte-pairencoding(BPE) (Sennrich et al., 2016).Liu et al. (2021) utilized transformer architecture with ex-An Indoor-Outdoor Escape in Los Angeles When most urbanites begin looking for a second home, they dream of finding a pastoral landscape or beachfront place as a refuge from hectic city life.But Zachary Lara and Sonya Yu, who live in San Francisco, wanted something else: a foothold in a second city where they could find more creative energy."It's something that we really crave," said Mr. Lara, 44, a technology consultant who also works in real estate development with Ms. Yu.But with plans to start a family -they now have two children, Evelyn, 4, and August, 1 -they weren't immune to the appeal of outdoor space and a less vertiginous lifestyle."In San Francisco, our house is on one of the steepest hills in the city.It's four stories, and there are lots of stairs," Ms. Yu said.For their getaway, "I wanted a single story where we didn't have to go up and down stairs all day," she said."And we wanted a pool for the kids."After a year of hunting for a Spanish Colonial-style home that checked those boxes, they were thrilled to find an online listing for an ideal-looking 1920s house in the Sunset Square neighborhood ......  tra methods like Entity-Guide and Visual-selective layer to obtain better textual and vision features.News image caption task is a multi-modal task with textual and vision features as input.Some of the previous models can obtain competitive performance even with single-textual input only, which also happened in other multimodal tasks like MMT(Multimodal Machine Translation).Elliott (2018) found that models with random unrelated images can also obtain competitive results, Caglayan et al. (2019); Li et al. (2022) conducted probing mask experiments and pointed out that vision features contribute more when the input text is insufficient.Wu et al. (2021) found that vision features assist MMT models through regularization training.We refer to some of these methods to analyze the contribution of vision features in news image captioning task as well.

Conclusions
In this work, we design and conduct extensive experiments to explore the impact of vision features on news image captioning models.First, we determine that vision features do contribute to generating news captions.Second, from our degradation experiment, we find that vision features can help models obtain a better generation of entities that appear both in textual context and captions.The low inclusion of entities in textual context will limit the improvement obtained from applying vision features.At last, we show that specific regions of images help models with the better generation of related entities in captions.However, images with regions covered can also generate the corresponding entities better.We believe it is important for future research to improve the ability of models to generate entities out of textual context and make better use of the specific information of different regions in images.

Limitations
Our experiments are conducted on transformerbased models with the same multi-modality fea-tures.Considering the importance of entity information in textual context and specific regions of images, it is also important to investigate whether the performance of the model promotes with different methods of extracting multi-modality features.We use the bpe technique to encode all entities in input articles which may separate the whole entity word into several sub-word tokens and may affect the impact of vision features.There are still a lot of entities of captions that don't appear in articles from datasets on our experiments.Conclusions will be more powerful if we can conduct experiments on other datasets which contain more sufficient information of captions.

A Appendix
We conduct extra experiments based on other models and vision features on NYTimes800k dataset.We implement LSTM-based model with RoBERTa and Transformer model with GloVe following the setting from (Tran et al., 2020) with location-aware textual context.We also test Tell(full) with vision features extracted from ViT-L/16 (Dosovitskiy et al., 2021).The performance on original data is shown in table 9. Their performance in the Article Degradation experiment is in table 10.Table 11 shows the result of the generation of Named entities by extra models, and table 12 shows the result of extra models in the Image Cover experiment.The results of extra experiments follow our conclusions.

Figure 1 :
Figure 1: The original multi-modal information (upper) and the modified multi-modal information in Article Degradation and Image Cover experiments (below)

Figure 2 :
Figure 2: CIDEr and Recall of Named entities with Congruent and Incongruent multi-modality features, all models are trained with congruent multi-modality features.

Figure 3 :
Figure 3: CIDEr and Recall of Named entities of four different Article Degradation methods with Tell(full) Model on NYTimes800k.NEM for masking non-entity words randomly, EO for input entity sequence only.
of the house, which was designed by the architect David and the architect John C. Liu.

Figure 4 :
Figure 4: An example of news image caption generation.(1) is the ground-truth caption and others are generated by Tell(full) with: (2) standard multi-modality features.(3) textual features only.(4) original article with the covered image.(5) degraded article with original image.(6) degraded article only.We highlight the entities in captions in red.
<mask> <mask>, who is from <mask>, has opened <mask> <mask>, a gluten-free bakery there.He's a fitness trainer who went into the baking business because his son could not tolerate gluten.Mr. <mask> cakes, notably the handsome banana variety, and ......
are masked in news articles of training and test data.To exclude the influence of the degradation method itself, we also mask words which are not entities randomly with the same rate.Through article degradation experiments, we can analyze the way images help generating the specific entities: generating them from scratch or assist models to select the right entities appearing in textual context.……NowMounir Jabrane, who is from Morocco, has opened Le Gourmand, a gluten-free bakery there.He's a fitness trainer who went into the baking business because his son could not tolerate gluten.Mr. Jabrane's cakes, notably the handsome banana variety, and ...... ……Now , on both training and test sets.

Table 2 :
Tran et al. (2020)) is a transformerbased model with component template guidance following the journalistic guidelines.Oracle template vectors are generated from captions according to the high-level component class which contains several component classes(Who, When etc.) based on journalistic guidelines.JoGANIC uses a hybrid transformer decoder to generate captions with the template vector predicted by a multi-layer perceptron.JoGANIC also applies extra Named Entity Embedding(NEE) and Multi-Span Text Reading(MSTR) method to obtain better representations of entities and read longer articles.In our experiments, we implement JoGANIC with the same multi-modality features obtained in Section 3.2.Results of models on GoodNews and NYTimes800k, we implement Tell(full), JoGANIC and Tell(EG+SA) with or without vision features with standard news image captioning data.We report BLEU-4, ROUGE, CIDEr, precision(P) and recall(R) of Named entities, and People's names.We implement the JoGANIC model based on the framework of Tell since no official code for JoGANIC has been released.We directly use the results of Tell model fromTran et al. (2020).
(Liu et al., 2021)(Liu et al., 2021), we also implement Tell(full) with entity guidance(EG) technique and replace Dynamic Convolution with self-attention(SA) mechanism, which is referred as Tell(EG+SA) in our experiments.

Table 3 :
CIDEr and Recall of Named entities with Tell(full) under other training settings on NYTimes800k.The Blind system uses incongruent vision features both to train and test models.

Table 4 :
The Results of Article Degradation Experiments on GoodNews and NYTimes800k datasets.We report BLEU-4, CIDEr, and the Recall(R) of Named entities.

Table 5 :
Recall of Named entities which are in or out of the input textual context with standard multi-modality features on GoodNews and NYTimes800k.

Table 6 :
Table8indicates that vision features extracted from covered images still contribute a lot to samples where text-only Tell(full) can hardly produce any people's names.We leave the detailed mechanism as future work.Recall of People's names for models applying textual features and multi-modality features with original or covered images on two subsets from GoodNews and NYTimes800k divided by whether images contain regions of people.

Table 7 :
Recall of People's names contained in input textual context with image cover experiment on the subset where the image is detected with regions of people from GoodNews and NYTimes800k.
Machine Learning Research, pages 2048-2057, Lille, France.PMLR.Xuewen Yang, Svebor Karaman, Joel Tetreault, and Alejandro Jaimes.2021.Journalistic guidelines aware news image captioning.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5162-5175, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.Mingyang Zhou, Grace Luo, Anna Rohrbach, and Zhou Yu. 2022.Focus! relevant and sufficient context selection for news image captioning.