Visual News: Benchmark and Challenges in News Image Captioning

We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.


Introduction
Image captioning is a language and vision task that has received considerable attention and where important progress has been made in recent years (Vinyals et al., 2015;Xu et al., 2015;Lu et al., 2018b;Anderson et al., 2018). This field has been fueled by recent advances in both visual representation learning and text generation, and also by the availability of image-text parallel corpora such as the Common Objects in Context (COCO) Captions dataset (Chen et al., 2015).
While COCO contains enough images to train reasonably good captioning models, it was collected so that objects depicted in the images are biased toward a limited set of everyday objects. Moreover, while it provides high-quality human * Work completed before joining Amazon. President Obama and Mitt Romney debate in Hempstead NY on Tuesday.
Virginia Cavaliers fans celebrate on the court after the Cavaliers game against the Duke Blue Devils at John Paul Jones Arena.
A baseball player hitting the ball during the game.
A bunch of people who are holding red umbrellas. Figure 1: Examples from our Visual News dataset (left) and COCO (Chen et al., 2015) (right). Visual News provides more informative captions with name entities, whereas COCO contains more generic captions. annotated captions, these captions were written so that they are descriptive rather than interpretative, and referents to objects are generic rather than specific. For example, a caption such as "A bunch of people who are holding red umbrellas." properly describes the image at some level to the right in Figure 1, but it fails to capture the higher level situation that is taking place in this picture i.e. "why are people gathering with red umbrellas and what role do they play?" This type of language is typical in describing events in news text. While a lot of work has been done on news text corpora such as the influential Wall Street Journal Corpus (Paul and Baker, 1992), there have been considerably fewer resources of such news text in the language and vision domain.
In this paper, we introduce Visual News, a dataset and benchmark containing more than one million publicly available news images paired with both captions and news article text collected from a diverse set of topics and news sources in English (The Guardian, BBC, USA TODAY, and The Washington Post). By leveraging this dataset, we focus on the task of News Image Captioning, which aims at generating captions from both input images and corresponding news articles. We further propose Visual News Captioner, a model that generates captions by attending to both individual word tokens and named entities in an input news article text, and localized visual features.
News image captions are typically more complex than pure image captions and thus make them harder to generate. News captions describe the contents of images at a higher degree of specificity and as such contain many named entities referring to specific people, places, and organizations. Such named entities convey key information regarding the events presented in the images, and conversely events are often used to predict what types of entities are involved. e.g. if the news article mentions a baseball game then a picture might involve a baseball player or a coach, conversely if the image contains someone wearing baseball gear, it might imply that a game of baseball is taking place. As such, our Visual News Captioner model jointly uses spatial-level visual feature attention and word-level textual feature attention.
More specifically, we adapt the existing Tranformer (Vaswani et al., 2017) to news image datasets by integrating several critical components. To effectively attend to important named entities in news articles, we apply the Attention on Attention technique on attention layers and introduce a new position encoding method to model the relative position relationships of words. We also propose a novel Visual Selective Layer to learn joint multimodal embeddings. To avoid missing rare named entities, we build our decoder upon the pointergenerator model. News captions also contain a significant amount of words falling either in the long tail of the distribution or resulting in out-ofvocabulary words at test time. In order to alleviate this, we introduce a tag cleaning post-processing step to further improve our model. Previous works (Lu et al., 2018a;Biten et al., 2019) have attempted news image captioning by adopting a two-stage pipeline. They first replace all specific named entities with entity type tags to create templates and train a model to generate template captions with fillable placeholders. Then, these methods search in the input news articles for entities to fill placeholders. Such approach reduces the vocabulary size and eases the burden on the template generator network. However, our Hillary Clinton is the Democratic Party's presumptive presidential nominee according to the Associated Press securing enough support from superdelegates to push her over the top on the eve of the final round of state primaries. Both AP and NBC News reported Monday night that a sufficient number of superdelegates had indicated their support for Clinton to guarantee she will have the 2383 delegates needed at the party's July in convention in Philadelphia ... extensive experiments suggest that template-based approaches might also prevent these models from leveraging contextual clues from the named entities themselves in their first stage.
Our main contributions can be summarized as: • We introduce Visual News, the largest and most diverse news image captioning dataset and study to date, consisting of more than one million images with news articles, image captions, author information, and other metadata. • We propose Visual News Captioner, a captioning method for news images, showing superior results on the GoodNews (Biten et al., 2019), NYTimes800k (Tran et al., 2020) and Visual News datasets with much fewer parameters than competing methods. • We benchmarked both template-based and endto-end captioning methods on two large-scale news image datasets, revealing the challenges in the task of news image captioning. Visual News text corpora, public links to download images, and further code and data are publicly available. 1

Related Work
Image captioning has gained increased attention, with remarkable results in recent benchmarks. A popular paradigm (Vinyals et al., 2015;Karpathy and Fei-Fei, 2015;Donahue et al., 2015) uses a convolutional neural network as the image encoder and generates captions using a recurrent neural network (RNN) as the decoder. The seminal work of Xu et al. (2015) proposed to attend to different image patches at different time steps and Lu et al. (2017) improved this attention mechanism by adding an option to sometimes not to attend to any image regions. Other extensions include at-   Table 2: Number of common named entities between different source agencies in Visual News dataset. "PERSON", "GPE", "ORG", and "DATE" are the top 4 most frequent named entity types. BBC has more common named entities with The Guardian than with USA Today and The Washington Post.
tending to semantic concept proposals (You et al., 2016), imposing local representations at the object level (Li et al., 2017) and a bottom-up and topdown attention mechanism to combine object and other salient image regions (Anderson et al., 2018).
News image captioning is a challenging task because the captions often contain named entities. Prior work has attempted this task by drawing contextual information from the accompanying articles. Tariq and Foroosh (2016) select the most representative sentence from the article; Ramisa et al. (2017) encode news articles using pre-trained word embeddings and concatenate them with CNN visual features to feed into an LSTM (Hochreiter and Schmidhuber, 1997); Lu et al. (2018a) propose a template-based method in order to reduce the vocabulary size and then later retrieves named entities from auxiliary data; Biten et al. (2019) also adopt a template-based method but extract named entities by attending to sentences from the associated arti-

Our Visual News Dataset
Visual News comprises news articles, images, captions, and other metadata from four news agencies: The Guardian, BBC, USA Today, and The Washington Post. To maintain quality, we first filter out images whose height or width is smaller than 180 pixels. We then keep examples with a caption length between 5 and 31 words. Figure 2 shows some examples from Visual News. Although only images, captions, and articles are used in our experiments, Visual News provides other metadata, such as article title, author, and geo-location. We summarize the difference between Visual News and other popular news image datasets in Table 1. Compared to other recent news captioning datasets, such as GoodNews (Biten et al., 2019) and NYTimes800k (Tran et al., 2020), Visual News has two advantages. First, Visual News has the largest number of images and articles. It contains over one million images and more than 600, 000 articles. Second, Visual News is more diverse, since it contains articles from four news agencies. For example, the average caption length of BBC is only 14.2 while for The Guardian it is 22.5. In addition, only 18% of the tokens in The Guardian are named entities while for The Washington Post it is 33%. Figure 3 shows the average count of named entity types in captions from each agency. For instance, USA Today has on average 0.84 "PERSON" entities per caption while BBC has only 0.46. The Washington Post has 0.29 "DATE" entities whereas USA Today has 0.47. We also randomly select 50, 000 captions from each agency and calculate their unique named entities to see how many they have in common with each other (as summarized in Table 2). For example, BBC has more common named entities with The Guardian than USA Today and The Washington Post. USA Today shares more named entities of the same type with The Washington Post.
To further demonstrate the diversity in Visual News, we train a Show and Tell (Vinyals et al., 2015) captioning model on 100, 000 examples from a certain agency and test it on 10, 000 examples from the other agencies. We report CIDEr scores in Table 3. A model trained on USA Today achieves a 3.7 score on USA Today test set but only 0.6 on The Guardian test set. 2 This gap also indicates that Visual News is more diverse and also more challenging. Figure 4 presents an overview of Visual News Captioner. We first introduce the image encoder and the text encoder. We then explain the decoder in section 4.3. To solve the out-of-vocabulary issue, we propose Tag-Cleaning in section 4.4.

Image Encoder
We use a ResNet152 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) to extract visual features. The output of the convolutional layer before the final pooling layer gives us a set of vectors corresponding to different patches in the image. Specifically, we obtain features V = {v 1 , . . . , v K }, v i ∈ R D from every image I, where K = 49 and D = 2048. With these features, we can selectively attend to different regions at different time steps.

Text Encoder
As the length of the associated article could be very long, we focus on the first 300 tokens in each article following (See et al., 2017). We also used the spaCy (Honnibal and Montani, 2017) named entity recognizer to extract named entities from news articles inspired by Li et al. (2018). We encode the first 300 tokens and the extracted named entities using the same encoder. Given the input text T = {t 1 , . . . , t L } where t i denotes the i-th token in the text and L is the text length, we use the following layers to obtain textual features: Word Embedding and Position Embedding. For each token t i , we first obtain a word embedding w i ∈ R H and positional embedding p i ∈ R H  through two embedding layers, H is the hidden state size and is set to 512. To better model the relative position relationships, we further feed position embeddings into an LSTM (Hochreiter and Schmidhuber, 1997) to get the updated position embedding p l i ∈ R H . We then add up p l i and w i to obtain the final input embedding w i .
Multi-Head Attention on Attention Layer. The Multi-Head Attention Layer (Vaswani et al., 2017) operates on three sets of vectors: queries Q, keys K and values V , and takes a weighted sum of value vectors according to a similarity distribution between Q and K. In our implementation, for each query w i , K and Q are all input embeddings T . In addition, we have the "Attention on Attention" (AoA) module (Huang et al., 2019) to assist the generation of attended information: where represents the element-wise multiplication operation and σ is the sigmoid function. W g and W a are trainable parameters.
Visual Selective Layer. One limitation of previous works (Tran et al., 2020;Biten et al., 2019) is that they separately encode the image and article, ignoring the connection between them during encoding. In order to generate representations that can capture contextual information from both images and articles, we propose a novel Visual Selective Layer which updates textual embeddings with a visual information gate: where MHAtt AoA corresponds to Eq 3. To obtain fixed-length article representations, we apply the average pooling operation to get T , which can be used as the query to attend to different regions of the image. FFN is a two-layer feed-forward network with ReLU as the activation function. w a i is the final output embedding from the text encoder. For the sake of simplicity, in the following text, we use A = {a 1 , . . . , a L }, a i ∈ R H to represent the final embeddings (w a i ) of article tokens, where H is the embedding size and L is the article length.
Similarly, E = {e 1 , . . . , e M }, e i ∈ R H represent the final embeddings of extracted named entities, where M is the number of named entities.

Decoder
Our decoder generates the next token conditioned on previously generated tokens and contextual information. We propose Masked Multi-Head Attention on Attention Layer to flexibly attend to the previous tokens and Multi-Modal Attention on Attention Layer to fuse contextual information. We first use the encoder to obtain embeddings of ground truth captions X = {x 0 , . . . , x N }, x i ∈ R H , where N is the caption length and H is the embedding size. Instead of using the Masked Multi-Head Attention Layer used in Tran et al. (2020) to collect the information from past tokens, we use the more efficient Masked Multi-Head Attention on Attention Layer. At time step t, output embedding x a t is used as the query to attend over context information:

Multi-Modal Attention on Attention Layer. Our
Multi-Modal AoA Layer contains three context sources: images V , articles A and name entity sets E. We use a linear layer to resize features in V intõ V , whereṽ ∈ R 512 . In each step, x a i is the query that attends over them separately: We combine the attended image feature V t , the attended article feature A t and the attended named entity feature E t , and feed them into a residual connection, layer normalization and a two-layer feed-forward layer FFN.
The final output P st will be used to predict token s t in the Multi-Head Pointer-Generator Module.
Multi-Head Pointer-Generator Module. For the purpose of obtaining more related named entities from the associated article and the extracted named entity set, we adapt the pointer-generator (See et al., 2017). Our pointer-generator contains two sources: the article and the named entity set. We first generate a V and a E over the source article tokens and extracted named entities by averaging the attention distributions from the multiple heads of the Multi-Modal Attention on Attention layer in the last decoder layer. Next, p gen and q gen are calculated as two soft switches to choose between generating a word from the vocabulary distribution P st , or copying words from the attention distribution a V or a E : where A i , V i and E i are attended context vectors, W p and W q are learnable parameters, and σ is the sigmoid function. P * s i provides us with the final distribution to predict the next word.
Finally, our loss can be computed as the sum of the negative log-likelihood of the target word at each time step:

Tag-Cleaning
To solve the out-of-vocabulary (OOV) problem, we replace OOV named entities with named entity tags instead of using a single "UNK" token, e.g. if "John Paul Jones Arena" is a OOV named entity, we replace it with "LOC_", which represents location entities. During testing, if the model predicts entity tags, we further replace those tags with specific named entities. More specifically, we select a named entity with the same entity category and the highest frequency from the named entity set.

Experiments
In this section, we first introduce details of implementation. Then baselines and competing methods will be discussed. Lastly, we present comprehensive experiment results on both the Good-News dataset and our Visual News dataset.     (Tran et al., 2020). Note that our proposed Visual News-Captioner is much more lightweight.

Implementation Details
Datasets. We conduct experiments on three largescale news image datasets: GoodNews, NY-Times800k and Visual News. For GoodNews and NYTimes800k, we follow the setting from the original paper. For Visual News, we randomly sample 100, 000 images from each news agency, leading to a training set of 400, 000 samples. Similarly, we get a 40, 000 validation set and a 40, 000 test set, both evenly sampled from four news agencies. Throughout our experiments, we first resize images to a 256 × 256 resolution, and randomly crop patches to a size of 224 × 224 as input. To preprocess captions and articles, we remove noisy HTML labels, brackets, non-ASCII characters, and some special tokens. We use spaCy's named entity recognizer (Honnibal and Montani, 2017) to recognize named entities in both captions and articles.
Model Training. We set the embedding size H to 512. For dropout layers, we set the dropout rate as 0.1. Models are optimized using Adam (Kingma and Ba, 2015) with warming up learning rate set to 0.0005. We use a batch size of 64 and stop training when the CIDEr  score on the dev set is not improving for 20 epochs. Since we replace OOV named entities with tags, we add 18 named entity tags provided by spaCy into our vocabulary, including "PERSON_", "LOC_", "ORG_", "EVENT_", etc. Evaluation Metrics. Following previous literature, we evaluate model performance on two categories of metrics. To measure the overall similarity between generated captions and ground truth, we report BLEU-4 (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), ROUGE (Ganesan, 2018) and CIDEr  scores. Among these scores, CIDEr is the most suitable for measuring performance in news captioning since it downweighs stop words and focuses more on un-common words through a TF-IDF weighting mechanism. On the other hand, we compute the precision and recall scores for named entities to evaluate the model ability to predict named entities.

Competing Methods and Model Variants
We compare our proposed Visual News Captioner with various baselines and competing methods. TextRank (Barrios et al., 2016) is a graph-based extractive summarization algorithm. This baseline only takes the associated articles as input. Show Attend Tell (Xu et al., 2015) tries to attend to certain image patches during caption generation. This baseline only takes images as input. Pooled Embeddings and Tough-to-beat (Arora et al., 2017) are two template-based models proposed in Biten et al. (2019). They try to encode articles at the sentence level and attend to certain sentences at different time steps. Pooled Embeddings computes sentence representations by averaging word embeddings and adopts context insertion in the second stage. Tough-to-beat obtains sentence representations from the tough-to-beat method introduced in Arora et al. (2017) and uses sentence level attention weights (Biten et al., 2019) to insert named entities. Transform and Tell (Tran et al., 2020) is the transformer-based attention model, which uses a pretrained RoBERTa (Liu et al., 2019) model as the article encoder and a transformer as the decoder. It uses byte-pair encoding (BPE) to represent out-ofvocabulary named entities. Visual News Captioner is our proposed model, which is based on transformer (Vaswani et al., 2017). Our transformer adopts Multi-Head Attention on Attention (AoA). EG (Entity-Guide) adds named entities as another text source to help predict named entities more accurately. VS (Visual Selective Layer) tries to strengthen the connection between the image and text. Pointer stands for the updated multi-head pointer-generator module. To overcome the limitation of a fixed-size vocabulary, we examine TC, the Tag-Cleaning operation handling the OOV problem. Table 5 and Table 4 summarize our quantitative results on the GoodNews , NYTimes800k and Visual News datasets respectively. On GoodNews and NYTimes800k, our Visual News Captioner outperforms the state-of-the-art methods on all 6 metrics.

Results and Discussion
On our Visual News dataset, our model outperforms baseline methods by a large margin, from 13.2 to 50.5 in CIDEr score. In addition, as revealed by Table 6, our final model outperforms Transform and Tell (transformer) with much fewer parameters. This demonstrates that our proposed model is able to generate better captions in a more efficient way.
Our Entity-Guide (EG) brings improvement in all datasets, demonstrating that the named entity set contains key information guiding the generation of news captions. In addition, our pointer-generator mechanism builds a stronger connection between the final distribution of the predicted tokens and the Multi-Modal AoA Layer. More importantly, our Visual Selective Layer (VS) improves the caption generation results by providing extra visual context to text features.
Furthermore, our Tag-Cleaning (TC) method is able to effectively retrieve uncommon named entities and thus improves the CIDEr score by 1.3% on the Visual News datasets. We present qualitative results of different models on both datasets in Figure 5. Our model shows the ability to generate more accurate named entities.
We also observe that our models and Transform and Tell methods achieve the best performances are directly trained on raw captions rather than following a two-stage template-based manner. Although template-based methods normally handle a much smaller vocabulary, these methods also suffer from losing rich contextual information brought by uncommon named entities.
The performance on the GoodNews dataset and NYTimes800k dataset is better compared to the results on Visual News. This is because our Visual News dataset is more challenging in terms of diversity. Our Visual News dataset is collected from multiple news agencies, thus, covers more topics and has more diverse language styles.

Conclusion and Future Work
In this paper, we study the task of news image captioning. First, we construct Visual News, the largest news image captioning dataset consisting of over one million images with accompanying articles, captions, and other metadata. Furthermore, we propose Visual News Captioner, an entity-aware captioning method leveraging both visual and textual information. We validate the effectiveness of our method on three datasets through exten-Ground Truth: republican presidential candidate donald trump enters germain arena to a packed house on monday Visual News Captioner: donald trump supporters cheer as republiaan presidential candidate donald trump speaks in germain arena Pooled Embeddings: obama and his wife obama celebrate during the recent weeks EVENT_ Ground Truth: virginia cavaliers fans celebrate on the court after the cavaliers game against the duke blue devils at john paul jones arena Visual News Captioner: virginia cavaliers forward anthony gill celebrates with fans after the game against the duke blue devils at john paul jones arena Pooled Embeddings krzyzewski fans celebrate after the krzyzewski win over north carolina in the semifinals Ground Truth: sidney crosdy celebrated his goal in the second period that seemed to deflate sweden Visual News Captioner: sidney crosby of canada celebrating a goal in the men's gold medal game Pooled Embeddings crosby of canada after scoring the winning goal in the second period Ground Truth: president obama delivered his annual state of the union address on tuesday in washington Visual News Captioner： president obama delivers the state of the union address on tuesday jan 20 Pooled Embeddings: waldman speaks during a the white house news conference on year in washington Figure 5: Examples of captions generated by different models. The first three are from Visual News and the last one is from GoodNews. Correct named entities are highlighted in bold. Our Visual News Captioner is able to predict the named entities more accurately and completely than the competing method. sive experiments. Visual News Captioner outperforms state-of-the-art methods across multiple metrics with fewer parameters. Moreover, our Visual News dataset can potentially be adapted to other NLP tasks, such as abstractive text summarization and fake news detection. We hope this work paves the way for future studies in news image captioning as well as other related research areas.