Journalistic Guidelines Aware News Image Captioning

The task of news article image captioning aims to generate descriptive and informative captions for news article images. Unlike conventional image captions that simply describe the content of the image in general terms, news image captions follow journalistic guidelines and rely heavily on named entities to describe the image content, often drawing context from the whole article they are associated with. In this work, we propose a new approach to this task, motivated by caption guidelines that journalists follow. Our approach, Journalistic Guidelines Aware News Image Captioning (JoGANIC), leverages the structure of captions to improve the generation quality and guide our representation design. Experimental results, including detailed ablation studies, on two large-scale publicly available datasets show that JoGANIC substantially outperforms state-of-the-art methods both on caption generation and named entity related metrics.


Introduction
Research on generating textual descriptions of images has made great progress in recent years with the introduction of encoder-decoder architectures (Xu et al., 2015;Johnson et al., 2016;Venugopalan et al., 2017;Karpathy and Fei-Fei, 2017;Anderson et al., 2018;Lu et al., 2018b;Aneja et al., 2018). Those models are generally trained and evaluated on image captioning datasets like COCO (Lin et al., 2014;Chen et al., 2015) and Flickr (Hodosh et al., 2013) that only contain generic object categories but no details such as names, locations, or dates. The captions generated by these methods are thus generic descriptions of the images.
The news image captioning problem (Feng and Lapata, 2013;Ramisa et al., 2018;Biten et al., 2019;Tran et al., 2020) can be seen as a multimodal extension of the image captioning task with additional context provided in the form of a news article. Specifically, given image-article pairs as input, the news captioning task aims to generate an informative caption that describes the image with proper named entities and context extracted from the article. The development of automatic news image caption generation methods can ease the process of adding images to articles and produce more engaging content. According to The News Manual 1 and International Journalists' Network 2 , a caption should help news readers understand six main components (who, when, where, what, why, how) related to the image and article. As shown in Fig. 1, different journalists can write captions to cover different components for the same image and article pair. Previous news image captioning work (Biten et al., 2019;Tran et al., 2020) has not directly addressed the challenge of generating a caption that follows those journalistic principles.
In this work, we tackle the news image captioning problem by introducing these guidelines in our modeling through a new concept called a 'caption template', which is composed of 5 key components, detailed in Section 3. We propose a Journalistic Guidelines Aware News Image Captioning (JoGANIC) model that, given an image-article pair, aims to predict the most likely active template components and, using component-specific decoding block, produces a caption following the provided template guidance. JoGANIC thus models the underlying structure of the captions, which helps to improve the generation quality.
Captions for images that accompany news articles often include named entities and rely heavily on context found throughout the article (making the text encoding process especially challenging). We propose two techniques to address these issues: (i) integration of features specifically to extract relevant named entities, and (ii) a multi-span text reading (MSTR) method, which first splits long articles into multiple text spans and then merges the extracted features of all spans together.
Our work has two main contributions: (i) the definition of the template components of a news caption based on journalistic guidelines, and their explicit integration in the caption generation process of our JoGANIC model; (ii) the design of encoding mechanisms to extract relevant information for the news image captioning task throughout the article, specifically a dedicated named entity representation and the ability to process longer article. Experimental results show better performance than state of the art on news image caption generation. We release the source code of our method at https://github.com/dataminr-ai/JoGANIC.

Generic Image Captioning
State-of-the-art approaches (Johnson et al., 2016;Wang et al., 2020;He et al., 2020;Sammani and Melas-Kyriazi, 2020) mainly use encoder-decoder frameworks with attention to generate captions for images. Xu et al. (2015) developed soft and hard attention mechanisms to focus on different regions in the image when generating different words. Similarly, Anderson et al. (2018) used a Faster R- CNN (Ren et al., 2015) to extract regions of interest that can be attended to. Yang et al. (2020) used self-critical sequence training for image captioning. Lu et al. (2018a) and Whitehead et al. (2018) introduced a knowledge aware captioning method where the knowledge comes from metadata associated with the datasets.
Our work differs from generic image captioning in three aspects: (i) our model's input consists of image-article pairs; (ii) our caption generation is a guided process following news image caption-  ing journalistic guidelines; (iii) news captions contain named entities and additional context extracted from the article, making them more complex.

News Article Image Captioning
One of the earliest works in news article image captioning, Ramisa et al. (2018), proposed an encoderdecoder architecture with a deep convolutional model VGG (Simonyan and Zisserman, 2015) and Word2Vec (Mikolov et al., 2013) as the image and text feature encoder, and an LSTM as the decoder. Biten et al. (2019) introduced the GoodNews dataset, and proposed a two-step caption generation process using ResNet-152 (He et al., 2016) as the image representation and a sentence-level aggregated representation using GloVe embeddings (Pennington et al., 2014). First, a caption is generated with placeholders for the different types of named entities: PERSON, ORGANIZATION, etc. shown in the left column of Table 1. Then, the placeholders are filled in by matching entities from the best ranked sentences of the article. This two-step process aims to deal with rare named entities but prevents the captions from being linguistically rich and is can induce error propagation between steps.
More recently, Liu et al. (2020); Hu et al. (2020);Tran et al. (2020) proposed one step, end-to-end methods. They all used ResNet-152 as image encoder, while for the text encoder: Hu et al. (2020) applied BiLSTM, Liu et al. (2020) used BERT andTran et al. (2020) used RoBERTa. Hu et al. (2020); Liu et al. (2020) used LSTM as the decoder. Tran et al. (2020)   refer to as Tell. This model exploits a Transformer decoder and byte-pair-encoding (BPE) (Sennrich et al., 2016) allowing to generate captions with unseen or rare named entities from common tokens. As in other multimodal tasks, where studies (Shekhar et al., 2019;Caglayan et al., 2019;Li et al., 2020) have shown that the exploitation of both modalities is essential for achieving a good performance, Tran et al. (2020) evaluated a text only model showing that it performs worse than the multimodal model. We will also evaluate single visual and text modality models in our experiments.
Our work differs from previous work in news image captioning in that JoGANIC is an end-toend framework that (i) integrates journalistic guidelines through a template guided caption generation process; and (ii) exploits a dedicated named entity representation and a long text encoding mechanism. Our experiments show that our framework significantly outpeforms the state of the art.

Defining Caption Templates
The objective of news image captioning is to give the reader a clear understanding of the main components who, when, where, what, why and how depicted in the image given the context of the article. We propose to exploit the idea of components in the caption generation process, but we first need to define components that can be automatically detected in the ground truth caption for training.
The who, when and where components can be retrieved via Named Entity Recognition (NER). As shown in the right column of Tab. 1, we define named entities with type 'PERSON', 'NORP' and 'ORG' as who, those with type 'DATE' and 'TIME' as when, and ones with type 'FAC', 'GPE' and 'LOC' as where. We define the component misc as the rest of the named entities. The what, why and how components are hard to define and can correspond to a wide range of elements, we propose to merge them into a context component, which we assume is present if a verb is detected by a part-ofspeech (POS) tagger 3 . In Fig. 1, captions 1 and 2 have an context component, but caption 3 does not contain a verb and thus has no context.
In summary, our proposed news caption template consists of at most five components: who, when, where, context and misc. We report the percentage of each component in the captions of the Good-News and NYTimes800k datasets in Tab. 2. The who is present in almost all the captions, and all components appear commonly in both datasets.

Template-Guided News Image Captioning
In this section, we formally define the news captioning task and introduce the idea of template guidance and our Journalistic Guidelines Aware News Image Captioning (JoGANIC) approach. We then propose two strategies to address the specific challenges of named entities and long articles.

News Captioning Problem Formulation
Given an image and article pair (X I , X A ), the objective of news captioning is to generate a sentence y = {y 1 , . . . , y N } with a sequence of N tokens, y i ∈ V K being the i-th token, V K being the vocabulary of K tokens. The problem can be solved by an encoder-decoder model. The decoder predicts the target sequence y conditioned on the source inputs X I and X A . The decoding probability P (y|X I , X A ) is modeled using the probability of each target token y n at time step n conditioned on the source input X I and X A and the current partial target sequence y <n : where, θ denotes the parameters of the model.

Template Guidance
To make our model capable of generating captions following different templates, we introduce a new variable α for template guidance. The new decoding probability can be defined as: where we ignore θ for simplicity. Based on our definition of templates, we could see α as the high-level template class defined by the combination of the active components. As there are 5 template components, the total number of possible template classes is 2 5 . However, this poses two challenges to train our model: (i) data imbalance, as the most frequent template corresponds to 15.2% of captions, while the least common ones appear less than 2% of the time (more details in Tab. 3 of the supplementary material), and (ii) different high-level templates may be similar (i.e. having a single component difference) but would be considered totally different classes.
In order to address these issues we define α as the set of active components of the template α 5 i=1 , with α i being the probability of a template having component i. This formulation enables us to exploit the partial overlap in terms of components between the different templates. Note that the percentage of each component, in Tab. 2, is not as imbalanced as the full template classes. The template guidance α can be provided by the news writer ('oracle' setting in the experiments) or can be estimated ('auto' setting) through a multi-label classification task as detailed in the next section and illustrated in the top-left of Fig. 2(a).
The template guidance variable α is static during the decoding process but that does not prevent our method from generating fluent captions covering the whole set of components. Exploring ways of exploiting dynamically the component specific representations during the caption generation process could be an interesting future work direction.

Our Model Description
We propose a news image captioning model that generates captions through template guidance and can also generate accurate named entities and cover a larger extent of the article. Our JoGANIC model, illustrated in Fig. 2, is a transformer-based encoderdecoder, with an encoder extracting features from the image X I and the article X A , a prediction head estimating the probability of each component and a hybrid decoder to produce the caption.
The encoder consists of three parts: (i) a ResNet-152 pretrained on ImageNet extracting the image feature X I ∈ R d I ; (ii) RoBERTa producing the text features X T ∈ R d T from the article; and (iii) a Named Entity Embedder (NEE), detailed in Section 4.3.1, applied to obtain the features X E ∈ R d E of the named entities in the article. The components prediction head, taking as input the concatenation of the image, article and named entities features, is a multi-layer perceptron with a sigmoid layer trained (using the components detected in the ground truth caption as target) to output the probability of each component P (α|X I , X A ).
The hybrid decoder consists of an embedding layer to get the embeddings of the output generated thus far (i.e., the partial generation), followed by 4 blocks of 3 Multi-Head Attention (MHA) modules, denoted as MHA (image/text/NE), to compute the attention across the partial generation and the input image, text and named entities. The final representation u i for each block is the concatenation of the 3 modules' output, Fig. 2(b). The first 3 blocks are shared for all components, while the 4-th block consists of 5 parallel component-specific blocks 4 1 −4 5 where block 4i outputs the representation u i 4 for the component i. The final representation of the decoder is the average of the weighted sum of all components u = 1 Then the output probability P (y n |X I , X A , α, y <n ) is obtained by applying a feed-forward (FF) layer, and softmax over the target vocabulary. Note that our "template guided" generation does not limit the number of occurrences of one component in the output caption and does not explicitly constrain the generation of specific components but rather the final representation u will rely more on the component-specific representations corresponding to higher α i values.

Named Entity Embedding
With over 96% (see Tab. 1 in the supplementary material) of the news captions containing named entities, producing accurate named entities is essential to generating good news captions. However, text encoders like RoBERTa cannot properly represent named entities, and only handle them implicitly through BPE (Byte-Pair Encoding) subwords.
To deal explicitly with named entities, we learn entity embeddings from the Wikipedia knowledge base (KB), following Wikipedia2vec (Yamada et al., 2018) which embeds words and entities into a common space 4 . Given a vocabulary of words V W and a set of entities V E , it learns a lookup embedding function There are three components in Wikipedia2Vec: (i) a skip-gram model for learning the word similarity in V W , (ii) a KB graph model to learn the relatedness between pairs of entities (vertices V E of the Wikipedia entity graph) and (iii) a version of Word2Vec where words are predicted from entities.
Since predicting the correct named entities from context is very important for news captioning, we introduce a fourth component: (iv) a neural entity predictor (NEP). Given a text (sequence of words) t = {w 1 , . . . , w N }, we train Wikipedia2vec to pre- -.
× Figure 2: The architecture of our model. (a) The Encoder takes image+text+named entities as input and generates features. The Decoder consists blocks 1-4, with blocks 1-3 shared for all template components who, when, where, context and misc. Block 4 consists of 5 component-specific subblocks (4 1 -4 5 ). A prediction head on top of the encoder predicts the probabilities of the 5 components α 1:5 , which then multiply the representations of the 5 subblocks u 1:5 4 . The final representation u is obtained by averaging and used to predict the output token probabilities. (b) Every block takes as input the representations from previous blocks as well as those from the Encoder via three Multi-Head Attention (MHA) modules designed for image, text and named entities separately. dict the entities e 1 , . . . , e m that appear in the sequence. With E KB being the set of all entities in KB, and v e and v t (computed as the element-wise mean of all the word vectors in t followed by a fully connected layer) the vector representations of the entity e and the text t, respectively, the probability of an entity e appearing in text t is defined as . ( We optimize the NEP model with a cross-entropy loss, but using Eq. 3 as is would be computationally expensive as it involves a summation over all entities in the KB. We address this by replacing E KB in Eq. 3 with E * , the union of the positive entity e and 50 randomly chosen negative entities not in t. Through exploiting the Named Entity Embedding (NEE), our model can represent and thus generate more accurate entities. The NEE model is not jointly trained with the template components prediction and caption generation heads of JoGANIC, but pre-trained offline on Wikipedia KB.
The Wikipedia KB contains a large set of NEs but cannot cover all NEs that could appear in a news article (about 40% are not covered in our datasets). The embedding of a new NE cannot be obtained directly by lookup. To alleviate this problem, we set the embedding of any missing NE with v t which is reasonable as we trained the NEP to maximize the correlation between v e and v t in Eq. 3.

Reading Longer Articles
Biten et al. (2019) use sentence-level features obtained by averaging the word features, of a pretrained GloVe (Pennington et al., 2014) model, in the sentence. While this method can embed the whole article, the averaging makes the feature less informative. Tran et al. (2020) instead use RoBERTa as the text feature extractor, though this has the limitation of exploiting only 512 tokens.
However, processing only the first 512 tokens may ignore important contextual information appearing later in the news article. To alleviate this problem, we propose a Multi-Span Text Reading (MSTR) method to read more than 512 tokens from the article. MSTR splits the text into overlapping segments of 512 tokens and pass them to the RoBERTa encoder independently. The representation of any overlapping token in 2 segments is the element-wise interpolation of their representations.

Experiments
We evaluate JoGANIC on two large-scale publicly available news captioning datasets: GoodNews (Biten et al., 2019) and NYTimes800k (Tran et al., 2020) both collected using The New York Times public API 5 , with the latter being larger and con-  We also train single-modality models with only an image encoder (JoGANIC image only) or a text encoder (JoGANIC text only). We compare against two types of baselines. (i) Two-step generation methods: that are based on conventional image captioning models (Xu et al., 2015;Rennie et al., 2017;Anderson et al., 2018;Lu et al., 2017;Biten et al., 2019) to first generate captions with placeholders and then insert named entities into these placeholders. (ii) End-to-end models: VGG+LSTM (Ramisa et al., 2018), Visu-alNews (Liu et al., 2020) that uses ResNet as image encoder, BERT article encoder and bi-LSTM as decoder, and Tell, with two variants: (a) Tell, which uses RoBERTa and ResNet-152 as the encoders and Transformer as the decoder, it is equivalent to JoGANIC without template guidance as they use the same encoders and training settings. (b) Tell (full), which includes two additional visual encoders: YOLOv3 and MTCNN, and Location-Aware and Weighted RoBERTa for text encoding.
For the general caption generation quality evaluation, we use the BLEU-4 (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Denkowski and Lavie, 2014) and CIDEr (Vedantam et al., 2015) metrics. We also use named entity precision/recall to evaluate the named entity generation quality. To better understand how well the generated captions follow the ground truth templates, we calculate precision and recall for the five components who, when, where, context and misc and use the averaged precision and recall 6 as the final metric. with β 1 = 0.9, β 2 = 0.98, = 10 −6 . The number of tokens in the vocabulary K = 50264 and d W iki = 300. We limit the text length in MSTR to 1,000 tokens as preliminary studies have shown similar performance with longer text input but at the expense of significant increased training time (Tab. 6 in supplementary). In practice, for an article longer than 512 tokens, we read two overlapping text segments of 512 tokens, one starting from the beginning and another from the end and thus can have [24 − 511] overlapping tokens. The components prediction head in Fig. 2 is a linear layer followed by an output layer of 1024 dimensions.

Implementation and Training details
The training pipeline uses PyTorch (Paszke et al., 2017) and the AllenNLP framework (Gardner et al., 2018). The RoBERTa model and dynamic convolution code are adapted from fairseq (Ott et al., 2019). We use a maximum batch size of 16 and training is stopped after the model has seen 6.6 million examples, corresponding to 16 epochs on GoodNews and 9 epochs on NYTimes800k. Training is done with mixed precision to reduce the memory footprint and allow our full model to be trained on a single V-100 GPU for 4 to 6 days on both datasets.

General Caption Generation
We first discuss the results with the general caption generation metrics BLEU-4, ROUGE, ME-TEOR and CIDEr reported in Table 8. We report the mean values of three runs, and the maximum standard deviations of our variants on BLEU, ROUGE, METEOR, CIDEr are 0.013, 0.019, 0.016 and 0.069, which shows the stability of our results and that our method improvements are notable. For the GoodNews dataset, JoGANIC (auto) provides an improvement of 0.89, 0.95, 1.04, 10.69 points over Tell on the four metrics respectively, while the full model JoGANIC+MSTR+NEE (auto) has an even bigger improvement of 1.38, 2.35, 1.51, 12.72. The improvement is especially impressive for the CIDEr score. JoGANIC performs much better than all the two-step captioning methods (first group of results) and VGG+LSTM. For the NY-Times800k dataset, we compare our models only to Tell since other models perform much worse.
Here, our full model achieves 6.79, 22.80, 10.93 and 59.42 with 1.78, 3.40, 1.88, 19.12 points improvement over Tell. Our JoGANIC+MSTR+NEE (auto) outperforms Tell (full) which exploits additional visual features on both datasets . This demonstrates the effectiveness of our model in generating good captions. By providing the oracle α, the Jo-GANIC+MSTR+NEE (oracle) can achieve even higher performance on almost all metrics, showing the value of our template guidance process.
From the single modality evaluation, we observe that models that exploit the text only (JoGANIC (zero-out image) and JoGANIC (text only)) perform better than those relying on the image only (JoGANIC (zero-out text) and JoGANIC (image only)) but all have lower performance than multimodal models, confirming that both modalities are important for news image captioning.

Named Entity Generation
One of the main objectives of news captioning is to generate captions with accurate named entities. As shown in Tab. 8, compared to Tell, Jo-GANIC+MSTR+NEE (auto) increases the named entity precision and recall scores by 5.77% and 4.65% on GoodNews, and 8.63% and 6.39% on NYTimes800k. The oracle versions of our models attain even higher performances.
In Mr. Pedersen's more than 60 years of architectural practice, the project that means the most is the modest one he just completed here, where he and his wife, Elizabeth Pedersen, have summered since 1975. "I probably had the most pleasurable professional experience of my life," Mr. Pedersen, said during a recent walk through the Shelter Island History Center. The center is the new name for the reconfigured complex run by the Shelter Island Historical Society. Until recently, those treasures were housed in the attic of the historical society's decaying Havens House Museum, built in 1743. Over the last three years, Havens House has been renovated and expanded with a two-level addition designed by Mr. Pedersen to create more storage and a proper art gallery. The effort was initiated by Ms. Pedersen. Four years ago, she received a diagnosis of pancreatic cancer, just as the project was gaining steam. Her husband has finished it for her.
The exterior of the Shelter Island History Center. The building was the former Havens House Museum, which has been renovated and expanded with a two-level addition.
A building with a clock on the side of it.   (7) auto, (8) oracle, with template (9) who + context, (10) who + when, (11) who + when + context, and (12) who. For the generated captions, we highlight wrong statements in red.

Template Components Evaluation
The average precision and recall of the template components, reported in the two rightmost columns of Tab. 8, of JoGANIC+MSTR+NEE (auto) increases by 6.3% and 5.5% on GoodNews dataset and 6.4% and 3.3% on NYTimes800k dataset compared to Tell. By providing the oracle α, even better results are obtained, demonstrating that our model can exploit template guidance.

Qualitative & Human Evaluation
In Figure 3 we show the image, article (shortened for visualization) and the captions generated by a conventional image captioning model SAT (Xu et al., 2015), Tell (Tran et al., 2020) and different JoGANIC variants. The captions generated by all JoGANIC variants are meaningful and closer to the ground truth than the baselines. Interestingly, most captions generated by JoGANIC variants include people's names, e.g. Mr. or Ms. Pedersen in addition to the building names probably because people's names are the most common type for the component who in the datasets (see Tab. 1 of the supplementary material). As MSTR can read longer text than Tell, JoGANIC+MSTR can exploit the end of the article and generates the text span effort initiated by Ms. Pedersen. The caption generated by JoGANIC+MSTR+NEE has all the key factors in the ground truth caption (the Havens House Museum, the Shelter Island History Center, been renovated and expanded) demonstrating the strengths of our model. The captions generated using the oracle α (8) as well as some other man-ually defined α (9-12) illustrate the benefits and flexibility of our "template components" modeling, showing how the caption generation process can be controlled by the template guidance in JoGANIC. Finally, we conducted a human evaluation through crowd-sourcing on Amazon Mechanical Turk on 200 random image-article pairs sampled from the test set of the NYT800K dataset. For each image-article pair, three different raters were requested to rate the ground truth caption, the caption generated by Tell, and captions generated by 4 variants of our model, on a 4 point scale. Raters were asked to evaluate separately how well the caption was describing the image, how relevant it was to the article, and how easy to understand the sentence was. We report the average of the three ratings in Tab. 4, showing that all variants of our model produce captions that are rated better than Tell and closer to the ground truth captions ratings on the three aspects. The groundtruth captions have the highest sentence quality score but can have lower score for image and article relatedness as journalists sometimes do not follow guidelines and can write a caption describing the image independently of the article context or on the contrary being more related to the article than the image content. Details on the annotation instructions and results are given in the supplementary material.

Conclusion
News image captioning is a challenging task as it requires exploiting both image and text content to produce rich and well structured captions including  relevant named entities and information gathered from the whole article. In this work, we presented Journalistic Guidelines Aware News Image Captioning, aiming to solve the news image captioning task by integrating domain specific knowledge in both the representation and caption generation process. On the representation side, we introduced two techniques: named entity embedding (NEE) and multi-span text reading (MSTR). Our decoding process explicitly integrates the key components a journalist would seek to describe to improve the caption generation quality. Our method obtains remarkable gains on both GoodNews and NYTimes800k datasets relative to the state-of-the-art.

Ethical Considerations and Broader Impact
Our model is a multi-modality extension of the general image captioning methods. It can further be applied to other applications, including but not limited to, multi-modality machine translation, summarization, etc. By modeling the template components of the captions, our research could be used to explore the underlying structure of each task, improving understanding of the generation decisions or providing explanations. The potential risks of news article image captioning is the generation bias, i.e., the model might tend to use the named entities that have high frequencies. We thus suggest that people use our model as a recommendation for generating captions, people could thus modify the generated captions and control for potential bias. We would also encourage further work to understand the biases and limitations of the datasets used in this paper, including tools to analyze gender bias and other limitations.

A Appendices
The appendices provide more information about the two datasets GoodNews and NYTimes800k, template statistics and prediction results, implementation and training details, model difference between Tell and JoGANIC, details on the human evaluation and ablation evaluations of another sequence length efficient Transfomer -Longformer as well as different sequence length for MSTR.

A.1 Datasets
We use two datasets GoodNews (Biten et al., 2019) and NYTimes800k (Tran et al., 2020   NYTimes800k dataset is 70% larger and more complete dataset of New York Times articles, images, and captions. The number of train, validation and test sets are 763K, 8K and 22K respectively. Tab. 5 presents a detailed comparison between GoodNews and NYTimes800k in terms of articles and captions length, and captions composition.
We also show the article length statistics in Tab. 6. With approximate 50% of the training articles having more than 512 tokens, MSTR technique is necessary to deal with this problem.

A.2 Template statistics and prediction results
We show the composition in terms of components and the percentage of the template classes of the whole GoodNews dataset in Tab. 7.
We also report in Tab. 8 detailed template components precision and recall scores for different variants of our model and the Tell baseline on the two datasets.

A.3 Implementation and Training details
Following Tran et al. (2020)   with β 1 = 0.9, β 2 = 0.98, = 10 −6 . The number of tokens in the vocabulary K = 50264 and d W iki = 300. We use a maximum batch size of 16 and training is stopped after the model has seen 6.6 million examples, corresponding to 16 epochs on GoodNews and 9 epochs on NYTimes800k. The components prediction head in Fig. 2 of the main paper is a Linear layer followed by an output layer with hidden states dimension equal to 1024. The training pipeline is written in Py-Torch (Paszke et al., 2017) using the AllenNLP framework (Gardner et al., 2018). The RoBERTa model and dynamic convolution code are adapted from fairseq (Ott et al., 2019). Training is done with mixed precision to reduce the memory footprint and allow our full model to be trained on a single GPU. The models take 4 to 6 days to train on one V-100 GPU on both datasets.

A.4 Model Difference Between Tell and JoGANIC
As shown in Tab. 9, our model shares some components with the baseline model Tell (Tran et al., 2020). JoGANIC and Tell both use an image and text encoder and a Transformer decoder. However, JoGANIC applies template guidance to model the journalistic guidelines for caption generation.

A.5 Human evaluation
We have conducted a human evaluation of 200 article-image pairs. Below the article and the image, we displayed either the ground truth caption or a caption generated by Tell or one of our model variant. We ask the annotators to rate each caption as follows: • How well does the caption describe the IM-AGE? Regardless of how fluent it is.
image text template guidance faces objects weighted RoBERTa location aware decoder # of parameters -1 = Very bad (Does not describe the image) -2 = Somewhat bad (Describes the image, but contradictory to or missing key information from the image) -3 = Somewhat good (Describes the image, no contradictions but missing key information from the image) -4 = Very good (Describes the image, no contradictions and contains the key information from the image) • How well does the caption summarize the AR-TICLE? Regardless of how fluent it is.
-1 = Very bad (Not relevant to the topic) -2 = Somewhat bad (Covers the right topic, but contradicting the article or missing key facts) -3 = Somewhat good (Covers the right topic, no contradictions with the article, but missing key facts) -4 = Very good (Covers the right topic, no contradictions with the article, and contains the key facts) • How easy or hard is it to understand the SEN-TENCE? Regardless of how well it describes the image or article.
-1 = Very hard or doesn't make sense -2 = Somewhat hard -3 = Somewhat easy -4 = Very easy to understand Each image-article pair is shown to three different annotators, thus each caption is rated three times. We average the rating for each caption, and then plot the image relevance, article relevance and sentence quality ratings statistics as violin plots in Figure 4, Figure 5 and Figure 6, respectively. In each of these plots, the median is reported as a large dashed line, the first and third quartile as thinner dashed lines and the mean score as the black diamond. The varying height of each violin represent the number of samples having the corresponding rating. We can observe that all distributions are somewhat similar, but the Tell model is generally produces the lowest rated captions. The basic Jo-GANIC is a bit better, while more advanced variations of our model produce captions that are rated higher and really similarly to the ground truth captions.

A.6 Ablation Study
In addition to the Multi-Span Text Reading (MSTR) method proposed as an efficient technique to read long articles, we also try out Longformer (Beltagy et al., 2020), which is proposed to read long articles efficiently with an attention mechanism that scales linearly with sequence length. This attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed  Table 10: Results on GoodNews and NYTimes800k. We highlight the best model in bold. Note that we report the mean values of three runs, and the maximum standard derivations of our variants on BLEU, ROUGE, ME-TEOR, CIDEr are 0.013, 0.019, 0.016 and 0.069, which shows the stability of our results and that our method improvements are notable. attention with a task motivated global attention. In this experiment, we replace RoBERTa with Longformer as the text feature extractor. Results are shown in Tab. 10. Unexpectedly, the Longformer variant of JoGANIC underperforms the RoBERTa variant. The possible reason is that in order to improve training efficiency, Longformer applies local windowed attentions with sparse global attentions. However, in this task, global attention is needed in every token. One possible solution for Longformer is to re-do the pretraining with fully global attention. However, this might be a non-trivial task and we will explore this in the future work.
We also conduct experiments to get the best possible number of tokens for MSTR. We applied number of tokens equal to 512, 800, 1000, 1200 and 1400 respectively. We found that the best choice is 1000 as it provides nearly the best performance while the training time per epoch is still good enough.