Enhancing Descriptive Image Captioning with Natural Language Inference

Generating descriptive sentences that convey non-trivial, detailed, and salient information about images is an important goal of image captioning. In this paper we propose a novel approach to encourage captioning models to produce more detailed captions using natural language inference, based on the motivation that, among different captions of an image, descriptive captions are more likely to entail less descriptive captions. Specifically, we construct directed inference graphs for reference captions based on natural language inference. A PageRank algorithm is then employed to estimate the descriptiveness score of each node. Built on that, we use reference sampling and weighted designated rewards to guide captioning to generate descriptive captions. The results on MSCOCO show that the proposed method outperforms the baselines significantly on a wide range of conventional and descriptiveness-related evaluation metrics.


Introduction
Automatically generating visually grounded descriptions for given images, a problem known as image captioning (Chen et al., 2015), has drawn extensive attention recently. In spite of the significant improvement of image captioning performance (Lu et al., 2017;Anderson et al., 2018;Xu et al., 2015;Lu et al., 2018), existing models tend to play safe and generate generic captions. However, generating descriptive captions that carry detailed and salient information is an important goal of image captioning. For example, recent work (Luo et al., 2018;Liu et al., 2018bLiu et al., , 2019a) leveraged crossmodal retrieval (Faghri et al., 2017;Feng et al., 2014) to solve this problem, based on the observation that more descriptive captions often result in better discriminativity in retrieval. 1 https://github.com/Gitsamshi/Nli-image-caption In the paper, we explore to develop better descriptive image captioning models from a novel perspective-considering that among different captions of an image, descriptive captions are more likely to entail less descriptive ones, we develop descriptive image captioning models that leverage natural language inference (NLI, or also known as recognizing textual entailment) (Dagan et al., 2005;MacCartney and Manning, 2009;Bowman et al., 2015), which can utilize multiple references of captions (Young et al., 2014;Lin et al., 2014) to guide the models to produce more descriptive captions.
Specifically, the proposed model first predicts NLI relations for all pairs of references, i.e., entailment or neutral 2 . Built on that, we construct inference graphs and employ a PageRank algorithm to estimate descriptiveness scores for individual captions. We use reference sampling and weighted designated rewards to incorporate the descriptiveness signal into the Maximum Likelihood Estimation and Reinforcement Learning phase, respectively, to guide captioning models to produce descriptive captions. Extensive experiments were conducted on the MSCOCO dataset using different benchmark baseline methods (Huang et al., 2019;Luo et al., 2018;Rennie et al., 2017).
We demonstrate that the proposed method outperforms the baselines, achieving better performances on various evaluation metrics. In summary, the major contributions of the paper are three-fold: (1) To the best of our knowledge, this is the first attempt to connect natural language inference to image captioning, which helps generate more descriptive captions; (2) we propose a reference sampling distribution and weighted designated rewards to guide captioning model to produce more descriptive captions; (3) the proposed method attains better performance on various evaluation metrics over the state-of-the-art baselines.
The work of (Luo et al., 2018;Liu et al., 2018b) is most related to ours, which uses retrieval loss as a rewarding signal to encourage descriptive captioning. Different from the above approaches, our method explicitly explore the different descriptiveness in references using NLI models and incorporate the information into the training objectives to guide the model to generate more informative sentences. We build our method on top of the existing methods to verify the effectiveness. Applications of NLI There are basically three major application types for NLI, (1) Direct application of trained NLI models. Trained NLI models are directly used in Fact Extraction and Verification (Thorne et al., 2018) to decide whether a piece of evidence supports a claim (Nie et al., 2019) and generation of longer sentences as a discriminator (Holtzman et al., 2018) to prevent a text decoder from contradicting itself; (2) NLI as a research and evaluation task for new methods. It is widely used as a major evaluation when developing novel language model pretraining (Devlin et al., 2018;Peters et al., 2018;Liu et al., 2019c); (3) NLI as a pre-training task in transfer learning. Training neural network models on NLI corpora and then fine-tuning them on target tasks often yields substantial improvements in performance (Liu et al., 2019b;Phang et al., 2018).

Our Method
The goal of image captioning is to train condi- where m is the number of training instances and n is the number of reference captions for an image.
The typical models leverage a two-phase learning process to estimate p θ (c | x): the first uses MLE objective, which minimizes a cross-entropy loss with regard to the ground truth captions: RL is then used to optimize models by maximizing the expected reward for generating captions.
where r(ĉ, x i ) could be CIDEr reward (r cd ) (Rennie et al., 2017) or a combination of CIDEr (r cd ) and discriminative loss (l dis ) (Luo et al., 2018).
In this work, we enhance these two basic learning objectives by considering the descriptiveness of references {c 1 i , · · · , c n i }.

Constructing Inference Graphs
NLI Matrix The SNLI corpus (Bowman et al., 2015) is widely used for training natural language inference models. To leverage the data for our task, we extract a subset of SNLI to fit our needs, e.g., removing contradiction sentence pairs (see Appendix B for details). Our NLI model is built upon BERT (Devlin et al., 2018), which achieves near state-of-the-art performance and is sufficient for our purpose. Given reference captions C i = {c 1 i , · · · , c n i } of an image, we obtain a NLI label for each ordered pair c j i , c k i , forming a NLI relation matrix, as shown in Figure 1. Note that a NLI relation matrix is not necessary to be a symmetric matrix. For example, it is possible that c j i , c k i has an entailment relation (i.e., c j i entails c k i ) and c k i , c j i is neutral, by the definition in NLI (Bowman et al., 2015). Inference Graphs Built on the NLI matrix, we construct the inference graphs. For To construct a directed inference graph, captions in a given image are added as vertexes. We add a directed edge from c j i to c k i if c j i , c k i is revEntail; i.e., the edge's head c k i is expected to be more descriptive than the tail c j i , and the edge points towards c k i . If c j i , c k i is fwdEntail, we add an edge from c k i to c j i . We do not add edges for paraphrase and muNeural pairs. Descriptiveness Scorer PageRank (Page et al., 1999) is a link analysis model applied to collections of nodes with quotations or references. We perform PageRank on a inference graph to compute the descriptiveness score for each node/caption, which measures at which node a random walk is more likely to stop. Nodes with a higher score assigned by PageRank can be viewed as more descriptive. We then normalize the score to obtain distribution q(c | x i ), c ∈ C i .

Descriptiveness Regularized Learning
Reference sampling (Rs) for MLE We can verify that L ML in Equation (1) is equivalent to the KL divergence between a uniform target reference distribution U (c | x i ) and model distribution p θ (c | x i ): Note that Equation (3) indicates that any c that belongs to reference set of C i will be equally learned without considering their descriptiveness. To resolve the issue, for an image x i , we use the probability distribution q obtained from graph nodes. We obtain an enhanced MLE loss L ML , which is equivalent to minimizing the KL divergence between the target reference sampling distribution q and p θ : Weighted reward (Wr) for RL We modify the reward function in RL to integrate the descriptiveness score to encourage more contribution from descriptive references in designated reward. Specifically, we change the CIDEr reward item r cd in r(ĉ, x i ) as shown in equation (2) where CD denotes the CIDEr similarity score.

Setup
Dataset and Evaluation Metrics We perform experiments on the Karpathy split of the MSCOCO dataset (Lin et al., 2014;Karpathy and Fei-Fei, 2015). We employ a wide range of conventional image caption evaluation metrics, i.e., SPICE(SP) (Anderson et al., 2016), CIDEr(CD) , METEOR(ME) (Denkowski and Lavie, 2014), ROUGE-L(RG) (Lin, 2004), and BLEU (Papineni et al., 2002) to evaluate the generated captions. Following (Liu et al., 2019a), we also use the caption generatedĉ to retrieve image x using a separately trained image-matching model . The retrieval evaluation is based on 1K images  from the Karpathy test set. Retrieval performances are measured by R@K (K = 1, 5), i.e., whether x is retrieved within the top K retrieved images. We also perform human evaluation on descriptiveness, fluency, and fidelity. Implementation Details To make a fair comparison, we use the same experiment setup that the compared baselines used. See more implementation details for NLI model, retrieval model in evaluation, and descriptiveness score normalization in appendix B. Compared Models We use AoANet, ATTN, and DISC(λ set to 1) as the baselines. ATTN (Rennie et al., 2017) is a LSTM based decoder with visual attention mechanism. AoANet (Huang et al., 2019) adopts the attention on attention module. We also leverage the discriminativity enhanced model DISC (Luo et al., 2018) which is built upon ATTN.

Results and Analyses
Overall Performance Table 1 shows the overall performance of different models.
Results on conventional metrics. Our method consistently outperforms the baseline models on most conventional metrics, especially SPICE and CIDEr; e.g., the proposed model improves the AoANet baseline from 118.4 to 119.1 on CIDEr, 21.5 to 21.7 on SPICE in the MLE phase, and improves the ATTN baseline on CIDEr from 117.4 to 120.1, SPICE from 20.5 to 21.0 in the RL phase. As CIDEr is based on tf-idf weighting, it helps to differentiate methods that generate more imagespecific details that are less commonly occur across the dataset. As our method is designed to encourage models to generate sentences with more objects, attributes, or relations, the effect was also suggested by the improvement on SPICE. Performance on descriptiveness related metrics. Our methods achieve consistently better results on R@1 and R@5 in both the MLE and RL optimization phases. Note that the proposed model can further boost the retrieval performance on the discriminativity enhanced baseline (DISC), improving R@1 from 46.5 to 48.1 and R@5 from 83.6 to 87.9. Our weighted CIDEr reward is complementary to the discriminative loss item in DISC and further boost the retrieval performance. Labels between generated sentences. We use the externally trained NLI model (Section 3.1) to further investigate the NLI relationships between the captions generated by our method and by the baselines (AoA and DISC) on the testset. Figure 2 shows that our model generates more descriptive sentences. For example, comparing the generation results of AoA+RsWr and AoA on 5,000 testing images, captions generated by AoA+RsWr forwardentails those generated by AoA on 1,591 images, and reverse-entails on 341 images. Ablation analysis. As shown in Figure 3, both reference sampling (Rs) and weighted reward (Wr) can improve performance in their respective optimization period, i.e., MLE to MLE(Rs), MLE+RL to MLE+RL(Wr). There is also a marginal improvement when using MLE(Rs) instead of MLE before the RL(Wr) optimization period, i.e., MLE+RL(Wr) to MLE(Rs)+RL(Wr), showing that MLE(Rs) has a positive impact even after RL(Wr) optimization. Human Evaluation We further perform human evaluation on our method and two baselines (here, ATTN and DISC) using 100 images randomly sampled from the test set. Three human subjects rate captions with 1-5 Likert scales (higher is better) with respect to three criteria: fluency, descriptiveness, and fidelity. See more details in appendix A for rating details. Table 2 shows that ATTN+RsWr performs better than ATTN on descriptiveness. Moreover, DISC+RsWr can further improve the descriptiveness performance over the baseline discriminativity enhanced captioning model. Case Study. Figure 4 includes three examples, in which our model produces captions with more attributes, objects, or relations.

Descriptiveness and Entailment
We perform human analysis between descriptiveness and entailment. Specifically we randomly  sample 50 images from the MSCOCO training set. For one image, there are five references, constituting ten reference pairs. So we have 500 reference pairs. For each reference pair, we ask three subjects to annotate whether one sentence conveys more non-trivial, important and detailed information than the other in terms of the described image.
If the majority of the three subjects annotate yes, they further annotate the NLI relation-entailment or neutral, with the more informative caption as premise and the other as the hypothesis. As a result, out of the 500 reference pairs, we obtained 208 pairs that have differences in descriptiveness. The annotated NLI relations show that 164 of the 208 collected pairs have the entailment relation; i.e., for around 80% of the 208 pairs, "descriptive captions entail less descriptive captions" holds in the randomly sampled MSCOCO subset, where MSCOCO is a widely used multi-reference image caption benchmark.

Pairwise similarity and Re-ranking
We apply a pairwise similarity approach to AoA, in which we use Jaccard similarity between a pair of sentences to build the graph and run PageRank to get scores. Table 3 shows that pairwise similarity baseline approach (AoA+Sim) did not further improve performance over the corresponding baselines, showing pairwise similarity does not suggest descriptiveness, unlike entailment.
We perform re-ranking on the ATTN baseline; we use beam search with a beam size of 3, and then rank the captions in the beam by descriptiveness  scores, which is calculated by BERT based NLI model. As shown in Table 3, the re-ranked sentences in the beam do not have much improvement in terms of baseline. Sentences generated by beam search (c.f. appendix C) do not vary significantly in terms of descriptiveness; these sentences are usually neutral to each other and sentences ranked low in the beam may have the fidelity/fluency issues.

Conclusions
We explore a novel approach to encourage image captioning models to produce more descriptive sentences using natural language inference. We construct inference graphs and descriptiveness scores are assigned to nodes using the PageRank algorithm. Built on that, we use reference sampling and weighted designated rewards to guide captioning to generate descriptive captions. We demonstrate the effectiveness of the model on various evaluation metrics and perform detailed analyses.

A Human Evaluation Details
The human evaluation is performed with three nonauthor human subjects. We ask the subjects to rate on three 1-5 Likert scales, corresponding to fidelity (the sentences' fidelity to the corresponding images), fluency (the quality of captions in terms of grammatical correctness and fluency), and descriptiveness (how much the sentences convey more detailed and faithful information about the images).

B More Implementation Details
NLI We exclude the training instances labeled with contradiction, since our task does not need to consider contradiction-reference captions for the same image are unlikely to contradict each other. We also sample training instances in the SNLI dataset to make the subset's length distribution similar to the caption references. We obtained a filtered dataset with around 250K sentence pairs as our training set, 4K and 4K as validation and test set, respectively. We leverage BERT (Devlin et al., 2018) as the framework for training which is a basis for many state-of-the-art models and achieve near state-of-the-art performance, which is sufficient for our purpose. The training gets stabled after 3 epochs, reaching an accuracy around 88% on the test set.

Retrieval Model in Evaluation
The model is trained with the published package of SCAN . For the specific parameters, we followed the "SCAN t-i LSE" setting in their published report.
Descriptiveness Score We use the entailment probability as the weights on the edges and then we perform PageRank using the toolkit from (Hagberg et al., 2008). We set the damping parameter of 0.95 for descriptiveness score at MLE training stage and 0.1 for descriptiveness score at RL training stage, as we find that a smooth score distribution on reward (c.f. Equation 5) and a peaked score distribution on MLE(c.f. Equation 4) lead to improved performance in the RL and MLE training stage respectively.

C Beam Search Generation
Example 1. {"image˙id": 247625, "caption": a man holding a snowboard in the snow, a man standing on a snowboard in the snow, a man is standing on a snowboard in the snow} {"image˙id": 131019, "caption": a group of zebras are standing in a field, a group of zebras are standing in a field with a zebra, a group of zebras are walking in a field} These are sentences generated by beam search by ATTN model after RL stage (before re-ranking).