L2C: Describing Visual Differences Needs Semantic Understanding of Individuals

Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs. Suppose there are two images, I_1 and I_2, and the task is to generate a description W_1,2 comparing them, existing methods directly model I_1, I_2 -> W_1,2 mapping without the semantic understanding of individuals. In this paper, we introduce a Learning-to-Compare (L2C) model, which learns to understand the semantic structures of these two images and compare them while learning to describe each one. We demonstrate that L2C benefits from a comparison between explicit semantic representations and single-image captions, and generalizes better on the new testing image pairs. It outperforms the baseline on both automatic evaluation and human evaluation for the Birds-to-Words dataset.


Introduction
The task of generating textual descriptions of images tests a machine's ability to understand visual data and interpret it in natural language. It is a fundamental research problem lying at the intersection of natural language processing, computer vision, and cognitive science. For example, single-image captioning (Farhadi et al., 2010;Kulkarni et al., 2013;Vinyals et al., 2015;Xu et al., 2015) has been extensively studied.
Recently, a new intriguing task, visual comparison, along with several benchmarks (Jhamtani and Berg-Kirkpatrick, 2018;Tan et al., 2019;Park et al., 2019;Forbes et al., 2019) has drawn increasing attention in the community. To complete the task and generate comparative descriptions, a machine should understand the visual differences between a pair of images (see Figure 1). Previous methods (Jhamtani and Berg-Kirkpatrick, 2018) often consider the pair of pre-trained visual features such vs This b whit animal1 has a medium sized dark beak, a white breast and grey wings. animal2 has a white breast with brown wings and tail, black eyes and a brown head .
vs Figure 1: Overview of the visual comparison task and our motivation. The key is to understand both images and compare them. Explicit semantic structures can be compared between images and used to generate comparative descriptions aligned to the image saliency.
as the ResNet features (He et al., 2016) as a whole, and build end-to-end neural networks to predict the description of visual comparison directly. In contrast, humans can easily reason about the visual components of a single image and describe the visual differences between two images based on their semantic understanding of each one. Humans do not need to look at thousands of image pairs to describe the difference of new image pairs, as they can leverage their understanding of single images for visual comparison. Therefore, we believe that visual differences should be learned by understanding and comparing every single image's semantic representation. A most recent work (Zhang et al., 2020) conceptually supports this argument, where they show that low-level ResNet visual features lead to poor generalization in vision-and-language navigation, and high-level semantic segmentation helps the agent
Motivated by humans, we propose a Learning-to-Compare (L2C) method that focuses on reasoning about the semantic structures of individual images and then compares the difference of the image pair. Our contributions are three-fold: • We construct a structured image representation by leveraging image segmentation with a novel semantic pooling, and use graph convolutional networks to perform reasoning on these learned representations.
• We utilize single-image captioning data to boost semantic understanding of each image with its language counterpart.
• Our L2C model outperforms the baseline on both automatic evaluation and human evaluation, and generalizes better on the testing image pairs.

L2C Model
We present a novel framework in Figure 2, which consists of three main components. First, a segmentation encoder is used to extract structured visual features with strong semantic priors. Then, a graph convolutional module performs reasoning on the learned semantic representations. To enhance the understanding of each image, we introduce a single-image captioning auxiliary loss to associate the single-image graph representation with the semantic meaning conveyed by its language counterpart. Finally, a decoder generates the visual descriptions comparing two images based on differences in graph representations. All parameters are shared for both images and both tasks.

Semantic Representation Construction
To extract semantic visual features, we utilize pretrained fully convolutional networks (FCN) (Long et al., 2015) with ResNet-101 as the backbone. An image I is fed into the ResNet backbone to produce a feature map F ∈ R D×H×W , which is then forwarded into an FCN head that generates a binary segmentation mask B for the bird class. However, the shapes of these masks are variable for each image, and simple pooling methods such as average pooling and max pooling would lose some information of spatial relations within the mask.
To address this issue and enable efficient aggregation over the area of interest (the masked area), we add a module after the ResNet to cluster each pixel within the mask into K classes. Feature map F is forwarded through this pooling module to obtain a confidence map C ∈ R K×H×W , whose entry at each pixel is a K-dimensional vector that represents the probability distribution of K classes. Then a set of nodes where i=1, ...H, j=1, ..., W,, C k is the k-th probability map and ⊙ denotes element-wise multiplication.
To enforce local smoothness, i.e., pixels in a neighborhood are more likely belong to one class, we employ total variation norm as a regularization term:

Comparative Relational Reasoning
Inspired by recent advances in visual reasoning and graph neural networks Li et al., 2019), we introduce a relational reasoning module to enhance the semantic representation of each image. A fully-connected visual semantic graph G = (V, E) is built, where V is the set of nodes, each containing a regional feature, and E is constructed by measuring the pairwise affinity between each two nodes v i , v j in a latent space.
where W i , W j are learnable matrices, and A is the constructed adjacency matrix. We apply Graph Convolutional Networks (GCN) (Kipf and Welling, 2016) to perform reasoning on the graph. After the GCN module, the out- tionship enhanced representation of a bird. For the visual comparison task, we compute the difference of each two visual nodes from two sets, denoted as V

Learning to Compare while Learning to Describe
After obtaining relation-enhanced semantic features, we use a Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) to generate captions. As discussed in Section 1, semantic understanding of each image is key to solve the task. However, there is no single dataset that contains both visual comparison and single-image annotations. Hence, we leverage two datasets from similar domains to facilitate training. One is for visual comparison, and the other is for single-image captioning. Alternate training is utilized such that for each iteration, two mini-batches of images from both datasets are sampled independently and fed into the encoder to obtain visual representations V o (for single-image captioning) or V o dif f (for visual comparison).
The LSTM takes V o or V o dif f with previous output word embedding y t−1 as input, updates the hidden state from h t−1 to h t , and predicts the word for the next time step. The generation process of bi-image comparison is learned by maximizing the log-likelihood of the predicted output sentence. The loss function is defined as follows: Similar loss is applied for learning single-image captioning: Overall, the model is optimized with a mixture of cross-entropy losses and total variation loss: where λ is an adaptive factor that weighs the total variation loss.  pineni et al., 2002), ROUGE-L (Lin, 2004), and CIDEr-D (Vedantam et al., 2015). Each generated description is compared to all five reference paragraphs. Note for this particular task, researchers observe that CIDEr-D is susceptible to common patterns in the data (See Table 1 for proof), and ROUGE-L is anecdotally correlated with higherquality descriptions (which is noted in previous work (Forbes et al., 2019)). Hence we consider ROUGE-L as the major metric for evaluating performances. We then perform a human evaluation to further verify the performance.

Implementation Details
We use Adam as the optimizer with an initial learning rate set to 1e-4. The pooling module to generate K classes is composed of two convolutional layers and batch normalization, with kernel sizes 3 and 1 respectively. We set K to 9 and λ to 1. The dimension of graph representations is 512. The hidden size of the decoder is also 512. The batch sizes of B2W and CUB are 16 and 128. Following the advice from (Forbes et al., 2019), we report the results

Validation Test
Model  (Forbes et al., 2019). Most Frequent produces only the most observed description in the dataset: "the two animals appear to be exactly the same". Text-Only samples captions from the training data according to their empirical distribution. Neural Naturalist is a transformer model in Forbes et al. (2019). CNN+LSTM is a commonly-used CNN encoder and LSTM decoder model.  using models with the highest ROUGE-L on the validation set, since it could correlate better with high-quality outputs for this task.

Automatic Evaluation
As shown in Table 1

Human Evaluation
To fully evaluate our model, we conduct a pairwise human evaluation on Amazon Mechanical Turk with 100 image pairs randomly sampled from the test set, each sample was assigned to 5 workers to eliminate human variance. Following Wang et al. (2018), for each image pair, workers are presented with two paragraphs from different models and asked to choose the better one based on text   Table 2, L2C outperforms CNN+LSTM, which is consistent with automatic metrics.

Ablation Studies
Effect of Individual Components We perform ablation studies to show the effectiveness of semantic pooling, total variance loss, and graph reasoning, as shown in Table 3. First, without semantic pooling, the model degrades to average pooling, and results show that semantic pooling can better preserve the spatial relations for the visual representations. Moreover, the total variation loss can further boost the performance by injecting the prior local smoothness. Finally, the results without GCN are lower than the full L2C model, indicating graph convolutions can efficiently modeling relations among visual regions.
Sensitivity Test We analyze model performance under a varying number of K (K is the number of classes for confidence map C), as shown in Figure 3. Empirically, we found the results are comparable when K is small.

Conclusion
In this paper, we present a learning-to-compare framework for generating visual comparisons. Our segmentation encoder with semantic pooling and graph reasoning could construct structured image representations. We also show that learning to describe visual differences benefits from understanding the semantics of each image.