Semantic Relation-aware Difference Representation Learning for Change Captioning

Change captioning is to describe the difference in a pair of images with a natural language sentence. In this task, the distractors, such as the illumination or viewpoint change, bring the huge challenges about learning the difference representation. In this paper, we propose a semantic relation-aware difference representation learning network to explicitly learn the difference representation in the existence of distractors. Speciﬁcally, we introduce a self-semantic relation embedding block to explore the underlying changed objects and design a cross-semantic relation measuring block to localize the real change and learn the discriminative difference representation. Besides, relying on the POS of words, we devise an attention-based visual switch to dynamically use visual information for caption generation. Extensive experiments show that our method achieves the state-of-the-art performances on CLEVR-Change and Spot-the-Diff datasets 1 .


Introduction
Change Captioning aims to describe a semantic change between a pair of "before" and "after" images, which has many practical applications such as facility monitoring (Sakurada and Okatani, 2015), medical imaging (Patriarche and Erickson, 2004), and aerial photography (Gueguen and Hamid, 2015).
The previous work (Jhamtani and Berg-Kirkpatrick, 2018) introduced this task with an ideal assumption that there is a semantic change between a completely-aligned image pair. However, there is always illumination change in a dynamic world, and same or similar scenes are prone * This work was done when Yunbin Tu visited VIPL research group, CAS and was supervised by Prof. Liang Li.
† Corresponding author 1 The code of this paper has been made publicly available at https://github.com/tuyunbin/SRDRL The tiny cylinder has disappeared.

<Before> <After> <Change Caption>
A person on the far corner of the sidewalk is now gone. to shoot under different viewpoints. Compared to semantic changes, both illumination and viewpoint changes are irrelevant distractors, so realistic change captioning requires a model: 1) distinguishing semantic changes (e.g., an object has moved) from distractors (e.g., a viewpoint change) and 2) conveying the detected change in a logically and grammatically accurate sentence. To this end, recent works (Park et al., 2019;Shi et al., 2020) focused on addressing change captioning in the presence of distractors.
Despite the progress, there are still two limitations for their approaches. First, the semantic difference was modeled only relying on the semantic features of objects, while ignoring their selfsemantic relations. Hence, the feature difference is hard to capture the tiny change. As shown in Figure 1, compared with many unchanged objects, the dropped object is tiny and easy to ignore. Differently, if one of the objects has changed, especially number or position change (e.g., "add", "drop", or "move"), the semantic relations surrounding it would change as well, which would be beneficial to explore the underlying objects that have changed. Second, due to the existing of irrelevant distractors, the model would capture the semantic difference with noises and thus learn a wrong difference representation. However, both distractors are irrelevant to the semantics of image contents. Therefore, the cross-semantic relation between the captured semantic difference and the image pair is beneficial to judge whether the semantic change has actually happened, and further learn the difference representation in the "before" and "after" images.
Besides, during caption generation, previous works exploited visual information to generate each word, which is unnecessary or even misleading (Lu et al., 2017;Song et al., 2017). As words with different part-of-speech (POS) information not only play different grammatical roles in a sentence , but also have different relationships with the visual information in an image. As shown in the first example of Figure 1, some words (e.g., "tiny", "cylinder" and "disappeared") belong to adjective, noun and verb words, which denote the size, category, and state of the visual object, while the word (i.e.,"the") is a determiner word which does not have corresponding canonical visual signals. Thus, it is useful to introduce the POS of words for switching visual information during change caption generation.
In this paper, we propose a Semantic Relationaware Difference Representation Learning (SR-DRL) network to localize the semantic change in the presence of distractors, and introduce an Attention-based Visual Switch (AVS) to dynamically decide when to use visual information during change caption generation. Specifically, first, a Self-Semantic Relation Embedding block (SSRE) builds semantic relations of objects for each image in the "before"/"after" pair via the self-attention mechanism. The built relations are embedded into image features for computing a relation-embedded feature difference. Second, a Cross-Semantic Relation Measuring block (CSRM) leverages the obtained difference to query the underlying "candidate change" in the each image. Further, CSRM uses the difference to generate an attention gate measuring its cross-semantic relations with respect to each image. Subsequently, the attention gate is applied to the candidate change to distinguish semantic change from the viewpoint/illumination change. Third, the change localizer is introduced to learn the accurate difference representation in the image pair under the guidance of a prior knowledge (the above distinguished information).
Finally, according to POS information of words, an Attention-based Visual Switch (AVS) is devised and incorporated into the caption generator to dynamically control visual information when predicting the next word. Extensive experiments show that our approach outperforms the state-of-the-art change captioning models with a large margin.
In summary, the contributions of this work have threefold: (1) We propose SRDRL that explicitly learns the semantic difference representation in the image pair by embedding self-semantic relations into object features of each image and further measuring the cross-semantic relations between the image pair and their difference. (2) Both SSRE and CSRM blocks are designed to help the change localizer to accurately focus on the changed objects.
(3) An AVS is customized to dynamically utilize visual information for caption generation based on the POS information of words.

Related Work
Different from conventional image (Liu et al., 2020Yan et al., 2019Yan et al., , 2020aYan et al., , 2021 or video captioning (Deng et al., 2021;Tu et al., 2017Tu et al., , 2020Yan et al., 2020b), change captioning addresses two-image captioning, especially to describe their difference. Jhamtani et al. (Jhamtani and Berg-Kirkpatrick, 2018) is the first work for change captioning. However, it is built upon an ideal situation by assuming there are no distractors (illumination/viewpoint change) between a pair of images. To make this task more close to our dynamic world, Park et al. and Shi et al. (Park et al., 2019;Shi et al., 2020) both aimed to address change captioning in the existence of distractors. On one hand, Park et al. directly concatenated the coarse feature difference with the image pair to operate spatial attention to localize the change. However, due to the existing of distractors, when the captured feature difference is not what the model really expects, the spatial attention module could be misled to give fallacious results. On the other hand, Shi et al. first exploited a cross-attention mechanism to search the most similar patches between the image pair and they are regarded as the unchanged representation. Then, they subtracted them from the original image to get the difference representation. However, as our aforementioned, the changed object is tiny and easy to ignore, so it is insufficient to capture the difference representation only at feature level. Different from the above state-of-the-art meth- ods, we first use SSRE to improve the fine-grained representation ability of object features by embedding the self-semantic relations among them. Then, we exploit CSRM to distinguish the actual semantic change from irrelevant distractors via measuring cross-semantic relations between the captured candidate difference and the original images. Finally, we use POS information to devise an attentionbased visual switch that dynamically determines not only when to use visual information, but also which to use ( e.g., "before" and "after"). Compared to the aforementioned methods, our method not only can learn discriminative difference representation, but also can describe it using an accurate natural language sentence.

Methodology
We present a semantic relation-aware difference representation learning (SRDRL) network for change localization and devise an attention-based Visual Switch (AVS) under the guidance of POS information for caption generation. When a pair of "before" and "after" images are given (denoted as I bef and I af t ), our SRDRL first detects what (position, number, attribute, or nothing) has changed in a scene and further decides where to localize on both I bef and I af t . Then, during caption generation, the AVS is able to dynamically decide when to use visual information and which to use (e.g., "before" and "after").

Self-Semantic Relation Embedding
Formally, given a pair of I bef and I af t , we first use pre-trained CNN model to extract object-level features and denote them as X bef and X af t , where X i ∈ R C×H×W ; C, H, W indicate the number of channels, height, and width. However, These original object features are independent, and there exist semantic relations among them Wu et al., 2019;Yin et al., 2020). Inspired by the self-attention (Vaswani et al., 2017) using in machine translation, the self-semantic relation embedding block (SSRE) relies on it to implicitly model the semantic relations among objects in each image. Specifically, we first reshape Then, given (key, value), SSRE exploits the scaled dotproduct attention on queries Q by: In our case, the queries, keys and values are all projections of the object features of X i : Though the SSRE, the semantic relations are embedded in the original object features; both X bef and X af t can be updated to X bef and X af t . Finally, we subtract X bef from X af t to capture the semantic difference X dif f in the both object feature and relation aspects.

Cross-Semantic Relation Measuring
Due to the existing of distractors, the resulting X dif f would include some irrelevant information, which would be noises for the accurate difference representation learning on both X bef and X af t . Thus, we propose a cross-semantic relation measuring block (CSRM) to distinguish the semantic change from the irrelevant illumination or viewpoint change by measuring the cross-semantic relation between the X dif f and X bef (X af t ). Concretely, the CSRM utilizes the X dif f to first query the possible "candidate change" C bef on the X bef , and then generates an "attention gate" A bef measuring its semantic relations with respect to X bef . These are defined by using two separate non-linear transformations: where , and C is the dimension of X dif f and X bef ; σ and φ denote the sigmoid and tanh function. The value in the "attention gate" indicates the semantic relevance between the "candidate change" and the "before". Thus, the more information in the "candidate change" passes through the "attention gate", the more X dif f is relevant to X bef .
Next, the CSRM applies the A bef to the C bef to filter all the underlying change information and focus on only the information about semantic change via element-wise multiplication: Besides, the information about semantic change C af t is computed via the similar operation between the X dif f and X af t :

Prior Knowledge-guided Change Localizer
After obtaining the C bef and C af t , we use them as the prior knowledge to guide the change localizer to learn the difference representation. Specifically, the change localizer first predicts two separate attention maps under the guidance of C bef and C af t , respectively: where [; ], conv, and σ indicate concatenation, convolutional layer, and element-wise sigmoid, respectively. After that, the difference representation features l bef and l aft are attended to by applying a bef and a aft to the input image features X bef and X aft : 3.2 Change Caption Generation

POS Predictor
Inspired by POS used in machine translation (Yin et al., 2019), we dynamically predict POS tags 1 of target words based on the previous hidden states h (t−1) c of the caption generator. The predicted tags help the captioning model use visual information in a dynamic way.
Specifically, at time t, h (t−1) c is first fed into a single hidden layer with the ReLU activation function: where W (1) p ∈ R M ×M and b (1) p ∈ R M , and M is the dimension of the hidden state in caption generator. Then, a POS tag probability is predicted by a linear transformation with a softmax function: where W (2) p ∈ R M ×n and b (2) p ∈ R n , and n is the number of POS tag. After obtaining w p t , we represent the POS tag of the target word w t using a semantic representation p t : 3.2.2 Attention-based Visual Switch Visual Attention. We first use a visual attention module to select a candidate feature from l bef , l aft , or l diff (l aft -l bef ), which could be relevant to the target word: where i ∈ ( bef, diff, aft ). α c are hidden states of the attention module LSTM a and the caption generator LSTM c , respectively.
Visual Switch. Then, we exploit a visual switch to decide whether to rely on visual information to predict the next word based on the predicted POS information p t . At time step t, the visual switch β t is defined as: where σ is the sigmoid function and W s * are the learnable parameters. The range of β t is [0,1] and the value of it indicates how much visual information to use when predicting the target word. Then, we apply this switch to attended visual feature l (t) dyn to control the use of visual information:

Caption generator
After the proper visual information is obtained, we use it and the previous word w t−1 (ground-truth word during training, predicted word during inference) to the caption generator LSTM c to predict a series of distributions over the next word: where E is a word embedding matrix; W c and b c are learnable parameters.

Joint Training
We jointly train the POS predictor and the caption generator end-to-end by maximizing the likelihood of the observed POS and word sequence. For the POS predictor, given the target ground-truth POS tags (w p 1 , . . . , w p m ), we minimize its negative loglikelihood loss: where θ p are the parameters of the POS predictor and m is the length of the POS tag. For the caption generator, given the target ground-truth caption words (w c 1 , . . . , w c m ), we minimize its negative log-likelihood loss: where θ c are the parameters of the caption generator and m is the length of the caption. Thus, the final loss function is optimized as follows: is a large scale dataset with a set of basic geometry objects, which consists of 79,606 image pairs and 493,735 captions. The change types consist of five cases, i.e., "Color", "Texture", "Add", "Drop", and ''Move". We use the official split with image pairs of 67,660 for training, 3, 976 for validation and 7,970 for testing. Spot-the-Diff. This dataset (Jhamtani and Berg-Kirkpatrick, 2018) contains 13,192 real image pairs which are well aligned image pairs, with one or more changes between the images (but no distractors). Similar to (Park et al., 2019), we only evaluate our model in a single change setting and split it into training, validation, and test sets with a ratio of 8:1:1.

Evaluation Metrics
We use five standard metrics to evaluate the quality of generated sentences, i.e., BLEU-4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin, 2004), CIDEr  and SPICE (Anderson et al., 2016). We get all the results in this paper according to the Microsoft COCO evaluation server (Chen et al., 2015).

Implementation Details
To extract image features, we use ResNet-101 (He et al., 2016) pre-trained on the Imagenet dataset (Russakovsky et al., 2015). We use features from the convolutional layer with dimensionality of 1024 × 14 × 14. The hidden size is set to 512 and the number of attention heads in SSRE is set to 4. The words are represented by trainable 300D word embedding features. POS tags are divided into 16 categories. In the training phase, on CLEVR-Change and Spot-the-Diff, we respectively set the minibatch size as 128 and 96. We use Adam optimizer (Kingma and Ba, 2014) with the learning rate of 1 × 10 −3 and 5 × 10 −4 , respectively. At inference, greedy decoding strategy is used to generate target captions. Both training and inference are implemented with PyTorch (Paszke et al., 2019) on a TITAN Xp GPU.

Ablation studies
In order to figure out the contribution of each module, we carry out the following ablation studies on CLEVR-Change: (1) Baseline which is based on DDUA (Park et al., 2019); (2) SSRE which only embeds the self-semantic relations of objects into their representations; (3) CSRM which only measures the cross-semantic relations between the cap-tured candidate difference and the original images, and the learned discriminative difference representation is used as a prior knowledge to guide the change localizer; (4) SRDRL which is the combination of (2) and (3); (5) AVS which only relies on the POS information to determine when to use visual information and which of them should be used; (6) SRDRL+AVS which is the combination of (4) and (5).
The Evaluation on Total Performance. We frist study the total performance of each block of the proposed method under the whole dataset, including scene change and none-scene change. Experimental results are shown in Table 1. We can observe that each module of the proposed method improves the total performance of the baseline. Moreover, the best performance is achieved when putting them together, which indicates each block not only plays its unique role, but also can be a supplementary role for the others. This global statistical performance validates the generalization ability of the proposed method, that is, it not only can explicitly judge whether there is a semantic change between a pair of unaligned images, but also can describe the change using an accurate sentence.
The Evaluation on Scene Change and Nonescene Change. The experimental results are shown  in Table 2, in terms of scene change, we can observe that 1) SSRE, CSRM and AVS all achieve improvements over the baseline; 2) compared with SSRE, the improvement is relatively small when respectively using CSRM and AVS; 3) better performances are achieved when using two kinds of combinations (SRDRL and SRDRL+AVS). These indicate 1) the effectiveness of our proposed SR-DRL and its single block, as well as the AVS; 2) the priority of this task is to capture the semantic difference in the image pair. The reason is that only if the semantic difference is captured sufficiently, can the following specific change localization and caption generation do well on itself part. Besides, we can observe that although each single block can improve the baseline in the case of scene change, but they are worse than the baseline in one or more metrics in the case of none-scene change. Our conjecture is that the robustness of single block is relatively weak, so it would sometimes misidentify the illumination or viewpoint change as the actual semantic change. When observing the performance of two kinds of combinations (SR-DRL and SRDRL+AVS), both of them improve the baseline in all metrics, which indicates the robustness of our overall model is strong.

Results on CLEVR-Change
In this dataset, we compare with four state-of-theart methods, Capt-Dual (Park et al., 2019), DUDA (Park et al., 2019), M-VAM (Shi et al., 2020) and M-VAM+RAF (Shi et al., 2020), in four dimensions: 1) the total performance of scene change and none-scene change; 2) only scene change; 3) only none-scene change; 4) specific type of scene change. The comparison results are shown in Table  3, Table 4, and Table 5, respectively.
From Table 3, in terms of total performance, we can clearly observe that our method achieves significant improvements over them in all evaluation metrics, in particular with an increase of 34.3% and 7.2% in SPICE, respectively. From Table 4, under two kinds of settings, we can observe that our method outperforms DDUA with a large margin. Furthermore, since the M-VAM+RAF did not report the results on scene change, we only compare with them in the setting of none-change. We can observe that it outperforms us in METEOR and CIDEr. This superiority could derive from the reinforcement learning strategy. However, this strategy will remarkably increase training time and computation complexity. Moreover, as reported in Table  3, our total performance is much better than them, which is evaluated under the both scene change and none-scene change. Hence, compared to them, our method is more robust due to the discriminative difference representation learning. Table 5 is the detailed breakdown of the evaluation based on five change types: "Color" (C), "Texture" (T), "Add" (A), "Drop" (D), and "Move" (M). Specifically, compare to all SOTA methods,    our method significantly raises the CIDEr scores in "Color" and "Texture" types, which indicates our method can better distinguish the attribute change of objects from an illumination change. Besides, for the number or position change of objects ("Add", "Drop", and "Move"), our method all outperforms them in most of metrics. Especially for SPICE, compared to them, our method has 64.9% and 12.9% improvements for "Move" case, respectively, which also shows our method can better localize the object movement from the viewpoint change. In particular, the most challenging change types are "Texture" and "Move" in this dataset, because they are most often confused with the illumination or viewpoint changes (Park et al., 2019). The relative experiments show that our method is more robust than SOTAs, and this benefits from the fact that the CSRM block helps attend to the actually semantic change by measuring the cross-semantic relations of the image pair and their difference.

Ground Truth:
The tiny blue cylinder changed its location.

Baseline:
The small blue matte cylinder that is behind the big blue matte object is no longer there.

SRDRL:
The small blue shiny cylinder that is to the left of the tiny green matte thing has been added.

SRDRL+AVS:
The small blue metal cylinder that is behind the tiny green metallic object changed its location. Figure 3: A comparative example about "Move" case from the test set of CLEVR-Change, which involves the caption generated by the baseline, SRDRL, and SRDRL+AVS. We visualize the localization results on "before" (blue) and "after" (red).

Results on Spot-the-Diff
To validate the generalization ability of the proposed method, we conduct the experiments on a recent published Spot-the-Diff dataset, where the image pairs are mostly well aligned and their is no viewpoint change. We compare with eight SOTA methods and most of them cannot consider handling viewpoint changes: DDLA (Jhamtani and Berg-Kirkpatrick, 2018), DDUA (Park et al., 2019), SDCM (Oluwasanmi et al., 2019a), FCC (Oluwasanmi et al., 2019b), static rel-att / dyanmic rel-att (Tan et al., 2019), and M-VAM / M-VAM+RAF (Shi et al., 2020).

Ground Truth
The small blue thing is in a different location.

Ground Truth
The large green matte sphere that is behind the purple cylinder is in a different location.

SRDRL+AVS
The small blue metal cube that is to the right of the large gray matte thing is in a different location.

SRDRL+AVS
The scene is the same as before. Figure 4: Qualitative examples of SRDRL+AVS. The left is a successful case that SRDRL+AVS localizes the accurate changed object and generates a correct sentence to describe the change. The right is a failure case that a slight movement of the object is not detected.
The results are reported in Table 6. We can observe that our method achieves the best performance in terms of METEOR and SPICE. Especially for SPICE which is recently designed for evaluating the image captioning task, our method achives 28.6% and 5.3% improvements over the current SOTA method M-VAM and M-VAM+RAF. Hence, compared to the above methods, the generated captions by our method are more in line with standards of human caption evaluation. This superiority results from that the SSRE block can capture the relation-embedded feature difference so as to better explore those tiny changed objects. Figure 3 shows a comparative example about "Move" from the CLEVR-Change dataset, which includes the change captions generated by humans, baseline, SRDRL, and SRDRL+AVS. We also visualize the results of change detection. The baseline is implemented based on DDUA (Park et al., 2019). We can clearly observe that it localizes a wrong region on the "after" and thus misidentifies "Move" as "Drop". By contrast, both proposed methods (SRDRL and SRDRL+AVS) can accurately localize the moved object on both "before" and "after" images, which validates the effectiveness of the proposed SRDRL. Moreover, it is interesting to note that, for the proposed methods, although the results of change localization are accurate, only using SRDRL generates a wrong caption, which indicates the POS tags of target words indeed guide and regularize the change caption generation. Figure 4 illustrates two examples with viewpoint changes on CLEVR-Change dataset. The left example is a success in which SRDRL+AVA can distinguish the small blue changed cube from the ir-relevant viewpoint change. This benefits from that SRDRL can learn discriminative difference representation and overcome viewpoint changes. The right example shows a failure, where SRDRL+AVA judges there is no difference. Our conjecture is that the movement of this sphere is very slight and thus confused with the viewpoint change. Hence, we will improve our method to learn more fine-grained difference representation in the future work.

Conclusion
In this paper, we propose a semantic relation-aware difference representation learning network (SR-DRL) and attention-based visual switch (AVS) to address change captioning in the presence of distractors, where SRDRL can explicitly learn the difference representation in the image pair and AVS can aid the caption generator to convey the localized change in a logically and grammatically accurate sentence. Extensive experiments conducted on both CLEVR-Change and Spot-the-Diff datasets show that the proposed method achieves state-ofthe-art results.