Rˆ3Net:Relation-embedded Representation Reconstruction Network for Change Captioning

Change captioning is to use a natural language sentence to describe the fine-grained disagreement between two similar images. Viewpoint change is the most typical distractor in this task, because it changes the scale and location of the objects and overwhelms the representation of real change. In this paper, we propose a Relation-embedded Representation Reconstruction Network (Rˆ3Net) to explicitly distinguish the real change from the large amount of clutter and irrelevant changes. Specifically, a relation-embedded module is first devised to explore potential changed objects in the large amount of clutter. Then, based on the semantic similarities of corresponding locations in the two images, a representation reconstruction module (RRM) is designed to learn the reconstruction representation and further model the difference representation. Besides, we introduce a syntactic skeleton predictor (SSP) to enhance the semantic interaction between change localization and caption generation. Extensive experiments show that the proposed method achieves the state-of-the-art results on two public datasets.


Introduction
Change captioning aims to generate a natural language sentence to detail what has changed in a pair of similar images. It has many practical applications, such as assisted surveillance, medical imaging, and computer assisted tracking of changes in media assets (Jhamtani and Berg-Kirkpatrick, 2018;Tu et al., 2021).
Different from single-image captioning (Kim et al., 2019;Jiang et al., 2019;Fisch et al., 2020), change captioning addresses two-image captioning, which requires not only to understand both Figure 1: Two examples of change captioning about an object move. The first example shows that the viewpoint changes the scale and location of the objects in the "after" image; the second example shows mostly well-aligned a pair of images with underlying illumination changes from surveillance cameras.
image content, but also to describe their disagreement. As the pioneer work, Jhamtani et al. (Jhamtani and Berg-Kirkpatrick, 2018) described semantic changes between mostly well-aligned image pairs with underlying illumination changes from surveillance cameras. However, they did not consider viewpoint changes that often happen in a dynamic world, and image pairs cannot be mostly well aligned in this case. Hence, feature shift between two unaligned images will adversely affect the learning of difference representation. To make this task more practical, recent works (Park et al., 2019;Shi et al., 2020) proposed to address change captioning in the presence of viewpoint changes.
Despite the progress, there are some limitations for the above state-of-the-art methods when modeling the difference representation. First, the object information of each image is only learned at feature-level, and this is difficult to discriminate fine-grained difference when changed object is too tiny and surrounded by the large amount of clutter, as shown in Figure 1. Actually, when an object moved, its semantic relations with surrounding objects would change as well, and this can help explore the fine-grained change. Thus, it is im-portant to model the difference representation at both feature and relation levels. Second, directly applying subtraction between a pair of unaligned images (Park et al., 2019) may learn the difference representation with much noise, because viewpoint changes the scale and location of the objects. However, we can observe that those unchanged objects are still in the approximate locations. Hence, it is beneficial to reveal the unchanged representation and further model the difference representation based on the semantic similarities in the corresponding locations of two images.
In this paper, we propose a Relation-embedded Representation Reconstruction Network (R 3 Net) to handle viewpoint changes and model the finegrained difference representation between two images in the process of representation reconstruction. Concretely, for "before" and "after" images, the relation-embedded module respectively performs semantic relation reasoning among object features via the self-attention mechanism. This can enhance the fine-grained representation ability of original object features. To model the difference representation, a representation reconstruction module (RRM) is designed, where a "shadow" representation ("after" or "before") is used to reconstruct a "source" representation ("before" or "after"). The RRM first leverages every location in the "source" to stimulate the corresponding locations in the "shadow" to judge their semantic similarities, i.e., "response signals". Further, under the guidance of the signals, the RRM picks out the unchanged features from the "shadow" as the "reconstruction" representation. The "difference" representation is computed with the changed features between the "source" and "reconstruction". Next, a dual change localizer is devised to use the representation of difference as the query to localize the changed object feature on the "before" and "after", respectively. Finally, the localized features are fed into an attention-based caption decoder for caption generation.
Besides, we introduce a Syntactic Skeleton Predictor (SSP) to enhance the semantic interaction between change localization and caption generation. As observed in Figure 1, a caption mainly consists of a set of nouns, adjectives, and verbs. These words can convey main information of the changed object and its surrounding references, called syntactic skeletons. The skeletons, which are predicted based on a global semantic representation derived from the R 3 Net, can supervise the modeling of difference representation and provide the decoder with high-level semantic cues about change type. This makes the learned difference representation more relevant to the target words and enhances the quality of generated sentences.
The main contributions of this paper are as follows: (1) We propose the R 3 Net to learn the finegrained change from the large amount of clutter and overcome viewpoint changes by embedding semantic relations into object features and performing representation reconstruction with respect to the two images. (2) The SSP is introduced to enhance the semantic interaction between change localization and caption generation via predicting a set of syntactic skeletons based on a global semantic representation derived from the R 3 Net. (3) Extensive experiments show that the proposed method outperforms the state-of-the-art approaches by a large margin on two public datasets.

Related Work
Change Captioning. Captioning the change in the existence of viewpoint changes is a novel task in the vision-language community Tu et al., 2017;Deng et al., 2021;. As the first work, DUDA (Park et al., 2019) directly applied subtraction between two images to capture their semantic difference. However, due to viewpoint changes, direct subtraction between an unaligned image pair cannot reliably model the correct change (Shi et al., 2020). Later, M-VAM (Shi et al., 2020) proposed to measure the feature similarity across different regions in an image pair and find the most matched regions as unchanged parts. However, since there are a lot of similar objects, cross-region searching will face the risk of matching the query region with a similar but incorrect region, impacting subsequent change localization. In contrast, in our representation reconstruction network, the prediction of unchanged and changed features are based on the semantic similarities of the corresponding locations in two images. This can avoid the risk of reconstructing "source" with incorrect parts from "shadow".
Skeleton Prediction in Captioning. Syntactic skeletons can provide the high-level semantic cues (e.g., attribute, class) about objects, so they are widely used in image/video captioning works. These methods either used skeletons as main information to generate captions   ⨀ ⊖ ⊕ Figure 2: The architecture of the proposed method, consisting of a relation-embedded representation reconstruction network, a syntactic skeleton predictor, a dual change localizer and an attention-based caption decoder. Gan et al., 2017;Dai et al., 2018) or leveraged them to bridge the semantic gap between vision and language Tu et al., 2020). Although the skeletons played different roles in the above methods, the common point was that they only represent basic information of objects in images or videos. Different from them, besides basic information, we try to use skeletons to capture the changed information among objects.

Methodology
As shown in Figure 2, the architecture of our method consists of four main parts: (1) a relationembedded representation reconstruction network (R 3 Net) to learn the fine-grained change in the presence of viewpoint changes; (2) a dual change localizer to focus on the specific change in a pair of images; (3) a syntactic skeleton predictor (SSP) to learn syntactic skeletons based on a global semantic representation derived from the R 3 Net; (4) an attention-based caption decoder to describe the change under the guidance of the learned skeletons.

Relation-embedded Representation
Reconstruction Network

Relation-embedded Module
We first exploit a pre-trained CNN model to extract the object-level features X bef and X af t for a pair of "before" and "after" images, where X i ∈ R C×H×W and C, H, W indicate the number of channels, height, and width. However, only utilizing these independent features is difficult to distinguish fine-grained change from the large amount of clutter (similar objects). And related works (Wu et al., 2019;Yin et al., 2020) have shown that capturing semantic relations among objects is useful for a thorough understanding of an image. Motivated by this, we devise a relationembedded (R emb ) module based on the selfattention mechanism (Vaswani et al., 2017) to implicitly learn semantic relations among objects in each image. Specifically, we first reshape Then, the semantic relations are embedded into independent object features of each image based on the scaled dot-product attention: where the quires, keys and values are the projections of the object features in X i and i ∈ (bef, af t): Thus, X bef and X af t are updated toX bef andX af t , respectively: When the model fully understands each image content, it can better capture the fine-grained difference between the image pair in the subsequent representation reconstruction.

Representation Reconstruction Module
The state-of-the-art method (Park et al., 2019) applied direct subtraction between a pair of unaligned images, which is prone to capture the difference with noise in the presence of viewpoint changes.
To distinguish semantic change from viewpoint changes, a representation reconstruction module (RRM) is proposed, where the inputs are a "source" representationX p ∈ R N ×C and a "shadow" rep-resentationX s ∈ R N ×C . Concretely, first, we exploit each location ofX p to stimulate the corresponding location ofX s . The response degrees of all locations inX s are regarded as the response signals α that measure the semantic similarities between corresponding locations in two images : where W p , W s ∈ R C×C and b s ∈ R C . Second, we useX s to reconstructX p under the guidance of the response signals α: whereX p ∈ R N ×C is the "reconstruction" representation, which represent unchanged features with respect to "source". Finally, the "difference" representation is captured by subtracting "reconstruction"X p from "source"X p : Since the predicted unchanged and changed features in uni-directional reconstruction are only with respect to one kind of "source" representation (e.g., "before"), the model cannot predict the changed feature when it is not in the "source". For an efficient model, it should capture all underlying changes with respect to both images. To this end, we extend the RRM from uni-direction to bi-direction. Specifically, we first use the "before" as "source" to predict unchanged and changed features, and then use the "after" as "source" to do so. Thus, the "reconstruction" and "difference" w.r.t. the "before" and "after" are formulated as: Finally, we obtain a bi-directional difference representation by a fully-connected layer:

Dual Change localizer
When the bi-directional difference representation X dif f is computed, we exploit it as the query to localize the changed feature inX bef andX af t , respectively. Specifically, the dual change localizer first predicts two separate attention maps a bef and a af t : where [;], conv, and σ denote concatenation, convolutional layer, and sigmoid activation function, respectively. Then, the changed features l bef and l af t are localized via applying a bef and a af t tô X bef andX af t : Finally, we compute the local difference feature w.r.t. both l bef and l af t from two directions:

Syntactic Skeleton Predictor
A syntactic skeleton predictor (SSP) is introduced to learn a set of syntactic skeletons based on the outputs derived from the R 3 Net. The predicted skeletons can provide the caption decoder with high-level semantic cues about changed objects and supervise the modeling of difference representation. This aims to enhance the semantic interaction between change localization and caption generation.
Inspired by Gan et al., 2017), we treat this problem as a multi-label classification task. Suppose there are N training image pairs, and y j = [y j1 , . . . , y jK ] ∈ {0, 1} K is the label vector of the j-th image pair, where y jk = 1 if the image pair is annotated with the skeleton k, and y jk = 0 otherwise. Specifically, first, we apply a mean-pooling layer over the concatenated semantic representations of X bef ,X af t , andX dif f to obtain a global semantic representation S j : Then, the probability scores p j of all syntactic skeletons for j-th image pair is computed by: where p j = [p j1 , . . . , p jK ] denotes the probability scores of K skeletons in j-th image pair. To maximize the probability scores of syntactic skeletons, we use the multi-label loss to optimize the SSP. It can be formulated as: where N and K indicate the number of all training samples and annotated skeletons of an image pair. The loss can be considered as the supervision signal to regularize the learning of difference representation in the R 3 Net.

Skeleton-guided Caption generation
Since the predicted skeletons are the explicit semantic concepts of the changed object and its surrounding references, the captions are generated under the guidance of them. Specifically, first, the predicted probability scores p j are embedded as a skeleton feature E[p j ]: where E q ∈ R k×M is a skeleton embedding matrix and M is the dimension of the skeleton feature. W q ∈ R M ×M and b q ∈ R M are the parameters to be learned. Then, we exploit a semantic attention module to focus on the key semantic feature from l bef , l af t , and l b↔a dif f , which is relevant to the target word: l where i ∈ (bef, af t, dif f ). β (t) i is computed by an attention LSTM a under the guidance of the predicted skeleton feature E[p j ]: (17) where W a 1 , b a 1 , W a 2 , and b a 2 are learnable parameters. h Finally, the caption generation process is also guided by the predicted skeleton feature. We feed it, attended visual feature, and the previous word embedding to the caption decoder LSTM c to predict a series of distributions over the next word: where E is a word embedding matrix; W c and b c are learnable parameters.

Joint Training
We jointly train the caption decoder and SSP in an end-to-end manner. For the SSP, the multi-label loss is minimized by the Eq. (14). For the decoder, given the target ground-truth words (w 1 , . . . , w m ), we minimize its negative log-likelihood loss: where θ c are the parameters of the decoder and m is the length of the caption. The final loss function is optimized as follows: where the hyper-parameter λ is to seek a trade-off between the decoder and SSP.

Datasets and Evaluation Metrics
CLEVR-Change dataset (Park et al., 2019) is a large-scale dataset with a set of basic geometry  objects, which consists of 79,606 image pairs and 493,735 captions. The change types can be categorized into six cases, i.e., "Color", "Texture", "Add", "Drop", ''Move" and "Distractors (e.g., viewpoint change)". We use the official split with 67,660 for training, 3,976 for validation and 7,970 for testing.
Spot-the-Diff dataset (Jhamtani and Berg-Kirkpatrick, 2018) contains 13,192 well-aligned image pairs from surveillance cameras. Based on the official split, the dataset is split into training, validation, and testing with a ratio of 8:1:1.

Implementation Details
We use ResNet-101 (He et al., 2016) pre-trained on the Imagenet dataset (Russakovsky et al., 2015) to extract object features, with the dimension of 1024 × 14 × 14. We project these features into a lower dimension of 256. The hidden size of overall model is set to 512 and the number of attention heads in relation-embedded module is set to 4. The number of skeletons in an image pair is set to 50. The dimension of words is set to 300. For the hyper-parameter λ, we empirically set it as 0.1. In the training phase, we use Adam optimizer (Kingma and Ba, 2014) with the learning rate of 1 × 10 −3 , and set the mini-batch size as 128 and 64 on CLEVR-Change and Spot-the-Diff. At inference, for fair comparison, we follow the pioneer works (Park et al., 2019;Jhamtani and Berg-Kirkpatrick, 2018) in the two datasets to use greedy decoding strategy for caption generation. Both training and inference are implemented with PyTorch (Paszke et al., 2019) on a Tesla P100 GPU.

Ablation Studies
To figure out the contribution of each module of the proposed network, we conduct the following ablation studies on CLEVR-Change: (1) Baseline which is based on DUDA (Park et al., 2019); (2) RRM which is the representation reconstruction module; (3) R 3 Net which augments the RRM with a relation-embedded module; (4) R 3 Net+SSP which augments the R 3 Net with a syntactic skeleton predictor.
The evaluation on Total Performance. Total performance is to simultaneously evaluate the model under both scene change and none-scene change. Experimental results are shown in Table  1. We can observe that each module and the full method improve the total performance of Baseline. This indicates that our method not only can correctly judge whether there is an semantic change between a pair of images, but also can describe the change in an accurate natural language sentence.
The evaluation on the settings of Scene Change and None-scene Change. In the setting of scene change, both object and viewpoint changes  happen. In the setting of none-scene change, there are only distractors, such as viewpoint change and illumination change. The experimental results are shown in Table 2. Under the setting of scene change, we can observe that 1) the RRM, R 3 Net, and R 3 Net+SSP all significantly improve the Baseline; 2) the R 3 Net is much better than the RRM; 3) the best performance is achieved when augmenting the R 3 Net with the SSP. The above observations indicate that 1) compared to direct subtraction between a pair of unaligned images, it is effective to capture difference representation via the R 3 Net, because it can overcome the distraction of viewpoint change; 2) learning semantic relations among object features is important, because these relations can enrich the raw object features, helpful for exploring fine-grained changes; 3) the SSP can enhance the semantic interaction between change localization and caption generation, and thus further improve the quality of generated sentences.
Besides, under the setting of non-scene change, we can observe that the RRM is worse than the Baseline on some metrics. Our conjecture is that, on one hand, due to the large amount of clutter and only representing the image pair at feature-level, the RRM cannot learn the exact semantic similarities of corresponding locations in the two images, performing worse on some metrics. On the other hand, the Baseline learns a coarse difference representation between two unaligned images by a direct subtraction, so it is prone to learn a wrong change type or simply judge nothing has changed. This leads to the results that it performs worse than the RRM with the total performance and scene change, but achieves higher scores than the RRM on some metrics with none-scene change. In fact, when embedding semantic relations among object features, the R 3 Net outperforms the Baseline in the both settings. This further indicates that it is beneficial to thoroughly understand image content via modeling semantic relations among object features.
From Table 3 and Table 4, under two kinds of settings and total performance, we can observe that our method surpasses Capt-Dual and DUDA with a large margin. Compared to M-VAM+RAF, the total performance of our method is much better than it, which indicates our method is more robust. As shown in Table 4, under the setting of none-scene change, it outperforms our method on METEOR and CIDEr. This could be a benefit of the reinforcement learning, while also sharply increasing the training time and computation complexity.     Table 5 is the specific change types. Among five changes, the most challenging types are " Texture" and "Move", because they are always confused with irrelevant illumination or viewpoint changes. Compared to the SOTA methods, our method achieves excellent performances under both change types. This shows that our method can better distinguish the attribute change or movement of objects from the illumination or viewpoint change.
Hence, compared to the current SOTA methods from different dimensions, the generalization ability of our method is much better. This benefits from the merits that 1) the R 3 Net can learn the finegrained change and overcome viewpoint changes in the process of representation reconstruction; 2) the SSP can enhance the semantic interactions between change localization and caption generation.

Results on Spot-the-Diff Dataset
The image pairs in this dataset are mostly well aligned. We compare with eight SOTA meth-<After> <Before>

Ground Truth:
The blue metal ball is in a different location.

+ SSP
The small blue metal sphere that is behind the small yellow rubber object is in a different location.

DUDA:
The small blue metal ball that is behind the tiny yellow rubber thing has been newly placed. Figure 3: A example about "Move" case from the test set of CLEVR-Change, which involves the caption generated by humans (Ground Truth), DUDA (current SOTA method) and R 3 Net+SSP. We also visualize the predicted syntactic skeletons and the localization results on the "before" (blue) and "after" (red). ods and most of them cannot consider handling viewpoint changes: DDLA (Jhamtani and Berg-Kirkpatrick, 2018), DUDA (Park et al., 2019), SDCM (Oluwasanmi et al., 2019a), FCC (Oluwasanmi et al., 2019b), static rel-att / dyanmic rel-att (Tan et al., 2019), and M-VAM / M-VAM+RAF (Shi et al., 2020).
The results are shown in Table 6. We can observe that when training without reinforcement learning, our method achieves the best performances on METEOR, ROUGE-L and SPICE. Compared to M-VAM+RAF trained by the reinforcement learn-

Ground Truth
The rubber cylinder is in a different location.

Ground Truth
The grey cylinder changed its location.

+ SSP
The small grey matter cylinder that is behind the big gray shiny thing moved.  The left is a successful case that R 3 Net+SSP localizes the accurate changed object and generates a correct sentence to describe the change. The right is a failure case that a slight movement of the object is not correctly described.
ing strategy, our method still outperforms them on METEOR and SPICE. Since there is no viewpoint change in this dataset, the superiority mainly results from that the relation-embedded module can enhance the fine-grained representation ability of object features, and the syntactic skeleton predictor can enhance the semantic interaction between change localization and caption generation. Figure 3 shows an example about the case of "Move" from the test set of CLEVR-Change. We can observe that DUDA localizes a wrong region on the "before" and thus misidentifies "Move" as "Add". By contrast, the R 3 Net+SSP can accurately locate the moved object on the "before" and "after" images, which benefits from two merits. First, the R 3 Net is able to localize the fine-grained change in the presence of viewpoint changes. Second, the SSP can predict the key skeletons based on the representations of image pair and their difference learned from the R 3 Net. For instance, the skeletons of "changed" and "location" has the higher probability scores than "newly" and "placed". This can provide the decoder with high-level semantic cues to generate the correct sentence. Figure 4 illustrates two cases about "Move". In the left example, the R 3 Net+SSP successfully distinguishes the changed object (i.e., small grey cylinder) and predicts accurate skeletons with high probability scores. The right example is a failure case. In general, we can observe that the grey cylinder is localized and the main skeletons are predicted, which indicates that the R 3 Net learns a reliable representation of difference. However, the decoder still generates the wrong sentence. The reason behind the failure may be that the movement of this cylinder is very slight and the decoder receives the weak information of change (including skeletons). In our opinion, a possible solution for this challenge is to model position information for object features, which would enhance their position representation ability and help localize the slight movements.

Conclusion
In this paper, we propose a relation-embedded representation reconstruction network (R 3 Net) and a syntactic skeleton predictor (SSP) to address change captioning in the presence of viewpoint changes, where the R 3 Net can explicitly distinguish semantic changes from viewpoint changes and the SSP is to enhance the semantic interaction between change localization and caption generation. Extensive experiments show that the state-ofthe-art results are achieved on two public datasets, CLEVR-Change and Spot-the-Diff.