Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning

Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs, but only with images and sentences drawn from different sources and object labels detected from the images. In previous work, pseudo-captions, i.e., sentences that contain the detected object labels, were assigned to a given image. The focus of the previous work was on the alignment of input images and pseudo-captions at the sentence level. However, pseudo-captions contain many words that are irrelevant to a given image. In this work, we investigate the effect of removing mismatched words from image-sentence alignment to determine how they make this task difficult. We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions: the detected object labels. The experimental results show that our proposed method outperforms the previous methods without introducing complex sentence-level learning objectives. Combined with the sentence-level alignment method of previous work, our method further improves its performance. These results confirm the importance of careful alignment in word-level details.


Introduction
Image captioning is a task to describe images in natural languages. This is a fundamental challenge with regard to automatically retrieving and summarizing the visual information in a human-readable form. Recently, considerable progress has been made (Vinyals et al., 2015;Xu et al., 2015;Anderson et al., 2018b) owing to the development of neural networks and a large number of annotated image-sentence pairs (Young et al., 2014;Lin et al., 2014;Krishna et al., 2017). However, these pairs are limited in their coverage of scenes 2 , and scaling them is difficult owing to the cost of manual annotation.
Unsupervised image captioning (Feng et al., 2019) aims to describe scenes that have no corresponding image-sentence pairs, without requiring additional annotation of the pairs. The only available resources are images and sentences drawn from different sources and object labels detected from the images. Although it is highly challenging, unsupervised image captioning has the potential to cover a broad range of scenes by exploiting a large number of images and sentences that are not paired by expensive manual annotation.
To train a captioning model in this setting, previous work (Feng et al., 2019;Laina et al., 2019) employed sentences that contained the object labels detected from given images. We refer to these sentences as pseudo-captions. However, pseudocaptions are problematic in that they are likely to contain words that are irrelevant to the given images. Assume that an image contains two objects cat and girl ( Figure 1). This situation could give rise to various possible pseudo-captions, e.g., "a girl is holding a cat," "a cat is sleeping with a girl," "a girl is running with a cat." When the first sentence is the correct caption of the image, the words sleeping and running of the other sentences are irrelevant to the image. As the detected object labels provide insufficient information to judge which sentence corresponds to the image, many pseudocaptions containing such mismatched words can be produced.  Figure 1: Overview of our model. The input is listed on the left-hand side: an image, its detected object labels, and its pseudo-captions. The model learns to generate the pseudo-captions while considering the correspondence between the image and each word being generated. The detailed process is shown in the blue box on the right-hand side. The base encoder-decoder model output h t , a gate value g t , and a pseudo-label f t on the gate are described in Sections 2.1, 2.2, and 2.3, respectively. The dashed arrows indicate the processes conducted only during training.
Regardless of the problem in pseudo-captions, previous work (Feng et al., 2019;Laina et al., 2019) did not explicitly remove word-level mismatches. They tried aligning the features of images and their pseudo-captions at the sentence level. Although this line of approach can potentially align the images and sentences correctly if there are sentences that exactly describe each image, it is not likely to hold for the images and sentences retrieved from different sources.
To shed light on the problem of word-level spurious alignment in the previous work, we focus on removing mismatched words from image-sentence alignment. To this end, we introduce a simple gating mechanism that is trained to exclude image features when generating words other than the most reliable words in pseudo-captions: the detected objects. The experimental results show that the proposed method outperforms previous methods without introducing complex sentence-level learning objectives. Combined with the sentence-level alignment method of previous work, our method further improves its performance. These results confirm the importance of careful alignment in word-level details.

Method
Our model comprises a sequential encoder-decoder model, a gating mechanism on the encoderdecoder model, a pseudo-label on the gating mechanism, and a decoding rule to avoid the repetition of object labels, as presented in Figure 1.

Base Encoder-Decoder Model
Typical supervised, encoder-decoder captioning models maximize the following objective function during training: where θ are the parameters of the models, I is a given image, and y = y 1 , ..., y T is its corresponding caption, the last token y T is a special end-ofsentence token. However, in unsupervised image captioning, the corresponding caption y is not available. Instead, object labels in given images are provided by pretrained object detectors. Previous work utilized the detected object labels to assign a roughly corresponding captionŷ, i.e., a pseudo-caption, to the given image. Following the previous work, we define pseudo-captions of an image as sentences containing the object labels detected from the image. Given the pseudo-captionŷ, our base encoderdecoder model maximizes the following objective function: In encoder-decoder captioning models, the probability p(y|I) 3 is auto-regressively factorized as p(y|I) = T t=1 p(y t |y <t , I) and each p(y t |y <t , I) is computed by a single step of recurrent neural networks (RNNs). The encoder encodes the given image I to an image representation v ∈ R d that is fed to the decoder as an initial input to generate a sequence of words auto-regressively. The detailed computation of p(ŷ t |ŷ <t , I) is as follows: where Enc(·) is a pre-trained image encoder with a linear transformation matrix W a ∈ R d×d on top of it, Dec(·) is an RNN decoder, Π(·) is the one-hot encoding function, h 0 ∈ R d is a zero vector, Y is the whole vocabulary to use, and W e , W o ∈ R d×|Y| are the word embedding matrices. Details of the encoder and decoder are provided in Section 3.2.

Gating Mechanism to Consider Word-Level Correspondence
As indicated in Eq. 2, our base encoder-decoder model decodes all of the words in pseudo-captions from the images. However, pseudo-captions are highly likely to contain words that are irrelevant to the given images. Thus, forcing a model to decode the pseudo-captions in their entirety from the images might be more disadvantageous than beneficial for training precise captioning models. To enable our model to handle word-level mismatches, we introduce a simple gating mechanism. Our model, which is equipped with this gating mechanism, takes an image representation at each t-th time step. The gate is designed to control the amount of image representation used to generate the t-th word. In other words, the gate is expected to determine the extent to which the given image corresponds to the t-th word. With a slight modification to Eq. 3, our model with the gating mechanism is defined as follows: where W k , W v ∈ R d×d are the linear transformation matrices for computing the gate value g t ∈ [0, 1] and the output of the gate r t ∈ R d . When g t is close to one, it forces the model to use more information from the image (v) to generate the t-th word; when g t is close to zero, it forces the model to do the opposite. The fed image representation W v v is kept constant at every time step t. Thus, even when the t-th word is correctly pictured in the image I, W v v itself cannot determine which specific object in the image should be generated according to the current context in the output caption. Therefore, we apply L2 normalization to the image representation in Eq. 8 to ensure that a relatively greater amount of the contextual information (h t ) is used.
To train our model with the gating mechanism, we minimize the following cross-entropy loss for each pair of images and their pseudo-captions: log p(ŷ t |ŷ <t , I). (10)

Pseudo-Labels on Gate to Remove Word-Level Spurious Alignment
The above gate is expected to reflect the correspondence between images and words in pseudocaptions. However, learning to reflect the correspondence correctly is difficult for the gate under the noisy and weak supervision of pseudo-captions. In this work, our focus is to remove the spurious alignment between images and words in pseudocaptions. Consequently, we apply the following rule to the gate that largely suppresses image representations to use: g t should be close to one if the t-th word to generate is a detected object label; otherwise, it should be close to zero. This is based on the assumption that, given an image and its pseudocaption, the reliable words in the pseudo-caption are only the detected object labels, and the others are likely to be irrelevant to the image.
We assign a pseudo-label f ∈ {0, 1} on the gate: f t = 1 if the wordŷ t corresponds to any of the object labels detected from a given image; otherwise, f t = 0. The gate then learns the correspondence by minimizing the following loss function: where α is the weight to emphasize the loss when f t = 1. A relatively large value is recommended for α to prevent g t from always being zero because the number of detected object labels (where f t = 1)  in pseudo-captions is generally smaller than the number of the other words (where f t = 0). Combined with the loss function of Eq. 10, the final loss function is defined as follows:

Unique-Object Decoding
An evaluation of our model revealed that it tends to repeat words in object categories. Although repetition is common in encoder-decoder models, this repetition was generated owing to a different cause. As mentioned in Section 2.2, the image representation v cannot correctly predict the word y t without the context representation h t ; if the gate value g t is exactly one, the model always outputs the most salient object label in the given image.
To avoid ignoring contextual information, we applied a simple decoding rule during the evaluation. Given that the model generates a word y t at t-th time step, our decoding rule checks whether y t is in predefined object categories, i.e., object categories defined for object detectors. If y t is found in the object categories, the rule forces the probability of generating y t to be zero in the subsequent time steps.

Experiments
We ran the experiments under two different settings, Feng et al. (2019) and Laina et al. (2019), for a fair comparison with each. For brevity, we refer to the settings in Feng et al. (2019) and Laina et al. (2019) as setting A and B, respectively. The difference of the settings is clarified in Table 1.

Datasets
Evaluation Set. To evaluate our proposed method, we used the MS COCO dataset (Lin et al., 2014) with the validation/test split defined by Karpathy and Fei-Fei (2015). Each split has 5,000 images and five reference captions for each image. Training Images. We used the images (without their captions) in the remaining training split of MS COCO (113,286 images), and a pre-traind object detector (Huang et al., 2017) to retrieve the object labels found in the images 4 . The detector is a publicly available Faster-RCNN model 5 (Ren et al., 2015). The training data of the object detector differs depending on the previous work; thus, we used the object detector trained on OpenImages-v2 (Krasin et al., 2017) to compare with Feng et al. (2019) and that trained on OpenImages-v4 (Kuznetsova et al., 2020) to compare with Laina et al. (2019). Note that these object detectors were not trained on MS COCO images. Following the previous work, we refrained from using the detected bounding boxes and their features. Training Text. Following the previous work, we used the Shutterstock image description corpus (SS) (Feng et al., 2019)

Implementation Details
Image Encoder. For a fair comparison with the previous work, we employed different image encoders depending on the compared method: Inception-v4 (Szegedy et al., 2017) in the settings of Feng et al. (2019) and ResNet-101 (He et al., 2016a,b) in the settings of Laina et al. (2019). Both image encoders were pre-trained on ImageNet (Russakovsky et al., 2015) and are publicly available 6 . The parameters of the image encoder were fixed during training and prediction. Text Decoder. Similar to the image encoder, we  used a different RNN as our decoder: LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) to enable us to compare our results with those of Feng et al. (2019) and Laina et al. (2019), respectively. Following the previous work, the number of hidden layers' dimensions was set to 512 for LSTM and 200 for GRU. The number of the RNN layer was set to one. Word embeddings were randomly initialized and had the same dimensions as the RNN hidden layer. Pseudo-Captions. Captions tend to describe salient objects, not all detected objects. For example, the frequent object person often co-occurs with face and clothing in images, but these three are not always the salient objects to be described in a caption. To avoid collecting the pseudo-captions that only contain these frequent objects, we picked up each detected object and their pairs to retrieve pseudo-captions, rather than using all detected objects. In this retrieval, we converted object labels to their plural forms using a dictionary used in Feng et al. (2019) so that the pseudo-captions could also cover the plural forms of the objects. Pseudo-Caption Preprocessing. For each pair of objects, we selected sentences where fewer than four words existed between the objects. This is to pick up the sentences likely to describe the relations of the target objects. We then removed the sentences wherein the target objects were adjacent to avoid collecting the objects' compound words. For each object, we selected sentences wherein fewer than two words were in between the object and its dependent adjective to pick up the sentences likely to describe the object in detail. We used spaCy 7 en core web lg model for parsing. Value of α. As described above, each pseudocaption contains only one or two detected objects, which is very few compared with the average sentence lengths of the text corpora (12.0 in SS and 10.7 in GCC). To balance the label imbalance of f t , 7 https://spacy.io we searched the value for α (Eq. 11) at a power of 2 and found that α = 16, which roughly equals the quotient of Sentence Length Detected Objects , worked well across the settings. Training Iteration. After collecting the pseudocaptions, we created a set of the objects and pairs that were used to collect the pseudo-captions. The training is iterated over the pairs in this set, rather than over each image, to avoid overfitting for the most frequent object labels. On each iteration of the pairs of objects, we randomly sampled the image and pseudo-caption, wherein both of the objects were contained. Likewise, we did the same sampling on each object in the pairs. The number of the object pairs was 11,607 and 10,612 in the settings A and B, respectively. We set the batch size to eight and terminated the training when the best validation score (specifically, the CIDEr score) did not exceed for 20 epochs. For the optimizer, we used Adam with the recommended hyperparameters (Kingma and Ba, 2015). Evaluation. In the evaluation, we set the maximum decoding length to 20. Our model decoded captions by using greedy search and unique-object decoding, described in Section 2.4. The evaluation metrics we used were BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Denkowski and Lavie, 2014), CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016).     (Feng et al., 2019). Scores of our model and the combined model are the mean ± standard deviation of five runs. We marked in bold the scores within the standard deviation of the best scores. Table 3 lists the results of our model obtained in the ablation studies. We tested the ablation of the gating mechanism (gate), pseudo-labels on the gating mechanism (pseudoL), unique-object decoding (unique), and image features (image). The pseudolabels cannot be implemented without the base gating mechanism. Thus, the model "w/o gate w/ pseudoL" is not applicable. The model w/o image is the same as Ours (full) except that it only uses the word embeddings of detected object labels, rather than image features. It encodes detected object labels into word embeddings and then takes their mean 9 and replaces the image feature v with it. All models here were trained in the same manner as described in Section 3.2.

Ablation Study
The results show that the pseudo-labels on the gating mechanism contribute a lot to the performance; the score degrades significantly from Ours (full) to w/o pseudoL in all metrics. On the other hand, the base gating mechanism does not function well by itself; not all scores of w/o gate are lower than those of w/o pseudoL. These results demonstrate that explicitly removing the word-level spurious alignment contributes the most to the relatively high performance of our model. Although it is a relatively low contribution compared with the pseudo-labels, unique-object decoding also en-9 The number of detected objects was 3.0 in setting A and 4.0 in setting B on average. Thus, taking the mean does not break the detected information significantly. hances performance.
The degraded performance of w/o image suggests that object labels themselves are insufficient to describe images correctly. We observed that this model was vulnerable to errors propagated through object detectors. See Section 3.8 for the examples.

Combining with Previous Methods
Our method focuses on removing word-level spurious alignment between images and pseudocaptions, whereas the previous methods focus on aligning images and pseudo-captions at the sentence level. To utilize the strength of each, we combined our method with the previous methods as a model initialization method.
We first trained our model on the setting A and generated captions for the images in training data. We then paired the captions with the images as their pseudo-captions 10 . With the pairs, the caption generator of Feng et al. (2019) was initialized by learning to generate the pseudo-captions from the images. After the initialization, we trained the previous model using their publicly available code 11 . We used the same hyperparameters as Feng et al. (2019) except for the learning rate of 10 −5 for the 10 To avoid assigning obviously incorrect pseudo-captions, we omitted the captions that contained fewer than one detected object for the images with more than two detected objects. For the images with fewer than one detected object, we omitted the captions that contained no detected objects. 11 https://github.com/fengyang0317/ unsupervised_captioning  Table 5: Bag-of-words matching scores with respect to detected object labels and the other words.
generator and 10 −8 for the discriminator. Table 4 shows the results. The combined model further improves the performance of our model and Feng et al. (2019). In particular, the improvement from Feng et al. (2019) is much larger than that from our model. These results suggest that removing the word-level spurious alignment is critical for the subsequent sentence-level alignment.

Negative Effect of Spurious Alignment
To further investigate the effect of removing the spurious alignment, we evaluated our model on noisier words: words other than the detected object labels. Our method discourages from aligning them with images because they are likely to be irrelevant to given images, while previous methods force the alignment. We tested the following bag-of-words matching on the MS COCO test set.
Let S be the bag of words of a caption generated from an image I and T m be the m-th reference caption of I. Given a set of detected object labels O of I, we took the intersections S det = S ∩O and S other = S ∩O for S, as well as for T m . We define the precision (P ), recall (R) and F1 score (F ) of S against T m as follows: P = |S∩T m | |S| , R = |S∩T m | |T m | , F = 2 · P ·R P +R . Based on this, we define the precision, recall, and F1 score of S det against T m det by replacing S with S det and T m with T m det , and likewise for those of S other against T m other . We calculated the above scores for each pair of a generated captions and their reference captions and subsequently averaged it across the pairs. The pairs with empty T m * were excluded from the calculation. Table 5 shows the results. Overall, the scores on detected object labels (Detected) are about two times higher than those on the other words (Others), indicating the difficulty of learning the alignment of the latter, noisier words. Our model performs better in predicting the noisier words, outperforming Feng et al. (2019) in all metrics. These results indicate that refraining from the alignment works better  Table 6: Analysis of generated captions with respect to object labels and the other words. Word Type is the number of unique words, and Frequency is the mean of the frequency of the words in the training text corpus. than forcing it for the noisier words.
On the other hand, our model performs worse in predicting detected object labels. This is because our method trusts all detected object labels and aligns them with images without any constraints used in previous work. Combined with the previous method (Ours + Feng et al. (2019)), our model improves the prediction on detected object labels.

Positive Effect of Frequency
By assigning the pseudo-label f , our method encourages to align detected object labels with the image representation v and the other words with the contextual representation h. Thus, our model is likely to predict the latter words mostly based on the previous output sequences, as language models do. If this is the case, then the latter words predicted tend to be the frequent words in the training text corpus.
To verify this tendency, we analyzed the frequency of output words in the training text corpus for object labels and the other words 12 . Table 6 presents the results. In contrast to object labels, our outputs' vocabulary is about five times smaller than that of Feng et al. (2019), and the words tend to be highly frequent in the training text corpus.
The results also show that a model performs better if it has the smaller and more frequent vocabulary of the words other than object labels. This correlation is convincing considering the coverage of frequent words. For example, a general caption such as "a man with a bike" can correctly cover various scenes in which a man is riding/sitting on/leaning on/standing near/... a bike. This positive effect of frequency suggests that firstly aligning the frequent words and gradually extending them can  be a promising approach. Figure 2 shows the captions generated by our model, its ablated models, Feng et al. (2019), and the combined model trained on the setting A. Our model generated correct captions for images (a) and (b). It successfully generated object labels that were not even detected by the object detector: bat in (a) and mirror in (b). On the other hand, errors of the object detector directly propagated to the output captions of w/o image model: the model generated an incorrect object a bottle of wine, owing to the missing object bat in (a).

Qualitative Analysis on Outputs
Captions of the other images are negative results of our model. We observed that our model tended to repeat similar objects: cat and dog in (c), and elephant and elephants in (f). Without unique-object decoding, this tendency got worse: w/o unique model repeated cat in (c) and (e), and elephant in (f). Ours + Feng et al. (2019) model did not change much of the prediction of our model, as we set the learning rate low (see Section 3.5). However, it allowed the partial correction seen in (c): the combined model modified dog to suitcase.
In our outputs, words other than object labels tended to be frequent words and composed short phrases. On the contrary, Feng et al. (2019) tended to generate less frequent words (savuti and kenya in (f)) and longer phrases (portrait of a happy young in (a) and young couple in love in (d)), which were incorrect predictions in these examples. Figure 3 shows output captions of our model and the gate values for each word. Overall, the gate values were high for object labels and low for the other words. Although our model was correct on the words other than object labels in these examples, these words were generated mostly by contextual features, thus heavily relied on contextual frequency. This heavy reliance on contexts resulted in generating the same word after an object label without considering images: is sitting on followed cat in both (c) and (e).

Related Work
There has been considerable research with different settings and approaches to describe scenes that have no image-sentence pairs. Novel object captioning (Hendricks et al., 2016;Venugopalan et al., 2017;Anderson et al., 2018a;Agrawal et al., 2019) attempted describing unseen objects in captions.
They incorporated an image classifier or object detector trained on objects not included in imagesentence pairs. Lu et al. (2018) tested captioning models on the generation of unseen combinations of objects, and Nikolaus et al. (2019) extended this to the unseen combinations of objects, attributes, and relations. In both settings, only the combinations were unseen, but each word in the combinations appeared in the training data. Semisupervised approaches utilized caption retrieval models to automatically collect the corresponding captions for unannotated images to augment imagesentence pairs (Liu et al., 2018;Kim et al., 2019).
The above work was evaluated on the scenes where correct descriptions partially overlapped with those in the training image-sentence pairs. However, there can be scenes with no such overlap due to the limited coverage of the currently available image-sentence pairs. Taking a step further, unsupervised image captioning (Feng et al., 2019;Laina et al., 2019) aims to describe scenes that have no overlap with the image-sentence pairs, without the annotation of the pairs. To test in that situation, the task does not allow to use any image-sentence pairs. The only available resources are images and sentences drawn from different sources and object labels detected from the images. Feng et al. (2019) first trained an encoderdecoder model that takes object labels in a sentence as its input and outputs the original sentence. After training, this model took the object labels detected from each image and outputted a sentence to pair with the image as its pseudo-caption. These pairs were then used to initialize a caption generator for the subsequent image-sentence alignment: bi-directional (image-to-sentence and sentence-toimage) feature reconstruction and GAN training (Goodfellow et al., 2014) to ensure fluency in generated captions. In the work of Laina et al. (2019), pseudo-captions were sentences that contained object labels detected from a given image. They employed metric learning and GAN training to minimize the difference between images and pseudocaptions in their latent space, as well as to maximize the difference between images and sentences wherein no detected object label was included.
Our approach is different from them in that it focuses on removing the mismatched words of pseudo-captions to take reliable supervision only, rather than forcing the use of the entire pseudocaptions for image-sentence alignment. Although the previous work additionally ensured to align detected object labels to images, they did not prevent the spurious alignment between images and words.
As an eased setting of unsupervised image captioning, unpaired image captioning has also been explored (Feng et al., 2019;Laina et al., 2019;Gu et al., 2019;. The major difference from unsupervised image captioning is that images and sentences are drawn from image-sentence pairs, rather than from different sources. That is, every image has completely matched captions in pseudo-captions, which is not the case in unsupervised image captioning. As correct captions exist for each image, previous approaches focused on matching images and sentences at the sentence level. Contrary to these approaches, we focus on employing unsupervised image captioning and devising a method to remove word-level spurious alignment in the much noisier pseudo-captions. Another variation of unpaired image captioning is the generation of captions in one language that has no image-sentence pairs, using paired images and captions in another language (Gu et al., 2018;Song et al., 2019). However, this line of research is beyond the scope of our work, as it requires imagesentence pairs to be at least in one language.
Our gating mechanism borrowed the idea of adaptive attention (Lu et al., 2017(Lu et al., , 2018. Adaptive attention serves to control when generating words from image representations. Although these methods assume that the control is automatically learned from image-sentence pairs, this is not the case in an unsupervised setting. Our method is different from theirs in that we add heuristic pseudo-labels to train the gate when using image representations.

Conclusion
We investigated the importance of removing wordlevel spurious alignment between images and pseudo-captions in the task of unsupervised image captioning. For this purpose, we introduced a simple gating mechanism trained to align image features with only the most reliable words in pseudo-captions. The experimental results showed that our proposed method outperformed the previous methods without the sentence-level learning objectives used in the previous methods. Moreover, our method improved the performance further by combining with the previous methods. These results confirmed the importance of careful alignment in word-level details.