Control Image Captioning Spatially and Temporally

Generating image captions with user intention is an emerging need. The recently published Localized Narratives dataset takes mouse traces as another input to the image captioning task, which is an intuitive and efficient way for a user to control what to describe in the image. However, how to effectively employ traces to improve generation quality and controllability is still under exploration. This paper aims to solve this problem by proposing a novel model called LoopCAG, which connects Contrastive constraints and Attention Guidance in a Loop manner, engaged explicit spatial and temporal constraints to the generating process. Precisely, each generated sentence is temporally aligned to the corresponding trace sequence through a contrastive learning strategy. Besides, each generated text token is supervised to attend to the correct visual objects under heuristic spatial attention guidance. Comprehensive experimental results demonstrate that our LoopCAG model learns better correspondence among the three modalities (vision, language, and traces) and achieves SOTA performance on trace-controlled image captioning task. Moreover, the controllability and explainability of LoopCAG are validated by analyzing spatial and temporal sensitivity during the generation process.


Introduction
Image captioning is a fundamental task to examine whether an intelligent system can understand the visual world by letting the system describe it with natural language. Generating a reasonable caption requires the model to link linguistic tokens to objects, relationships, scenes of the visual world in the input image. Thus, a great captioning model will help us better understand what characteristics promise a good joint visual-linguistic representation. * Contribution during internship at MSRA.
In this picture there is a stand on a ground. On the backside there is a person. He is riding on a horse. He is wearing a cap. He is in between the fence. There is a flags on a wall. On the left side there is a score board on a table and flower plants. We can see in the background sky and trees. Figure 1: A showcase of Trace Controlled Image Caption. Given an image together with a mouse trace representing user intention, the task is to generate the corresponding captions aligned with each part of the trace. In this case, the trace and the caption marked with the same color correspond to each other.
Most previous attempts aim to describe the image indicating the salient objects and relations without considering user intention. To generate controllable and explainable captions, recent works dedicated to establishing a new controllable image captioning task to generate the caption at will. The captioning process can be controlled by POS tagging (Deshpande et al., 2018), sentiment (You et al., 2018), length (Deng et al., 2020), bounding boxes (Cornia et al., 2019), and mouse traces (Pont-Tuset et al., 2020).
In this paper, we mainly investigate tracecontrolled image captioning, since it is not only a more natural and interactive paradigm for real web applications, e.g. automatic presentation or help people with visual difficulties but also a new perspective for us to better understand how the longpursued cross-modality alignment is performed in deep learning models. Figure 1 presents a showcase of the scenario. Given an image, users can easily draw a trace to ask the AI agent to describe the scene in the image along the trace automatically.
In the Localized Narratives dataset (Pont-Tuset et al., 2020), the annotators describe the image while drawing the traces of their attention movement, which presents a spatial alignment between visual objects and caption tokens as well as a temporal alignment between user intention(by trace) and caption sentences. From Figure 1, we see that the caption tokens, e.g. "person", "horse", "trees" can be grounded to the visual objects spatially, and the order of caption sentences can be arranged to align to the order of traces temporally. Although it is easy for humans to recognize which visual object is indicated by the traces, it is a challenge for the agent to recognize, emphasize and arrange visual semantics solely based on several tracepoints' coordinates. Thereby, we mainly devote our effort to the spatial grounding and temporal controllability of image captioning.
Inspired by the above observation, we design two novel approaches to tackle the above challenges. Specifically, we design sentence-level contrastive constraints to align the generated sentences to the corresponding trace sequences temporally. Besides, we design a type of heuristic spatial attention guidance to supervise each generated text tokens to attend to the correct visual objects. Composing the above together, We propose a novel trace-controlled image captioning model called LoopCAG and demonstrate its superior capability on captioning quality and flexible controllability.
Our contribution can be summarized as: 1) We propose a novel model LoopCAG, which learns the caption tokens' spatial grounding through attention guidance and temporal localization between trace input and the caption sentences through contrastive constraints in an end-to-end loop manner among the three modalities(vision, language, and traces).
2) The quantitative results show that our LoopCAG model can generate better tracecontrolled captions and achieve SOTA performance on automatic criteria. The qualitative results present that our model can generate highly relevant captions given users' trace inputs.
3) We intensively study the controllability and explainability of trace-controlled image captioning.

Task Definition
For image captioning, the task is to generate a text description y given an image I. We first apply a pre-trained visual object detector on the image and get an object level visual feature set V = {v 1 , . . . , v N }, in which v i ∈ R 2048 is the i-th object visual feature, and N is the number of visual objects. The text description sequence is y = {y 1 , . . . , y l }, in which y j is the j-th token and l is the text sequence length. The output is conditioned on model parameters θ, and the optimization process can be formulated as the following maximum likelihood form: (1) For trace-controlled image captioning, the raw trace input is a sequence of tracepoints coordinates with timestamps. To reduce those tracepoints to an acceptable length due to the limit of GPU memory, we segment the tracepoints sequences uniformly by the same time window τ , and then each trace segment is converted to its minimal bounding rectangle. Every bounding rectangle can be represented by a 5D vector which contains normalized coordinates of the top-left and bottom-right corners, and the area ratio with respect to the whole image. We denote the trace input as T = {t 1 , . . . , t M }, where t i ∈ R 5 . The trace controlled captioning objective can be formulated as follow:

Method
Our method consists of three components: the caption generation module with a transformer encoder-decoder backbone, the attention guidance for object-level spatial grounding, and the contrastive constraints for sentence-level temporal alignment. The overall model structure is illustrated in Figure 2. The model is trained by jointly optimizing the three objectives listed in the following subsections.

Caption Generation
The caption generation backbone is a transformerbased encoder-decoder proposed by Vaswani et al. (2017), which mainly employs a multi-head attention mechanism and achieves top-tier performance in many sequential related tasks. Here, we highlight several task-oriented modifications.
Vision-Trace Encoder The visual embeddings V and traces embeddings T are encoded separately and then concatenated together as a single input sequence feeding into a transformer encoder.  We directly concatenate the visual object embedding and the trace embedding as encoder input, and then employ a transformer decoder for caption generation. (b) Attention Guidance: We use a heuristic supervision attention score matrix to supervise the vision-linguistic cross-attention generated by the transformer backbone, grounding the caption tokens to visual objects spatially (c) Contrastive Constraints: We split the hidden states of caption tokens and traces by sentence respectively and then apply the contrastive loss to make the representations of the sentence and trace segment with same order indices closer, thereby aligning caption sentences to trace segments temporally.
• Object visual embedding: We first represent the spatial info of each object proposal by a 5D vector (in the same way as the traces), then project it into a spatial embedding p i ∈ R d , where d is the embedding size across the model. Each object visual feature v i is projected into a lower dimension vec- We also generate Sinusoidal Positional Embeddings (Vaswani et al., 2017) o i to capture the temporal order of the traces. The final trace embeddingT = Caption Decoder Caption decoder combines vision and trace information using cross attention connected to the hidden states of Vision-Trace Encoder's last layer. Using a casual mask to encode generated token progressively, the transformer decoder ensures that the predictions for position i can depend only on the known outputs at positions less than i. During training, the ground truth caption tokens are shifted right, and a special token BOS (begin of the sentence) is inserted into the first position. A cross-entropy generation loss L gen is then computed with the logits transformed from the last decoder layer's hidden states and un-shifted ground truth caption token ids with a special token EOS (end of the sentence) appended.
It is noted thatŷ is the masked version of the ground-truth caption y. To make a fair comparison with the baseline (Pont-Tuset et al., 2020), we apply the same setting and do not employ common techniques such as label smoothing (Szegedy et al., 2016) or self-critical training(Rennie et al., 2017).

Attention Guidance for Spatial Ground
Attention Supervision Construction To explicitly guide the attention for object-level spatial grounding, we align the semantic caption tokens with the visual object by taking trace as an intermediate bridge. In this way, we construct a supervision matrix to guide the attention between the caption tokens and visual objects by the two following steps.
1) Language-trace temporal alignment. In the Localized Narrative dataset, the caption utterances 1 u and mouse traces are highly temporal-aligned, i.e., every utterance u has a The person is riding a horse. Figure 3: A showcase of spatial attention scoring corresponding time window, every tracepoint p has a timestamp. To leverage this information, we first assign each tracepoint p to a unique utterance u, where the tracepoint timestamp is in the utterance time window. Thus, every utterance u is aligned to a series of tracepoints P u = {p 1 , . . . , p ku }.
2) Language-vision spatial alignment. Give the utterance u and corresponding P u , we calculate the alignment score considering the spatial overlap between tracepoints P u and each vision object v i . Every visual object v i has a corresponding spatial bounding box are top-left and bottom-right horizontal and vertical coordinates respectively. We set the alignment score s (u j ,b i ) between utterance u j and bounding box b i as, where I is an indicator of whether point p is in the bounding box b i : x p and y p are the coordinates of each tracepoint in p u . An example of the alignment score calculation is illustrated in Figure 3.
By calculating the alignment score, we establish the spatial grounding supervision between caption tokens and auto-detected visual objects. For every word y i in the same utterance u, the s (y i ,b j ) = s (u,b j ) . Eventually, we get the supervision score matrix S ∈ [0, 1] N ×T and S ij = s (y i ,b j ) .
Attention-guided Grounding A cross-attention matrix is generated in shape (N, T, L, H) during the transformer's decoding steps. Here N denotes the number of pre-detected visual objects, T denotes the number of tokens in a caption sentence after padding, L denotes the number of transformer layers, and H denotes the number of attention heads in transformer layers. Two linear projections and layer normalization (Ba et al., 2016) are applied sequentially on dimension L and H, respectively reducing the dimension to 1. Thus, for a single instance, we eventually calculate an attention To train the model, the goal can be achieved by minimizing the following attention guidance loss function L att : which is a weighted Binary Cross Entropy between A and S. Noted that we also choose to mask out some stop-words columns of the matrix A and S to avoid introducing too much annotation noise.

Contrastive Constraints for Temporal Alignment
As illustrated on the left side of Figure 4, we first use a "split by sentence" procedure to build a sentence-level alignment between caption and traces, and then employ contrastive loss to constrain the temporal order of the generation process.

Contrastive Constraints
Split By Sentence Figure 4: A showcase of split by sentence and contrastive constraints for temporal alignment Split by Sentence An annotated instance consists of an image, a tracepoint list, and a caption paragraph consisting of a list of ordered caption sentences. Here, we define a caption sentence as a series of utterances segmented out by a period('.').
In section 3.2, we already maintain an alignment between utterances and tracepoints. Following this setting, we can unite a list of ordered utterance U = {u 1 , . . . , u k } in the same caption sentence, and then orderly unite a list of tracepoints corresponding to U 's elements into a so-called trace segment. The alignment between caption sentences and trace segments can be established by simply uniting the association between utterances and tracepoints with respect to the above sentence split. We call this procedure as split by sentence.
Temporal Contrastive Constraints According to the split mentioned above, we aggregate the transformer's last layer hidden states of trace segments and caption sentences respectively, and denote them as H ts = {h 1 ts , . . . , h n ts } and H cs = {h 1 cs , . . . , h n cs }. Here n is the number of caption sentences.
We adopt the NCE loss to learn to discriminate the positive from negative trace-caption pairs. The positive is defined as all the temporal aligned corresponding caption sentence and trace segment pairs i.e. with the same order indices, and other pairs without temporal alignment in the same image as negative samples. This contrastive loss function L cts is defined as follows, where s(·, ·) means two linear layers and an L2 normalization applied on the elements respectively, and a dot production between them. By minimizing the L cts , we force the model to learn a representation being aware of sentence-level temporal ordering, which leads to more precise captioning.

Loss
Finally, the model is trained with three losses L gen , L att , and L cts , where L gen is the caption generation loss, L att is the spatial attention guidance loss, and L cts is the temporal contrastive loss.We jointly optimize our model by minimizing all losses added together: L all =L gen + L att + L cts .

Dataset
We use the annotated COCO subset of Localized Narratives to evaluate our method. We call this dataset split as LN-COCO for short. Each image has one or several pairs of the captioning paragraph and corresponding mouse traces. Every single pair is a so-called localized narrative. The training and validation splits are identical to Pont-Tuset et al.
(2020)'s setting. There are 134,272 localized narratives in the training set and 8,573 in the validation set. We train on the whole training set and evaluate our model performance against the identical validation set.

Implementation Details
For the visual feature, we adopt Faster- RCNN(Ren et al., 2015) to extract 100 bounding box proposals.
For trace feature, we use τ = 0.4s to extract trace segment for feature extraction. The embedding size d, number of transformer layers, hidden size of the transformer feed-forward layer are 768, 2, and 768, respectively. The number of attention heads is 8, and the dropout rate is 0.1. We adopt the Adam-W optimizer (Loshchilov and Hutter, 2019) with learning rate of 7e-4(which is the best performance setting of baseline, and adopted widely for other trials), and set two momentum parameters β 1 = 0.9 and β 2 = 0.99. We set the batch size to 256. All models are trained on 4 Tesla V100 GPUs with 32GB memory for 10 to 12 hours.

Evaluation Metrics
This generation task adopts the traditional image captioning evaluation metric using the open-source tool 2 with a minor modification 3 to suit with LN-COCO, including BLEU (

Results
Baseline and +Trace methods The Baseline and +Trace methods are our re-implementations following (Pont-Tuset et al., 2020)'s method description.
The Baseline method only takes image feature as input while the +Trace model take image feature and trace both as input. They employ the architecture in Changpinyo et al. (2019) with a few minor differences. First, they set the number of Transformers' layers for both the encoder and the decoder to 2 instead of 6. Second, their projection layers also consist of layer normalization (Ba et al., 2016). Third, they set the maximum number of iterations to 150k. Finally, they allow the maximum number of target captions to be as long as 225 to account for the narration's longer nature. LoopCAG methods Our model comprises of four components: 1) the transformer encoderdecoder framework; 2) the trace input; 3) Attention Guidance(+AG for short) grounding loss; 4) Contrastive constraints(+C for short).
Main Results The Table 1 shows the overall performance comparison on the LN-COCO dataset. To reduce the deviation caused by different implementation details, we first present our implementations' performance (with *), which have a higher score than Pont-Tuset et al. (2020) reported. Thus, we have a more strict baseline to evaluate the improvement purely coming from our innovative method. Compared to Baseline* method, the performance on all metrics improves significantly when controlling captioning using the mouse trace (+Trace*), it indicates that using the mouse trace enables the system to describe better those user intended parts of the image.
Most importantly, the results indicate that our LoopCAG method achieves state of the art on all automatic criteria, outperforming the previous state-of-art model by 2.4 and 7.5 on BLEU-4 and CIDEr-D, respectively. This demonstrates our proposed Attention Guidance method helps the model generate better spatially grounded and more precise captions. When considering the 2.0 rising on ROUGE-L score, we can conclude that Contrastive constraints can help the model better align the order of generated sentence to the user intent because ROUGE-L mainly employs an order mattered longest common sequence F-measure.
Ablations We perform three ablations to verify the most improvements in-deed come from the Attention Guidance and Contrastive constraints. Starting from standard captioning (Baseline*), we add the Attention Guidance to help the model better spatially ground visual objects and caption tokens ( Table 2, "+ Ag"). This affects performance, suggesting that the model does benefit from knowing where to find the highly semantic related appearance feature in the image. Next, we add the trace feature (Table 2, "+ Trace"). This introduces user intention to the model. We also take this line to show the performance lift caused by Contrastive constraints fairly. Then we add the contrastive module (Table 2, "+C") and see a good improvement on almost all criteria. Hence, we verify the significance of the positive influence of temporal contrastive constraints. Moreover, in the last line is our full LoopCAG model. We can see the two proposed methods are not exclusive to each other.

Controllability Analysis on Temporal Order
We also design an experiment to further demonstrate LoopCAG's superior controllability on the caption sentences' temporal order. Specifically, we split each localized narrative input by sentence as described in Sec3.3, and reverse the sequential order of the splits, i.e., the last sentence of a caption paragraph will become the first one, the same processing is applied to trace segments, too. We conduct an evaluation on the sentence&segment reverted dataset, and the performance comparison is shown in Table 3. With the Contrastive constraints mechanism's help, the LoopCAG model is much more robust to trace input reversing, even competitive with the model trained on reverted data. In contrast, the base models all face a dramatic drop on almost all metrics when the input trace order is reversed. This also implies there are some biased habits of human annotators. For example, they always describe the salient objects first and end with a sentence about the background of the image.
Controllability Analysis on Temporal Frequency Then, we analyze the controllability of the temporal frequency τ to present whether the coarse-grained or fine-grained tracepoints (sampling rate, in other words) affects the generation performance. As the Table 4 shows, we change the temporal frequency τ from 0.4 to 1.2. A performance drop is impressive with the τ getting larger. The purpose of this experiment for various τ is to simulate the trace drawing speed of users in a real application scenario, and a larger τ is equivalent to a faster drawing speed. As Deng et al. (2020) has demonstrated, the length is one of the critical facts that impact quantitative performance. This result implies we can further decide to generate either a coarse-grained or fine-grained caption by    controlling the time-frequency τ .
Controllability Analysis on Spatial Semantic Grounding One of our important purposes of using attention guidance is introducing more interpretability to the model while improving the caption performance. When generating each token, the model is forced to show which visual elements are the most effective reason for the current generation. And this effectiveness is supervised by our pseudo attention label. In this way, we can hopefully obtain better visual-linguistic joint representation. In appendix A, we showcase the attention values comparison of models w/wo attention guidance. We find that the AG model has a more diverse distribution across all different types of tokens. A "neater" activation is observed in Appendix A (a) compared with (c), e.g., activations of "who", "is" and "on" are clearly suppressed. We observe that these suppressions happen on most function word, so we add this illustration for further discussion and exploration by our research community.

Qualitative Case Study
We present a showcase of a captioning result of different methods in Figure 7. We can easily find that the Baseline captioning describes the image in random order while the +Trace Captioning and LoopCAG Captioning almost have the same order as Ground Truth Captioning. It is also aweinspiring that the Baseline captioning and +Trace Captioning both consist of some preposterous description highlighted in red color. In contrast, the LoopCAG captioning is all reasonable. This is evidence of superior fact grounding advantage brought by our Attention Guidance Method.

Ground Truth Captioning
In this picture there is a stand on a ground. On the backside there is a person. He is riding on a horse. He is wearing a cap. He is in between the fence. There is a flags on a wall. On the left side there is a score board on a

Related Work
Controllable Image Captioning is an emerging research direction. Previous works aim to control the captioning by Part-Of-Speech tagging (Deshpande et al., 2018), sentiment, (You et al., 2018), length (Deng et al., 2020), bounding box (Cornia et al., 2019) etc. Those works either tried to describe a semantic guided captioning. Other works relied on predefined categories, e.g., bounding box or sentiment classes. Similar works (Yu et al., 2018;Cornia et al., 2019) control the caption by a sequence of ordered topics and bounding boxes. However, those methods limit the captioning on the pre-defined or recognized objects in the bounding box and hard to scale out. Besides, the trace is a more natural way to input than the bounding box. The most similar work (Pont-Tuset et al., 2020) proposed a trace-controlled image captioning task and designed a simple benchmark by directly concatenating the mouse trace coordinates and size into a self-attention module. Although mouse trace is flexible and interactive, it is easy for humans to understand the trace's semantic representation but hard for AI agents. Unlike previous works, we propose a novel trace-controlled model for capturing the semantic representation of trace from both fine-grained and coarse-grained spatial and temporal characteristics. Contrastive Learning Recently, contrastive learning has been widely studied in unsupervised representation learning for vision, Grill et al., 2020;Caron et al., 2020;Chen and He, 2020), language (Mikolov et al., 2013;Saunshi et al., 2019;Chi et al., 2020;Fang and Xie, 2020;Giorgi et al., 2020;Kong et al., 2020;Gunel et al., 2021), or multi-modal (Sun et al., 2019;Luo et al., 2020). The goal is to learn semantic representation between two views by allowing the positive sample to be similar (in semantic space) and negatives to be dissimilar semantically simultaneously. CLIP (Radford et al.) and MIL-NCE (Miech et al., 2020) has demonstrated the effectiveness for learning the semantic mapping between vision and language. Previous attempts mainly exploit the InfoNCE (Oord et al., 2018) objective to maximize a lower bound of the mutual information. This paper extends the multimodal contrastive learning between the trace in the image and captioning sentence. In the same image, they correspond to each other semantically. This motivates us to design a contrastive loss for better 2022 alignment between the trace and language.

Conclusion
In this paper, we focus on the controlled image captioning task and find mouse traces provide an intuitive and efficient way for a user to control the description. We propose a novel caption generation model with contrastive constraints and attention guidance called LoopCAG to control the captioning process spatially and temporally. The experimental results demonstrate the our model's effectiveness, and our work will inspire more future research on vision-linguistic understanding and generation. In this image i can see a person wearing white shirt, blue tie, blue blazer, skirt and black shoes is standing and holding a black colored bag in his hand. In the background i can see the white colored wall.
in this picture we can see a man standing and holding a mobile in his hand, in the background we can find a wall.
+Trace Captioning LoopCAG Captioning in the middle of the image a man is standing and smiling and he is holding a tennis racket. behind him there is a cloth on the red color wall. bottom left side of the room there are two shoes.
in this image i can see a person wearing blue coat, black pant and black shoe is standing and holding a black colored bag in his hand. in the background i can observe the white colored wall.