Transforming Visual Scene Graphs to Image Captions

We propose to TransForm Scene Graphs into more descriptive Captions (TFSGC). In TFSGC, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a simple and homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TFSGC. The code is in: https://anonymous.4open.science/r/ACL23_TFSGC.


Introduction
Image captioning, which aims to generate one sentence for describing multi-aspects of an image, has made huge progress since the proposal of the encoder-decoder framework (Vinyals et al., 2015;Xu et al., 2015).Such a framework contains one visual encoder to extract a series of visual features from the image and one language decoder to generate captions from the extracted visual features.Since the visual encoder is usually well pre-trained by image classification and object detection, the extracted features contain abundant knowledge of the object categories, which enables the captioning model to generate object-abundant captions.
However, object category is not the only visual pattern that matters for high-quality captions (Anderson et al., 2018;Jiang et al., 2020), object attributes and relations also play significant roles in generating descriptive captions, i.e., the caption containing multi-aspects of an image.Motivated by this, researchers propose to incorporate additional semantic knowledge, e.g., object categories, attributes, and relations, into the captioning models by using the scene graph as the mediator (Yao et al., 2018;Yang et al., 2020).Scene graphs assign each object node with certain attribute nodes and some pairwise objects with certain relation nodes.These nodes are represented by the corresponding semantic tags, e.g., as shown in Figure 1, the object "dog" is assigned with the attribute "black" and the pairwise objects "dog" and "fish" have the relation "bite" in between.To exploit the scene graph, Graph Neural Network (GNN) (Battaglia et al., 2018) is deployed to embed the graphs and the output embeddings are input to the decoder for captioning.
The top part of Figure 1 shows the pipeline of the previous popular GNN-based captioning model (Yao et al., 2018;Yang et al., 2020), which implements GNN as a few Fully-Connected (FC) and non-linear activation layers.To update the node embedding, this GNN maps the concatenated neighbour embeddings into the new one (Xu et al., 2019).Then the updated graph embeddings are input into the language decoder that contains a few LSTM layers and an attention module.The LSTM layers are used to generate the context vector based on the partially generated captions.This context vector works as the query in the attention module for determining which graph embeddings should be used to generate the next word.Compared with the models without GNN, this GNN-LSTM pipeline usually gets better performance.A black dog biting a fish is running on the grass.
A dog that is standing in the grass.However, this GNN-LSTM framework implies two flaws which hinder the further improvement of applying scene graphs.First, FC-based GNN and LSTM do not share the same building blocks and thus the constructed model is a heterogeneous structure, which requires well-chosen training strategies, e.g., choosing different learning rates or optimizers for different sub-networks, to achieve the best performance (Yang et al., 2020).Finding such training configurations is a labour-intensive process.Second, the graph embeddings are indiscriminately selected during captioning (the grey embeddings in Figure 1 top (c) denote such indiscrimination), which causes less descriptive captions.While intuitively, different kinds of node embeddings should be used for generating the words with diverse part-of-speech (POS), e.g., the object/attribute/relation embeddings should be more responsible for the nouns/adjectives/verbs, respectively (Yang et al., 2019).
To alleviate the above-mentioned flaws, we propose a novel homogeneous captioning model to Transform Scene Graphs (TSG) into captions.Our TSG is built based on the Transformer (Vaswani et al., 2017) since it is more powerful than LSTM in image captioning (Herdade et al., 2019;Li et al., 2019;Cornia et al., 2020).TSG is homogeneous since we use multi-head attention (MHA) to design both the graph encoder to embed the scene graphs and the language decoder to generate the caption.
Specifically, to design GNN by MHA, we first linearize the scene graph into a token sequence and introduce a binary mask to index which two nodes are connected in the graph.Then we use the masked MHA operation to deal with this linearized token sequence for graph embedding.In this process, each token embedding is added by a learnable type embedding to index the token type (e.g., object/attribute/relation) and we will show that such type embedding can help distinguish the edge type during the attention calculation.
After graph operation, we get a series of object/attribute/relation embeddings, which will be used in the decoder for captioning.To make the decoder discriminate different embeddings for generating different words, we learn from MOE networks (Jacobs et al., 1991;Xue et al., 2022;Du et al., 2022) to revise the original Transformer decoder with two strategies.First, as Figure 1 bottom (d) shows, we use three encoder-decoder attention layers, which are built on MHA, as three experts to address object/attribute/relation embeddings, respectively.Second, we incorporate an attention-based soft routing network to discriminate which kinds of embeddings should be more responsible for generating the next word.Both the MOE-decoder and the type embedding in the encoder help distinguish node embeddings for better captions.We carry exhaustive ablation studies and comparisons to validate the effectiveness of TSG and it achieves 132.3/138.6/139.5 CIDEr scores when using BUTD/ViT/VinVL features.
Recently, Transformer (Vaswani et al., 2017) has gradually substituted LSTM as the mainstream language decoder in image captioning (Herdade et al., 2019;Li et al., 2019) since it achieves better performances than the LSTM-based models.Based on this new backbone, researchers develop more advanced strategies for further improving the effectiveness, including designing more sophisticated attention mechanisms (Huang et al., 2019;Pan et al., 2020), introducing additional memory blocks (Cornia et al., 2020;Yang et al., 2021b), distilling knowledge from the large-scale pre-training models (Radford et al., 2021;Li et al., 2021;Xu et al., 2021), and exploiting Transformer-based visual encoders (Wang et al., 2022;Fang et al., 2022), modularized design for large-scale multi-modal pretraining (Li et al., 2022;Xu et al., 2023;Ye et al., 2023).Since the recently proposed SOTA models use Transformer as the backbone, we also built TSG based on Transformer for fair comparison.
Graph Neural Network (GNN).Scene Graph abstracts the major visual patterns in a visual scene as a graph.It is usually used as the mediator to narrow the gap between the vision and the language domains.To incorporate scene graphs into deep networks, GNN (Battaglia et al., 2018) is used to embed the discrete node labels into dense embeddings.However, most of the previous GNNs are MLP-based (Yang et al., 2020;Yao et al., 2018;Xu et al., 2019), which may limit the effectiveness of embedding scene graphs in a Transformer architecture.In our research, we design an MHA-based GNN to remedy this limitation.

Mixture of Experts (MOE).
The major idea of MOE is to construct a network with lots of experts where different experts deal with diverse samples (Jacobs et al., 1991;Shazeer et al., 2017).When a sample is input to the MOE network, a routing network will decide which experts should be more responsible for this input.Thus MOE naturally fits our case where we hope diverse experts can discriminate graph embeddings for generating the words with different POS.Different from the existent MOE-based Transformer (Lepikhin et al., 2020;Xue et al., 2022;Du et al., 2022) which applies various feed-forward networks as different experts, we set three encoder-decoder attention layers as different experts where the query is set to the same context vector while the key and value are set to object/attribute/relation embeddings.

Revisiting of Transformer
We first revisit the Transformer-based captioning model and then introduce how to revise it to get our TSG in the next section.Transformer-based model contains a visual encoder to calculate the contexts of the extracted visual features and the output embeddings will be input into a language decoder for captioning.For both the encoder and the decoder, the most elemental building block is the multi-head attention (MHA).Given the query, key, and value matrices: where Besides MHA, another important module in Transformer is the Feed-Forward network (FFN): where FC denotes the fully-connected layer and ReLU denotes the rectified linear function.
Given MHA and FFN, we can use them to build a Transformer-based captioning model.For the encoder, it stacks 6 identical blocks where each one contains an MHA and an FFN.Given the output of the former block as the input X, the next block calculates its output as: Note that the variables X, Y , Z used here are "local variables" for conveniently introducing the work flow of Transformer architecture, whose values will be set to the specific values when introducing the concrete captioning model.In "Self-ATT", Q, K, V are set to the same value and this op- eration is named as self-attention (Vaswani et al., 2017).After stacking 6 blocks defined in Eq. ( 3), a visual encoder is built.For the first block, its input is the extracted visual feature set of the given image.The output of the last block will be input into the language decoder.
For the decoder, it also stacks 6 identical blocks where each one contains two MHAs and an FFN.Given the output of the former decoder block X D and the output of the visual encoder X E , the next decoder block calculates its output: Note that in "ED-ATT", Q is set to the output of the former decoder block while K, V are set to the output of the visual encoder, and such operation is called encoder-decoder attention (Vaswani et al., 2017).After stacking 6 blocks defined in Eq. ( 4), a language decoder is built.
For the first block in the decoder, X D in Eq. ( 4) is set to the word embedding set of the partially generated captions S = {s 1 , ..., s t } at the t-th time step.For all the decoder blocks, the input X E is set to the same value, which is the output of the visual encoder.The output of the last decoder block Z = {z 1 , ..., z t } is used to calculate the word distribution of the next word: P (s t+1 ) = Softmax(z t ).
(5) Given the ground-truth caption S * , we can train this model by minimizing the cross-entropy loss: or by maximizing a reinforcement learning (RL) based reward (Rennie et al., 2017): where r is a sentence-level metric for the sampled sentence S s and the ground-truth S * , e.g., the CIDEr-D (Vedantam et al., 2015) metric.

Transforming Scene Graphs
In this section, we introduce how to revise the Transformer to get our TSG.We will first show how to get an MHA-based GNN and then introduce how to design an MOE-based decoder.

MHA-GNN
A visual scene graph (Krishna et al., 2017) contains three kinds of node embeddings: object/attribute/relationship embeddings o/a/r.These nodes are connected by the following rules: if an object o i has an attribute a k , o i and a k are connected, e.g., o 1 connects a 1 in Figure 2 (a).If two objects o i and o j have the relation r k , we connect r k with o i and o j , e.g., r 1 connects o 1 and o 2 .Given an image, we extract a series of visual features from the image as the object embeddings: {o 1 , ..., o No }.To get the attribute/relation embeddings, we first use the attribute/relation annotations from VG (Krishna et al., 2017) to train the attribute/relation classifiers to predict the labels.Then we use two learnable embedding layers to respectively transform these labels into the dense attribute/relation embeddings {a 1 , ..., a Na }/ {r 1 , ..., r Nr }.
Given these original node embeddings, GNN will update each one by aggregating the neighbour embeddings.In the previous GNN-LSTM-based models, GNN is usually deployed by the FC layers, which aggregates the contexts by mapping the concatenated neighbour embeddings to the new one (Yao et al., 2018;Yang et al., 2020).However, in our TSG, since Transformer is applied as the backbone, we may more hope to use the basic building block of Transformer to deploy the GNN.Such a design principle has two advantages.First, it alleviates the coding implementation difficulty that we do not need to specify some additional GNN operations.Second, which is more important, when the GNN and the Transformer architecture are homogeneous, the whole model will be more easily trained.For example, we do not need to set different training strategies like the learning rate or optimizer to different sub-networks.
Since MHA (Eq.( 1)) can learn the context knowledge between the embeddings (Vaswani et al., 2017), it can naturally be used to define the graph operation for aggregating the knowledge.We apply the following two steps to do this.Firstly, as shown in Figure .2, we linearize the object, attribute, and relation embeddings into one sequence and add the learnable type embeddings as the linearized token set: U = {u 1 , ..., u N }: where e o /e a /e r are respectively learnable type embeddings correspond to object/attribute/relation types and N = N o + N a + N r .For example, in Figure .2, N o /N a /N r is 3/1/2, the objects o 1:3 become u 1:3 , the attributes a 1 becomes u 4 , and the relations r 1:2 become u 5:6 .
After linearizing, the topological knowledge of the graph is lost, i.e., this token sequence does not show which two nodes are connected or not.To remedy such knowledge, we use a symmetric binary mask matrix M ∈ R N ×N to control whether two nodes are connected or not.If two nodes u i and u j are connected in the original scene graph, we set M i,j = 1 and M i,j = 0 otherwise.Specifically, the values of M are set as follows: 1) If o i has one attribute a j , we set M i,j+No = 1, e.g., o 2 (u 2 ) and a 1 (u 4 ) in Figure 2 are connected and M 2,4 = 1.2) If r k connects with o i and o j , we set M i,k+No+Na = M j,k+No+Na = 1, e.g., r 1 (u 5 ) connects o 1 (u 1 ) and o 2 (u 2 ) and M 1,5 = M 2,5 = 1. 3) All the object nodes are connected with each other since they are visual features that their contexts play a key role in captioning (Herdade et al., 2019).Thus ∀i, j ≤ N o , M i,j = 1.4) Since the scene graph is an undirected graph, M should be symmetric: M i,j = M j,i .
After getting U and M , we can revise the Transformer encoder to get our MHA-GNN.Specifically, we use U as the input of the encoder defined in Eq.( 3) and revise the Att operation in Eq. ( 1) as the following Masked Att operation: where denotes the element-wise production.In this way, the graph operation is defined by MHA.Specifically, for each node embedding, it is updated by weighted summing its neighbour embeddings, where the weights are from the attention heads A i calculated by the Att operation in Eq. ( 1).During weighted summing, the binary matrix control whether two nodes are connected or not.Note that the edge type is implicitly embedded in Att operation due to the added node type embedding.For example, after adding node type embeddings e o and e r to the object and relation embeddings o and r, respectively, the inner-product becomes * : where the right three terms are affected by the node type embedding.Thus, when the edge type changes (e.g., the object-relation edge changes to object-attribute edge), the corresponding node type embeddings also change (e.g., e r changes to e a ), which means Eq. ( 10) encodes the knowledge of edge types into the embeddings.
By stacking more such layers, the receptive field is increasing and thus each node can be updated by aggregating more neighbour embeddings, which naturally follows the design principle of GNN (Battaglia et al., 2018).The output graph embedding set G are input to the decoder for captioning.

MOE-decoder
As mentioned before, a caption contains different kinds of words for describing diverse visual patterns, e.g., nouns/adjectives/verbs for objects/attributes/relations (Yang et al., 2019), which suggests that different experts should be used to address diverse visual knowledge for generating the corresponding words.Motivated by this idea, we design an MOE-based (Jacobs et al., 1991;Du et al., 2022)  Then we only need to input them into the corresponding experts for discriminating them.Figure 3 sketches the designed MOE-based decoder, which is got by revising the decoder defined in Eq. ( 4) as: where EXP O , EXP A , and EXP R denote three different experts (encoder-decoder attentions) used to address object, attribute, and relation embeddings, respectively.They have the same structure while with different parameters.Note that the input X D is the word embeddings of the partially generated captions and at t-th step, X D = {x 1 D , ..., x t D }. Then all the X/Z o /Z a /Z r also contain t elements, e.g., Z o = {z 1 o , ..., z t o }.Soft Router (SR) calculates an ensemble embedding z at each time step to construct the embedding set Z = {z 1 , ..., z t }.Specifically, for each element x/z o /z a /z r in X/Z o /Z a /Z r , a corresponding output z can be got † : Input: x, zo, za, zr, ATT: (12) where ATT operation calculates the soft routing weights, since x accumulates the context knowledge of the partially generated caption, it can help judge which kind of word should be generated at the next step.For example, if the last word of this partially generated caption is an adjective "black", the next word is more like to be a noun and thus α o should be a large value for using more object embeddings instead of the other embeddings.

Datasets, Metrics, and Implementation Details
Datasets.MSCOCO.We use MSCOCO (Lin et al., 2014) to validate our TSG.This dataset has 123,287 images and each one is labeled with 5 captions.We use two splits in the experiments: the offline Karpathy split (113,287/5,000/5,000 train/val/test images) and the Official online split (82,783/40,504/40,775 train/val/test images).
Visual Genome (Krishna et al., 2017) provides scene graph annotations for training the scene graph parser.We follow (Yang et al., 2020) to filter the noisy dataset (e.g., lots of labels only appear a few times in the dataset) by removing the attribute/relation labels appearing less than 2000 times and use the remained 103/64 attribute/relation labels to train the attribute/relation classifiers.Implementation Details.In the experiments, we use three kinds of visual features to exhaustively compare to the other SOTA models, which are BUTD (Anderson et al., 2018), ViT (Liu et al., 2021), and VinVL (Zhang et al., 2021).During training and inference, for BUTD/ViT/VinVL, we respectively follow (Yang et al., 2020) and VinVL's official parser ‡ to parse SGs, where the latter is more powerful.For all the visual features, we set the batch size to 20 and use Adam (Kingma and Ba, 2014) as the optimizer.For BUTD/ViT/V-inVL features, We sequentially use cross-entropy loss (Eq.( 6)) and RL-based reward (Eq.( 7)) to train the models 20/20/30 and 30/30/30 epochs.

Ablation Studies
To confirm the effectiveness of the proposed MHA-GNN and MOE-decoder, we deploy exhaustive ablations as follows.Note that we use BUTD feature in this section.BASE: We directly use the classic Transformer architecture.SG: We incorporate the scene graphs into the Transformer by using the node embeddings without any graph operations.MLP-GNN: We apply MLP-based Graph Neural Network (Xu et al., 2019) for embedding the scene graphs.MHA-GNN w/o e: We apply the proposed MHA-GNN while do not use node type embedding.

MHA-GNN:
We apply the proposed MHA-GNN and keep the decoder unchanged as BASE.MOE: We use the proposed MOE-decoder and do not use GNN but input the original node embeddings into the decoder.TSG: We apply the integral TSG.Table 1 compares the similarity metrics of the ablation models.Firstly, we can find that the integral TSG achieves the highest scores, which confirms its effectiveness.Next, we respectively compare the ablation models to validate the effectiveness of the proposed MHA-GNN and MOE-decoder.By comparing MLP-GNN, SG, and BASE, it is easy to see that using GNN can gain more profits than only using node embeddings.Furthermore, we can find that MHA-GNN has higher CIDEr than MLP-GNN, which suggests that designing GNN by MHA is more powerful than by MLP in the Transformer architecture.Next, to see whether discriminating the graph embeddings is beneficial or not, we can compare MHA-GNN with MHA-GNN w/o e and find that using node type embedding performs better.Also, it can be seen that MOE and TSG respectively achieve better performances than SG and TSG-ED, which validates the effectiveness of the MOE-decoder.
Besides evaluating these ablation models by similarities, in Table 2, we calculate the recalls of the words with different POS to evaluate the descriptiveness.Table 2 shows that the captions generated from TSG have the highest recalls, suggesting that TSG generates the most descriptive captions.Also, we can find that both the proposed MHA-GNN (MHA-GNN vs. MLP-GNN) and MOE-based decoder (MOE vs. SG) can boost the recalls, suggesting both of them improve the descriptiveness.Figure 4 shows 4 examples of the captions generated from diverse models, where we can see that TSG generates more descriptive captions.BASE generates less descriptive captions since it does not use scene graphs and thus loses semantic knowledge compared with SG.Also, we show that which expert is more responsible for the words with different POS, e.g., in (a), the adjective "green" is generated by using more knowledge from the attribute expert.

Comparisons with SOTA
Recently, various SOTA captioning models with diverse settings have been proposed, including using different language decoders (LSTM, GRU, and Transformer), different visual features, and whether distilling knowledge from large-scale pretraining models (CLIP (Radford et al., 2021) or VInVL (Zhang et al., 2021)).In fairness, we compare our TSG with the models that also use Transformer-based decoder by three features: BUTD (Anderson et al., 2018), ViT (Liu et al., 2021), and VinVL (Zhang et al., 2021).Note that It can be found that TSG generates more descriptive captions, e.g., it describes more attributes like "green truck" and "brick building" in (a), or using more fine-grained nouns like "jacket" in (b).Diverse colors in TSG denote these words use more knowledge from different experts: green/blue/orange corresponds to object/attribute/relation expert, which is got by checking which one of α o /α a /α r in Eq ( 12) is the largest.
we do not compare with extra large scale models trained by millions of image-text pairs.
From Table 3 we can find that TSG achieves the highest CIDEr-D scores in different settings, i.e., achieving 132.3, 138.6, 139.5 CIDEr scores when using BUTD, ViT, and VinVL features, re-spectively.Among these compared methods, although other SOTAs do not use scene graphs, they usually have some other training burdens.For example, APN and X-Linear apply more complex attention mechanisms and it requires more computation resource for well-training them, while our TSG only apply the simplest attention operation.Moreover, as detailed in Sec 4.1 of ViTCAP, it applies much more training data (9.9Mimage-text pairs from 4 datasets including VG) to pre-train a concept network to get more powerful discrete tags, while we only use one dataset VG to get scene graphs and achieve better performance, which suggests that connecting discrete tags into a graph is a useful strategy if the discrete tags is not very powerful.To sum up, the advantage of TSG is that it can effectively embed and discriminate the semantic knowledge from scene graphs to balance the (usually more) burden of using more training data or of training more complex networks.
We also submit the single model TSG S and 4ensembled model TSG E trained by regional-based features into the online server for testing, where the results are shown in Table 4. From this table, we can discover that both TSG S and TSG E have the highest CIDEr-D scores, which further confirm the effectiveness of the proposed TSG.ically, we use MHA to design the GNN by linearizing the scene graph and remedying the lost topological knowledge with a binary mask matrix.Furthermore, we add learnable type embedding and design an MOE-based decoder to distinguish node embeddings for more descriptive captions.At last, we compared TSG with various SOTA and demonstrated that our model can achieve comparable performances to some strong benchmarks.

Limitations
There are two major limitations of the proposed TSG.The first one is that the effectiveness of TSG depends on the quality of the scene graph.If the scene graph quality is poor, then TSG will not achieve good performance.In this paper, we use Visual Genome, which contains abundant and useful scene graph annotations for parsing effective scene graphs and thus TSG is powerful.
The second limitation of TSG is that if the visual features contain abundant attribute or relation knowledge, then the improvement of TSG compared with the classic Transformer will be weakened.For example, compared with the BUTD feature case where the relative improvement of CIDEr-D is 3.6 (TSG-BASE in Table 1), the VinVL feature is more powerful since it is trained by much more data samples with more semantic labels, thus the relative improvement is lower, which is 2.2 (TSG-VinVL(Transformer) in Table 3).

Figure 1 :
Figure 1: Comparison between traditional heterogeneous GNN-LSTM (top part) and our homogeneous TSG model (bottom part).In GNN-LSTM, they use MLP-based GNN and do not discriminate the graph embeddings (grey colour in (c) is used to strengthen such indiscrimination).In TSG, we use MHA to design the GNN and the decoder and discriminate diverse graph embeddings (different colours in (c) are used to strengthen such discrimination).
and W H i ∈ R d×d are all trainable matrices; h is the number of attention heads (set to 8 in the experiments) and d h = d/h; A i denotes the i-th attention matrix used to calculate the i-th head matrix; [•] means the concatenation operation; and LN is the Layer Normalization operation.

Figure 2 :
Figure 2: The sketch of the proposed MHA-GNN.In (a), square/triangle/circle demonstrate the object/attribute/relation embeddings, where the dash line means that all the object nodes should be connected for capturing visual contexts.(b) shows the linearized scene graph, where the top and bottom parts are respectively node feature and type embeddings.(c) sketches the built binary mask matrix, where the top part shows the original graph embeddings (o/a/r).For convenience, we also index the linearized annotation (u) in the left and top.(d) details how MHA achieves the graph operation.
language decoder to discriminate diverse graph embeddings by setting three encoder-decoder attention layers as different experts.As shown in Figure. 2 (c), the graph embeddings G = {g 1 , ..., g N } output from the MHA-GNN can be naturally divided according to the original token types in the scene graph: object/attribute/relation sets G o = {g 1 , ..., g No }/ G a = {g No+1 , ..., g No+Na }/G r = {g No+Na+1 , ..., g N }.

Figure 4 :
Figure4: The captions generated by BASE, MHA-ATT, and TSG.It can be found that TSG generates more descriptive captions, e.g., it describes more attributes like "green truck" and "brick building" in (a), or using more fine-grained nouns like "jacket" in (b).Diverse colors in TSG denote these words use more knowledge from different experts: green/blue/orange corresponds to object/attribute/relation expert, which is got by checking which one of α o /α a /α r in Eq (12) is the largest.

Table 2 :
The recalls (%) of five part-of-speech words.

Table 4 :
The scores on the MS-COCO online test server.