Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination

In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text image pairs, and tested with only source-text inputs. First, we represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics. To enable pure-text input during inference, we devise a visual scene hallucination mechanism that dynamically generates pseudo visual SG from the given textual SG. Several SG-pivoting based learning objectives are introduced for unsupervised translation training. On the benchmark Multi30K data, our SG-based method outperforms the best-performing baseline by significant BLEU scores on the task and setup, helping yield translations with better completeness, relevance and fluency without relying on paired images. Further in-depth analyses reveal how our model advances in the task setting.


Introduction
Current neural machine translation (NMT) has achieved great triumph (Sutskever et al., 2014;Bahdanau et al., 2015;Zhu et al., 2020), however in the cost of creating large-scale parallel sentences, which obstructs the development of NMT for the minor languages. Unsupervised NMT (UMT) has thus been proposed to relieve the reliance of parallel corpora (Artetxe et al., 2018;Chen et al., 2018). The core idea of UMT is to align the representation spaces between two languages with alternative pivot signals rather than parallel sentences, such as bilingual lexicons (Lample et al., 2018), multilingual language models (LM) (Conneau and Lample, 2019) and back-translation technique (Sennrich et al., 2016). Recent trends have considered  the incorporation of visual information, i.e., multimodal machine translation (MMT) (Specia et al., 2016;Huang et al., 2016). Intuitively, visual modality can serve as language-agnostic signals, pivoting different languages by grounding the same textual semantics into the common visual space. Therefore, solving UMT with visual contents as pivot becomes a promising solution, a.k.a., unsupervised MMT (UMMT) (Huang et al., 2020;Su et al., 2019).
UMMT systems are trained with only the textimage pairs (<text-img>), which can be easier to collect than the parallel source-target sentence pairs (<src-tgt>) (Huang et al., 2020). Although exempting the parallel sentences for training, UMMT still requires such text-image pairs as inputs for testing. Yet such assumption might be unrealistic, because in most of the real-world scenarios such as online translation systems, paired images are not available during inference. Especially for some scarce languages, the <text-img> pairs have difficult access. In other words, practical UMMT systems should not only avoid the parallel sentences during training, but also the text-image pairs during inference. As summarized in Table 1, although some existing MMT researches exempt the testing-time visual inputs (Zhang et al., 2020;Li et al., 2022b), they all unfortunately are supervised methods, relying on large-scale parallel sentences for training. As emphasized above, the visual information is vital to UMMT. However, for both the existing supervised and unsupervised MMT studies, they may suffer from ineffective and insufficient modeling of visual pivot features. For example, most of MMT models perform vision-language (VL) grounding over the whole image and text Zhang et al., 2020), where such coarse-grained representation learning can cause mismatching and sacrifice the subtle VL semantics. Fang and Feng (2022) recently introduce a fine-grained VL alignment learning via phrase-level grounding, while without a holistic understanding of the visual scene, such local-level method may lead to incomplete or missing alignments.
In this work, we present a novel UMMT method that solves all aforementioned challenges. First of all, to better represent the visual (also the textual) inputs, we consider incorporating the visual scene graph (VSG) (Johnson et al., 2015) and language scene graph (LSG) (Wang et al., 2018). The scene graphs (SG) advance in intrinsically depicting the semantic structures of texts or images with rich details (cf. Fig. 1), which offers a holistic viewpoint for more effective pivoting learning. Then, we build the UMMT framework as illustrated in Fig. 2. The input src text and paired image are first transformed into LSG and VSG, which are further fused into a mixed SG, and then translated into the tgt-side LSG. And the tgt sentence will be finally produced conditioned on the tgt LSG. Several SGbased pivoting learning strategies are proposed for unsupervised training of UMMT system. In addition, to support pure-text (image-free) input during inference, we devise a novel visual scene hallucination module, which dynamically generates a hallucinated VSG from the LSG compensatively.
Our system is evaluated on the standard MMT Multi30K and NMT WMT data. Extensive experimental results verify that the proposed method outperforms strong baselines on unsupervised multimodal translation by above 5 BLEU score on average. We further reveal the efficacy of the visual scene hallucination mechanism in relieving the reliance on image inputs during inference. Our SG-pivoting based UMMT helps yield translations with higher completeness, relevance and fluency, and especially obtains improvements on the longer sentences.
Overall, we make the following contributions: 1) We are the first to study the inferencetime image-free unsupervised multimodal machine translation, solved with a novel visual scene hallucination mechanism. 2) We leverage the SGs to better represent the visual and language inputs. Moreover, we design SG-based graph pivoting learning strategies for UMMT training. 3) Our model achieves huge boosts over strong baselines on benchmark data. Code is available at https: //github.com/scofield7419/UMMT-VSH.
2 Scene Graph-based Translation System

Problem Definition
In UMMT, no parallel translation pairs are available. This work considers an inference-time imagefree UMMT. During training, the data availability is <x, z>∈<X , Z> and the corresponding srcside LSG x and VSG, where X are the src-side sentences, and Z are the paired images. During inference, the model generates tgt-side sentences y ∈ Y based on the inputs of only x ∈ X and the  corresponding LSG x , while the visual scene VSG is hallucinated from LSG x . In both training and inference, y will be generated from the intermediate tgt-side language scene graph LSG y , which is produced from LSG x and VSG (or VSG ).

Framework
As shown in Fig. 2, the system first represents the src-side LSG x and VSG features with two GCN graph encoders, respectively. Then the SG fus-ing&mapping module integrates and transforms two SG representations into a unified one as tgtside LSG, i.e., LSG y . Another GSN model further encodes the LSG y , where the representations are used to generate the tgt sentence (i.e., translation).

Scene Graph Generating and Encoding
We first employ two off-the-shelf SG parsers to obtain the LSG and VSG, separately (detailed in the experiment part). For simplicity, here we unify the notations of LSG and VSG as SG. We denote a SG as G=(V, E), where V are the nodes (including object o, attribute a and relation r types), and E are the edges e i,j between any pair of nodes v i ∈ V . We then encode both the VSG and LSG with two spatial Graph Convolution Networks (GCN) (Marcheggiani and Titov, 2017) respectively, which is formulated as: where r i is the representation of node v i . We here denote r L i as LSG's node representation, and r V i as VSG's node representation.
Visual Scene Hallucinating During inference, the visual scene hallucination (VSH) module is activated to perform two-step inference to generate the hallucinated VSG , as illustrated in Fig. 3.
Step1: sketching skeleton aims to build the skeleton VSG. We copy all the nodes from the raw LSG x to the target VSG, and transform the textual entity nodes into the visual object nodes.
Step2: completing vision aims to enrich and augment the skeleton VSG into a more realistic one. It is indispensable to add new nodes and edges in the skeleton VSG, since in real scenarios, visual scenes are much more concrete and vivid than textual scenes. Specifically, we develop a node augmentor and a relation augmentor, where the former decides whether to attach a new node to an existing one, and the later decides whether to create an edge between two disjoint nodes. To ensure the fidelity of the hallucinated VSG , during training, the node augmentor and relation augmentor will be updated (i.e., with the learning target L VSH ) with the input LSG and VSG supervisions. Appendix §A.1 details the VSH module.
SG Fusing&Mapping Now we fuse the heterogeneous LSG x and VSG into one unified scene graph with a mixed view. The key idea is to merge the information from two SGs serving similar roles. In particular, we first measure the representation similarity of each pair of <text-img> nodes from two GCNs. For those pairs with high alignment scores, we merge them as one by averaging their representations, and for those not, we take the union structures from two SGs. This results in a pseudo tgt-side LSG y . We then use another GCN model for further representation propagation. Finally, we employ a graph-to-text generator to transform the LSG y representations to the tgt sentence y. Appendix §A.2 presents all the technical details in this part.

Learning with Scene Graph Pivoting
In this part, based on the SG pivot we introduce several learning strategies to accomplish the unsupervised training of machine translation. We mainly consider 1) cross-SG visual-language learning, and 2) SG-pivoted back-translation training. Fig. 4 illustrates these learning strategies.

Cross-SG Visual-language Learning
The visual-language SG cross-learning aims to enhance the structural correspondence between the LSG and VSG. Via cross-learning we also teach the SG encoders to automatically learn to highlight those shared visual-language information while deactivating those trivial substructures, i.e., denoising. Cross-modal SG Aligning The idea is to encourage the text and visual nodes that serve a similar role in VSG and LSG to be closer. To align the fine-grained structures between SGs, we adopt the contrastive learning (CL) technique (Logeswaran and Lee, 2018;Yan et al., 2021;Fei et al., 2022f;Huang et al., 2022). In particular, CL learns effective representation by pulling semantically close content pairs together, while pushing apart those different ones. Technically, we measure the similarities between pairs of nodes from two VSG and LSG: A threshold value α is pre-defined to decide the alignment confidence, i.e., pairs with s i,j > α are considered similar. Then we put on the CL loss: where τ >0 is an annealing factor. j * means a positive pair with i, i.e., s i,j * >α. Cross-modal Cross-reconstruction We further strengthen the correspondence between VSG and LSG via cross-modal cross-reconstruction. Specifically, we try to reconstruct the input sentence from the VSG, and the image representations from the LSG. In this way we force both two SGs to focus on the VL-shared parts. To realize VSG→x we employ the aforementioned graph-to-text generator. For LSG→z, we use the graph-to-image generator . The learning loss can be marked as L REC .

SG-pivoted Back-translation Training
Back-translation is a key method to realize unsupervised machine translation (Sennrich et al., 2016).
In this work, we further aid the back-translation with structural SG pivoting.

Visual-concomitant Back-translation
We perform the back-translation with the SG pivoting. We denote the X →Y translation direction as y=F xz→y (x, z), and Y→Z as x=F yz→x (y, z).
As we only have src-side sentences, the backtranslation is uni-directional, i.e., x→ȳ→x.
Captioning-pivoted Back-translation Image captioning is partially similar to MMT besides the non-text part of the input. Inspired by Huang et al. (2020), based on the SG pivoting, we incorporate two captioning procedures, Z→X and Z→Y, to generate pseudo parallel sentences <x-ȳ> for back-translation and better align the language latent spaces. We denote Z→X asx=C z→x (z), Z→Y asȳ=C z→y (z). The back-translation loss will be: Remarks In the initial stage, each of the above learning objectives will be executed separately, in a certain order, so as to maintain a stable and effective UMMT system. We first perform L CMA and L REC , because the cross-SG visuallanguage learning is responsible for aligning the VL SGs, based on which the high-level translation can happen. Then we perform back-translation training L VCB and L CPB , together with VSH updating L VSH . Once the system tends to converge, we put them all together for further fine-tuning:   (Radford et al., 2021) to retrieve images from Multi30K for sentences. Following prior research, we employ the Faster-RCNN (Ren et al., 2015) as an object detector, and MOTIFS (Zellers et al., 2018) as a relation classifier and an attribute classifier, where these three together form a VSG generator. For LSG generation, we convert the sentences into dependency trees with a parser (Anderson et al., 2018), which is then transformed into the scene graph based on certain rules (Schuster et al., 2015). For text preprocessing, we use Moses (Koehn et al., 2007) for tokenization and apply the byte pair encoding (BPE) technique. We use Transformer (Vaswani et al., 2017) as the underlying text-encoder to offer representations for GCN, and use the FasterRCNN to encode visual feature representations. All GCN encoders and other feature embeddings have the same dimension of 1,024, and all GCN encoders are with two layers.  We mainly compare with the existing UMMT models: Game-MMT (Chen et al., 2018), UMMT (Su et al., 2019) and PVP (Huang et al., 2020). To achieve a fair comparison on the inference-time image-free setup, we also re-implement the UMMT and PVP by integrating the phrase-level retrievalbased visual hallucination method (Fang and Feng, 2022). All models use the same fair configurations, and we do not use pre-trained LM. On WMT we also test the supervised MMT setup, where we use these baselines: UVR (Zhang et al., 2020), RMMT (Wu et al., 2021b), PUVR (Fang and Feng, 2022) and VALHALLA (Li et al., 2022b). We report the BLEU and METEOR scores for model evaluation. Our results are computed with a model averaging over 5 latest checkpoints with significance test.

Main Results
Results on Multi30K In Table 2 we show the overall results on Multi30K data. First, we inspect the performance where gold-paired images are given as inputs for testing. We see that our method (Ours # ), by integrating the LSG and VSG information, shows clear superiority over baselines on all translation jobs, while ablating the SGs, the  performance drops rapidly. This shows the importance of leveraging scene graphs for more effective multimodal feature representations. Then, we look at the results where no paired images are given, i.e., an inference-time image-free setup. By comparing UMMT/PVP with UMMT * /PVP * we understand that without images unsupervised MMT fails dramatically. Notably, our system shows significant improvements over the best baseline PVP * , by average 5.75=(3.9+6.5+6.6+6.0)/4 BLEU score. Although UMMT * and PVP * acquire visual signals via the phrase-level retrieval technique, our SGbased visual hallucination method succeeds much more prominently. Besides, there are comparably small gaps between Ours and Ours # , which indicates that the proposed SG-based visual hallucination is highly effective. The above observations prove the efficacy of our overall system for UMMT. Table 3 we quantify the contribution of each objective of scene graph pivoting learning via ablation study. Each learning strategy exhibits considerable impacts on the overall performance, where the captioning-pivoted back-translation influences the results the biggest, with an average 4.3 BLEU score. Overall, two SG-pivoted back-translation training targets show much higher influences than the two cross-SG visual-language learning objectives. When removing both two back-translation targets, we witness the most dramatic decrease, i.e., average -5.7 BLEU. This validates the long-standing finding that the back-translation mechanism is key to unsupervised translation (Sennrich et al., 2016;Huang et al., 2020).   Table 6: Vision-language aligning evaluation. For our models, we transform the hallucinated VSG into an image via a graph-to-image generator. We use CLIP to measure the VL relevance score.

Results on WMT
parallel sentences are overall better than the unsupervised ones. However, our UMMT system effectively narrows the gap between supervised and unsupervised MMT. We can find that our unsupervised method only loses within 1 BLEU score to supervised models, e.g., UVR and PUVR.

Further Analyses and Discussions
In this part we try to dive deep into the model, presenting in-depth analyses to reveal what and how our proposed method really works and improves.
• Integration of the vision and language SGs helps gain a holistic understanding of input. Both VSG and LSG advance in comprehensively depicting the intrinsic structure of the content semantics, which ensures a holistic understanding of the input texts and images. By encoding the vision and language SGs, it is expected to completely capture the key components from src inputs, and thus achieve better translations. However, without such structural features, some information may be lost during the translation. In Table 5 via human evaluation we can see that our system obtains significantly higher scores in terms of the completeness, comparing to those baselines without considering SGs. Also in Fig. 5, we can find that the baseline system PVP * (PR), with only the local-level phrase-level visual retrieval, has frequently missed the key enti- zwei fahrräder stehen hinter zwei mann mit den eingetopften graspflanzen in der nähe des meeres. man in t-shirt and shorts kicking football off tee. ties during the translation, e.g., the object 'tee' in case#2.
• SG-based multimodal feature modeling helps achieve more accurate alignment between vision and language. Another merit to integrating the SGs is that the fine-grained graph modeling of visual and language scenes obviously aids more precise multimodal feature alignment. In this way, the translated texts have higher fidelity to the original texts. Inaccurate multimodal alignment without considering the SG modeling will otherwise lead to worse ambiguity. Observing the ambiguity in Table 5, we see that our model exhibits the lowest ambiguity. In Fig. 5 for the case#3, PVP * (PR) confuses the verb 'saw' as 'see' as it fails to accurately refer 'saw' to a certain lumbering tool, while ours gives a correct prediction. Besides, accurate multimodal alignment greatly enhances the utility of visual information. In Table 6 we compare the relevance of vision-language counterparts by different models, where our model gives the highest performance on both the overall text-image matching and the regional phrase-object matching. In addition, two proposed cross-SG learning targets display big impacts on the VL-aligning ability.
• The longer and more complex the sentences, the higher the translation quality benefiting from the SGs features. In this work, we investigate the SG structures to model the input texts. Graph modeling of the texts has proven effec- tive for resolving the long-range dependency issue (Marcheggiani and Titov, 2017;Li et al., 2022b). In Fig. 6 we group the translation performance based on the lengths of source sentences. We see that our SG-based model gives very considerable gains over the two non-SG baselines, where the longer the sentences the higher the improvements.
• Incorporating SGs into MMT advances in more fluent translation. Also, modeling the semantic scene graph of the input features contributes a lot to the language fluency of the translation texts.
Looking at the Fluency item in Table 5, we find that our system gives the best fluency with the lowest grammar errors.
• SG-based visual scene hallucination mechanism helps gain rich and correct visual features. Different from the baseline retrieval-based methods that directly obtain the whole images (or local regions), our proposed VSH mechanism instead compensatively generates the VSGs from the given LSGs. In this way, the hallucinated visual features enjoy two-fold advantages. On the one hand, the pseudo VSG has high correspondence with the textual one, both of which will enhance the shared feature learning between the two modalities. On the other hand, the hallucinated VSG will produce some vision-specific scene components and structures, providing additional clues to facilitate back to the textual features for overall better semantic understanding. Fig. 7 illustrates the node increasing rate during the vision scene graph hallucination. We see that the numbers of all three types of nodes increase, to different extents, where object nodes grow rapidest. Also, during the two transition steps of the VSH mechanism we get two VSGs, skeleton VSG and hallucinated VSG. From Fig. 8 we see that after two full hallucination steps, we can obtain high-fidelity vision features, demonstrating the necessity of the second completing-vision step.

Related Work
Within the scope of natural language processing (NLP), there has been a wide range of specific language understanding tasks (Fei et al., 2022e,f,g,d,a,c,h,b). Among them, neural machine translation aims at automating the process of translating text or speech from one language to another (Sutskever et al., 2014;Bahdanau et al., 2015;Luong et al., 2015). In the era of deep learning (Fei et al., 2020e,f,a, 2021d(Fei et al., 2020e,f,a, , 2020d, NMT has achieved notable development. The constructions of powerful neural models and training paradigms as well as the collection of large-scale parallel corpora are the driving forces to NMT's success (Vaswani et al., 2017;Devlin et al., 2019). The key of NMT is to learn a good mapping between two (or more) languages. To reach the goal, rich external information or knowledge has been integrated (Li et al., , 2022aWang et al., 2022;Cao et al., 2022;Chen et al., 2022;Fei et al., 2021bFei et al., ,a,f,e,c, 2023. In recent years, visual information has been introduced for stronger NMT (i.e., multimodal machine translation), by enhancing the alignments of language latent spaces with visual grounding (Specia et al., 2016;Huang et al., 2016). Intuitively, people speaking different languages can actually refer to the same physical visual contents and conceptions.
Unsupervised machine translation aims to learn cross-lingual mapping without the use of largescale parallel corpora. The setting is practically meaningful to those minor languages with hard data accessibility. The basic idea is to leverage alternative pivoting contents to compensate the parallel signals based on the back-translation method (Sennrich et al., 2016), such as third-languages , bilingual lexicons (Lample et al., 2018) or multilingual LM (Conneau and Lample, 2019). The visual information can also serve as pivot signals for UMT, i.e., unsupervised multimodal machine translation. Comparing to the standard MMT that trains with <src-img-tgt> triples, UMMT takes as input only the <src-img>. So far, few studies have explored the UMMT setting, most of which try to enhance the back-translation with multimodal alignment mechanism (Nakayama and Nishida, 2017;Chen et al., 2018;Su et al., 2019;Huang et al., 2020).
Scene graph describes a scene of an image or text into a structure layout, by connecting discrete objects with attributes and with other objects via pairwise relations (Krishna et al., 2017;Wang et al., 2018). As the SGs carry rich contextual and semantic information, they are widely integrated into downstream tasks for enhancements, e.g., image retrieval (Johnson et al., 2015), image generation  and image captioning (Yang et al., 2019). This work inherits wisdom, incorporating both the visual scene graph and language scene graph as pivots for UMMT.
All the UMMT researches assume that the <src-img> pairs are required during inference, yet we notice that this can be actually unrealistic. We thus propose a visual hallucination mechanism, achieving the inference-time image-free goal. There are relevant studies on supervised MMT that manage to avoid image inputs (with text only) during inference. The visual retrieval-base methods (Zhang et al., 2020;Fang and Feng, 2022), which maintain an image lookup-table in advance, such that a text can retrieve the corresponding visual source from the lookup-table. Li et al. (2022b) directly build pseudo image representations from the input sentence. Differently, we consider generating the visual scene graph with richer and holistic visual structural information.

Conclusion
We investigate an inference-time image-free setup in unsupervised multimodal machine translation. In specific, we integrate the visual and language scene graph to learn the fine-grained visionlanguage representations. Moreover, we present a visual scene hallucination mechanism to generate pseudo visual features during inference. We then propose several SG-pivoting learning objectives for unsupervised translation training. Experiments demonstrate the effectiveness of our SG-pivoting based UMMT. Further experimental analyses present a deep understanding of how our method advances the task and setup.

Limitations
Our paper has the following potential limitations. First of all, we take advantage of the external scene graph structures to achieve the inference-time visual hallucination and secure significant improvements of the target task, while it could be a doubleedged sword. This makes our method subject to the quality of the external structure parsers. When the parsed structures of visual scene graphs and language scene graphs are with much noise, it will deteriorate our methods. Fortunately, the existing scene graph parsers have already achieved satisfactory performance for the majority language (e.g., English), which can meet our demands. Second, the effectiveness of our approach depends on the availability of good-quality images, which how-ever shares the pitfalls associated with the standard unsupervised multimodal translation setup.  Figure 9: A detailed view of our model architecture.
The tgt-side LSG y is synthesized from input LSG x and VSG, which is a pseudo LSG without a real input of LSG y from a parser.

A Appendix
In §2.2 we give a brief induction to the overall model framework. Here we extend the details of each module of the scene graph-based multimodal translation backbone. In Fig. 9 we outline our framework.

A.1 Visual Scene Hallucination Learning Module
First of all, we note that VSH only will be activated to produce VSG hallucination at inference time. During the training phase, we construct the VSG vocabularies of different VSG nodes. We denote the object vocabulary as D o , which caches the object nodes from parsed VSG of training images; denote the attribute vocabulary as D a , which caches the attribute nodes; and denote the relation vocabulary as D r , which caches the relation nodes. Those vocabularies will be used to provide basic ingredients for VSG hallucination. At inference time, VSH is activated to perform two-step inference to generate the hallucinated VSG . The process is illustrated in Fig. 3.
Step1: Sketching Skeleton This step builds the skeleton VSG from the raw LSG. Specifically, we only need to transform the textual entity nodes into the visual object nodes, while keeping unchanged the whole graph topology. As for the attribute nodes and the relation nodes, we directly copy them into the VSG, as they are all text-based labels that are applicable in VSG. Then we transform the textual entity nodes into the visual object nodes. For each textual entity node in LSG, we employ the  CLIP tool 1 to search for the best matching visual node (proposal) in D o as the counterpart visual object, resulting in the skeleton VSG. After this step, we obtain the sketch structure of the target VSG.
Step2: Completing Vision This step completes the skeleton VSG into a more realistic one, i.e., the final hallucinated VSG . With the skeleton VSG at hand, we aim to further enrich skeleton VSG. Because intuitively, in actual world the visual scenes are always much more concrete and vivid than textual scenes. For example, given a caption text 'boys are playing baseball on playground', the LSG only mentions 'boys', 'baseball' and 'playground' objects. But imaginarily, there must be a 'baseball bat' in the scene of vision, and also both the pairs of 'boys'-'playground' and 'baseball'-'playground' has 'on' relation. Thus it is indispensable to add new nodes and more edges, i.e., scene graph augmentation. To reach the goal, we propose a node augmentor and a relation augmentor, as shown in Fig. 10. First of all, we downgrade all the relation nodes as the edge itself, i.e., an edge with a relation label. By this, we obtain a VSG that only contains object and attribute nodes, and labeled 1 https://github.com/openai/CLIP Figure 11: Degeneration of the relation node to the labeled edge.
edges, which is illustrated in Fig. 11.
For the node augmentor, we first traverse all the object nodes in the skeleton VSG. For each object node v i , we then perform k-order routing over its neighbor nodes. We denote its neighbor nodes as V na i = {· · · , v k , · · · }. Then we use the attention to learn the neighbor influence to v i , and obtain the k-order feature representation h i of v i : where r i and r k is the node representations of v i and v k , which are obtained from GCN encoder. Then we use a classifier to make prediction over the total vocabularies of D o and D a , to determine which nodev i (either an object or an attribute node) should be attached to v i , if any: including an additional dummy token indicating no new node to be attached to v i . And if the predicted node is an object node, an additional relation classifier will determine what is the relation labelê betweenv i and v i :ê ← Softmax D r (FFN([h na i ; r i )) . For the relation augmentor, we first traverse all the node-pairs (object or attribute nodes, excluding the relation nodes) in the VSG, i.e., v i &v j . Then, for each node in the pair we use a triaffine attention (Wang et al., 2019;Wu et al., 2021a) to directly determine which new relation typeê i,j should be built between them, if exists: where D pa = D r ∪ { }, where the dummy token indicates no new edge should be created between two nodes. The new edgeê i,j has a relation label. r i−j is the representation of the path from v i to v j , which is obtained by the pooling function over all the nodes in the path: h pa i−j = Pool(r i , · · · , r j ) . Note that the triaffine scorer is effective in modeling the high-order ternary relations, which will provide a precise determination on whether to add a new edge.
During training, the node augmentor and the relation augmentor are trained and updated based on the gold LSG and VSG, to learn the correct mapping between LSG and VSG. Such supervised learning is also important for ensuring that the final generated hallucinated visual scenes are basically coincident with the caption text, instead of random or groundless vision scenes.

A.2 SG Fusing&Mapping Module
Here we extend the contents in § 2.2. As shown in Fig. 9, first of all, the SG fusing module merges the LSG x and VSG into a mixed cross-modal scene graph, such that the merged scene graph are highly compact with less redundant. Before the merging, we first measure the similarity of each pair of <text-img> node representations via cosine distance: This is a similar process as in Eq.
(2). For those pairs with high alignment scores, i.e., s i,j > α (we use the same pre-defined threshold as in crossmodal alignment learning), we consider them as serving a similar role. Since we will perform the cross-modal SG aligning learning L CMA , the accuracy of the alignment between LSG x and VSG can be guaranteed. Then, we average the representations of the image-text node pair from their GCNs. And for the rest of nodes in LSG x and VSG, we take the union structures of them. The resulting mixed SG fully inherits the semantic-rich scene nodes from both the textual SG and the visual SG, which will benefit the following text generation. Now we treat the mixed SG as a pseudo tgt-side LSG y . We use another GCN to model LSG y for further feature propagation: r y 1 , · · · , r y m = GCN(V SG y ) . The initial node representations of GCN are from the GCNs of VSG and LSG x , i.e., r L and r V as in Eq. (1). Based on the node representation r y i of VSG y , we finally employ a graph-to-text model 2 to generate the final tgt-side sentence. Specifically, all the node representation r i will be first summarized into one unified graph-level feature via pooling: r y = Pool(r y 1 , · · · , r y m ) . Then, an autoregressive sequential decoder (Se-qDec) will take r y to generate tgt-side token over the tgt-side vocabulary at each setp, sequentially: e i = SeqDec(e ≤i , r y ) , y i ← Softmax(e i ) .