Visual Semantic Parsing: From Images to Abstract Meaning Representation

The success of scene graphs for visual scene understanding has brought attention to the benefits of abstracting a visual input (e.g., image) into a structured representation, where entities (people and objects) are nodes connected by edges specifying their relations. Building these representations, however, requires expensive manual annotation in the form of images paired with their scene graphs or frames. These formalisms remain limited in the nature of entities and relations they can capture. In this paper, we propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR), to address these shortcomings. Compared to scene graphs, which largely emphasize spatial relationships, our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input. Moreover, they allow us to generate meta-AMR graphs to unify information contained in multiple image descriptions under one representation. Through extensive experimentation and analysis, we demonstrate that we can re-purpose an existing text-to-AMR parser to parse images into AMRs. Our findings point to important future research directions for improved scene understanding.


Introduction
The ability to understand and describe a scene is fundamental for the development of truly intelligent systems, including autonomous vehicles, robots navigating an environment, or even simpler applications such as language-based image retrieval. Much work in computer vision has focused on two key aspects of scene understanding, namely, recognizing entities, including object detection (Liu et al., 2016;Ren et al., 2015; Carion *Work done during an internship at Samsung AI Centre -Toronto †Work done while at Samsung AI Centre -Toronto et al., 2020;Liu et al., 2020a) and activity recognition (Herath et al., 2017;Kong and Fu, 2022;Li et al., 2018;Gao et al., 2018), as well as understanding how entities are related to each other, e.g., human-object interaction (Hou et al., 2020;Zou et al., 2021) and relation detection (Lu et al., 2016;Zhang et al., 2017;Zellers et al., 2018). A natural way of representing scene entities and their relations is in graph form, so it is perhaps unsurprising that a lot of work has focused on graphbased scene representations and especially on scene graphs (Johnson et al., 2015a). Scene graphs encode the salient regions in an image (mainly, objects) as nodes, and the relations among these (mostly spatial in nature) as edges, both labelled via natural language tags; see Fig. 1(b) for an example scene graph. Along the same lines, Yatskar et al. (2016) propose to represent a scene as a semantic role labelled frame, drawn from FrameNet (Ruppenhofer et al., 2016) -a linguistically-motivated approach that draws on semantic role labelling literature.
Scene graphs and situation frames can capture important aspects of an image, yet they are limited in important ways. They both require expensive manual annotation in the form of images paired with their corresponding scene graphs or frames. Scene graphs in particular also suffer from being limited in the nature of entities and relations that they capture (see Section 2 for a detailed analysis). Ideally, we would like to capture event-level semantics (same as in situation recognition) but as a structured graph that captures a diverse set of relations and goes beyond low-level visual semantics.
Inspired by the linguistically-motivated image understanding research, we propose to represent images using a well-known graph formalism for language understanding, i.e., Abstract Meaning Representations (AMRs Banarescu et al., 2013). Similarly to (visual) semantic role labeling, AMRs also represent "who did what to whom, where,

Someone riding a wave on their surfboard. A man riding a wave on top of a surfboard. A surfer is on a surfboard riding a large wave. A man surfing a wave in front of a cliff.
A man surfing with the waves in the sea near mountain side.
when, and how?" (Màrquez et al., 2008), but in a more structured way via transforming an image into a graph representation. AMRs not only encode the main events, their participants and arguments, as well as their relations (as in semantic role labelling/situation recognition), but also relations among various other participants and arguments; see Fig. 1(a). Importantly, AMR is a broadly-adopted and dynamically evolving formalism (e.g., Bonial et al., 2020;Bonn et al., 2020;Naseem et al., 2021), and AMR parsing is an active and successful area of research (e.g., Zhang et al., 2019b;Bevilacqua et al., 2021;Xia et al., 2021;Drozdov et al., 2022). Finally, given the high quality of existing AMR parsers (for language), we do not need manual AMR annotations for images, and can rely on existing image-caption datasets to create high quality silver data for image-to-AMR parsing. In summary, we make the following contributions: • We introduce the novel problem of parsing images into Abstract Meaning Representations, a widely-adopted linguistically-motivated graph formalism; and propose the first image-to-AMR parser model for the task.
• We present a detailed analysis and comparison between scene graphs and AMRs with respect to the nature of entities and relations they capture, results of which further motivates research in the use of AMRs for better image understanding.
• Inspired by work on multi-sentence AMR, we propose a graph-to-graph transformation algorithm that combines the meanings of several image caption descriptions into image-level meta-AMR graphs. The motivation behind generating the meta-AMRs is to build a graph that covers most of entities, predicates, and semantic relations contained in the individual caption AMRs.
Our analyses suggest that AMRs encode aspects of an image content that are not captured by the commonly-used scene graphs. Our initial results on re-purposing a text-to-AMR parser for image-to-AMR parsing, as well as on creating image-level meta-AMRs, point to exciting future research directions for improved scene understanding.

Motivation: AMRs vs. Scene Graphs
Scene graphs (SGs) are a widely-adopted graph formalism for representing the semantic content of an image. Scene graphs have been shown useful for various downstream tasks, such as image captioning (Yang et al., 2019;Li and Jiang, 2019;Zhong et al., 2020), visual question answering (Zhang et al., 2019a;Hildebrandt et al., 2020;Damodaran et al., 2021), and image retrieval (Johnson et al., 2015b;Schuster et al., 2015;Schroeder and Tripathi, 2020). However, learning to automatically generate SGs requires expensive manual annotations (object bounding boxes and their relations). SGs were also shown to be highly biased in the entity and relation types that they capture. For example, an analysis by Zellers et al. (2018) reveals that clothing (e.g., dress) and object/body parts (e.g., eyes, wheel) make up over one-third of entity instances in the SGs corresponding to the Visual Genome images , and that more than 90% of all relation instances belong to the two categories of geometric (e.g., behind) and possessive (e.g., have).
One advantage of AMR graphs is that we can draw on supervision through captions associated with images. Nonetheless, the question remains as to what types of entities and relations are encoded by AMR graphs, and how these differ from SGs. To answer this question, we follow an approach similar to Zellers et al. (2018), and categorize entities and relations in SG and AMR graphs corresponding to a sample of 50K images. We use the same categories as Zellers et al., but add a few new ones to capture relation types specific to AMRs, namely, Attribute (small), Quantifier (few), Event (soccer), and AMR specific (date-entity). Details of our categorization process are provided in Appendix A. Figure 2 shows the distribution of instances for each Entity and Relation category, compared across SG and AMR graphs. AMRs tend to encode a more diverse set of relations, and in particular capture more of the abstract semantic relations that are missing from SGs. This is expected because our caption-generated AMRs by design capture the essential meaning of the image descriptions and, as such, encode how people perceive and describe scenes. In contrast, SGs are designed to capture the content of an image, including regions representing objects and (mainly spatial/geometric) visually-observable relations; see Fig. 1 for SG and AMR graphs corresponding to an image. In the context of Entities, and a major departure from SGs, (object/body) parts are less frequently encoded in AMRs, pointing to the well-known whole-object bias in how people perceive and describe scenes (Markman, 1990;Fei-Fei et al., 2007). In contrast, location is more frequent in AMRs.
The focus of AMRs on abstract content suggests that they have the potential for improving downstream tasks, especially when the task requires an understanding of the higher level semantics of an image. Interestingly, a recent study showed that using AMRs as an intermediate representation for textual SG parsing helps improve the quality of the parsed SGs (Choi et al., 2022), even though AMRs and SGs encode qualitatively different information. Since AMRs tend to capture higher level semantics, we propose to use them as the final image representation. The question remains as to how difficult it is to directly learn such representations from images. The rest of the paper focuses on answering this question.

Parsing Images into AMR Graphs
We develop image-to-AMR parsers based on a state-of-the-art seq2seq text-to-AMR parser, SPRING (Bevilacqua et al., 2021), and a multimodal VL- BART (Cho et al., 2021). Both are transformer-based architectures with a bidirectional encoder and an auto-regressive decoder. SPRING extends a pre-trained seq2seq model, BART (Lewis et al., 2020), by fine-tuning it on AMR parsing and generation. Next, we describe our models, input representation, and training.
Models. We build two variants of our image-to-AMR parser, as depicted in Fig. 3 Figure 3: Model architecture for our two image-to-AMR models: (a) IMG2AMR direct : A direct model that uses a single seq2seq encoder-decoder to generate linearlized AMRs from input images; and (b) IMG2AMR 2stage : A two-stage model containing two independent seq2seq components. g and r stand for global and region features, q for tag embeddings, and n for the embeddings of the predicted nodes. The input and output space of the decoders come from the AMR vocabulary.
• Our second model, inspired by text-to-graph AMR parsers (e.g., Zhang et al., 2019b;Xia et al., 2021), generates linearized AMRs in two stages by first predicting the nodes, and then the relations. Specifically, we first predict the nodes of the linearized AMR for a given image. These predicted nodes are then fed (along with the image) as input into a second seq2seq model that generates a linearized AMR (effectively adding the relations). We refer to this model as IMG2AMR 2stage .
Input Representation. To represent images, we follow VL-BART, which takes the output of Faster R-CNN (Ren et al., 2015) (i.e., region features and coordinates for 36 regions) and projects them onto d = 768 dimensional vectors via two separate fully-connected layers. Faster R-CNN region features are obtained via training for visual object and attribute classification (Anderson et al., 2018) on Visual Genome. The visual input to our model is composed of position-aware embeddings for the 36 regions, plus a global image-level feature (r and g in Fig. 3). To get the position-aware embeddings for the regions, we add together the projected region and coordinate embeddings. To get the global image feature, we use the output of the final hidden layer in ResNet-101 (He et al., 2016), which is passed through the same fully connected layer as the regions to obtain a 768-dimensional vector.
Training. To benefit from transfer learning, we initialize the encoder and decoder weights of both our models from the pre-trained VL-BART. This is a reasonable initialization strategy, given that VL-BART has been pre-trained on input similar to ours. Moreover, a large number of AMR labels are drawn from the English vocabulary, and thus the pre-training of VL-BART should also be appropriate for AMR generation. We fine-tune our models on the task of image-to-AMR generation, using images paired with their automatically-generated AMR graphs. We consider two alternative AMR representations: (a) caption AMRs, created directly from captions associated with images (see Section 4 for details); and (b) image-level meta-AMRs, constructed through an algorithm we describe below in Section 3.2. We perform experiments with either caption or meta-AMRs, where we train and test on the same type of AMRs. For the various stages of training, we use the cross-entropy loss between the model predictions and the ground-truth labels for each token, where the model predictions are obtained greedily, i.e., choosing the token with the maximum score at each step of the sequence generation.

Learning per-Image meta-AMR Graphs
Recall that, in order to collect a data set of images paired with their AMR graphs, we rely on image-caption datasets such as MSCOCO. Specifically, we use a pre-trained AMR parser to generate AMR graphs from each caption of an image. Images can be described in many different ways, e.g., each image in MSCOCO comes with five different human-generated captions. We hypothesize that these captions collectively represent the content of the image they are describing, and as such propose to also combine the caption AMRs into image-level meta-AMR graphs through a merge and refine process that we explain next.
Prior work has used graph-to-graph transformations for merging sentence-level AMRs into document-level AMRs for abstractive and multi-Algorithm 1 META-AMR Graph Construction 1: Input: k human-generated image descriptions {ci} k i=1 for a given image i; a set of pre-defined AMR relation types R; 2: Output: A meta-AMR graph gmeta; 3: Initialize: Generate AMR graphs {gi} for the k descriptions using a pre-trained AMR semantic parser; Initialize gm = (N , E) to be the null graph. 4: N = ∪ k i=1 Ni 5: for i = 1 ∼ k do 6: Ei = getEdges(gi) 7: for (ns, nt) : r ∈ Ei do (ns, nt) is a pair of nodes connected via an edge labeled as r 8: if Add a new edge when neither (ns, nt) nor (nt, ns) previously included, and r belongs to a pre-selected set of AMR relation types R 10: Gm = weaklyConnectedComponents(gm) Get all connected components as gmeta candidates since it should be a connected graph according to the definition of AMR 11: gmeta = getLargestComponent(Gm) Get the candidate with the largest number of nodes as it can cover most entities and predicates in the image 12: gmeta = refineNodes(gmeta) Replace node types by their frequent hypernym if available 13: return gmeta document summarization (e.g., Liu et al., 2015;Liao et al., 2018;Naseem et al., 2021). Unlike in a summarization task, captions do not form a coherent document, but instead collectively describe an image. Inspired by prior work, we propose our graph-to-graph transformation algorithm that learns a unified meta-AMR graph from caption graphs; see Algorithm 1. Specifically, we first merge the nodes and edges from the original set of k caption-level AMRs, only including a predefined set of relation/edge labels. We then select the largest connected component of this merged graph, which we further refine by replacing nonpredicate nodes by their more frequent hypernyms, when available. The motivation behind this refinement process is to reduce the complexity of the meta-AMR graphs (in terms of their size), which would potentially improve parsing performance. An example of a meta-AMR graph generated from caption AMRs is given in Appendix C.
AMR graphs of the MSCOCO training captions contain more than 90 types of semantic relations and more than 21K node types, with long-tailed distributions; see Fig. 6 in Appendix B. To refine meta-AMR graphs, we only maintain the top-20 most frequent relation types that include core roles, such as ARG0, ARG1, etc., as well as high-frequency noncore roles, such as mod and location. To further refine the graphs, we replace each non-predicate node (e.g., salmon) with its most frequent hypernym (e.g., fish) according to WordNet (Fellbaum, 1998). This results in just about 30% reduction in the number of node types (to 15K). The average complexity of graphs is also reduced from 19 nodes and 23 relations to 16 and 18, respectively.

Experimental Setup
Data. For our task of AMR generation from images, we use an augmented version of the standard MSCOCO image-caption dataset, which is composed of images paired with their captions, automatically generated caption-level linearized AMR graphs, and an image-level linearized meta-AMR graph. We use the splits established in previous work (Karpathy and Fei-Fei, 2015), containing 113, 287 training, 5000 VALidation, and 5000 TEST images, where each image is associated with five manually-annotated captions. Following the cross-modal retrieval work involving MSCOCO (e.g., Lee et al., 2018), we use a subset of the VAL and TEST sets, containing 1000 images each. AMR graphs of the captions are obtained by running the SPRING text-to-AMR parser (Bevilacqua et al., 2021) that is trained on AMR2.0 dataset. 1 The meta-AMR graph is created from the individual AMRs through our merge and refine process described in Algorithm 1 of Section 3.
Parser implementation details. We initialize our IMG2AMR models from VL-BART, which is based on BART Base . BART uses a sub-word tokenizer with a vocabulary size of 50, 265. Following SPRING, we expand the vocabulary to include frequent AMR-specific tokens and symbols (e.g., :OP, ARG1, temporal-entity), resulting in a vocabulary size of 53, 587. The addition of AMR-specific symbols in vocabulary improves efficiency by avoiding extensive sub-token splitting. The embeddings of these additional tokens are initialized by taking the average of the embeddings of their sub-word constituents. The IMG2AMR direct models are trained for 60 epochs, while the IMG2AMR 2stage models are trained for 30 epochs per stage. We use a batch size of 10 with gradients being accumulated for 10 batches (hence an effective batch size of 100), the batch size was limited due to the length of the linearized meta-AMRs. The optimizer used is RAdam (Liu et al., 2020b) of 10 −5 , and a dropout rate of 0.25. Each experiment is run on one Nvidia V100-32G GPU. Model selection is done based on the best SEMBLEU-1.

Image-to-AMR Parsing Performance
We use the standard measures of SMATCH (Cai and Knight, 2013) and SEMBLEU (Song and Gildea, 2019) to evaluate our various image-to-AMR models. SMATCH compares two AMR graphs by calculating the F1-score between the nodes and edges of these two graphs. This score is calculated after applying a one-to-one mapping of the two AMRs based on their nodes. This mapping is chosen so that it maximizes the F1-score between the two graphs. However, since finding the best exact mapping is NP-complete, a greedy hill-climbing algorithm with multiple random initializations is used to obtain this best mapping. SEMBLEU extends the BLEU (Papineni et al., 2002) metric to AMR graphs, where each AMR node is considered a unigram (used in SEMBLEU-1), and each pair of connected nodes along with their connecting edge is considered a bigram (used in SEMBLEU-2). These metrics are calculated between the model predictions and the noisy AMR ground-truth. We report results on generating caption AMRs (when the models are trained and tested on these AMRs), as well as meta-AMRs. When evaluating on caption AMR generation, we compare the model output to the five reference AMRs, and report the maximum of these five scores. The intuition is to compare the predicted AMR to the most similar AMR from the five references. Table 1 (top two rows) shows the performance of the models on the task of generating meta-AMRs from TEST images. We perform ablations of the model input combinations on VAL set (see Section D below), and report TEST results for the best setting, which uses all the input features for both models. The 2stage model does slightly better on this task, when looking at the SMATCH and SEMBLEU-2 metrics that take the structure of AMRs into account. Note that SEM-BLUE-1 only compares the nodes of the predicted and ground-truth graphs.
Meta-AMR graphs tend to, on average, be longer than individual caption AMRs (∼34 vs ∼12 nodes and relations). We thus expect the generation of meta-AMRs to be harder than that of caption AMRs. Moreover, although we hypothesize that meta-AMRs capture a holistic meaning for an image, the caption AMRs still capture some (possibly salient) aspect of an image content, and as such are useful to predict, especially if they can be generated with higher accuracy. We thus report the performance of our direct model on generating caption AMRs (when trained on caption AMR graphs); see the final row of Table 1. We can see that, as expected, performance is much higher on generating caption AMRs vs. meta-AMRs.
Given that AMRs and natural language are by design closer in the semantic space, unlike for AMRs and images, it is not unexpected that the results for our image-to-AMR task are not comparable with those of SoTA text-to-AMR parsers, including SPRING. Our results highlight the challenges similar to those of general image-to-graph parsing techniques, including visual scene graph generation (Zhu et al., 2022), where there still exists a large gap in predictive model performance.

Image-to-AMR for Caption Generation
To better understand the quality of our generated AMRs, we use them to automatically generate sentences from caption AMRs (using an existing AMR-to-text model), and evaluate the quality of these generated sentences against the reference captions of their corresponding images. Specifically, we use the SPRING AMR-to-text model that we train from scratch on a dataset composed of AMR2.0, plus the training MSCOCO captions paired with their (automatically-generated) AMRs.  We evaluate the quality of our AMR-generated captions using standard metrics commonly used in the image captioning community, i.e., CIDEr (Vedantam et al., 2015), METEOR (Denkowski and Lavie, 2014), BLEU-4 (Papineni et al., 2002), and SPICE (Anderson et al., 2016), and compare against VL-BART's best captioning performance as reported in the original paper (Cho et al., 2021). Reported in Table 2, the results clearly show that the quality of the generated AMRs are such that reasonably good captions can be generated from them, suggesting that AMRs can be used as intermediate representations for such downstream tasks. Future work will need to explore the possibility of further adapting the AMR formalism to the visual domain, as well as the possibility of enriching image AMRs via incorporating additional linguistic or commonsense knowledge, that could potentially result in better quality captions.

Performance per Concept Category
The analysis presented in Section 2 suggests many concepts in AMR graphs tend to be on the more abstract (less perceptual) side. We thus ask the following question: What are some of the categories that are harder to predict? To answer this question, we look into the node prediction performance of our two-stage model for the different entity and relation categories of Section 2. Note that this categorization is available for a subset of nodes only.
To get the per-category recall and precision values, we take the node predictions of the first stage of the IMG2AMR 2stage model (trained to predict meta-AMR nodes) on the VAL set. For each VAL image i, we have a set of predicted nodes, which we compare to the set of nodes in the ground-truth meta-AMR associated with the image. When calculating per-category recall/precision values, we only consider nodes that belong to that category. We calculate per-image true positive, true negative, and false positive counts, which are used to obtain the recall and precision using micro-averaging. Fig. 4 presents the per-category (as well as overall) recall and precision values over the VAL set. Interestingly, events (e.g., festival, baseball, ten-nis) have the highest precision and recall. These are abstract concepts that are largely absent from SGs, suggesting that relying on a linguisticallymotivated formalism is beneficial in capturing such abstract aspects of an image content. The event category contains 14 different types, many referring to sports that have a very distinctive setup, e.g., people wearing specific clothes, holding specific objects, etc. The possibility of encoding such abstract concepts in the training AMRs (generated from human-written descriptions likely to mention the event) helps the model learn to generate them for the relevant images during inference. The next group with high precision and recall are entities (which are likely to be more closely tied to the image regions), and possessives (containing a small number of high-frequency relations, e.g., have and wear). Semantic relations have a decent performance, but contain a diverse number of types, and need to be further analyzed to disentangle the effect of category vs. frequency. Quantifiers (many of which are related to counting), geometric relations, and attributes seem to be particularly hard to predict. Counting is known to be hard for deep learning models. Geometric relations are much less frequent in AMRs, compared to SGs. Perhaps, we do need to rely on special features (e.g., relative position of bounding boxes) to improve performance on these relations. Attributes (such as young, old, small) require the model to learn subtle visual cues. In addition to understanding what input features may help improve performance on these categories, we need to further adapt the AMR formalism to the visual domain.

Qualitative Samples: Generating
Descriptive Captions from meta-AMRs In Section 5.2, we showed that caption AMRs produced by our IMG2AMR model can be used to generate reasonably good quality captions via an AMR-to-text model. Here, we provide samples of how meta-AMRs can be used as rich intermediate representations for generating descriptive captions; see Fig. 5 and Section E. To get these captions, we apply the same AMR-to-text model that we trained as described in Section 5.2 to the meta-AMRs predicted by our IMG2AMR direct model. Captions generated from meta-AMRs tend to be longer than the original human-generated captions, and contain much more details about the scene. These captions, however, sometimes contain repetitions of the same underlying concept/relation (though using different wordings), e.g., caption (a) contains both in grass and in a grassy area. We also see that our hypernym replacement sometimes results in using a more general term in place of a more specific but more appropriate term, e.g., woman instead of girl in (d).
Nonetheless, these results generally point to the usefulness of AMRs and especially meta-AMRs for scene representation and caption generation.

Discussion and Outlook
In this paper, we proposed to use a well-known linguistic semantic formalism, i.e., Abstract Meaning Representation (AMR) for scene understanding. We showed through extensive analysis the advantages of AMR vs. the commonly-used visual scene graphs, and proposed to re-purpose existing text-to-AMR parsers for image-to-AMR parsing.
Additionally we proposed a graph transformation algorithm that merges several caption-level AMR graphs into a more descriptive meta-AMR graph. Our quantitative (intrinsic and extrinsic) and qualitative evaluations demonstrate the usefulness of (meta-)AMRs as a scene representation formalism.
Our findings point to a few exciting future research directions. Our image-to-AMR parsers can be improved by incorporating richer visual features, a better understanding of the entity and relation categories that are particularly hard to predict for our current models, as well as drawing on methods used for scene graph generation (e.g., Zellers et al., 2018;Zhu et al., 2022). Our meta-AMR generation algorithm can be further tuned to capture visuallysalient information (e.g., quantifiers are too hard to learn from images, and perhaps can be dropped from a visual AMR formalism).
Our qualitative samples of captions generated from meta-AMRs show their potential for generating descriptive and/or controlled captions. Controllable image captioning has received a great deal of attention lately (e.g., Cornia et al., 2019;Chen et al., , 2021. It focuses on the use of subjective control, including personalization and stylefocused caption generation, as well as objective control on content (controlling what the caption is about, e.g., focused on a set of regions), or on the structure of the output sentence (e.g., controlling sentence length). We believe that by using AMRs as intermediate scene representations, we can bring together the work on these various types of control, as well as draw on the literature on controllable natural language generation  for advancing research on rich caption generation.
(a) A couple of giraffe standing next to each other in a field near rocks walking in grass in a grassy area.
(b) A yellow and blue fire hydrant on a city street in front at an intersection sitting on the side of the road near a traffic position.
(c) A large long passenger train going across a wooden beach plate, traveling and passing by water.
(d) A woman sitting at a table eating a sandwich and holding a hot dog in a building smiling while eating.
(e) A white area filled with lots of different kinds of donuts with various toppings sitting on them.
(f) A group of people sitting around at a dining table with water posing for a picture.
(g) A person in a red jacket cross country skiing down a snow covered ski slope with a couple of people riding skis and walking on the side of the snowy mountain.
(h) A person in black shirt sitting at a table in a building with a plate of food with and smiling while having meal. Figure 5: A sample of images, along with descriptive captions automatically generated from the meta-AMRs predicted by our IMG2AMR direct model. Refer to Section E for the generated meta-AMRs. The url and license information for each of these images is available in Section E. Faces were blurred for privacy.

A AMR vs. SG: Entity and Relation Categorization Details
The analysis provided in Section 2 requires us to annotate the entities and relations of a sample of AMRs and SGs into a pre-defined set of categories. We first select all images that appear in both MSCOCO (Lin et al., 2014) and Visual Genome, so we have access to ground-truth scene graphs, as well as captions from which we can generate AMR graphs for the same set of images. We use a single AMR per image, generated from the longest caption, but include all SGs associated with an image in our analysis. For each SG and AMR graph, we consider the entities and relations corresponding to ∼900 most frequent types (around 1.3M entity and 1M relation instances for SGs; and around 130K entity and 150K relation instances for AMRs). We annotate these into a pre-defined set of entity and relation categories, including those defined by (Zellers et al., 2018) plus a few we add to cover new AMR relations. Table 5 provides a breakdown of the categories, as well as examples of word types we considered to belong to each category. The table also provides the total number of word types per category and percentages of instances across all types for each category. Next, we describe our annotation process. SG nodes (entities) come with their most common WordNet sense annotations, which we use to identify their categories. For SG relations, we manually annotate their categories. To annotate AMR entities and relations, we follow a similar procedure, by automatically finding the most common Word-Net sense for non-predicate AMR nodes (assuming most of these will be entities) and correcting them if needed. For example, the automatically-identified most common sense of mouse is the Animal sense, whereas in our captions, almost all instances of the word point to the computer mouse (Artifact). For any remaining concepts, including predicate nodes (e.g., eat, stand) and entities for which a category cannot be assigned automatically, we manually identify their categories.  Figure 6: Frequency of the 90 AMR role/edge types prior to the refinement process, which exhibits the characteristics of a long-tail distribution. Fig. 7 shows an example of how a meta-AMR is constructed from five caption-level AMRs. The corresponding captions are provided in red, and the AMR graphs are given in PENMAN notation.

D Ablations
Effect of input on node prediction performance. Table 3 presents performance of meta-AMR node prediction (first stage of IMG2AMR 2stage ) with different input combinations, in terms of Precision and Recall (when predicted and ground-truth nodes are taken as sets), and BLEU-1 (when the order of nodes in the final linearized AMR is taken into consideration). These results suggest that an overall best performance is achieved by using all input features, namely regions, tags and global image feature.  Effect of input on parsing performance. We train our IMG2AMR models with different inputs to the encoders, and evaluate on VAL set. Specifically, the input to the model may contain the global image feature g, region embeddings r, tag embeddings q (for the first encoder), and node embeddings n (for the second encoder of IMG2AMR 2stage ). Table 4 reports the VAL results of our two models (trained and tested with meta-AMRs) with different input combinations (region embeddings, tag embeddings, global image features) for the direct model, and (node embeddings, global image features, region embeddings) for the second encoder of the 2stage model. For IMG2AMR 2stage , we fix the input of the first encoder to the best combination according to Table 3 above, and ablate over the input of the second encoder. Both models are trained and tested with meta-AMRs. As we can see, richer input generally results in better performance. We can also see a big drop in the performance of IMG2AMR direct when only region features are used as input, suggesting that tags can help associate mappings between regions and AMR concepts.   (z0 / bicycle :ARG1-of (z1 / park-01 :ARG2 (z2 / kitchen :location (z3 / stove)))) (z0 / and :op1 (z1 / bicycle :ARG1-of (z2 / lean-01 :ARG2 (z3 / stove))) :op2 (z4 / cabinet :location (z5 / inside :op1 (z6 / kitchen)))) A bicycle parked in a kitchen with a stove and cabinets A black bicycle leaning against the kitchen cabinets .
A bicycle leaning on the stove and cabinets located inside the kitchen . A bicycle parked in a kitchen by the stove .