HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. “people eating food in a park”. Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict (“people at a holiday resort”) and the actions they perform (“people having a picnic”). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.


Introduction
Conceptual grounding broadly refers to the idea that symbols (e.g. language) are grounded in perception (Barsalou et al., 2008). Perceptually grounded communication is made possible by the fact that perceptual experiences are largely shared. However, individual experience can also license subjective inferences which inform not just what we express through language, but also what we choose to assume and leave unexpressed (Bisk et al., 2020). 1 huggingface.co/datasets/ michelecafagna26/hl github.com/michelecafagna26/HL-dataset Among the many modalities available in the perceptual spectrum, visual grounding has always been of primary interest as it provides a relatively straightforward way to link linguistic expressions to physical objects. Consistent with this claim, a glance at many widely used datasets and models in image captioning reveals a bias towards 'objectcentric' descriptions, whereby models are trained on image-text pairs where the text consists of explicit mentions of objects visible in the scene. However, experience and perception also motivate other, non-object-centric ways of talking about the world, for example, when we talk about scenes, or when we describe actions or their underlying rationales. While such 'high-level' descriptions are also perceptually grounded, they incorporate world knowledge and subjective experience.
For example, the object-centric description in Table 1 certainly describes the visual content, though it is based mainly on the recognition of objects in the scene. By contrast, the three high-level captions (scene, action, rationale, from the HL-Dataset described below), provide three different perspectives of the scene among the many possible ones, which are triggered by expectations and assumptions based on subjective experience and world knowledge.
In this work, we tackle the issue of grounding high-level linguistic descriptions in the visual modality, proposing the High-Level (HL) Dataset: a resource for Vision and Language (V&L) modeling which aligns existing object-centric captions with human-collected high-level descriptions of images along three different axes: scenes, actions and rationales. The high-level captions capture the human interpretation of the scene which are complementary to object-centric captions used in current V&L datasets, e.g. in COCO (Lin et al., 2014). We take a step further, and we collect confidence scores from independent annotators, which serve to shed arXiv:2302.12189v2 [cs.CL] 1 Aug 2023 Image Axis Caption scene the picture is shot in a ski resort action they are just relaxing after a round of skiing rationale they want to have a good time together object-centric (COCO) a woman and a boy sitting in the snow outside of a cabin. light on the extent to which the high-level captions in the dataset correspond to widely-shared assumptions, or to idiosyncratic interpretations. Finally, we consider the task of generating captions that incorporate these different axes, yielding a more narrative-like description of images. Our contributions are: • We present and release the HL Dataset, a new V&L resource, grounding high-level captions in images along three different axes and aligned with existing object-centric captions; • We describe the collection protocol and provide an in-depth analysis of the data; • We present baselines for the High-Level Captioning task and describe further potential uses for our data.
2 Related work Hodosh et al. (2013), in their influential work, argue that image captioning is mostly interested in 'conceptual descriptions', which focus on what is actually in the image and differ from the socalled non-visual descriptions, which provide additional background information. This line of thought has been broadly followed in the field, resulting in datasets emphasizing object-centric content in V&L tasks involving text generation, like image captioning (Lin et al., 2014;Sharma et al., 2018;Agrawal et al., 2019) and visual question answering (Antol et al., 2015;Zhu et al., 2016). For instance, in the instructions used to collect COCO (Lin et al., 2014), the annotators are explicitly asked to mention entities visible in the image. This is beneficial to enhance cross-modal interactions: Zhang et al. (2021) show that improving the visual backbone on object recognition tasks, improves the performance of visio-linguistic models in downstream tasks. Li et al. (2020) show that using object labels to bridge the two modalities improves grounding capabilities of V&L models. Object-centricity is also a feature of widely-used web-scraped datasets: in the Conceptual Captions dataset for instance, Sharma et al. (2018) filtered out all captions which did not overlap with object labels automatically identified by a computer vision model in the corresponding image.
Some efforts have been made to understand how low-level concepts improve generalization capabilities and connect to high-level concepts. Objectcentric captions help to improve the generalization over unseen objects (Hu et al., 2021) and play a role in the model understanding of abstract concepts (Cafagna et al., 2022;Wang et al., 2022b). In our work, we are interested in the relations between what Hodosh et al. (2013) refer to as 'conceptual' and 'non-visual' descriptions, which we re-frame as a distinction between low-level (object-centric) and high-level descriptions in multimodal learning. We release a novel dataset to foster research in this direction.
Motivation for the present work is also provided by recent research exploring the visual correlates of inferences, temporal and causal relationships (e.g., Park et al., 2020), which also have implications for generation. In visual storytelling, for instance, a model has to understand actions and interactions among the visually depicted entities (Huang et al., 2016;Hu et al., 2020;Lukin et al., 2018;Hong et al., 2023). Identifying actions is a prerequisite for predicting their motivations or rationales as well as explaining automatically generated descriptions of images (Hendricks et al., 2018). Actions and intention are paramount to performing commonsense and temporal reasoning on visual inputs. Along these lines, Park et al. (2020) creates dynamic stories on top of static images, where the task is to predict priors and subsequent actions and rationales. Our work is similar in spirit, as we align high-level descriptions of actions and rationales with low-level descriptions of static images. Some work has also been done to test multimodal model grounding capabilities from a more linguistic perspective. Parcalabescu et al. (2022) build a benchmark to test models on a variety of linguistic phenomena, like spatial relations, counting, existence, etc. Pezzelle et al. (2020) assess the integration of complementary information of V&L models across modalities, while Thrush et al. (2022) test multimodal models on compositional reasoning. In this context, the HL Dataset proposed here can offer another benchmark for V&L models' understanding of high-level descriptions of images. Such descriptions are licensed by the entities depicted in the visual modality and the relationships between them but they do not mention them explicitly.

Data
In this section, we describe the protocol used to collect annotations for scenes, actions and rationales and the subsequent collection of confidence scores through crowdsourcing. Differently from previous works, such as COCO, where human annotators are instructed to be objective and to mention only the objects clearly visible in the picture, we elicit high-level concepts in the form of captions by encouraging the annotators to rely on their subjective interpretation of the image.

Data collection
The task of collecting high-level descriptions is by nature hard to define and requires a clear and careful formulation, therefore we run a pilot study with the double goal of collecting feedback and fine-tuning the task instructions. Full details of the pilot are reported in Appendix D.
Procedure The participants are shown an image containing at least one human subject and three questions regarding three aspects or axes: scene, actions and rationales i,e. Where is the picture taken?; What is the subject doing?; and Why is the subject doing it? We explicitly ask the participants to rely on their personal interpretation of the scene and add examples and suggestions in the instructions to further guide the annotators. Moreover, differently from other VQA datasets like (Antol et al., 2015) and (Zhu et al., 2016), where each question can refer to different entities in the image, we systematically ask the same three questions about the same subject for each image. See Appendix D for the full instructions and Appendix C for details regarding the annotations costs.
Images As mentioned in Section 1 the COCO dataset has a very explicit object-centric orientation, therefore it provides a good starting point to select images, such that we can couple objectcentric and high-level captions in a resource-lean approach. Moreover, the alignment of objectcentric and high-level captions permits an investigation of the relationship between them.
We randomly select 14,997 images from the COCO 2014 train-val split. In order to answer questions related to actions and rationales we need to ensure the presence of a (human) subject in the image. Therefore, we leverage the entity annotation provided in COCO to select images containing at least one person.
The whole annotation is conducted on Amazon Mechanical Turk (AMT). We split the workload into batches in order to ease the monitoring of the quality of the data collected. Each image is annotated by three different annotators, therefore we collect three annotations per axis.

Confidence Scores
The high-level descriptions are collected by asking the participants to interpret the scene leveraging their personal experience. The element of subjectivity leads us to expect some variation in the resulting descriptions, especially where annotators need to infer actions and rationales. In order to distinguish what can confidently be considered widelyshared, or 'commonsense' descriptions, from more idiosyncratic interpretations, we conduct a separate study where we crowd-source confidence scores for each high-level caption. We ask independent participants to score the likelihood of a high-level description given the image and the corresponding question on a Likert scale from 1 to 5. For a detailed example of the form see Figure 23 in Appendix D.
Agreement-based worker selection The confidence scores are collected following the same protocol used to collect the high-level descriptions. Using the data from our pilot study, which was carried out with participants who had been thoroughly briefed on the task, we ran a preliminary qualification task where we employed an automatic worker selection method to hire qualified annotators from the crowd-sourcing platform.
Let's consider the participants of the pilot as gold annotators (as they were trained on the task) and their annotations as reference annotations. The inter-annotator agreement computed on the reference annotations can be considered the gold interannotator agreement α gold of the task. We run the qualification task using the same set of items used in the pilot, then for each worker w we re-compute the inter-annotator agreement (Hayes and Krippendorff, 2007), combining the workers and the reference annotations, obtaining α w . We compute an agreement ratio Then, we select the worker w if r > t, where t is a threshold empirically set to 0.5. This is equivalent to choosing workers such that their contribution does not negatively affect α gold by a factor greater than t. In other words, the workers are selected if they are relatively compliant with the gold annotators.

Dataset Analysis
In this section, we analyse the captions collected in the High-Level Dataset. To provide insights on the kind of captions collected, we analyse the distribution of the captions across different axes, also comparing them with the object-centric COCO captions 2 . Furthermore, we perform a grammatical error analysis, which we report in Appendix A.1.

High-Level descriptions
We collected 3 annotations per axis over a set of 14,997 images for a total of 134,973 captions. An example of high-level descriptions aligned with the original object-centric caption from COCO is shown in Table 1. We expect to observe shorter texts in the high-level captions as annotators were not giving highly descriptive details typical of object-centric captions. This is visible in Figure 1, which shows that the length of the high-level captions is roughly half of the object-centric COCO captions. Though shorter, they have a comparable number of unique tokens over all the axes (as reported in Table 2); this suggests that the high-level captions are not repetitive and contain a fair amount of lexical variability. A more detailed comparison of the statistics is reported in Table 2.   Moreover, as already mentioned, the COCO captions are object-centric, that is, these captions are collected to objectively represent the visual content. Although this is convenient in recognition-oriented tasks, they lack the situational knowledge required to contextualize scenes; knowledge that is instead an essential part of the cognitive processes underlying the grounding of language in vision. Indeed, as shown in Figure 2, the most frequent lemmas in the original COCO captions for the images used in the HL Dataset denote mostly objects visible in the picture. The high-level captions represent the same visual content with the addition of situational knowledge coming from the three axes, and this is also visible in different lexico-semantic choices in the texts. For example, Figure 3 shows the most frequent lemmas found in the scene axis. Because we align them to the same images, the dataset gives us a clean way to explore the relationship between objects and high-level axes.
Disentangling the content across the axes Asking the same three questions about the same subject for each image allows us to consistently compare the content of our captions across three welldefined axes. We analyse the most frequent nouns  person  tennis  street  group  baseball  table  ball  boy  bus  snow  player  girl  field  beach  game  skateboard water train in the scene axis in order to characterize the kind of scenes mentioned in the captions collected. The top most frequent scenes include street, room and road. These are scene types which can encompass a very broad variety of objects. However, we can also identify scenes for which a narrower range of objects would be diagnostic, for example those related to sport activities like baseball, tennis, ski, ground and court, or domestic environments like house, kitchen and living (referring to 'living rooms'). For a more complete view see Figure 3 where we report the top 20 most frequent scenes in the HL dataset.
Similarly, we can characterize also the action and the rationale axes. We identify the action dis- tribution by analysing the verbs contained in the captions. In Figure 4 we observe that the most frequent actions are related to sports activities, consistently with what was observed in the scene axis distribution. The most frequent verbs are play, ski, surf, skateboard, but we can also find generic actions like hold, walk, sit and eat.
In the rationale axis we analyse both nouns and verbs. In this axis we expect to observe more subjectivity and content variability, with more lemmas denoting intents, mental states and events, including psych verbs. Our hypothesis is that the annotators leverage their personal experience to infer these answers to a greater extent than they do for scene descriptions.
The majority of the rationales express intentions; in fact, want is by far the most frequent term in the lemmas distribution. As observed with the other two axes, terms related to sports activities are more frequent (play, game, tennis, practice), but also related to leisure (enjoy, fun, vacation, love, family) along with generic activities (work, wait, try, eat). For more details see Figure 5.
The systematic disentanglement of the content along three axes can serve as a filter to identify or analyse sub-samples of the data with specific characteristics. For instance, as observed so far, we can confidently say that sports-related activities are predominant in the dataset.
Connecting high-and low-level concepts One of the main goals of this resource is to enable the discovery of connections between high-and lowlevel captions, that are, descriptions of the same images at different levels of abstraction. By construction, the alignment provided by the HL Dataset allows us to identify concrete objects in images which provide 'support' to infer high-level concepts such as scenes, actions and rationales.
We dive deeper into our analysis and study the connection between high-level concepts related to scene, action and rationale, to low-level objects present in the aligned COCO captions. We ask: 'What are the most informative objects for a highlevel concept (e.g. enjoy) found in a specific axis (e.g rationale)? ' We leverage the Point-wise Mutual Information (PMI) (Church and Hanks, 1990) to find the most informative objects linked to a high-level concept. This is helpful to discover connections between concepts across different levels of abstraction but also gives clues on the content distributions within the axes. We filter out object mentions which have a frequency less than 100 in the low-level captions. This leaves 475 object-denoting lemmas. Then, we compute the PMI between content words in the high-level captions and all these lemmas. For example, Figure 6 shows the nouns in the objectcentric captions which have the strongest PMI with the verb 'enjoy' in the rationale axis.
We can observe that high-level captions can express different nuances of the same abstract concept. To take another example, love (in Figure 7) can refer to the love between an animal and its owner, between two partners (e.g. wedding) or the love for sports (e.g. skate, snowboard). In the same way, as shown in Figure 6 a general concept like enjoy can be characterized by object-level concepts leaning toward a specific nuance of meaning,  like sports activities (e.g. kite, snowboarder, skier) or places (e.g. sandy shore, ocean, lake). More examples are provided in Appendix A.2.

Confidence scores analysis
Our confidence scores are similar in spirit to the self-confidence scores collected in the VQA dataset (Antol et al., 2015). However, they differ insofar as our scores are not self-reported by the authors of the captions, but collected from independent annotators. The inclusion of an external judgment plays an important role in determining the reliability of interpretation operated by the annotators in the caption collection and therefore, in shedding light on the extent to which an annotator's interpretation of a scene relies on 'shared' or 'commonsense' knowledge, or is entirely idiosyncratic.
We observe an average confidence score of 4.47 on a Likert scale from 1 to 5 (with a standard deviation of 0.78 and a median of 5) over all the axes. This suggests that, overall, according to independent judges, our high-level captions succeeded in capturing shared or 'commonsense' high-level interpretations of the scene.
Furthermore, the confidence scores provide an additional perspective under which our data can be characterized: by performing an axis-wise analysis of the confidence scores distribution (see Figure 8), we observe that the scene and action captions feature the highest overall confidence, while the rationale axis lags behind by a small margin. We expect such differences, since determining the rationale of an action depicted in a static image is challenging, in particular, because annotators can leverage significant visual cues, but have no access either to temporal information or the subject's stated intentions. Therefore, they need to resort to their own priors and expectations which can also lead to idiosyncratic interpretations which independent judges -as in our confidence score analysiswould find relatively unlikely. One important use of confidence scores is to provide a measure of uncertainty of the data, which can be used, for instance, to identify hard samples; an example is shown in Figure 9. The scene is hard to interpret even for humans and the scene captions display more variability and have low confidence scores. A detailed analysis of lexical and semantic variability in the presence of high-confidence scores is reported in Appendix A.3.  Table 3: Automatic metrics for baselines (GIT, BLIP, and ClipCap) fine-tuned along the three axes (scene, action, and rationales) of the HL dataset. The results are the average of 5 evaluation runs, by keeping the same decoding strategy and parameters for all the models.

Baselines and results
In this section, we show how the dataset can be used to finetune models to generate high-level, aspect-specific descriptions, e.g. image-to-scene or image-to-action. Below, in Section 6, we also describe a data augmentation and generation experiment, to merge the three axes into more 'narrativelike' descriptions of images.
We provide baselines for this task by fine-tuning three models, namely GIT (Wang et al., 2022a), BLIP (Li et al., 2022), and ClipCap (Mokady et al., 2021) on each separate axis. All the baselines were trained for a maximum of 10 epochs using a learning rate of 5e−5, Adam optimizer, and halfprecision (fp16). Table 3 displays automatic evaluation results for the three models, on each axis. The first observation is that ClipCap outperforms by far the other models in each separate axis. Differently from the other models, which are natively multimodal, Clip-Cap leverages a LLM to generate captions, conditioning the text generation on a prefix representing the visual information, which is obtained by a mapping network trained to generate the prefix from CLIP's (Radford et al., 2021) image embeddings.
A second observation, consistent with the analysis presented in earlier sections, is that on all metrics, models fine-tuned to generate rationale-based descriptions receive lower scores. We hypothesise that this is due in part to the greater variability in this axis, and to its inherent difficulty, as reflected in lower confidence scores. Future work could leverage these scores as additional signal in fine-tuning models on captions that require more inference, compared to more descriptive ones. generation We now describe how we extend the dataset to combine the three axes to compose a short 'narrative', which describes the scene, action and rationale in tandem. We call this new dataset HL Narratives. To do this, we leverage the individual axes and synthesise this part of the data using a pre-trained language model. Since scenes, actions, and rationales were elicited individually in a visually grounded and controlled setting, a synthesised version of the three individual captions should also be true of the image to the same extent (modulo the variations in confidence that we observe).

Data generation process
We frame the synthesis of narrative captions as a paraphrasing task. We follow a human-in-the-loop approach consisting of three stages: (i) we manually annotate a small sample of gold data; (ii) we fine-tune a large pre-trained language model (LPLM); (iii) we use the fine-tuned model to generate a sample of data, which is manually corrected and then (iv) added to the gold annotations before fine-tuning again. This procedure allows us to use only a few iterations to annotate quickly a considerable amount of data because the model improves the quality of the generated data, making manual correction progressively easier.
We use a version of T5 (Raffel et al., 2020) already fine-tuned on paraphrase generation 3 as LPLM data generator. We initialise the process with manually paraphrased annotations for 50 images (3 × 50 = 150), fine-tune the model for 2 epochs, and generate 150 captions for another 50 images, which are manually corrected and added to the original 150. The model is then fine-tuned for a further two epochs. In each iteration, we reserve 10% as validation data. After two epochs, we observe that the validation loss does not improve further. Finally, in the last iteration, we use all gold data to fine-tune the model and generate synthetic high-level captions for the whole HL dataset, obtaining 14,997 synthetic captions for training and 1499 for testing. In addition to the T5 paraphrase model, we also experimented with LLaMA (Touvron et al., 2023) in a few-shot setting; however, we find that T5 outperforms LLAMA in this task.
3 Details about the T5 fine-tuned on paraphrase generation are available at https://huggingface.co/Vamsi/ T5_Paraphrase_Paws.  Table 4: Results of the narrative generation task, averaged over 5 runs using the same decoding parameters for all models. PRE: pretrained models; FT: finetuned on the synthetic data.
See Appendix B for full details.

Results
We build three baselines by fine-tuning the same three large pre-trained models used in Section 5: GIT, BLIP, and ClipCap on our synthetic narrative captions. We fine-tune for 3 epochs with batch size 8, learning rate 5e −5 , and Adam optimizer with weight decay (Loshchilov and Hutter, 2017). We test on our gold human-annotated data. As shown in Table 4, where we report results for automatic metrics, overall the models achieve worse results than in the aspect-specific caption generation task (reported in Table 3). This further highlights the difficulty of generating narrative captions of this kind for models trained on object-centric captions.
Notably, the best-performing model in the aspect-specific caption generation task, namely ClipCap, is the worst in the narrative caption generation, though by a small margin (Table 4). This suggests that although a conditioned LLM can greatly adapt to generate high-level descriptions of specific aspects of the scene, it struggles in generating comprehensive high-level descriptions involving multiple high-level aspects of the scene. Ultimately, this suggests that the multimodal representations learned by multimodal models are more robust and effective in generating natural captions than conditioned unimodal models such as ClipCap.
However, the exposure to a small amount of synthetic high-level captions is sufficient to drive the models' generated text toward more narrative-like outputs. See Appendix F for more examples from all models. Further progress can be done in this direction, for example by incorporating confidence scores during finetuning.

Further uses of the HL Dataset
We envision a wide set of use cases and tasks enabled by the HL Dataset.  V&L generative tasks Our captions support image captioning generation tasks which encompass a broader range of visually grounded linguistic descriptions than the highly object-centric, 'conceptual' descriptions which dominate the captioning literature Hodosh et al. (2013). Moreover, the decomposition along three axes can be exploited to compose narratives of the image, as in image paragraph generation (Wang et al., 2019) and visual storytelling (Huang et al., 2016;Hu et al., 2020). They can be used in combination with the question each axis corresponds to, in order to generate micro-dialog scenarios.
We would also argue that the high-level captions are also more natural and human-like, since they were collected without enforcing any restriction on the content to be described. Given that the images are also aligned with object-centric captions, it is possible to envisage a scenario in which a model is trained to generate high-level captions, which are 'explained' or justified with reference to lowlevel, object-centric properties (see Hendricks et al., 2016Hendricks et al., , 2018, for some work in this direction). In this way, the dataset can be leveraged to provide captions and explanations. Furthermore, the confidence scores serve for the identification of hard samples in the data, both for evaluation purposes and to provide additional training signals, as recently shown by Ouyang et al. (2022).
Multimodal Grounding HL Dataset is also a useful resource to benchmark the grounding capabilities of large pre-trained V&L models. Along these lines, Cafagna et al. (2021) study the capability of V&L models to understand scene descriptions in zero-shot settings, finding that only largescale pre-trained V&L models have enough generalization capabilities to handle unseen high-level scene descriptions. Cafagna et al. (2022) analyse the impact of exposure to high-level scene descriptions on multimodal representations in models pretrained on object-centric captions. They show that exposure to high-level concepts mainly affects the model's attentional resource allocation over the visual input, even though the low-level concepts learned during pre-training provide enough signal to support and easily adapt to scene descriptions during fine-tuning. This is also supported by Wang et al. (2022b) who find that low-level concepts are needed to learn higher-level concepts, though this does not hold in the other direction.

Conclusions
In this paper, we introduced the High-Level (HL) Dataset. We extended 14,997 images from the popular COCO dataset with 134,973 human-annotated high-level descriptions systematically collected over three axes: scene, action, and rationale. We aligned high-level captions with object-centric captions and we provided human-collected confidence scores to measure the degree of commonsense expressed in the high-level captions. We also provided baseline results on generating captions for individual axes, as well as synthesised narrative captions by combining these three high-level axes of description.
Differently from current V&L captioning datasets, the high-level captions capture the human interpretation of the scene allowing for inference and expectations. We discussed how they can be used also in combination with low-level captions to improve research in visual commonsense reasoning and multimodal grounding of visual concepts into linguistic expressions and for generative tasks, hoping to foster future research in this direction.

Ethical Considerations
The data collection received ethical approval from the University of Malta Research Ethics Committee. This data is intended to be used for training, fine-tuning, and performing experimental evaluations of machine learning models. The dataset from which the images were originally sourced is a widely-studied, publicly available resource. As far as we are aware, the data does not contain harmful or offensive content. However, we acknowledge that any biases in the collection of images and/or captions in the original dataset will also be present in the HL Dataset.

Supplementary Materials Availability Statement:
The HL Dataset is publicly released on GitHub 4 and Huggingface 5 . The syntetic HL Narratives Dataset described in Section 6, is publicly released on Huggingface 6 . All the baselines described in Section 5 and 6 are available on Huggingface 7 .

A.1 Quantitying grammatical errors
We ask two postgraduate students experts in linguistics to correct grammatical errors in a sample of 9900 captions, 900 of which are shared between the two experts. They are shown the image-caption pairs and they are asked to edit the caption whenever they identify a grammatical error. The most common errors reported by the annotators are: • Misuse of prepositions; • Wrong verb conjugation; • Pronoun omissions.
In order to quantify the extent to which the corrected captions differ from the original ones, we compute the Levenshtein distance (Levenshtein, 1966) between them.
We observe that 22.5% of the sample have been edited and only 5% with a Levenshtein distance greater than 10. This suggests a reasonable level of grammatical quality overall, with no substantial grammatical issues. This can also be observed from the Levenshtein distance distribution reported in Figure 11. Moreover, the human evaluation is quite reliable as we observe a moderate inter-annotator agreement (α = 0.507, (Krippendorff, 2018)) computed over the shared sample.

A.2 PMI analysis examples
The PMI analysis can provide interesting insight into the connection between object-level and highlevel captions on all the three axes available.
On the scene axis, for instance, the PMI gives some clues on the extent to which an object can be considered diagnostic for a scene. For instance, two semantically similar scenes like restaturant (see Figure 12) and kitchen (see Figure 14) share several diagnostic objects, as we would expect. However, we can identify important semantic nuances: the scene restaurant contains objects related to the food (i.e. pizza, cheese, wine, sandwhich) whereas kitchen contains objects related to the preparation of food (i.e. stove, oven, tray, refrigerator). Another example is shown in Figure 13, where the most relevant objects for the action look encompass a wide variety of contexts, like looking at a screen or a device (e.g. device, screen, cellphone) or entertainment (e.g. zoo, zebra, giraffe). For more examples see Table 5, where are shown the top most relevant objects for the top three lemmas in the scene, action and rationale axes.
These semantic differences, while quite easy for humans to interpret, are not usually present in object-centric V&L datasets. They are made explicit and easy to identify in the HL dataset, where captions with different levels of abstraction are aligned with the same image.

A.3 Quantifying Lexical and Semantic Diversity
In Section 4.2, we showed that in the presence of low confidence, there can be variation or disagreement among high-level captions given by different annotators for the same axis. In such cases, the captions focus on different aspects or refer to different interpretations. Although this phenomenon has been observed for captions with a low confidence score, it is conceivable that it might also happen with high-confidence captions, for example, two captions annotated by different annotators, while differing in the interpretation of an image, could nevertheless be considered highly likely. To quantify this phenomenon, in this section we further expand our analysis by studying the lexical and semantic diversity of our captions.
Purity score We leverage the BLEURT score (Sellam et al., 2020), a trainable metric used to evaluate semantic differences in Natural Language Generation, to compute a score measuring the semantic diversity among the high-level captions associated with an image. To do so, we first compute such scores across each axis, and then we combine them to obtain a final score for the item. In this way, we can unpack the semantic diversity itemwise and axis-wise. Let C be the set of high-level captions of a given axis (e.g. scenes) for a given image. For simplicity, we do not report the index of the image and the axis in the following notation. We compute the BLEURT score of the caption as follows: where s i is the resulting BLEURT score, c i is a high-level caption, and ref is the set of reference captions defined as follows: In other words ref is the set of remaining captions along the axis and therefore, s i is measuring the semantic diversity of the caption with respect to the other captions along the same axis.
By averaging the caption-wise scores across a single axis and across all the axes we obtain a purity score measuring the semantic consistency both axis-wise and item-wise.
Diversity score Along the same lines, we propose the diversity score, to measure the lexical diversity of the captions. The diversity score follows the same logic implemented to compute the purity score introduced in the previous paragraph, but the BLEURT score in Eq. 2 is replaced by the BLEU score (Papineni et al., 2002) and then normalized between 0 (similar) and 1 (very different). Our score is similar in spirit to self-BLEU (Zhu et al., 2018) as it measures the similarity of the captions within their own distribution. However, its computation concerns only axis-wise and item-wise captions.

A.3.1 Results and discussion
As shown in Figure 15 the purity scores obtained are mostly negative, this is due to lexical variations, which the BLEURT score is known to be sensitive to (Sellam et al., 2020). However, BLEURT is not defined in any specific interval thus, it is usually hard to interpret (Sellam et al., 2020) if not considered in relative terms. Based on that, we use it to compare the semantic purity across items and axes within our dataset. As shown in Figure 15, action and scene share similar purity score distributions whereas the rationale is more skewed to the left than the other axes. This shows that the rationales feature a higher semantic diversity (lower overall BLEURT) than the other axes.
The rationale axis is also the one featuring the highest lexical diversity, whereas the scene and the action have similar distributions. This is shown in Figure 16 where the rationale density estimate (in green) has a higher peak skewed on the righthand side than scene and actiondensity estimate (respectively in orange and blue).
We have similar observations for both purity and the diversity scores and this confirms what was observed in the confidence score analysis in Section 4.2, namely that the task of determining the rationale of an action from a static image produces more variation and divergent interpretations leading to higher semantic and lexical diversity. Moreover, we find that both the diversity and the purity scores positively correlate with the confidence scores (See Figure 17).

A.3.2 Item-based analysis
An item in the HL dataset is an image along with all the high-level captions of all the axes. For instance, Figures 18 and 19 show the item-wise diversity score and purity score distribution respectively, along with their average value across the whole dataset. An item on the right-hand side of the distribution is systematically more consistent across its axes with respect to the measure considered (purity or diversity). This information can be combined with confidence scores to perform a more fine-rained sample selection. For example in zero-shot testing, we might want to use a hard sample to test our model with, we can select items with similar lexicons, low-semantic purity, and low confidence scores.

B.1 Few-shots Prompting Data Generation
We test an alternative data generation pipeline by leveraging the in-context learning capabilities featured by the most recent large language models (LLM) (Brown et al., 2020;Maeng et al., 2017;Touvron et al., 2023). This data generation approach has the advantage of not requiring any model fine-tuning. We design a prompt for our task and we use it to generate data from the recently developed LLaMA Given three sentences merge them into one sentence, and make sure that the sentence is grammatically correct. Here is an example:'in a beach',' holding an umbrella',' so they won't get a sunburn' <holding an umbrella in the beach so that they won't get a sunburn.>\n The three sentences are: 'scene','action','rationale' < Figure 20: Prompt used for the data generation. The parts in bold are replaced with the corresponding highlevel descriptions for the given sample.
model (Touvron et al., 2023). The prompt consists of the task description, followed by an example and the inputs of the task written in natural language. The full prompt is shown in Figure 20. The resulting output is then post-processed to extract the generated high-level caption.
Discussion As described in Section 6, we build baseline image captioning models starting from GIT-base and fine-tuning on the LLaMA-and T5generated synthetic data. The best model is chosen on a combination of qualitative models' output inspections and automatic metrics (SacreBLEU (Post, 2018), ROUGE-L (Lin, 2004) and Cider (Vedantam et al., 2015)) computed over the gold data.
In Table 6 we show the results of the evaluation based on the automatic metrics. First, we observe that the performance of the pre-trained model (PRE) is extremely poor, in the high-level caption generation task, highlighting the substantial difference between captions of this kind with traditional object-centric captioning the pre-trained model is trained on.
Second, focusing on the fine-tuned models, we observe that GIT fine-tuned on T5-generated data performs better than the LLaMa-based counterpart on the automatic metrics. We argue that the model trained on T5-generated synthetic data benefits from the exposure of the data generator to the gold data distribution. However, we point out that the few-shot data generation pipeline remains a valid alternative as it achieves comparable performance without requiring any further fine-tuning.

C Annotation Costs
In this section, we report the costs related to the data collection.
High-level caption collection Overall 1033 participants took part in the caption data collection, they were paid $ 0.04 per item corresponding to the hourly minimum rate in the United Kingdom. In total, the data collection cost $ 1938.  Table 6: Automatic metrics computed over the gold annotated high-level captions; the scores are the average results of 5 runs using the same decoding parameters for all models. We compare the pre-trained model (PRE) with the model finetuned on T5-generated (T5) and LLaMA-generated (LLaMA) data.
Confidence Scores collection The qualification task for confidence scores led to the recruitment of 53 annotators. We found that this task was harder than the high-level caption annotation in terms of complexity but not in terms of execution time which was indeed shorter. Therefore, in order to encourage good quality annotations, we pay $ 0.04 per item. Considering the time needed to perform the task, this corresponds to 4 times the hourly rate of the minimum wage in the United Kingdom. The qualification task and the data collection cost respectively $ 93 and $ 1938.

D.1 Pilot
We run a pilot study with the double goal of collecting feedback and defining the task instructions. The pilot is run with 9 participants who were trained on the task, with high proficiency in English and a background in computer science and linguistics. With the results from the pilot we design a beta version of the task and we run a small batch of cases on the crowd-sourcing platform. We manually inspect the results and we further refine the instructions and the formulation of the task before finally proceeding with the annotation in bulk. The final annotation form is shown in Figure 22. It is important to notice that the instructions, shown in Figure 21 are always visible to the workers. Figure 23 shows the annotation form used for the confidence score collection. Also in this case, the instructions are always visible to the worker and each image is presented along with the original question and the answer.

E Additional Data Examples
In Table 7 we show further examples of images and their corresponding captions in the HL Dataset.

Instructions:
You are going to see some pictures. Each picture involves one or more people ('the subject'). You will be asked some questions about the picture Don't think too much, feel free to give your personal interpretation using your knowledge or common sense. Try to answer using full English sentences. If you're not sure what the answer could be, give your best guess. Avoid using expressions like "I think" or "I suppose" or "Maybe. Do not propose options or possibilities saying for instance: something "or" something else. Make your best guess and state the one you choose. Write a statement, don't write a one-word answer, avoid acronyms or slangs and write a full sentence.

1.
Where is the picture taken: give your best guess about the type of place where the action is happening (for example, "in a ski resort"); 2. What is the subject doing: Try to describe what the people are doing as concisely as possible.
If there is more than one person, try to choose a description that captures what all of them are doing (for example, "They are skiing") 3. Why is the subject doing it: here, write your best guess about why the person or persons are doing the action (for example, "They are on a family holiday") The What question and the Why question cannot have the same answer.
The answers must be written correctly in English, check the spell and most importantly don't forget the subject of the sentence in your answer (he, she, it, they)

F Examples of Narrative Caption generations
In Figure 24 we show examples of narrative caption generations from our fine-tuned baselines.