FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval. However, existing scene graph parsers that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations. To address these challenges, we propose a novel dataset, which involves re-annotating the captions in Visual Genome (VG) using a new intermediate representation called FACTUAL-MR. FACTUAL-MR can be directly converted into faithful and consistent scene graph annotations. Our experimental results clearly demonstrate that the parser trained on our dataset outperforms existing approaches in terms of faithfulness and consistency. This improvement leads to a significant performance boost in both image caption evaluation and zero-shot image retrieval tasks. Furthermore, we introduce a novel metric for measuring scene graph similarity, which, when combined with the improved scene graph parser, achieves state-of-the-art (SOTA) results on multiple benchmark datasets for the aforementioned tasks. The code and dataset are available at https://github.com/zhuang-li/FACTUAL .


Introduction
A scene graph is a representation that describes the contents of a visual scene, including objects, their attributes, and the relationships between them.The grounding of a scene graph with an image or a text can provide significant benefits for various vision-language tasks, such as image caption evaluation (Anderson et al., 2016) and image re-* The two authors contributed equally to this work.
trieval (Johnson et al., 2015).Therefore, transduction of image descriptions into scene graphs through textual scene graph parsing has been a crucial vision-language research area.
Accurately generating scene graphs that capture intersected information from images and their corresponding descriptions is crucial for a successful textual parser.However, current baseline parsers often generate unfaithful scene graphs that fail to represent the complete intersected information or generate semantically correct graphs, as shown in Figure 1.Furthermore, inconsistencies exist in the outputs of scene graph parsers, as depicted in the same figure, where "tennis" is interpreted as an attribute in one graph and as a part of an object in another graph.Such inconsistencies can severely impact downstream tasks of textual scene graph parsers, especially when they produce different graphs for a semantic unit, such as a phrase, across various captions, despite they carry the same semantic meaning.
Upon inspection, we hypothesize that the issues of unfaithfulness and inconsistency arise due to the inherent shortcomings of scene graph parsing algorithms and limitations within the datasets.One widely utilized parser, SPICE-Parser (Anderson et al., 2016), is known for converting caption dependency graphs into scene graphs using predefined rules, which can result in error propagation.Furthermore, the dependency graphs may not adequately capture the semantic characteristics of scene graphs, as dependency graphs primarily focus on syntactical relationships.Additionally, the limitations of the datasets contribute to the problems as well.As demonstrated in Figure 1, the largest scene graph dataset, VG (Krishna et al., 2017), includes notable annotation issues regarding faithfulness and inconsistency.
To address the aforementioned issues, we create a high-quality scene graph dataset for training parsers.We firmly believe that the problems of unfaithfulness and inconsistency within the dataset can be effectively resolved by incorporating two key measures: i) employing rigorous definitions for the literals and ii) implementing strict quality control during the annotation process.Therefore, we propose a novel intermediate meaning representation (MR) coined as FACTUAL-MR, which ensures FAithful and Consistent texTUAL scene graph parsing.FACTUAL-MR is a semantic representation that can be deterministically mapped to the scene graph, thereby avoiding the issues that arise from converting syntactical graphs into scene graphs.The annotation of FACTUAL-MRs can be divided into manageable sub-tasks, allowing us to easily control the quality of annotations in each sub-task and ensure their faithfulness.Furthermore, the literals within the FACTUAL-MRs are precisely defined to ensure consistency in textual scene graph parsing annotations.As a result, we re-annotate captions sampled from the VG dataset with FACTUAL-MRs, enabling us to leverage the existing scene graph annotations from VG.Additionally, in order to further enhance the advantages provided by the scene graph parsing for its downstream tasks, we propose a simple yet effective metric called SoftSPICE.This metric calculates graph similarity and significantly improves the performance of vision-language tasks that leverage scene graphs.
Overall, the key contributions are as follows: • We propose a novel intermediate representation, FACTUAL-MR, which can be easily annotated and converted into scene graphs.The annotation process of FACTUAL-MR could ensure the faithfulness and consistency of the scene graphs converted from FACTUAL-MR.
• We construct a large-scale benchmark, FAC-TUAL, consisting of 40,369 parallel examples.
We conduct thorough intrinsic and extrinsic evaluations to demonstrate that FACTUAL significantly improves the performance of textual scene graph parsing.
• We propose a simple graph similarity metric, SoftSPICE, that achieves new SOTA results in image caption evaluation and zero-shot image retrieval tasks, when combined with a scene graph parser trained with FACTUAL.

Related Work
Grounding a scene graph with an image or image description can be beneficial for a variety of downstream tasks, such as image retrieval (Andrews et al., 2019;Johnson et al., 2015), image caption evaluation (Anderson et al., 2016) and image captioning (Zhong et al., 2020).Currently, there are three main research directions to scene graph parsing: those that focus on parsing images (Zellers et al., 2018;Tang et al., 2020;Xu et al., 2017;Zhang et al., 2019a;Cong et al., 2022;Li et al., 2022), text (Anderson et al., 2016;Schuster et al., 2015;Wang et al., 2018;Choi et al., 2022;Andrews et al., 2019;Sharifzadeh et al., 2022), or both modalities (Zhong et al., 2021;Sharifzadeh et al., 2022) into scene graphs.Parsing images involves utilizing an object detection model to identify the location and class of objects, as well as classifiers to determine the relationships and attributes of the objects.Textual scene graph parsing employs techniques such as the Sequence-to-Sequence model (Sutskever et al., 2014) to parse image descriptions into linearized scene graphs (Sharifzadeh et al., 2022) or generate intermediate representations, such as dependency graphs or Abstract Meaning Representation (AMR) (Banarescu et al., 2013), which are then mapped into scene graphs using deterministic rules or machine learning models.However, directly utilizing intermediate representations like dependency graphs or AMR often leads to subpar performance in downstream tasks, as emphasized by Anderson et al. (2016), and may even be infeasible for multi-modal tasks requiring annotations for both modalities, given that the intermediate representations only annotate the text.Recent studies in parsing both modalities (Zhong et al., 2021;Sharifzadeh et al., 2022) have primarily utilized textual parsing models to enhance the performance of visual scene graph parsing.Our work primarily focuses on textual scene graph parsing.

Textual Scene Graph Parsing
A scene graph, as introduced by  et al. (2022).Therefore, the textual scene parsing aims to learn a mapping π θ : X → G, which translates a textual image description X ∈ X into a scene graph G ∈ G.

Challenges
Unfaithfulness.The scene graph faithfulness is determined by its completeness and correctness.
Completeness is defined as the extent to which the graph conveys the complete semantic meaning of the intersected information from both the caption and the image.For example, Figure 1 demonstrates that the output of VG-T5 (Sharifzadeh et al., 2022) lacks the facts (tennis player, hold, tennis racket) and (tennis balls, rest on, tennis racket), indicating an incomplete graph.This incompleteness issue of parsing outputs can be caused by the noisy training set from VG, which was generated without Correctness refers to the semantic accuracy of the graph with respect to the intersected information from the caption and the image.The annotation errors of VG contribute significantly to the correctness issues.As in Figure 1, the presence of the predicate "rest balls on ten" highlights a significant annotation mistake.Dependency-based parsing methods, such as SPICE-Parser, produce graphs that lack correctness primarily due to error propagation.As shown in Figure 1, the term "rest" is incorrectly considered an attribute of "racket" due to the parsing errors from the Stanford dependency parser (Manning et al., 2014).Another issue with dependency-based methods is that they focus on capturing syntactic relationships among words rather than semantic relationships among objects.The phrases such as "without leaves" or "without a shirt" indicate the absence of objects like "leaves" or "shirt" in the scene, but dependency-based methods still interpret them as objects.
Inconsistency.The inconsistency in the dataset is primarily the result of linguistic variations.The object, attribute, and relations are all extracted from texts, but the same semantics can be expressed in multiple ways.For instance, (tennis player, hold, tennis racket) and (tennis racket, held by, tennis player) are semantically equivalent, even though the orders of the subjects and objects differ.Different understanding of the tasks among crowd workers is also a serious issue.Some may consider "stone wall" as a composite object, while others may consider "stone" as an attribute and "wall" as an object.To measure the consistency of the annotations, we have calculated diversity metrics for the objects, attributes, and predicates within a set of examples encompassing various types of annotations.We assume that the diversity scores indicate the annotations' consistency.As in Table 1, the results of the three diversity metrics indicate that the annotations in VG and CDP datasets have a higher degree of diversity regarding their objects, attributes, and predicates than the ones in FACTUAL dataset.

Meaning Representation
We propose a novel intermediate semantic representation, FACTUAL-MR, in which elements are clearly defined to eliminate confusion among annotators.The task of annotating captions and their associated images with FACTUAL-MRs can be broken down into manageable sub-tasks, and each FACTUAL-MR can be deterministically mapped into a conventional scene graph, enabling the utilization of FACTUAL parser outputs in a wide range of multi-modal applications that rely on scene graphs.Specifically, the template of each fact in FACTUAL-MR is presented in one of two formats: {Object, Attribute} or {Quantif ier sub , Object sub , V erb, P reposition, Quantif ier obj , Object obj }.
Object.An object in a scene graph is essentially defined as a grouping of concepts.This results from the widely accepted notion in vision tasks that an image object typically encompasses a collection of homogeneous concepts within a bounding box (Krishna et al., 2017).Therefore, a common source of inconsistency in VG-SG is the various methods used to represent the quantity of objects.This can be attributed to the varying understandings of tasks among annotators.For example, as depicted in Figure 1, three trees may be represented as a single collective object contained within a large bound-ing box on an image, with the attribute of "three" (trees, has_attribute, three), or as three distinct objects of tree distributed throughout three facts in the visual scene.These different representations of object quantity can lead to inconsistencies.To address this, we propose defining each object in FACTUAL-MR as a grouping of collective concepts.To differentiate between two collective objects with identical names, unique suffix identifiers are utilized.For instance, the phrase "men watch men" would be represented as (men, watch, men:1).
Attribute.The attribute definition in FACTUAL-MR is similar to the original scene graph, with one notable distinction.In FACTUAL-MR, attributes are used to describe all individual concepts within each collective object.For example, in the case of (3, tennis balls, has_attribute, white), it implies that all the tennis balls are white.
Quantifier.The quantifier indicates the quantity of concepts within a collective object if the quantity is explicitly mentioned in the text.Additionally, a quantifier modifier may be used to specify the unit of measurement when explicit quantifier modifiers are present in the text.For instance, the phrase "both men" is expressed as "2, men" while "both groups of men" would be represented as "2g, men" and "both pairs of" as "2p".To avoid annotation inconsistencies, a limited set of pre-defined modifiers is provided.In cases where the quantity of objects cannot be expressed by the predefined set, two special quantities, "many" and "unaccountable", are offered as placeholders for annotators.
Verb and Preposition.Given the linguistic variations present in VG, the number of relations exceeds 36,000.Through analysis, we have determined that the semantics of each relation can be composed of both a verb and a preposition or either one alone.To this end, we have decomposed these relations into their respective verbs and prepositions.In order to ensure consistency in annotation, a fixed list of verbs and prepositions with exclusive semantics is provided for the annotators to select from.To further facilitate consistency, all verbs are lemmatized to their original forms.The benefits of this decomposition method will be further explained in Section 4.3.Additionally, the verb's voice plays a crucial role in the semantics of a fact.For example, the phrases "cup covered with blanket" and "cup covers blanket" possess distinct semantic meanings.To prevent ambiguity during annotation, an indicator, "p:", is used as a prefix to the verb to indicate whether it is in a passive voice.

Connection to Scene Graph
To map a FACTUAL-MR into the original scene graph, we first combine the verb and prepositions into a predicate.The voice of the verb is altered based on whether it is passive or active.However, as the object in our annotation is collective, a collective-distributive ambiguity is present in the sentence, as also highlighted by Schuster et al. (2015).For instance, given an image describing "three men reading books", we can know which man is reading which book according to the image, while in the image caption, the information is insufficient to determine this.Previous approaches, such as SPICE (Anderson et al., 2016) and Stanford (Schuster et al., 2015) parsers, address this issue using heuristic rules.The SPICE-Parser considers all relations between two collective objects as collective, leading to the phrase being expressed as (men, reading, books), (men, has_attribute, 3).However, this annotation type is not commonly used as annotators tend to annotate relations distributedly in the VG-SG annotations.Another option, adopted by the Stanford parser, is to consider all these cases as distributive behaviours, resulting in the phrase being expressed as "(man, reading, book), (man:1, reading, book), (man:2, reading, book)".This may also be incorrect, as three men might read two books.Therefore, in such cases, we improve this heuristic by utilizing our annotated quantifiers.We annotate the implicit quantifiers for the "books" according to the image content.If FACTUAL-MR annotates the number of books as three, we know that each man is distributedly reading one book.Otherwise, they are collectively engaging in the activity.

Annotation
Our annotation process consists of two stages.In the first stage, we carefully selected approximately 44,000 captions, with each caption aligned to a distinct image, to ensure diversity in our FAC-TUAL dataset derived from the VG dataset.We hired 25 annotators with diverse backgrounds, either through Amazon Mechanical Turk (Paolacci et al., 2010) or from local undergraduate students, and provided them with one-hour training sessions to ensure consistent annotation practices.Throughout the annotation process, both the images and captions were presented to the annotators to ensure the faithfulness of the annotations to both modalities.Each annotator was reimbursed at a rate of 0.25 USD per task.In the second stage, three expert annotators with a high level of agreement in their annotations performed post-processing and verification steps to ensure the quality of the data.After undergoing the quality check, we retained 40,369 examples in the dataset.
Object and Attribute.The annotation process for objects and attributes involved extracting information from the captions to ensure faithfulness to the text while utilizing the image to resolve any linguistic ambiguities.For example, in the caption, "the picture depicts a car" it is unclear whether the image includes an object labelled as "picture" or if the caption is referring to the image itself as a "picture" without the context of the image.Furthermore, during the training, the annotators were also instructed to extract the objects for the coreferences, such as the pronoun "it" mentioned in the captions.
Quantifier.Regarding quantifiers, the annotators could only select from the pre-determined sets of quantities and quantity modifiers.If an exact match of a modifier was not found, the annotators were instructed to choose the modifier with the equivalent semantic meaning to the modifier in the text.In most cases, only the quantity was annotated when the number of objects was explicitly mentioned.However, exceptions were made for cases involving collective-distributive ambiguity, requiring the annotations of implicit quantities.
Verb and Preposition.To ensure consistency in the predicate annotations, the annotators were instructed to select from a pre-determined set of predicates rather than writing them on their own.However, the predicates in the VG dataset were not mutually exclusive in semantics.Therefore, we implemented a process of partitioning them into 1000 clusters using K-means, followed by manually selecting around 2000 predicates by observing the clusters.Despite this pruning, the large number of remaining predicates still posed a challenge for annotators to make selections.Therefore, the predicates1 were further decomposed into around 400 verbs and 100 prepositions.For each selection slot, verbs and prepositions were ranked using an information retrieval method, and the annotators Table 2: The statistics about the number of distinct labels and occurrence (occ.) of the various elements in the 40,369 FACTUAL-MRs.For simplicity, we omit their suffixes when calculating the occurrence of quantifiers.
were asked to select from the 20 most probable candidates.Annotators were specifically instructed to annotate verbs in the active voice whenever possible.For example, if both active and passive voices were possible for annotation, as seen in the phrases "blanket covering cup" and "cup covered with a blanket", both should be annotated as (blanket, cover, cup).However, in cases where only the passive voice construction was syntactically and semantically valid, such as in the example "cup filled with water," it should be annotated as (cup, p:fill, with, water) since (water, fill, cup) would not be appropriate.
Post-processing and Verification.
In the second stage, three expert annotators conducted a thorough examination of all cases to verify and rectify annotation errors.Particular attention was paid to identifying and correcting any incorrect annotations related to passive and active voice, as well as quantifiers and their modifiers.Furthermore, in cases where captions did not include specific name phrases for objects but only pronouns, those pronouns were converted into object names.For example, in the sentence "he is walking" where "he" was annotated as an object, it was resolved to "man."Additionally, any annotations that were entirely unrelated to the text and images were discarded.

Statistical Analysis of Dataset
We present a statistical overview of the FACTUAL dataset, which comprises 40,369 distinct captions and includes over 4,000 unique object labels with a total occurrence of 116,712.On average, each object label appears approximately 28 times throughout the dataset.Notably, prepositions occur more frequently compared to verbs, although there are four times as many distinct verb labels compared to the number of distinct prepositions.Furthermore, each fact within the dataset tends to be unique within a single caption, with an average occurrence of fewer than two times.Upon analyzing the scene level, we find that, on average, at least two distinct objects are present in each scene.However, there are much fewer distinct verbs, prepositions, and attributes.It is worth highlighting that quantifiers play a relatively minor role in the dataset, as most collective objects described in the image captions consist of only one individual object.

Experiments
We evaluate the effectiveness of our new scene graph benchmark through one intrinsic evaluation and two extrinsic evaluation tasks.Datasets.In terms of datasets, our evaluations are conducted on the VG (Krishna et al., 2017), CDP (Wang et al., 2018), and FACTUAL dataset.

Textual Scene
The VG dataset comprises 108,077 images and 5.4 million region captions.The CDP dataset converts all scene graphs in VG into a customized dependency graph, which has a one-to-one mapping to the original scene graphs.
We report the performance of the parsers on two data splits for each dataset representation.For the FACTUAL dataset, we consider a random split (Random), which includes 37,861 training, 1,000 validation, and 1,508 test examples.Additionally, we also evaluate a more challenging split (Length) to assess the parsers' compositional generalization abilities Baselines.In this study, we evaluated the performance of five parsers: SPICE-Parser (Anderson et al., 2016), AMR-SG-T5 (Choi et al., 2022), CDP-T5 (Choi et al., 2022), VG-T5 (Sharifzadeh et al., 2022), and FACTUAL-T5.SPICE utilizes a set of rules to convert dependency graphs of captions into scene graphs.AMR-SG-T5 converts captions into AMRs through the use of AMR-BART (Bai et al., 2022), and subsequently converts the AMRs into CDP-SG format by using a T5 (Raffel et al., 2020) model.CDP-T5 directly converts captions into CDP-SGs without the intermediate steps.In contrast to the original CDPto-SG parser (Wang et al., 2018), which relies on intermediate representation, CDP-T5 demonstrates significantly better performance (Choi et al., 2022).VG-T5, trained on the VG, parses captions into VG-SGs.FACTUAL-T5 parses captions into FACTUAL-SGs and maps them into scene graphs in a collective way.FACTUAL-T5 (pre) was first pre-trained on the VG dataset and then fine-tuned on FACTUAL.As different datasets use different annotations, SPICE2 , AMR-SG-T5 and CDP-T5 are evaluated against the ground truth of the CDP dataset, while VG-T5 and FACTUAL-T5 are evaluated against the ground truth VG-SGs and FACTUAL-SGs.
Evaluation.Following Schuster et al. (2015); Wang et al. (2018);Choi et al. (2022), we evaluate scene graph parsers utilizing the SPICE metric (Anderson et al., 2016).The SPICE F-score measures the similarity between the candidate and ground truth graph representations extracted from captions by the parsers.In addition, we also employ the Exact Set Match metric (Yu et al., 2019), which assesses the accuracy of the parsers by determining whether the strings of the parsed facts match the ground truth facts while disregarding the order of the facts.During the evaluation, all intermediate representations are converted into scene graphs.
We also evaluate the faithfulness and consistency of parser outputs by human evaluation and automatic lexical diversity metrics, respectively.Specifically, three students manually examine the rates of correctness and completeness of the parsing outputs, and we report the average scores.We employ Yules I (Yule, 2014), TTR (Templin, 1957), and MTLD (Koehn, 2005)  Discussion.As shown in Table 3, the FACTUAL-T5 and FACTUAL-T5 (pre) models demonstrate a clear superiority over other parsers regarding Set Match and SPICE scores.Notably, the FACTUAL-T5 model, which utilizes the T5 architecture, outperforms other T5-based baselines trained on millions of data points with different annotations.This highlights the effectiveness of the FACTUAL benchmark in generating outputs that are wellaligned with ground truth annotations.In the more challenging Length setting, all parsers experience a decline regarding parsing text into ground truth scene graphs.However, the FACTUAL-T5 model has the least drop among all parsers.Furthermore, pre-training the FACTUAL-T5 model on millions of VG data points only results in a slight improvement in the Length split.This indicates that a dataset as small as 40,000 high-quality examples is sufficient to yield a competent parser.
The SPICE-Parser has become the most frequently utilized parser in vision-language tasks.However, as shown in Table 3, it is unable to align with the CDP-SG in either of the two settings.However, this does not necessarily imply that the SPICE-Parser is the worst among the parsers, as the oracle CDP-SGs have a high degree of noise as well, as demonstrated in Table 1.Our human evaluation of the faithfulness of the parsing results, as presented in Table 4, indicates that the SPICE-Parser can perform comparably with the VG-T5 model and outperform the CDP-T5 model in terms of completeness.Furthermore, our subsequent extrinsic evaluation also shows that the SPICE-Parser is the second-best parser among the parsers evaluated.Table 4 also illustrates that our parser performs much better than the other baselines in terms of faithfulness while ranking second in terms of consistency.Interestingly, the VG-T5 model exhibits the best performance in consistency.However, its ORACLE annotations are more inconsistent than ours.Our analysis reveals that the VG-T5 prioritizes predicting scene graphs with simple lexicons and discards more complex patterns, resulting in its strong performance in consistency but much weaker performance in faithfulness metrics.

Image Caption Evaluation
Task Setting.To assess the quality of the modelgenerated captions regarding a set of reference captions and an image, we adopt the SPICE and Soft-SPICE metrics to calculate a graph similarity between graphs extracted from the candidate and reference captions.As these metrics are based on the parser outputs, a better parser will result in scores that more closely align with human judgment.
Evaluation.Following Hessel et al. (2021), we employ two evaluation settings.The first setting involves calculating the correlation of the scores with human judgment utilizing Kendall's τ and Pearson correlation on the Flicker8K dataset (Hodosh et al., 2013).The Flicker8K dataset includes 17k "expert" human judgments for 5664 images, with each caption being rated on a scale of 1 to 4 against five reference captions.In the second setting, we utilize one (1-ref) or four (4-ref) reference captions sourced from the FOIL dataset (Shekhar et al., 2017).This dataset consists of 32k pairs of true captions and their corresponding corrupted versions, where a single word is replaced with an incorrect one.The objective is to assess the accuracy of each image caption evaluation metric in identifying and assigning higher scores to the uncorrupted captions.This setting aims to evaluate the metric's ability to detect instances of sentence hallucination effectively.
SoftSPICE.SPICE calculates the similarity between two graphs by matching strings of subcomponents within the graphs.These subcomponents include objects, tuples {object, at-tribute} and triples {object, predicate, object}.To improve SPICE, we propose an alternative method that utilizes embedding-based techniques to calculate string similarity.This approach involves decomposing each graph into the aforementioned sub-components and encoding the text of each component using the Sentence-BERT (Reimers and Gurevych, 2019).The resulting similarity score, coined SoftSPICE, is as follows: where e denotes the embedding of each component, V r and V c denote the sets of embeddings encoding components within the candidate and reference graphs, respectively.Additionally, we can also use the image I to compute a SoftSPICE(img) score, denoted as φ i (G c , I).This score is computed by combining the embeddings of the graph components and the image: where e c and e I are obtained by encoding the subcomponents and the images with CLIP.
Discussion.Table 5 illustrates that FACTUAL-T5 demonstrates improvement over other parsers in terms of enhancing the correlation of SPICE and SoftSPICE scores with human judgments.However, when using SPICE to detect hallucinated instances, our parser performs comparably to the SPICE-Parser.We attribute this to the fact that approximately one-third of the pairs will have tied SPICE scores due to the use of exact string matching.On the other hand, when using the embeddingbased metric, SoftSPICE, the superiority of our parser on FOIL is revealed.Currently, the SPICE utilizing the SPICE-Parser has been a common standard in image caption evaluation settings.We are confident that our parser can be a suitable replacement for SPICE-Parser.We also compare SoftSPICE with current SOTA image evaluation metrics, namely BERTScore (Zhang et al., 2019b) and RefCLIPScore.These metrics calculate the similarity between the embeddings of the candidate caption with the embeddings of the reference captions, the image, and both reference captions and images, respectively.As in Table 6, SoftSPICE performs comparably with all the SOTA methods when there are over four reference captions, and with the inclusion of image information, SoftSPICE(img) can even outperform SOTA results on Flicker8K.We also observed that the scene graph feature could be a useful supplement to caption-level features.By taking the harmonic mean of SoftSPICE(img) with BERTScore and RefCLIPScore, the performance of both metrics achieve new SOTA results.

Zero-shot Image Retrieval
Task Setting.The goal of image retrieval is to identify and retrieve an image that precisely corresponds to a given textual query description.This is typically accomplished by allocating scores to images based on their relevance to the query and selecting the top k images.Following the setting from Johnson et al. (2015); Wang et al. (2018), we have selected 456 captions and their corresponding images from the Random and Length test sets, initially prepared for intrinsic evaluation.These captions serve as queries to retrieve their associated images, forming the basis for evaluating the performance of our image retrieval system.We proceed under the assumption that an oracle scene graph corresponding to each selected image is available.Furthermore, we introduce a 'Local' setting, which provides access to the coordinates of a bounding box within each image that corresponds to each caption and the ground truth scene graph aligned with this bounding box region.
Evaluation.During the evaluation, the scene graph of the captions is generated using various baseline parsing methods.The 456 images are ranked according to the similarity scores computed Method Parser Random Length R@1 R@5 R@1 R@5 using either the SoftSPICE or CLIPScore between each image and the caption.Notably, the representation encoders employed in both similarity measurements are not fine-tuned on the in-domain dataset.The performance of various methods is assessed using the Recall@k metric.The performance of different methods is assessed using the Recall@k metric, which indicates the percentage of caption queries where the top k retrieved images, given a specific query, include the ground truth.

Local
Discussion.As observed in Table 7, FACTUAL-T5 consistently outperforms other baselines in zeroshot image retrieval tasks, highlighting the superiority of our dataset and parser.The performance of both SoftSPICE and CLIPScore is generally enhanced by incorporating location information of the bounding boxes, depicting that more accurate information could boost image retrieval.Moreover, when combined with all available parsers, Soft-SPICE demonstrates significantly superior performance compared to CLIPScore, emphasizing the substantial potential benefits of utilizing structured information for image retrieval.

Conclusion
We introduce a new intermediate representation, coined FACTUAL-MR, which aims to address the issues of faithfulness and consistency for textual scene graph parsers.By utilizing a rigorous annotation process, it is possible to create a large-scale dataset based on FACTUAL-MR.Our experiments demonstrate that FACTUAL-T5, trained on this dataset, is capable of generating consistent scene graphs that are highly faithful to corresponding images and captions.Utilizing a novel graph similarity metric, SoftSPICE, FACTUAL-T5 significantly improve performance in both image caption evaluation and zero-shot image retrieval.

Limitations
Despite the significant advancements made by the proposed FACTUAL-MR representation in addressing the limitations of current scene graph parsing datasets, there remain several areas for future research.
First, FACTUAL-MR currently relies on heuristic rules to resolve the collective-distributive ambiguity as introduced in Section 4.2.However, the limitations still remain due to the ambiguity of language.To obtain a perfect parser, rich-world knowledge from multi-modalities or textual context (Li et al., 2020) is required, which is left as our future work.
Second, there is currently no explicit alignment between objects represented within FACTUAL-MR and the corresponding bounding boxes in the image.To fully utilize multi-modal information, collecting such alignments may be necessary.
Third, the proposed method utilizes ORACLE scene graphs of the image, however, in practical applications, extracting a scene graph from an image remains a challenging problem.Further research is required to determine if utilizing a visual scene graph parsing model to extract scene graphs from images would negatively impact image retrieval performance.
Lastly, our current approach utilizes a large pretrained language model to train the parser.However, the issue of robustness in parsers (Huang et al., 2021;Zhuo et al., 2023) has always been a significant concern.The captions in the VG dataset mainly consist of short sentences with simple patterns.It remains unclear whether the parser is robust enough to handle sentences with more complex linguistic variations, which calls for further investigation.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?No space.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?In the experiment.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?In the experiment.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? In Section 4.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?In Section 4.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?In Section 4.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?In Section 4.

Figure 1 :
Figure 1: The intermediate representations and scene graphs produced by various parsers are compared with the ORACLE annotations when provided with an image and a caption.
. The benchmark test set for this split comprises 1,053 examples.The caption of each example includes more than ten caption tokens and three facts in the corresponding scene graphs.The remaining examples are split into 38,316 training and 1,000 validation examples.The test examples for VG and CDP consist of captions from the Random and Length splits of FACTUAL, while the remaining examples are divided into a validation set of 1,000 and a training set of over 2 million.
O is the set of edges connecting the objects.Each object o i = {c i , a i } is associated with an object class c i ∈ C and an attribute a i ∈ A.
Johnson et al. (2015), is a formal representation of the objects, their attributes, and the relationships between objects in a visual scene.Given a set of object classes C, a set of attribute types A, and a set of predicate types R, a scene graph G is defined as a tuple (O, E), where O = {o 1 , ..., o n } is a set of objects and E ∈ O × R ×

Table 4 :
to evaluate the lexical diversity of objects, attributes, and predicates, which indicate consistency of the output scene graphs.Evaluation of faithfulness and consistency across outputs from various scene graph parsers.

Table 7 :
Zero-shot image retrieval evaluation on two sets of image-caption pairs that utilize localization or do not use localization information during image retrieval.