Semi-supervised multimodal coreference resolution in image narrations

In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets. To tackle these challenges, we present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context. Our approach incorporates losses for both labeled and unlabeled data within a cross-modal framework. Our evaluation shows that the proposed approach outperforms strong baselines both quantitatively and qualitatively, for the tasks of coreference resolution and narrative grounding.


Introduction
In linguistic processing, coreference resolution is a standard task that aims to identify referring expressions such as noun phrases and pronouns that refer to the same entity.It is fundamental to many standard problems including question answering (Kwiatkowski et al., 2019;Das et al., 2017), sentiment analysis (Cambria et al., 2017;Medhat et al., 2014), summarization (Gupta and Lehal, 2010;Shi et al., 2021) and machine translation (Lopez, 2008;Bahdanau et al., 2014;Wu et al., 2016).In this work, we focus on a multimodal coreference resolution (MCR) scenario where the coreferences occur in a narration paired with an image and also link to an image region as shown in Figure 1.Here resolving coreferences is challenging, as mentions referring to different entities can be very similar when encoded by a language model, e.g., one boy, the other boy, the boy.Hence it demands a fine-grained understanding of each modality and as well as across them.In particular, it requires simultaneously grounding instances by we can see two small boys are sitting on a white color mat and in that one boy is crying and he is wearing a yellow color t-shirt and grey color short.The other boy is wearing white color t-shirt and cream color short and he is also holding some object in his hand.On the head of the boy we can see a white color sticker and on their t-shirts we can see some text and designs also.
Phrases marked in the same color corefer to the same entity which are also grounded in the image.We do not show singletons for brevity.
identifying fine-grained visual details (e.g., disambiguating them by recognizing the action 'crying', spotting 'white color t-shirt and cream color short' or 'a white color sticker on the head'), and capturing long-range dependency across sentences (e.g., two small boys and their).
MCR has recently gained increasing attention, with several notable studies (Ramanathan et al., 2014;Huang et al., 2018;Cui et al., 2021;Parcalabescu et al., 2021;Das et al., 2017;Guo et al., 2022;Goel et al., 2022;Hong et al., 2023).However, many of them focus on images with simple short sentences, such as 'A woman is driving a motorcycle.Is she wearing a helmet?' (Das et al., 2017;Parcalabescu et al., 2021), or are limited to identifying movie characters or people (Ramanathan et al., 2014;Cui et al., 2021).More recently, Goel et al. (2022) introduced a challenging and unconstrained MCR problem (see Figure 1) including a dataset, Coreferenced Image Narratives (CIN), with both people and objects as referents with long textual descriptions (narrations).As manually annotating a large dataset with coreferencing and grounding labels is expensive, the authors provide annotations only for evaluation purposes.They also propose a weakly supervised method that learns to jointly ground mentions in images and use them as anchors along with prior linguistic rules (Lee et al., 2011) to group coreferring men-tions from only image and narration pairs without the annotations.The method has multiple shortcomings: (1) weakly supervised grounding fails to disambiguate multiple instances of the same object class, boy (one boy, the other boy), (2) language rules such as exact match of phrases are either too strict or too generic, e.g., pronoun match, linking pronouns to one antecedent (one boy, he, he, his) and, (3) they require an additional modality, mouse traces to learn coreferences which can be expensive to obtain.
Motivated by these limitations, we argue that it difficult to successfully resolve coreferences from only image-narration pairs in cases where multiple instances of the same object category are present; this situation is common coincides with a mention in the narration.Since full manual annotations of coreference chains and bounding boxes is expensive, we propose to resolve coreferences and ground mentions in a semi-supervised setting where only a few data samples are labeled.Our approach involves a customized multi-modal fusion model that combines image region features and mention features from narrations through crossattention (Vaswani et al., 2017;Li et al., 2021).We investigate different task-specific losses for training on labeled and unlabeled data, and show that naively combining training on the labeled and pseudo-labeled data suffers from severe overfitting (Arazo et al., 2020).Hence, we propose a robust loss function and thresholding-based training scheme to effectively learn from the unlabeled set.This novel approach results in consistent performance improvements with the inclusion of unlabeled data during training.
Our main contributions are (1) a vision-language framework for MCR trained on a small labeled and an unlabeled dataset, (2) novel task-specific losses (on both labeled and pseudo-labeled data) for learning joint multi-modal embeddings for coreference resolution while simultaneously improving narrative grounding, (3) extensive evaluation of our proposed method on the CIN dataset and ablation studies to validate our design choices, showing consistent performance gains compared to baselines on coreference resolution and narrative grounding.

Related work
Multimodal coreference resolution.MCR involves comprehending the contextual information in language and establishing connections with spe-cific regions in an image.Recently, considerable efforts have been dedicated to developing datasets that can effectively address this intricate task.Parcalabescu et al. (2021) introduced the VALSE dataset, which encompasses various coreference scenarios.However, this dataset focuses on the downstream task of visual question answering without evaluating coreference resolution or grounding.Hence, we evaluate our method on CIN dataset (Goel et al., 2022) that contains coreference chains and grounding annotations.Another approach to MCR datasets involves linking people's names mentioned in the text to corresponding images and resolving pronouns that connect to those specific names (Ramanathan et al., 2014;Cui et al., 2021;Hong et al., 2023).However, our main focus is to resolve coreferences in a generic scenario (with visual complexity) unlike the others that are either limited to only people names/characters (Ramanathan et al., 2014;Cui et al., 2021;Hong et al., 2023) or have simple sentences (Das et al., 2017;Parcalabescu et al., 2021).
Vision-language learning.Existing work on vision and language understanding employs either pre-trained object detector features (He et al., 2017;Ren et al., 2015) as an image encoder, ViT (Dosovitskiy et al., 2020) or a CNN (Simonyan and Zisserman, 2014) combined with a transformer-based text encoder (Devlin et al., 2018).To model crossmodal interaction between the image and text encoders, UNITER (Chen et al., 2020), ALBEF (Li et al., 2021) and VinVL (Zhang et al., 2021b) employ a multimodal encoder.They are pre-trained on large-scale image-caption pairs such as COCO (Lin et al., 2014), Conceptual captions (Sharma et al., 2018;Changpinyo et al., 2021), Visual Genome (Krishna et al., 2017).The pre-training objectives are implemented with image-text contrastive loss, masked language modeling, and image-text matching loss.Our method is inspired by these architectures and is trained using a set of self-supervised and task-based objectives in a semi-supervised learning fashion.
Semi-supervised learning.There is a large body of work in semi-supervised learning (Zhai et al., 2019;Van Engelen and Hoos, 2020;Ouali et al., 2020).These methods typically exploit unlabeled data via either pseudo-labeling with small amounts of labeled data (Lee et al., 2013;Arazo et al., 2020;Rizve et al., 2021;Sohn et al., 2020;Zhang et al., 2021a) or by enforcing consistency regularization (Berthelot et al., 2019;Abuduweili et al., 2021) on the unlabeled data to produce consistent predictions over various perturbations of the same input by applying several augmentation strategies (Zhang et al., 2017;Cubuk et al., 2018Cubuk et al., , 2020)).Our method draws inspiration from pseudo-labeling literature and uses a robust loss function and thresholding to counter overfitting to pseudo-labels.

Task Overview
Our goal is (1) to group mentions (i.e., referential words or phrases) in the narration that corefer to the same entity and, (2) to ground each mention to a region in an image.Formally, let N = {m 1 , m 2 , . . ., m |N | } denote a narration with |N | mentions for an image I with |I| regions where I = {r 1 , r 2 , . . ., r |I| }.We wish to learn an embedding function f that takes in an image I and its narration N , parsed to contain a set of mentions, and outputs a score for a mention pair (m, m ′ ): The mention pair m and m ′ corefers if the score in Equation ( 1) is high, otherwise they do not.
For grounding of the mention m on the image region r, we also learn another function g that outputs a score for the mention m being located at region r in image I. Next, we describe in detail our methodology to learn the two functions f and g.

Model Architecture
In Figure 2, we illustrate our model architecture.Each image is parsed into a set of regions through a pre-trained object detector (Ren et al., 2015), where each region r is represented by a d-dimensional joint embedding v r ∈ R d including its visual, semantic and spatial features.In particular, the visual encoder f v is instantiated as a transformer block that takes in a joint feature embedding v r for the object region r and outputs a D dimensional embedding, i.e., f v (v r ) : R d → R D .
Furthermore, we encode the words in each narration N using a tokenizer (Devlin et al., 2018) to get a set of tokens for the words w ∈ R V where V is the vocabulary size.The text encoder f t which is also a transformer block that takes in the word token w and outputs a D dimensional embedding, i.e., f t (w) : R V → R D .The mention embeddings are computed by averaging its corresponding word

CR BBR GD PCR PGD MLM
Figure 2: Illustration of our model architecture and training methodology.The pre-extracted image regions are fed into the visual encoder, the narrations are fed into the text encoder and both modalities are fused using a multimodal encoder.The model is optimized using self-supervised objectives (in grey) and specialized taskbased losses on both the labeled data (in yellow boxes) and the pseudo-labeled data (in green boxes).
representations as: where, |m| indicates the mention length in words, and the embeddings f t (m) have the same dimensionality as the visual features.
Next, the multi-modal encoder f fuses the visual features from the visual encoder f v (v r ) with the mention features from the text encoder f t (m).Similar to the cross-modal architectures (Li et al., 2021;Zhang et al., 2021b), the embeddings from the text encoder are first encoded using self-attention layers (Vaswani et al., 2017).Then, a multi-head cross attention module integrates the textual and visual features.In the cross-attention module, the self-attended mention embeddings f t (m) are treated as the query, while the image representations f v (v r ) are treated as keys and values.The attention weights between the mention m and the region r are given as: where the softmax is computed over the image regions for each mention.This attention matrix (or the grounding function) g from the multi-head cross attention learns fine-grained mention to region alignment scores.Finally, the vision-aware mention embedding is represented as: where, f (m) ∈ R D .This weighted embedding is then passed to a feed-forward module (Li et al., 2021) with an MLP and layer normalization.All the transformer encoders/blocks are based on the architecture proposed by (Li et al., 2021).It is important to note that compared to Goel et al. (2022), our model fuses vision and text features with a multimodal encoder, unlike theirs.

Semi-supervised Learning
Concretely, we aim to learn the parameters of the modules f v , f t and f given a training dataset D with |D| samples of image-narration pairs.Specifically, we use a small labeled set where x i = {I, N } is the image-narration input pair and (x,y)∈Ds where, L s is the supervised loss and L u is the unsupervised loss.First, we discuss how to formulate task-based supervised losses on the dataset D s .
(S1) Coreference loss (CR) Specifically, we propose to learn the similarity between the mention embeddings using a supervised contrastive loss (Khosla et al., 2020) which is defined as: where τ is the temperature.This loss helps to cluster embeddings for coreferring mentions together and push the embeddings of non-referrants away from each other.(S2) Grounding loss (GD) To align the mention m and region r, we use the grounding function g defined in Equation (2).In particular, we first define the ground-truth binary alignment on the labeled training set D s .For the ground-truth bounding box b m for a mention m we compute the intersection over union (IoU) between this bounding-box and the R pre-extracted image regions.This is crucial because we don't have the exact region-mention match for the detections from the object detector.Following this, we get the binary alignment function h(m, r), which is 1 for the mention m and the detected image region r if the region r has the maximum IoU overlap with the ground-truth bounding box b m , and 0 otherwise.Once we have the ground-truth alignment h(m, r), we compute the cross-entropy loss as: (S3) Bounding box regression loss (BBR) We further propose to add additional supervision to refine the object proposals from the detector for a mention.For each mention m, the ground-truth bounding box localization is represented as b m = (x, y, w, h).To learn refinements, we predict the box deltas from the model as δ m = (δ x , δ y , δ w , δ h ) for each mention m.We then take the highest scoring region for a given mention m as: Our goal is to learn a transformation that maps a proposed box r m to a ground-truth box b m .We then apply the smooth-L1 loss following Ren et al.
(2015), denoted as L bbr .Further details about this loss are given in the appendix.Next, we discuss how to train on the unlabeled subset of the dataset by generating pseudo-labels for the coreference and grounding tasks.(U1) Pseudo coreference loss (PCR) Given the unlabeled dataset D u , we compute the pseudo coreferring pairs for the mentions in N .More specifically, we compute pseudo-positives P (m) and pseudonegatives Â(m) for a mention m by computing the cosine similarity between the embeddings as in Equation ( 1).For each mention m, if the similarity with another mention m ′ is greater than a threshold then we label it as a positive otherwise a negative.Finally, we compute the triplet loss as: where α is the margin, f (m) is the embeddings for the query mention m, 1 | P (m)| p∈ P (m) f (p) is the mean of embeddings of the pseudo-positive labels P (m) and 1 is the mean of embeddings of the pseudo-negative labels Â(m).
The key intuition behind using the mean in a triplet loss formulation is to reduce overfitting to the noise in the pseudo labels.This works better in practice compared to the contrastive loss formulation in Equation ( 5) or mining a random positive/negative label for the standard triplet loss, especially when dealing with pseudo labels.(U2) Pseudo grounding loss (PGD) Furthermore, we compute the pseudo grounding loss on the unlabeled training dataset.Specifically, we impute the pseudo-labels from the grounding function, g(m, r).We only consider samples whose grounding score is greater than a confidence threshold t, which is set to 0.9 in our experiments.The high threshold value ensures that we consider only confident samples in the unlabeled set and eliminates learning from noisy samples.We denote this label after binary thresholding as ĥ(m, r).The pseudo grounding alignment loss is: Apart from the above mentioned task-based losses, we combine the standard image-text pretraining losses (Vaswani et al., 2017;Li et al., 2021).These losses help to learn better unimodal representations before fusion.(U3) Image-Text contrastive loss (ITC) Following Goel et al. (2022), we incorporate the contrastive loss to align the image and narration pairs to learn better representations before fusion.This loss is defined as: is the mention-region matching score from the visual and text representations before fusing in the multi-modal encoder and v r are the raw features for the highest scoring region for a mention m. (U4) Masked language modeling loss (MLM) To fine-tune the pretrained BERT model (Devlin et al., 2018) on the image-narration data, we also use the pre-trained task of masked language modeling.In particular, the input word tokens are randomly masked and are replaced by a special masking token.The model needs to predict the mask token based on the unmasked words.This task is trained with a cross-entropy loss, L mlm .
Hence, our overall training objective in Equation ( 4

Inference
To obtain the coreference scores, we form chains by measuring the cosine similarity between the mentions as described in Equation ( 1), considering the pairs with similarity higher than a predefined threshold as positives.When evaluating narrative grounding, we extract the cross-attention scores from the last layer of the multimodal encoder.For each mention, we identify the region with the highest softmax score as the positively referred region.

Experiments
Datasets.We evaluate our proposed method on the CIN dataset (Goel et al., 2022) that consists of 1000 test and 880 validation image-narration pairs from the Flickr30k split of the Localized Narratives dataset (Pont-Tuset et al., 2020) annotated with coreference chains and bounding boxes.We use the test split of the CIN dataset to report the performance on CR and narrative grounding.The annotations from the validation split are used as the small labeled set during training.The unlabeled dataset is the Flickr30k training subset of the Localized Narratives dataset, which consists of 50k image-narration pairs but is not annotated with bounding boxes or coreference chains.Implementation details.For the image regions, we extract bounding box regions, visual features and object class labels using the Faster-RCNN object detector (Ren et al., 2015) as in Goel et al. (2022).We use a 4-layer transformer architecture for the text encoder and the multi-modal encoder similar to the ALBEF (Li et al., 2021) framework.The weights of the transformer encoders are initialized with the first four layers of BERT (Devlin et al., 2018).The visual encoder is a stack of two transformer encoder layers.Each transformer encoder layer includes a multi-head self-attention layer and an FFN.There are two heads in the multi-head attention layer, and two FC layers followed by ReLU  2022) and report the grounding accuracy for both noun phrases and pronouns.More precisely, if the overlap between the ground-truth box and the predicted box is greater than 0.5, then it is considered to be a correct prediction.

Coreference Resolution
Table 1 reports the coreference resolution performance on the CIN dataset (Goel et al., 2022) for our method and the baselines.Further details about the baselines are given in the appendix.The textbased baselines Neural Coref (Lee et al., 2017) and longdoc (Toshniwal et al., 2021) are evaluated in a zero-shot way on the task.Their low CoNLL F1 scores indicate the incapability of the model to generalize to new domains which is in line with what has been evaluated extensively in the coreference literature (Toshniwal et al., 2021;Xia and Van Durme, 2021;Gandhi et al., 2023).We further compare to strong multi-modal baselines by directly evaluating the VLMs in a zeroshot way on the CIN dataset.Interestingly, all three methods: VisualBERT (Su et al., 2019), UNITER (Chen et al., 2020) and VinVL (Zhang et al., 2021b) perform better in MUC and B 3 compared to the text-based baseline, longdoc (Toshni-wal et al., 2021), but drop in performance on the average CoNLL F1 scores.These results show the inability of these models to effectively find singletons, hence leading to poor performance in the precision scores.Moreover, we can conclude that the vision and language pre-trained models fail to generalize for MCR.
We also compare to two weakly supervised methods that are trained on the CIN dataset, MAF (Wang et al., 2020) and WS-MCR (Goel et al., 2022).Goel et al. (2022) present results on the MAF model as a baseline and their proposed method, WS-MCR.MAF is a weakly supervised grounding method trained with ITC that is evaluated for CR and WS-MCR (Goel et al., 2022) learns weakly-supervised grounding and CR combining the ITC loss and prior linguistic rules.Both of these methods improve significantly in MUC scores compared to other zero-shot unimodal and multi-modal baselines.
Finally, we compare with the text-only variant (without any images) of our method.This method improves over the baselines on the CoNLL F1 scores.The significant gains in performance of our final method, with both text and image, combined with label supervision shows the importance of carefully tuning the model with a small amount of labeled data and large amounts of pseudo-labeled data.

Narrative Grounding
In Table 2, we present a comprehensive comparison between the baselines and our proposed approach on the task of narrative grounding.This task is both challenging and crucial, as it evaluates the precise alignment between image regions and phrases in textual data.Notably, our proposed method goes beyond the traditional alignment of noun phrases and also addresses the grounding pronouns, which is vital for multimodal coreference resolution.We measure noun phrase grounding, pronoun grounding, and overall accuracy to measure performance (Goel et al., 2022).
Remarkably, our proposed method exhibits superior performance compared to weakly supervised baselines, showing a margin of improvement if approximately 2% and 2.5% in noun phrase and pronoun grounding accuracy, respectively.Furthermore, when compared to our unsupervised baseline, namely "Ours (ITC + MLM)", the inclusion of labeled and pseudo-labeled data yields a significant performance boost of approximately 6%.These results demonstrate the significance of training with both grounding alignment and coreference resolution loss, highlighting the mutual benefits derived from this approach.(Goel et al., 2022).

Ablation Study
Varying labeled and unlabeled data.We study the impact of labeled data on the learning process, allowing us to showcase the strengths of our approach.In Table 3, we measure the model's performance on CoNLL F1 scores at different proportions of labeled data (20% and 50%).Remarkably, despite the limited amount of labeled data samples, the model demonstrates consistently high performance without any significant drop.This highlights the exceptional ability of our model to effectively learn from a small labeled set, without relying on a large number of annotated training samples.Furthermore, to validate the efficacy of our proposed method, we also investigate the influence of unlabeled data samples during training.Following the same data split as in the supervised experiments, we observe the changes in performance indicated by row 2 in Table 3.As the quantity of unlabeled samples increases, the model exhibits enhanced coreference resolution performance.This result reinforces the ability of our proposed method to leverage and effectively learn from pseudo-labeled data.Detailed results are in the appendix.Impact of different loss functions.In Table 4, we assess the performance of coreference resolution by incorporating various losses proposed in Section 3. Throughout the training process, the model consistently integrates the self-supervised objectives of ITC and MLM, see first row of Table 4.
Integrating the supervised contrastive coreference resolution loss, CR, in addition to ITC and MLM, results in a significant performance drop.Due to the limited availability of labeled data, the model struggles to effectively generalize for coreference resolution, leading to overfitting and consequently lower F1 scores.However, by progressively incorporating the bounding box regression loss, BBR, and the grounding alignment loss GD, we get a much stronger training signal even with a small labeled set.This multi-task training objective contributes to an improvement of approximately 1.5% in the CoNLL F1 score.
Subsequently, we investigate the impact of incorporating loss on pseudo-labeled data.By introducing the pseudo coreference loss, denoted as PCR, we observe a remarkable improvement of approximately 2% in the CoNLL F1 scores.This result highlights the significance of leveraging pseudo clusters and underscores the effectiveness of our proposed robust triplet loss, which computes the triplet loss using the mean of positive and negative embeddings.Notably, this approach successfully incorporates pseudo-labeled data without leading to overfitting while achieving substantial performance gains.Consequently, our final proposed method, which integrates the pseudo grounding loss, PGD, exhibits the most superior overall performance, validating the potency of pseudo-labels for both coreference resolution and grounding.Choice of coreference resolution loss.In Table 5, we examine the impact of different types of coreference resolution losses.We present a comparison of the following loss combinations: (1) Binary    We observed a significant performance drop when training with the BCE loss, compared to utilizing the supervised contrastive loss.We hypothesize that the supervised contrastive loss provides a better clustering of mentions by contrasting them in the embedding space directly than the binary cross-entropy loss.Consequently, the embeddings become more robust for CR, contributing to improved performance.
Interestingly, when applying the supervised contrastive loss to D u (row 2), we observed a drop in performance.Our hypothesis is that the contrastive loss tends to overfit in the presence of noisy pseudo labels, leading to a degradation in performance.In contrast, our pseudo triplet loss formulation PCR is softer in penalizing noisy pseudo labels.This allows the model to gradually adapt and become more resilient to such noise, resulting in more efficient clustering of mentions.We also compare to another ablation where instead of taking the mean of the embeddings for pseudo-positive labels and pseudo-negative labels, we sample a random positive and negative label (results in row 3) abbreviated as RTC.Randomly sampling the labels generalizes better than the other ablations but the mean cluster embeddings outperforms than randomly selecting samples.

Qualitative Results
In Figure 3, we qualitatively visualize the performance of our method and compare it with the weakly supervised baseline from Goel et al. (2022).
Our model correctly separates the mentions the front man and the the man both during CR and grounding, whereas the WS-MCR (Goel et al., 2022) method incorrectly assigns the mention the man to the the front man and grounds it incorrectly too (denoted by the blue dotted line).Hence, our method can effectively learn to disambiguate the instances based on the visual details which is also helpful for coreference resolution.

Conclusion
In conclusion, this paper addresses the challenging task of multimodal coreference resolution where an image is accompanied by a longer descriptive text.We propose a data efficient semi-supervised approach that incorporates task-based losses for both labeled and unlabeled data, operating within a cross-modal framework.Our method achieves remarkable results for CR and narrative grounding tasks on the CIN dataset, showcasing its effectiveness in handling the complexities of MCR.In the future, we plan to investigate how the power of pre-training combined with semi-supervised finetuning can be fully utilized for the task of MCR.

Limitations
Here, we outline limitations that are important considerations for future work.
First, the current model's performance in coreference resolution and grounding is limited by the use of a pre-trained object detector.The detectors pretrained for object detection task have a limited object category vocabulary and lack in fine-grained properties including adjectives, human actions and the open vocabulary found in narrations.This forces the model to rely on a predetermined set of regions and object classes, preventing it from directly learning region coordinates for a mention on an image.To improve performance, we envision the development of an end-to-end approach that eliminates this reliance on pre-defined regions.
Second, our model currently depends on groundtruth mentions to resolve coreferences and ground them.In the future, one promising direction would be to detect mentions simultaneously with coreference resolution and grounding.This would significantly improve the applicability of our proposed method and reduce dependence on off-the-shelf mention detectors or ground-truth annotations.

Ethics Statement
All datasets used in this work have been previously released.The use of the CIN dataset (Goel et al., 2022) in our paper is consistent with their intended use.The detail of the dataset is describied in Goel et al. (2022).Multimodal datasets frequently include social biases, and we expect the models trained on them to reflect the biases in these datasets.It is important to note that multimodal models have both beneficial and harmful applications.Beneficial applications include advanced image and video retrieval, visual description systems to assist the visually impaired, and user interfaces that enhance interaction with smart home devices.However, harmful applications, such as non-consensual surveillance or fine-tuning models to retrieve inappropriate content, must be carefully addressed and mitigated.
in this image we can see a man is dancing on a road, and the man is wearing white color shirt and black color pant.
we can also see a red and white color cloth in his hand, and the man is also wearing a orange and black color belt around his chest on the road we can see a grey color car is parked, and we can also see some other people also standing on the road by holding flags on the footpath we can see two women are sitting and talking with each other, and in that one woman is holding a baby, and we can also see some other people are also standing on the footpath, and in the background we can see a big orange color building with white color glass windows, and we can also see the entrance door of the building.
in this image we can see a man is dancing on a road, and the man is wearing white color shirt and black color pant.
we can also see a red and white color cloth in his hand, and the man is also wearing a orange and black color belt around his chest on the road we can see a grey color car is parked, and we can also see some other people also standing on the road by holding flags on the footpath we can see two women are sitting and talking with each other, and in that one woman is holding a baby, and we can also see some other people are also standing on the footpath, and in the background we can see a big orange color building with white color glass windows, and we can also see the entrance door of the building.

Ground Truth
Ours in this image we can see a man is dancing on a road, and the man is wearing white color shirt and black color pant.
we can also see a red and white color cloth in his hand, and the man is also wearing a orange and black color belt around his chest on the road we can see a grey color car is parked, and we can also see some other people also standing on the road by holding flags on the footpath we can see two women are sitting and talking with each other, and in that one woman is holding a baby, and we can also see some other people are also standing on the footpath, and in the background we can see a big orange color building with white color glass windows, and we can also see the entrance door of the building.(Goel et al., 2022) Figure 4: Coreference resolution for an image-narration pair.We break down the full narration as individual sentences in the columns for simplicity.The rows from top to bottom show ground-truth annotations, predictions from the WS-MCR method (Goel et al., 2022) and Ours.The mentions in the same color form a part of the coreference chain.

11080
Ground Truth WS-MCR (Goel et al., 2022) Ours on the footpath we can see two women are sitting and talking with each other, and in that one woman is holding a baby, and we can also see some other people are also standing on the footpath, and in the background we can see a big orange color building with white color glass windows, and we can also see the entrance door of the building.on the footpath we can see two women are sitting and talking with each other, and in that one woman is holding a baby, and we can also see some other people are also standing on the footpath, and in the background we can see a big orange color building with white color glass windows, and we can also see the entrance door of the building.on the footpath we can see two women are sitting and talking with each other, and in that one woman is holding a baby, and we can also see some other people are also standing on the footpath, and in the background we can see a big orange color building with white color glass windows, and we can also see the entrance door of the building.11081 , b m } is the label for the input pair.In particular, the label for each mention m in the narration is given as: P (m) and A(m), the set of positive and negative mentions respectively for the mention m and b m , the bounding-box coordinates of the region corresponding to the mention m.Due to the unavailability of a large labeled training set, we leverage the unlabeled data D u = D \D s where, D u = {x i } |Du| i=1 with only image-narration pairs as inputs.Our overall training objective is the joint loss function as follows: ) is a combination of specialized task losses on the labeled training set D s (L cr , L gd and L bbr ) and the unlabeled training set D u (L pcr and L pgd ) and global pre-training objectives on the entire training dataset D (L itc and L mlm ).
cross-entropy loss (BCE) applied to both D s and D u , (2) Supervised contrastive loss (CR) applied to both D s and D u , and (3) Supervised contrastive loss (CR) on D s and random triplet mining loss (RTC) on D u .
Figure 3: Visualization for grounding and coreference resolution.The colored boxes in image correspond to the mentions with the same color in the sentence.

Figure 5 :
Figure 5: Visualization of Grounding and CR performance.Zoom in for better visualization of bounding boxes.

Table 1 :
(Goel et al., 2022)ion results on the CIN dataset(Goel et al., 2022)from our proposed method and other state-of-the-art unimodal and multi-modal baselines.† indicates the use of predicted mentions, while the other results rely on ground-truth mentions during inference.* means zero-shot performance.
activation layers in the FFN.Training details are in the appendix.Evaluation.We report results for coreference resolution and narrative grounding.For the former, we use the standard CoNLL F1 score which is the average of three coreference-based metrics: MUC, B 3 and CEAF ϕ4 .For the latter, we followGoel et al.  (

Table 2 :
Comparison of narrative grounding performance on the CIN dataset

Table 3 :
CR performance by changing the amount of labeled and unlabeled data during training.

Table 4 :
Ablation study on our proposed method with the combination of the proposed losses.

Table 5 :
Performance comparison with the choice of coreference resolution loss on the labeled dataset D s and the unlabeled dataset D u .