EDIS: Entity-Driven Image Search over Multimodal Web Content

Making image retrieval methods practical for real-world search applications requires significant progress in dataset scales, entity comprehension, and multimodal information fusion. In this work, we introduce \textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a challenging dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description. Unlike datasets that assume a small set of single-modality candidates, EDIS reflects real-world web image search scenarios by including a million multimodal image-text pairs as candidates. EDIS encourages the development of retrieval models that simultaneously address cross-modal information fusion and matching. To achieve accurate ranking results, a model must: 1) understand named entities and events from text queries, 2) ground entities onto images or text descriptions, and 3) effectively fuse textual and visual representations. Our experimental results show that EDIS challenges state-of-the-art methods with dense entities and a large-scale candidate set. The ablation study also proves that fusing textual features with visual features is critical in improving retrieval results.


Introduction
Image search, also known as text-to-image retrieval, is to retrieve matching images from a candidate set given a text query.Despite the advancements in large-scale vision-and-language models (Wang et al., 2021;Zhang et al., 2021;Chen et al., 2020;Li et al., 2020b), accurately retrieving images from a large web-scale corpus remains a challenging problem.There remain several critical issues: 1) Lack of large-scale datasets: existing image retrieval datasets typically contain 30K-100K images Figure 1: EDIS contains entity-rich queries and multimodal candidates.EDIS requires models to recognize subtle differences across different modalities to identify the correct candidates.For instance, the last three sample candidates either miss entities in the image or describe a different event.(Plummer et al., 2015;Lin et al., 2014), which is far less than the number of images that search engines must deal with in real applications.2) Insufficient entity-specific content: existing datasets focus on generic objects without specific identities.Specific entities ("Statue of Liberty") in web images and text may be recognized as general objects ("building").3) Modality mismatch: existing image retrieval methods usually measure image-text similarity.However, for web image search, the surrounding text of an image also plays a crucial part in this fast and robust retrieval process.
Recently, there has been a continuous interest in event-centric tasks and methods in the news domain (Reddy et al., 2021;Varab and Schluter, 2021;Spangher et al., 2022).For instance, NY-Times800K (Tran et al., 2020), and Visual News (Liu et al., 2021) are large-scale entity-aware benchmarks for news image captioning.TARA (Fu et al., 2022) is proposed to address time and location reasoning over news images.NewsStories (Tan et al., 2022) aims at illustrating events from news articles using visual summarization.While many of these tasks require accurate web image search results as a premise, a large-scale image retrieval dataset is lacking to address the challenges of understanding entities and events.
Therefore, to tackle the aforementioned three key challenges, we introduce a large-scale dataset named Entity-Driven Image Search (EDIS) in the news domain.As is shown in Fig. 1, EDIS has a much larger candidate set and more entities in the image and text modalities.In addition to images, text segments surrounding an image are another important information source for retrieval.In news articles, headlines efficiently summarize the events and impress readers in the first place (Panthaplackel et al., 2022;Gabriel et al., 2022).Hence, to simulate web image search with multi-modal information, we pair each image with the news headline as a textual summarization of the event.As a result, EDIS requires models to retrieve over image-headline candidates, which is a novel setup over existing datasets.
Given a text query, existing models can only measure query-image or query-text similarity alone.BM25 (Robertson et al., 2009), and DPR (Karpukhin et al., 2020) fail to utilize the visual features, while vision-language models like Visual-Bert (Li et al., 2019) and Oscar (Li et al., 2020b) cannot be adopted directly for image-headline candidates and are infeasible for large-scale retrieval.Dual-stream encoder designs like CLIP (Radford et al., 2021) are efficient for large-scale retrieval and can compute a weighted sum of query-image and query-text similarities to utilize both modalities.However, as is shown later, such multi-modal fusion is sub-optimal for EDIS.In this work, we evaluate image retrieval models on EDIS and reveal that the information from images and headlines cannot be effectively utilized with score-level fusion.Therefore, we further proposed a featurelevel fusion method to utilize information from both images and headlines effectively.Our contribution is three-fold: • We collect and annotate EDIS for large-scale image search, which characterizes singlemodality queries and multi-modal candidates.EDIS is curated to include images and text segments from open sources that depict a significant amount of entities and events.
• We propose a feature-level fusion method for multi-modal inputs before measuring alignment with query features.We show that images and headlines are exclusively crucial sources for accurate retrieval results yet cannot be solved by naive reranking.
• We evaluate existing approaches on EDIS and demonstrate that EDIS is more challenging than previous datasets due to its large scale and entity-rich characteristics.

Related Work
Cross-Modal Retrieval Datasets Given a query sample, cross-modal retrieval aims to retrieve matching candidates from another modality (Bain et al., 2021;Hu and Lee, 2022;Wang et al., 2022;Sangkloy et al., 2022).Several datasets have been proposed or repurposed for text-to-image retrieval.
For instance, MSCOCO (Lin et al., 2014), andFlickr-30K (Plummer et al., 2015) are the two widely used datasets that consist of Flickr images of common objects.CxC (Parekh et al., 2021) extends MSCOCO image-caption pairs with continuous similarity scores for better retrieval evaluation.Changpinyo et al.(2021) repurposes ADE20K (Pont-Tuset et al., 2020) for image retrieval with local narratives and mouse trace.WebQA (Chang et al., 2022) has a similar scale to EDIS and defines source retrieval as a prerequisite step for answering questions.The source candidates are either text snippets or images paired with a short description.
In contrast, EDIS is a large-scale entity-rich dataset with multi-modal candidates that aligns better with realistic image search scenarios.

Task Formation
EDIS contains a set of text queries Q = {q 1 , q 2 , . ..}, and a set of candidates c m , essentially image-headline pairs B = {c 1 = (i 1 , h 1 ), c 2 = (i 2 , h 2 ), . ..}where i n denotes an image and h n denotes the associated headline.For a query q m , a retrieval model needs to rank top-k most relevant image-headline pairs from B. As is shown in Fig. 1, both images and headlines contain entities that are useful for matching with the query.We evaluate approaches with both distractor and full candidate sets.The distractor setup is similar to conventional text-to-image retrieval using MSCOCO (Lin et al., 2014), or Flickr30K (Plummer et al., 2015), where images are retrieved from a limited set B with ∼25K (image, headline) pairs.The full setting requires the model to retrieve images from the entire candidate set B.

Entity-Driven Image Search (EDIS)
We select queries and candidates from humanwritten news articles and scraped web pages with different stages of filtering.Then we employ human annotators to label relevance scores.Fig. 2 illustrates the overall dataset collection pipeline.

Query Collection
We extract queries and ground truth images from the VisualNews (Liu et al., 2021), and TARA (Fu et al., 2022) datasets.These datasets contain news articles that have a headline, an image, and an image caption.We adopt captions as text queries and use image-headline pairs as the retrieval candidates.
We design a series of four filters to select highquality, entity-rich queries. 1) Query complexity: we first evaluate the complexity of queries and remove simple ones with less than ten tokens.2) Query entity count: we use spaCy to estimate average entity counts in the remaining queries and remove 20% queries with the lowest entity counts.The resulting query set has an average entity count above 4.0.3) Query-image similarity: to ensure a strong correlation between queries and the corresponding ground truth image, we calculate the similarity score between query-image using CLIP (Radford et al., 2021) and remove 15% samples with the lowest scores.4) Query-text similarity: we calculate the query-text similarity using Sentence-BERT (Reimers and Gurevych, 2019) and remove the top 10% most similar data to force the retrieval model to rely on visual representations.
To avoid repeating queries, we compare each query to all other queries using BM25 (Robertson et al., 2009).We remove queries with high similarity scores as they potentially describe the same news event and lead to the same retrieved images.

Candidate Collection
In web image search experience, multiple relevant images exist for a single query.Therefore, we expand the candidate pool so that each query corresponds to multiple image-headline pairs.Additional candidates are collected from Google Image Search and the rest of the VisualNews dataset.
For each query from VisualNews, we select seven image-headline pairs from Google search.For each query from TARA, we select five image-headline pairs from Google search.and two image-headline pairs from VisualNews.Then, we ask annotators to label the relevance score for each candidate on a three-point Likert scale.Score 1 means "not relevant" while 3 means "highly relevant".Formally, denote E(•) as the entity set and E(•) as the event of a query q m or a candidate c n = (i n , h n ), we define the ground truth relevance scores as: ( Each candidate is annotated by at least three workers, and it is selected only when all workers reach a consensus.Controversial candidates that workers cannot agree upon after two rounds of annotations are discarded from the candidate pool.Additionally, one negative candidate is added to each annotation task to verify workers' attention.The final conformity rate among all annotations is over 91.5%.

Hard Negatives Mining
We discover that EDIS queries can be challenging to Google Image Search in some cases.Among the 200K images from Google Search, 29K (∼15%) are annotated with a score of 1, and 124K (∼62%) are annotated with a score of 2. These candidates are hard negatives that require retrieval models to understand and ground visual entities in the images.As for candidates from VisualNews, there are 9.7K (∼41%) with a score of 1 and 9.5K (∼40%) candidates with a score of 2. We refer to these samples as in-domain hard negatives as their headlines share some entities with the query but refer to different events with discrepant visual representations.
Soft Negative Mining Lastly, we utilize the rest of the image-headline pairs from VisualNews and TARA to augment the candidate pool.These candidates are naturally negative candidates with a relevance score of 1 because of the unique article contents and extensive diversity in topics.Therefore, our dataset consists of 1,040,919 image-headline candidates in total.

Dataset Statistics
We demonstrate the major advantage of EDIS over existing datasets in Table 1.EDIS has the largest candidate set with a consistent candidate modality.Our images are not restricted to a specific source as a result of collecting images from a real search engine.Queries from EDIS are entity-rich compared to datasets with general objects (e.g.MSCOCO).
In Fig. 3 (left), we show the score distribution of human annotations.Candidates mined from Visual News are mostly in-domain hard negatives, while the images represent missing entities or different events.These candidates are mostly annotated with scores of 1 or 2. As for Google search candidates, many images depict the same event but with missing entities.Therefore, the annotations concentrate on score 2. In Fig. 3 (right), we show that most of the queries have at least one hard negative, usually more score 2 negatives than score 1 negatives.About half of the queries have more than one positive candidate (score 3).We show more examples of EDIS candidates in Fig. 8-11.

Multi-Modal Retrieval Method
Given a text query q, a model should be able to encode both images i n and headlines h n to match the query encoding.Therefore, the model should include a multi-modal candidate encoder f C and a query encoder f Q .Within f C , there is a branch for image input f I and a branch for headline f H .We formalize the matching process between a query q m and a candidate c n = (i n , h n ) as: where s m,n is the similarity score between q m and c n .Based on the design of f C , we categorize methods into score-level fusion and feature-level fusion.
Score-level Fusion These methods encode image and headline independently and compute a weighted sum of the features, i.e., f Therefore, s m,n is equivalent to a weighted sum of query-image similarity s i k,k and query-headline similarity s h k,k : Specifically, CLIP (Radford et al., 2021), BLIP (Li et al., 2022), and a combination of models like CLIP and BM25 (Robertson et al., 2009) belong to this category.
Feature-level Fusion In Sec. 6, we show that score-level fusion is a compromised choice for encoding multi-modal candidates.Therefore, we propose a modified version of BLIP (mBLIP) to fuse features throughout the encoding process.The overall fusion process can be abstracted as follows:  As is shown in Fig. 4, we first extract image embeddings f I (•) using the image encoder and then feed f I (•) into the cross-attention layers of f H .The output from f H is a feature vector v i,h that fuses the information from both image and text modalities.We separately obtain the query feature v q = f Q (q m ) where f Q shares the same architecture and weights with f H , except that the cross-attention layers are not utilized.We adopt the Image-Text Contrastive (ITC) loss (Li et al., 2021) between v i,h and v q to align the fused features with query features.
6 Experiment Setup

Baselines
For score-level fusion mentioned in Sec. 5, we consider CLIP, BLIP, fine-tuned BLIP, and BM25+CLIP reranking to utilize both modalities of the candidates.In addition, we benchmark existing text-to-image retrieval methods, and text document retrieval methods, including VinVL (Zhang et al., 2021), ALBEF (Li et al., 2021), and BM25 (Robertson et al., 2009).Although they are not designed for multi-modal candidates, benchmarking these methods facilitates our understanding of the importance of single modality in the retrieval process.We do not consider single-stream approaches like UNITER (Chen et al., 2020) as they are not efficient for large-scale retrieval and result in extremely long execution time (see Appendix A).

Evaluation Metrics
We evaluate retrieval models with the standard metric Recall@k (R@k) that computes the recall rate of the top-k retrieved candidates.k is set to 1, 5, 10.We report mean Average Precision (mAP) to reflect the retrieval precision considering the ranking position of all relevant documents.Formally, where P (m, n) is the Precision@n of a query q m .For R@k and mAP, candidates with relevant score 3 are positive candidates, while candidates with scores 2 or 1 are (hard) negative samples.These two metrics reflect the model's ability to retrieve the most relevant candidates, which aligns with the definition in Fig. 2.
To give merits to candidates with a score of 2, we also report Normalized Discounted Cumulative Gain (NDCG).NDCG assigns importance weights proportional to the relevance score so that ranking score 2 candidates before score 1 candidate will lead to a higher metric value.
where IDCG(m) is the DCG value of q m with the ideal candidate ranking.

Implementation Details
For BLIP fine-tuning, we adopt the same loss and hyperparameters as reported in the original implementation.We increase the learning rate to 1e-5 for optimal validation results.We directly rank the candidates by computing the cosine similarity of query features and candidate features and do not use any linear regression heads for reranking.Therefore, we abandon the image-text matching (ITM) loss in mBLIP fine-tuning and increase the learning rate to 5e-5 for optimal performance.More details can be found in Appendix A.
7 Experimental Results

BLIP-based fusion methods
We first investigate the performance difference between score-level and feature-level fusion as mentioned in Sec. 5. We implement these two ap-

EDIS Test
Methods Fine-tuned R@1 R@5 R@10 mAP NDCG Table 2: Retrieval performance of BLIP and mBLIP under the distractor and full settings.We use grid search on the validation split to find the best score fusion weights (see Eq. 4) for zero-shot and fine-tuned BLIP.
proaches on BLIP (Li et al., 2022).Table 2 compares the result under two different setups where "BLIP" denotes the score-level fusion using the original BLIP architecture, and "mBLIP" denotes our proposed feature-level fusion.For score-level fusion, we obtain the weights from a grid search on the validation set under the distractor setup.
Distractor Set Pre-trained BLIP achieves 18.4 R@1 and 46.6 R@5, which means that almost onethird of the queries have a positive candidate in top-1 results, and around half of the positive candidates are retrieved in top-5 results.After finetuning, BLIP doubles R@1 to 32.6 and achieves significant gain in other metrics.The improvement shows that entities in EDIS are out-of-domain concepts for zero-shot BLIP, and EDIS training split is useful for models to adapt to the news domain.mBLIP outperforms BLIP in all metrics except R@1.The overall improvement entails that featurelevel fusion is superior to score-level fusion by utilizing headlines more effectively.The degradation in R@1 can be attributed to the fact that the image-query alignment is accurate enough for a small number of queries.Therefore, utilizing headlines slightly harms the results as they only provide high-level summarization.
Full Set Retrieving from the full candidate set significantly degrades the performance by over 50% in all metrics.Though the distractor setup was widely adopted in previous work, we show that a larger candidate set imposes remarkable challenges to the SOTA models.We can observe similar trends by comparing the three variants of BLIP.mBLIP achieves over 17% relative improvement across all metrics except R@1, even more significant than 4-12% relative improvement under the distractor set.
The degradation in R@1 is also much less severe.
Therefore, feature-level fusion is a more effective EDIS Val

EDIS Test
Methods Fine-tuned f H f I R@1 R@5 R@10 mAP NDCG R@1 R@5 R@10 mAP NDCG way to encode multi-modal candidates, considering that users usually receive more than one searched image in reality.

Additional Baselines
In Table 3, the defective recall rates of BM25 and CLIP text encoder imply that headlines solely are insufficient for accurate retrieval.However, textbased retrieval achieves promising NDCG values, indicating that headlines are useful for ranking score 2 candidates to higher positions.
Score-level Fusion "BM25+CLIP" first ranks candidates using BM25 and then reranks the top 50 or 200 candidates with CLIP to utilize the images.Despite the improvement compared to text-based methods, it underperforms zero-shot CLIP or BLIP.This implies that ranking with query-headline similarity imposes a bottleneck on the reranking process.CLIP achieves the best performance in terms of R@1/5/10 and mAP compared to other methods.We hypothesize that the "CLIP filtering" step in Sec.4.1 eliminates hard negative query-image pairs for CLIP and thus introduces performance bias towards CLIP.Fine-tuned CLIP does not show apparent improvement and thus is not shown in Table 3.Therefore, EDIS is still challenging for SOTA retrieval models.score 2 candidates higher.We conjecture that many score 2 images have insufficient entities, resulting in lower query-image similarity scores.Hence, models must rely on headlines to simultaneously recognize entities from multiple modalities.

Ablation Study
Component Analysis Table 4 shows the performance of two fusion approaches without either image or headline branch.BLIP achieves much lower performance when relying solely on query-headline alignment (6.6 R@1, 29.7 mAP) compared to utilizing images only (33.9 R@1, 54.0 mAP).BLIP only achieves comparable and slightly degraded performance when using images and headlines for score fusion.Therefore, score-level fusion cannot easily tackle multi-modal candidates in EDIS.
In contrast, mBLIP shows improved performance with the headline encoder while decreased performance with the image encoder only.This is intuitive as the BLIP fine-tuning process only utilizes images without headlines, yet mBLIP utilizes both images and headlines.More interestingly, when using both image and headline encoders, mBLIP demonstrates over 20% relative increase in all metrics.The results imply that feature-level fusion is a more effective method to combine candidate features from multiple modalities.

Case Study
Success Case We show one success case and one failure case of mBLIP in Fig. 5.In the success case (top), mBLIP manages to retrieve all four relevant images while BLIP retrieves five false positives.Since all ten images contain a "cruise", we conjecture that entities in headlines (e.g., "Cambodia", "Sihanoukville") play a critical role for mBLIP to outperform BLIP in this case.The case shows feature-level fusion is much more effective in utilizing headline features than score-level fusion.
Failure Case As for the failure case in Fig. 5 (bottom), BLIP and mBLIP fail to retrieve the positive candidates in the top-5 results.Both methods fail to recognize "John Podesta" and align the text with the visual representation.For example, the top-2 candidates retrieved by mBLIP depict a different person from a different event."Hillary Clinton" becomes a distracting entity in the query, and the model must understand the event instead of just matching entities to achieve accurate retrieval results.The third candidate of mBLIP shows the image with the correct person but from a different event.It further proves that the EDIS is a challenging dataset that requires specific knowledge of entities, cross-modal entity matching, and event understanding.

Conclusion
Training and evaluating large-scale image retrieval datasets is an inevitable step toward real image search applications.To mitigate the gap between existing datasets and real-world image search challenges, we propose a large-scale dataset EDIS with a novel retrieval setting and one million candidates.EDIS queries and candidates are collected from the news domain describing abundant entities and events.EDIS candidates are image-headline pairs since realistic image search utilizes the surrounding text of an image to facilitate accurate searching results.As a primary step towards handling multimodal candidates in EDIS, we review two primary fusion approaches and propose a feature-level fusion method to utilize the information from both images and headlines effectively.Our experimental results show ample space for improvement on EDIS.Future work should consider more principled solutions involving knowledge graphs, entity In this study, we only cover image retrieval datasets with English instructions.Queries and headlines in other languages may characterize different types of ambiguity or underspecification.Thus, expanding the datasets to multi-lingual image retrieval based on our dataset is important.Secondly, we only consider the news domain to collect entityrich queries and images.We plan to expand our dataset to open-domain where other entities like iconic spots will be included.In addition, we only consider the headlines as the text information to utilize in the retrieval process.However, in real image search scenarios, search engines usually utilize multiple paragraphs of the surrounding text to determine the relevance of the image.In the future, we will expand the text of the multimodal candidates with news articles or segments of the articles.Our dataset and models trained on it could be biased if the model is not accurate enough.The model may return completely incorrect candidates and cause users to confuse persons or objects with incorrect identities.We will provide all ground truth annotations with visualization code to help users learn about the ground truth candidates.Last but not least, we do not consider the phenomenon of underspecification in the image search experience.Users search with phrases or incomplete sentences to save typing efforts.Therefore, more realistic queries can be underspecified and grammatically incorrect.However, this is a problem universal to all existing image retrieval datasets, as collecting real human search results could be challenging.We plan to make our dataset more realistic in the future by utilizing powerful tools such as large language models to generate underspecified, near-realistic queries.

Ethics Consideration
We will release our dataset EDIS for academic purposes only and should not be used outside of research.We strictly follow any licenses stated in the datasets that we have newly annotated.As introduced in Sec.4.2, we annotated the data with crowd-workers through Amazon Mechanical Turk.The data annotation part of the project is classified as exempt by our Human Subject Committee via IRB protocols.We required the workers to be in English-speaking regions (Australia, Canada, New Zealand, the United Kingdom, and the United States).We keep the identity of workers anonymized throughout the collection and postprocessing stages.We also require the workers to have a HIT approval rating of ≥ 96% or higher.We pay each completed HIT $0.2, and each HIT takes around 30-40 seconds to complete on average.Therefore, this resulted in an hourly wage of $18-$24, as determined by the estimation of completing time for each annotation task.Example screenshots of our annotation interface can be found in Fig. 6-7 under Appendix A.
to obtain similarity scores for all query-candidate pairs.Since it takes around 3.5 minutes for BLIP to evaluate over 3.2k queries with 25K candidates, it is taking more than 5 days for a single encoder model to complete retrieval under the distractor setting.It takes more than a year to complete retrieval under the full setting.

Figure 3 :
Figure 3: Left: Annotated candidates distribution by relevance score.Right: Query distribution by the scores of annotated candidates.

Figure 4 :
Figure 4: Score-level fusion encodes each modality into a single feature vector, while feature-level fusion outputs a single feature vector for multi-modal candidates by adopting cross-attention layers.

Query:Figure 5 :
Figure 5: A success case (top) and a failure case (bottom) of mBLIP compared to BLIP.

Figure 6 :
Figure 6: Amazon Mechanical Turk interface with annotation examples for crowd-workers to read

Figure 7 :
Figure 7: Amazon Mechanical Turk interface and annotation task

Table 1 :
Statistics of EDIS and existing image retrieval datasets.EDIS has a larger set of multi-modal candidates, unrestricted image sources, multi-scale annotations, and entity-rich queries compared to previous datasets.

Table 3 :
Evaluation results on additional baselines.

Table 4 :
Ablation study on the effectiveness of feature fusion in BLIP and mBLIP.