Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search

Internet search affects people’s cognition of the world, so mitigating biases in search results and learning fair models is imperative for social good. We study a unique gender bias in image search in this work: the search images are often gender-imbalanced for gender-neutral natural language queries. We diagnose two typical image search models, the specialized model trained on in-domain datasets and the generalized representation model pre-trained on massive image and text data across the internet. Both models suffer from severe gender bias. Therefore, we introduce two novel debiasing approaches: an in-processing fair sampling method to address the gender imbalance issue for training models, and a post-processing feature clipping method base on mutual information to debias multimodal representations of pre-trained models. Extensive experiments on MS-COCO and Flickr30K benchmarks show that our methods significantly reduce the gender bias in image search models.


Introduction
Internet information is shaping people's minds. The algorithmic processes behind modern search engines, with extensive use of machine learning, have great power to determine users' access to information (Eslami et al., 2015). These information systems are biased when results are systematically slanted in unfair discrimination against protected groups (Friedman and Nissenbaum, 1996).
Gender bias is a severe fairness issue in image search. Figure 1 shows an example: given a genderneutral natural language query "a person is cooking", only 2 out of 10 images retrieved by an image search model (Radford et al., 2021) depict females, while equalized exposure for male and female is expected. Such gender-biased search results are harmful to society as they change people's cognition and worsen gender stereotypes (Kay et al., 2015). Mitigating gender bias in image search is imperative for social good.
In this paper, we formally develop a framework for quantifying gender bias in image search results, where text queries in English 1 are made genderneutral, and gender-balanced search images are expected for models to retrieve. To evaluate model fairness, we use the normalized difference between masculine and feminine images in the retrieved results to represent gender bias. We diagnose the gender bias of two primary families of multimodal models for image search: (1) the specialized models that are often trained on in-domain datasets to perform text-image retrieval, and (2) the generalpurpose representation models that are pre-trained on massive image and text data available online and can be applied to image search. Our analysis on MS-COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014) datasets reveals that both types of models lead to serious gender bias issues (e.g., nearly 70% of the retrieved images are masculine images).
To mitigate gender bias in image search, we propose two novel debiasing solutions for both model families. The specialized in-domain training methods such as SCAN (Lee et al., 2018) often adopt contrastive learning to enforce image-text matching by maximizing the margin between positive and negative image-text pairs. However, the gender distribution in the training data is typically imbalanced, which results in unfair model training. Thus we introduce a fair sampling (FairSample) method to alleviate the gender imbalance during training without modifying the training data.
Our second solution aims at debiasing the large, pre-trained multimodal representation models, which effectively learn pre-trained image and text representations to accomplish down-stream applications (Bachman et al., 2019;Chen et al., 2020a,c;Gan et al., 2020;Chen et al., 2020d;Rad-Figure 1: Gender bias in image search. We show the top-10 retrieved images for searching "a person is cooking" on the Flickr30K (Young et al., 2014) test set using a state-of-the-art model (Radford et al., 2021). Despite the gender-neutral query, only 2 out of 10 images are depicting female cooking. ford et al., 2021). We examine whether the representative CLIP model (Radford et al., 2021) embeds human biases into multimodal representations when they are applied to the task of image search. Furthermore, we propose a novel post-processing feature clipping approach, clip, that effectively prunes out features highly correlated with gender based on their mutual information to reduce the gender bias induced by multimodal representations. The clip method does not require any training and is compatible with various pre-trained models.
We evaluate both debiasing approaches on MS-COCO and Flickr30K and find that, on both benchmarks, the proposed approaches significantly reduce the gender bias exhibited by SCAN and CLIP models when evaluated on the gender-neutral corpora, yielding fairer and more gender-balanced search results. In addition, we evaluate the similarity bias of the CLIP model in realistic image search results for occupations on the internet, and observe that the post-processing methods mitigate the discrepancy between gender groups by a large margin.
Our contributions are four-fold: (1) we diagnose a unique gender bias in image search, especially for gender-neutral text queries; (2) we introduce a fair sampling method to mitigate gender bias during model training; (3) we also propose a novel post-processing clip method to debias pre-trained multimodal representation models; (4) we conduct extensive experiments to analyze the prevalent bias in existing models and demonstrate the effectiveness of our debiasing methods.

Gender Bias in Image Search
In an image search system, text queries may be either gender-neutral or gender-specific. Intuitively, when we search for a gender-neutral query like "a person is cooking", we expect a fair model returning approximately equal proportions of images depicting men and women. For gender-specific queries, an unbiased image search system is supposed to exclude images with misspecified gender information. This intention aligns with seeking more accurate search results and would be much different from the scope of measuring gender bias in gender-neutral cases. Therefore, we focus on identifying and quantifying gender bias when only searching for gender-neutral text queries.

Problem Statement
Given a text query provided by the users, the goal of an image search system is to retrieve the matching images from the curated images. In the domain of multi-modality, given the dataset {(v n , c n )} N n=1 with N image-text pairs, the task of image search aims at matching every image v based on the providing text c. We use V = {v n } N n=1 to denote the image set and C = {c n } N n=1 to denote the text set. Given a text query c ∈ C and an image v ∈ V, image retrieval models often predict the similarity score S(v, c) between the image and text. One general solution is to embed the image and text into a highdimensional representation space and compute a proper distance metric, such as Euclidean distance or cosine similarity, between vectors (Wang et al., 2014). We take cosine similarity for an example: The image search system outputs a set of top-K retrieved images R K (c) with the highest similarity scores. In this work, we assume that when evaluating on test data, ∀c ∈ C, the text query c is written in gender-neutral language.

Measuring Gender Bias
The situations of image search results are complex: there might be no people, one person, or more than one person in the images. Let g(v) ∈ {male, female, neutral} represent the gender attribute of an image v. Note that in this study gender refers to biological sex Larson, 2017. We use the following rules to determine g(v): g(v) = male when there are only men in the image, g(v) = female when there are only women in the image, otherwise g(v) = neutral. Portraits in image search results with different gender attributes often receive unequal exposure. Inspired by Kay et al. (2015) and Zhao et al. (2017), we measure gender bias in image search by comparing the proportions of masculine and feminine images in search results. Given the set of retrieved images R K (c), we count the images depicting males and females and define the gender bias metric as: We don't take absolute values for measuring the direction of skewness, i.e., if ∆ K (c) > 0 it skews towards males. Note that a similar definition of gender bias N male N male +N female in Zhao et al. (2017) is equivalent to (1 + ∆(c)) 2. But our definition of gender bias considers the special case when none of the retrieved images are gender-specific, i.e., N male + N female = 0. For the whole test set, we measure the mean difference over all the text queries:

Mitigating Gender Bias in Image Search
There are two fashions of multimodal models for the image search task. One is to build a specialized model that could embed image and text into representation vectors with measurable similarity scores. The other is to use general-purpose imagetext representations pre-trained on sufficiently big data and compute a particular distance metric. We focus on two representative models, SCAN (Lee et al., 2018) and CLIP (Radford et al., 2021), for both fashions. For the first fashion, we propose an in-processing learning approach to ameliorate the unfairness caused by imbalanced gender distribution in training examples. This approach builds on contrastive learning but extends with a fair sampling step. The in-processing solution requires full training on in-domain data examples. For the second fashion, we propose a post-processing feature clipping technique to mitigate bias from an information-theoretical perspective. This approach is compatible with pre-trained models and is light to implement without repeating training steps.

In-processing Debiasing: Fair Sampling
Image search models in the first fashion are often trained under the contrastive learning framework (Le-Khac et al., 2020). For our in-processing debiasing approach, we now explain the two primary components, contrastive learning and fair sampling, within our context.

Contrastive Learning
We start by formally introducing the standard contrastive learning framework commonly used in previous works (Lee et al., 2018;Chen et al., 2020b) for image-text retrieval. Given a batch of N image-text pairs B = {(v n , c n )} N n=1 , the model aims to maximize the similarity scores of matched image-text pairs (positive pairs) while minimizing that of mismatched pairs (negative pairs). The representative SCAN model (Lee et al., 2018), denoted as S(v, c) outputting a similarity score between image and text, is optimized with a standard hinge-based triplet loss: where γ is the margin,ṽ andc are negative examples, and [⋅] + denotes the ramp function. L i−t corresponds to image-to-text retrieval, while L t−i corresponds to text-to-image retrieval (or image search). Common negative sampling strategy includes selecting all the negatives (Huang et al., 2017), selecting hard negatives of highest similarity scores in the mini-batch (Faghri et al., 2018), and selecting hard negatives from the whole training data (Chen et al., 2020b). Minimizing the marginbased triplet loss will make positive image-text pairs closer to each other than other negative samples in the joint embedding space.
Fair Sampling One major issue in the contrastive learning framework is that the gender distribution in a batch of image-text pairs is typically imbalanced. Hence, the negative samples will slant towards the majority group, leading to systematic discrimination. To address this problem, we propose a fair sampling strategy. We split the batch of image-text pairs into masculine and feminine pairs based on the image's gender attribute: For every positive image and text pair (v, c) ∈ B, we identify the gender information contained in the query c. If the natural language query is genderneutral, we sample a negative image from the set of male and female images with probability 1 2 , respectively. Otherwise, we keep the primitive negative sampling selection strategy for keeping the model's generalization on gender-specific queries. Let B * be the batch of gender-neutral image-text pairs, the image search loss with fair sampling is: Empirically, we find that if we thoroughly apply the Fair Sampling strategy, the recall performance drops too much. To obtain a better tradeoff, we use a weight α to combine the objectives as the final text-to-image loss function. We do not alter the sentence retrieval loss L i−t during training for preserving generalization.
Algorithm 1 clip algorithm In this work, we find that the pre-trained CLIP model reaches the state-of-the-art performance but exhibits large gender bias due to training on uncurated image-text pairs collected from the internet. Although Radford et al. (2021) released the pre-trained CLIP model, the training process is almost unreproducible due to limitations on computational costs and massive training data. In order to avoid re-training of the CLIP model, we introduce a novel post-processing mechanism to mitigate the representation bias in the CLIP model. We propose to "clip" the dimensions of feature embeddings that are highly correlated with gender information. This idea is motivated by the fact that an unbiased retrieve implies the independence between the covariates (active features) and sensitive attributes (gender) (Barocas et al., 2019). Clipping the highly correlating covariates will return us a relatively independent and neutral set of training data that does not encode hidden gender bias.
The proposed clip algorithm is demonstrated in Algorithm 1, and we explain the key steps below.
Let Ω = {1, ..., d} be the full index set. We use V = V Ω = [V 1 , V 2 , ..., V d ] to represent the variable of d-dimensional encoding image vectors and g(V ) ∈ {male, female, neutral} to represent the corresponding gender attribute. The goal is to output the index set Z of clipped covariates that reduce the dependence between representations V Ω Z and gender attributes g(V ). We measure the correlation between each dimension V i and gender attribute g(V ) by estimating their mutual information I(V i ; g(V )) (Gao et al., 2017): where D KL is the KL divergence (Kullback and Leibler, 1951), P (V i ,g(V )) indicates the joint distribution, P V i and P g(V ) indicate their marginals. Next, we greedily clip m covariates with highest mutual information, and construct (d − m)dimensional embedding vectors V Ω Z . m is a hyper-parameter that we will experimentally find to best trade-off accuracy and the reduced gender bias, and we show how the selection of m affects the performance in Section 5.3. To project text representations, denoted by variable C, into the same embedding space, we also apply the index set Z to obtain clipped text embedding vectors C Ω Z .
The clipped image and text representations, denoted by ⃗ v * and ⃗ c * , will have a relatively low correlation with gender attributes due to the "loss" of mutual information. Then we compute the cosine similarity between image and text by substituting ⃗ v * and ⃗ c * into Equation (1): Finally, we rank the images based on the cosine similarity between the clipped representations.

Datasets
We evaluate our approaches on the standard MS-COCO (Chen et al., 2015) and Flickr30K (Young et al., 2014)  Identifying Gender Attributes of Images Sensitive attributes such as gender are often not explicitly annotated in large-scale datasets such as MS-COCO and Flickr30K, but we observe that implicit gender attributes of images can be extracted from their associated human-annotated captions. Therefore, we pre-define a set of masculine words and a set of feminine words. 4 Following Zhao et al. (2017) and Burns et al. (2018) we use the groundtruth annotated captions to identify the gender attributes of images. An image will be labeled as "male" if at least one of its captions contains masculine words and no captions include feminine words. Similarly, an image will be labeled as "female" if at least one of its captions contains feminine words and no captions include masculine words. Otherwise, the image will be labeled as "gender-neutral".

Models
We compare the fairness performance of the following approaches: • SCAN (Lee et al., 2018): we use the official implementation for training and evaluation 5 .
• FairSample: we apply the fair sampling method proposed in Section 3.1 to the SCAN framework and adopt the same hyper-parameters suggested by Lee et al. (2018) for training.
• CLIP (Radford et al., 2021): we use the pretrained CLIP model released by OpenAI. 6 The model uses a Vision Transformer (Dosovitskiy et al., 2021) as the image encoder and a masked self-attention Transformer (Vaswani et al., 2017) as the text encoder. The original model produces 500-dimensional image and text vectors.
• CLIP-clip: we apply the feature pruning algorithm in Section 3.2 to the image and text features generated by the CLIP model. We set m = 100 and clip the image and text representations into 400-dimensional vectors. Note that SCAN and FairSample are trained and tested on the in-domain MS-COCO and Flickr30K datasets, while the pre-trained CLIP model is directly tested on MS-COCO and Flickr30K test sets without fine-tuning on their training sets (same for CLIP-clip as it simply drops CLIP features).
Before Pre-processing After Pre-processing A man with a red helmet on a small moped on a dirt road.
A person with a red helmet on a small moped on a dirt road. A little girl is getting ready to blow out a candle on a small dessert.
A little child is getting ready to blow out a candle on a small dessert. A female surfboarder dressed in black holding a white surfboard. A surfboarder dressed in black holding a white surfboard. A group of young men and women sitting at a table.
A group of young people sitting at a table.

Evaluation
Gender-Neutral Text Queries In this study, we focus on equalizing the search results of genderneutral text queries. In addition to the existing gender-neutral captions in the test sets, we preprocess those gender-specific captions to construct a purely gender-neutral test corpus to guarantee a fair and large-scale evaluation. For every caption, we identify all these gender-specific words and remove or replace them with corresponding genderneutral words. We show some pre-processing examples in Table 1.
Metrics As introduced in Section 2.2, we employ the fairness metric in Equation (3), Bias@K, to measure the gender bias among the top-K images. In addition, following standard practice, we measure the retrieval performance by Recall@K, defined as the fraction of queries for which the correct image is retrieved among the top-K images.

Main Results on MS-COCO & Flickr30K
We report the results comparing our debiasing methods and the baseline methods in Table 2.  Table 2: Results on MS-COCO (1K and 5K) and Flickr30K test sets. We compare the baseline models (SCAN (Lee et al., 2018) and CLIP (Radford et al., 2021)) and our debiasing methods (FairSample and CLIP-clip) on both the gender bias metric Bias@K and the retrieval metric Recall@K.

Gender Bias at Different Top-K Results
We plot how gender bias varies across different values of K (1-10) for all the compared methods in Figure 2. We observe that when K < 5, the gender bias has a higher variance due to the inadequate retrieved images. When K ≥ 5, the curves tend to be flat. This result indicates that Bias@10 is more recommended than Bias@1 for measuring gender bias as it is more stable. It is also noticeable that CLIP-clip achieves the best fairness performance in terms of Bias@10 consistently on all three test sets compared to the other models.

Tradeoff between Recall and Bias
There is an inherent tradeoff between fairness and accuracy in fair machine learning (Zhao and Gordon, 2019). To achieve the best recall-bias tradeoff in our methods, we further examine the effect of the controlling hyper-parameters: the weight α in Fair-Sampling and the number of clipped dimensions m in CLIP-clip. Figure 3 demonstrates the recall-bias curve with the fair sampling weight α ∈ [0, 1]. Models of higher recall often suffer higher gender bias, but the fairness improvement outweighs the recall performance drop in FairSample models. For example, the model fully trained with fair sampling (α = 1) has the lowest bias and drops the recall performance the most-it relatively reduces 22.5% Bias@10 but only decreases 10.9% Recall@10 on Flickr30K. We choose α = 0.4 for the final model, which has a better tradeoff in retaining the recall performance.
As shown in Figure 4, we set the range of the clipping dimension m between 100 and 400 on MS-COCO 1K. We find that clipping too many covariates (1) harms the expressiveness of image and text representations (Recall@1 drops from 46.1% to 11.3%, Recall@5 drops from 75.2% to 25.4%, and Recall@10 drops from 86.0% to 34.2%), and (2) causes high standard deviation in gender bias.
In light of the harm on expressiveness, we select m = 100 for conventional use.

Evaluation on Internet Image Search
The aforementioned evaluation results on MS-COCO and Flickr30K datasets are limited that they rely on gender labels extracted from human  captions. In this sense, it is important to measure the gender biases on a benchmark where the gender labels are identified by crowd annotators. To this end, we further evaluate on the occupation dataset (Kay et al., 2015), which collects top 100 Google Image Search results for each gender-neutral occupation search term. 7 Each image is associated with the crowd-sourced gender attribute of the participant portrayed in the image. Inspired by Burns et al. (2018) and Tang et al. (2020), we measure the gender bias by computing the difference of expected cosine similarity between male and female occupational images. Given an occupation o, the similarity bias is formulated as where V o male and V o female are the sets of images for occupation o, labeled as "male" and "female". Figure 5 demonstrates the absolute similarity bias of CLIP and CLIP-clip on the occupation dataset for 18 occupations. We observe that the CLIP model exhibits severe similarity discrepancy for some occupations, including telemarketer, chemist, and housekeeper, while the clip algorithm alleviates this problem effectively. Note that for doctor and police officer, the CLIP-clip model exaggerates the similarity discrepancy, but the similarity bias is still less than 0.01. In general, CLIP-clip is effective for mitigating similarity bias and obtains a 42.3% lower mean absolute bias of the 100 occupations than the CLIP model (0.0064 vs. 0.0111).

Related Work
Fairness in Machine Learning A number of unfair treatments by machine learning models were reported recently (Angwin et al., 2016;Buolamwini and Gebru, 2018;Bolukbasi et al., 2016;Otterbacher et al., 2017), and the literature has observed a growing demand and interests in proposing defenses, including regularizing disparate impact (Za-far et al., 2015) and disparate treatment (Hardt et al., 2016), promoting fairness through causal inference (Kusner et al., 2017), and adding fairness guarantees in recommendations and information retrieval (Beutel et al., 2019;Biega et al., 2018;Morik et al., 2020). The existing fair machine learning solutions can be broadly categorized as pre-processing (KamiranFaisal and CaldersToon, 2012;Feldman et al., 2015;Calmon et al., 2017), inprocessing, and post-processing approaches. Preprocessing algorithms typically re-weight and repair the training data which captures label bias or historical discrimination (KamiranFaisal and CaldersToon, 2012;Feldman et al., 2015;Calmon et al., 2017). In-processing algorithms focus on modifying the training objective with additional fairness constraints or regularization terms (Zafar et al., 2017;Agarwal et al., 2018;Cotter et al., 2019). Post-processing algorithms enforce fairness constraints by applying a post hoc correction of a (pre-)trained classifier (Hardt et al., 2016;Calmon et al., 2017). In this work, the fair sampling strategy designed for the contrastive learning framework could be considered as an in-processing treatment, while the clip algorithm is in the post-processing regime that features an information-theoretical clipping procedure. Our contribution highlights new challenges of reducing gender bias in a multimodal task and specializes new in-processing and postprocessing ideas in the domain of image search.
Social Bias in Multi-modality Implicit social bias related to gender and race has been discussed in multimodal tasks including image captioning (Burns et al., 2018;Tang et al., 2020), visual question answering (Manjunatha et al., 2019), face recognition (Buolamwini and Gebru, 2018), and unsupervised image representation learning (Steed and Caliskan, 2021). For example, Zhao et al. (2017) shows that models trained on unbalanced data can amplify bias, and injecting corpus-level Lagrangian constraints can calibrate the bias amplification. Caliskan et al. (2017) demonstrates the association between the word embeddings of occupation and gendered concepts correlates with the imbalanced distribution of gender in text corpora. There are also a series of debiasing techniques in this area. Bolukbasi et al. (2016) propose to surgically alter the embedding space by identifying the gender subspace from gendered word pairs. Manzini et al. (2019) extend the bias component removal approach to the setting where the sensitive attribute is non-binary. Data augmentation approaches remove the implicit bias in the training corpora and train the models on the balanced datasets (Zhao et al., 2018). Our work complements this line of research by examining gender bias induced by multimodal models in image search results. Our focus on gender bias in the gender-neutral language would offer new insights for a less explored topic to the community.
Gender Bias in Online Search Systems Our work is also closely connected to studies in the HCI community showing the gender inequality in online image search results. Kay et al. (2015) articulate the gender bias in occupational image search results affect people's perceptions of the prevalence of men and women in each occupation. Kay et al. (2015) compare gender proportions in occupational image search results and discuss how the bias affects people's perceptions of the prevalence of men and women in each occupation.  examine the prevalence of gender stereotypes on various digital media platforms. Otterbacher et al. (2017) identify gender bias with character traits. Nonetheless, these works do not attempt to mitigate gender bias in search algorithms. Our work extends these studies into understanding how gender biases enter search algorithms and provides novel solutions to mitigating gender bias in two typical model families for image search.

Conclusion
In this paper, we examine gender bias in image search models when search queries are genderneutral. As an initial attempt to study this critical problem, we formally identify and quantify gender bias in image search. To mitigate the gender bias perpetuating two representative fashions of image search models, we propose two novel debiasing algorithms in in-processing and post-processing manners. When training a new image search model, the in-processing FairSample method can be used to learn a fairer model from scratch. Meanwhile, the clip algorithm can be used for lightweight deployment of pre-trained representation models with accessible gender information.

Broader Impact
The algorithmic processes behind modern search engines, with extensive use of machine learning algorithms, have great power to determine users' access to information (Eslami et al., 2015). Our research provides evidence that unintentionally using image search models trained either on in-domain image retrieval data sets or massive corpora across the internet may lead to unequal inclusiveness between males and females in image search results, even when the search terms are gender-neutral. This inequity can and do have significant impacts on shaping and exaggerating gender stereotype in people's minds (Kay et al., 2015).
This work offers new methods for mitigating gender bias in multimodal models, and we regard the algorithms proposed in this paper have the potentials to be deployed in real-world systems. We conjecture that our methods may contribute to driving the development of responsible image search engines with other fairness issues. For instance, we would encourage future works to understand and mitigate the risks arising from other social biases, like racial bias, in image search results. We would also encourage researchers to explore whether the methodology presented in this work could be generalized to quantify and mitigate other bias measures.
Our work has limitations. The gender bias measures and the debiasing methods proposed in this study require acquiring the gender labels of images. Our method for identifying the gender attributes of people portrayed in the images is limited: we make use of the contextual cues in the human-annotated captions from the image datasets. The accuracy of such a proxy-based method heavily relies on the coverage of gendered nouns and the inclusiveness of gendered language in the original human annotations. The corruption of gender labels, due to missing gendered words or inappropriate text preprocessing steps, may introduce biases we have not foreseen into the evaluated metrics. Additionally, the gendered word lists are collected from English corpora and may differ in other languages or cultures. It is possible that blind application of our methods by improperly acquiring the gender labels may create image search models that produce even greater inequality, which is very much discouraged. This limitation arises from the unavailability of such sensitive attributes in the source datasets. The lack of relevant data for studying gender bias in image search, and the concerns about how to acquire the gender attributes while preserving the privacy of people concerned, is itself an important question in this area. We believe this research would benefit when richer datasets become available.

A Gender Word Lists
We show the word lists for identifying the gender attributes of a caption in Table 3. feminine words woman, women, female, girl, lady, mother, mom, sister, daughter, wife, girlfriend masculine words man, men, male, boy, gentleman, father, brother, son, husband, boyfriend genderneutral words person, people, human, adult, baby, child, kid, children, guy, teenage, crowd Table 3: Gender word lists. We identify the gender attributes of captions based on the occurrence of genderspecific words appeared in the sentences.

B.1 Computing Infrastructure
We use a GPU server with 4 NVIDIA RTX 2080 Ti GPUs for training and evaluation.

B.2 Computational Time Costs
We find that SCAN (Lee et al., 2018) and SCAN with fair sampling need about 20 hours for training 30 epochs on MS-COCO and 8-10 minutes for testing on 1K test set. In comparison, pre-trained CLIP (Radford et al., 2021) and CLIP-clip can be evaluated within 1 minutes on MS-COCO 1K test set.

C Qualitative Examples
We take a qualitative study on the image search results. We show the results of searching "a person riding a bike" in Figure 6. The first row presents the top-5 retrieved images for SCAN, the second row presents the top-5 retrieved images for SCAN+FairSample, the third row presents the top-5 retrieved images for CLIP, and the last row presents the top-5 retrieved images for CLIP-clip. While we notice that all the models retrieve relevant images, we find FairSample put images depicting females in a higher rank.

SCAN
FairSample CLIP CLIP-clip Figure 6: Qualitative analysis of gender bias in image search results. The text query is "a person riding a bike". The first row presents the top-5 retrieved images for SCAN, the second row presents the top-5 retrieved images for SCAN+FairSample, the third row presents the top-5 retrieved images for CLIP, and the last row presents the top-5 retrieved images for CLIP-clip.