T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation

Warning: This paper contains several contents that may be toxic, harmful, or offensive. In the last few years, text-to-image generative models have gained remarkable success in generating images with unprecedented quality accompanied by a breakthrough of inference speed. Despite their rapid progress, human biases that manifest in the training examples, particularly with regard to common stereotypical biases, like gender and skin tone, still have been found in these generative models. In this work, we seek to measure more complex human biases exist in the task of text-to-image generations. Inspired by the well-known Implicit Association Test (IAT) from social psychology, we propose a novel Text-to-Image Association Test (T2IAT) framework that quantifies the implicit stereotypes between concepts and valence, and those in the images. We replicate the previously documented bias tests on generative models, including morally neutral tests on flowers and insects as well as demographic stereotypical tests on diverse social attributes. The results of these experiments demonstrate the presence of complex stereotypical behaviors in image generations.


Introduction
Recent progress on generative image models has centered around utilizing text prompts to produce high quality images that closely align with the provided natural language descriptions (Ramesh et al., 2022;Nichol et al., 2022;Saharia et al., 2022;Yu et al., 2022;Chang et al., 2023).Easy access to these models, notably the open-sourced Stable Diffusion model (Rombach et al., 2022), has made it possible to develop them for a wide range of downstream applications at scale, such as generating stock photos (Raemont, 2022), and creating creative prototypes and digital assets (OpenAI, 2022).* Corresponding Author.

A photo of a child studying astronomy.
A photo of a girl studying astronomy.
A photo of a boy studying astronomy.procedure.We instantiate the proposed bias test on Gender-Science.We use the text prompt "A photo of a child studying astronomy" to generate neutral images.Then we substitute "child" with feminine and masculine words and generate attribute-specific images.We calculate the average difference in the distance between the neutral and attribute-specific images as a measure of association.
The success of text-to-image generation was enabled by the availability and accessibility of massive image-text paired datasets scraped from the web (Schuhmann et al., 2022).However, it has been shown that data obtained by these curations may contain human biases in various ways (Birhane et al., 2021).Selection bias occurs when the data is not properly collected from a diverse set of data sources, or the sources themselves do not properly represent groups of populations of interest.For example, it is reported that near half of the data samples of ImageNet came from the United States, while China and India, the two most populous countries in the world, were the contributors of only a small portion of the images (Shankar et al., 2017).It is important to be aware that the generative models trained on such datasets may replicate and perpetuate the biases in the generated images (Wolfe et al., 2022).
Our work seeks to quantify the implicit human biases in text-to-image generative models.A large body of literature has identified the social biases pertaining to gender and skin tone by analyzing the distribution of generated images across different social groups (Bansal et al., 2022;Cho et al., 2022).These bias metrics build on the assumption that each generated image only associates with a single protected group of interest.However, in reality, the images might not belong to any of the protected groups when there is no discernible human subject or the appearances of the detectable human subjects are blurred and unclear.Moreover, the images may belong to multiple demographic groups when more than one human subjects are present in the image.Therefore, these bias measures can easily fail to detect the subtle differences between the visual concepts reified in the images and the attributes they are associated with.
Unlike previous studies, our work aims to provide a nuanced understanding on more complex stereotypical biases in image generations than the straightforward demographic biases.Examples of the complex stereotypes includes: there is a belief that boys are inherently more talented at math, while girls are more adept at language (Nosek et al., 2009); people with lighter skin tones are more likely to be appeared in home or hotel scenes, while people with dark skin tones are more likely to cooccur with object groups like vehicle (Wang et al., 2020).We investigate how these biases will be reified and quantified in machine generated images, with a special focus on valence (association with negative or unpleasant vs. positive or pleasant concepts) and stereotypical biases.
In this paper, we propose the Text-to-Image Association Test (T2IAT), a systematic approach to measure the implicit biases of image generations between target concepts and attributes (see Figure 1).One benefit of our bias test procedure is that it is not limited to specific demographic attributes.Rather, the bias test can be applied to a wide range of concepts and attributes, as long as the observed discrepancy between them can be justified as stereotyping biases by the model owners and users.For use cases, we conduct 8 image generation bias tests and the results of the tests exhibit various human-like biases at different significance levels as previously documented in social psychology.
We summarize our contribution as two-fold: first, we provide a generic test procedure to detect valence and stereotypical biases in image generation models.Second, we extensively conduct a variety of bias tests to provide evidence for the existence of such complex biases along with significance levels.

Related Work
Text-to-Image Generative Models aim to synthetic images from natural language descriptions.There is a long history of image generation, and many works have been done in this area.Generative Adversarial Networks (GANs) (Goodfellow et al., 2020) andVariational Autoencoders (Van Den Oord et al., 2017) (VAEs), as well as their variants, have been shown excellent capability of understanding both natural languages and visual concepts and generating high-quality images.Until recently, diffusion models (Ho et al., 2020), such as DALL-E2, Stable Diffusion (Rombach et al., 2022), and Imagen (Saharia et al., 2022) have gained a surge of attention due to their significant improvements in generating high-resolution photo-realistic images.Moreover, due to the development of multi-modal alignment (Radford et al., 2021), text-to-image generation proves a promising intersection between representation learning and generative learning.Despite that there are several existing works (Ramesh et al., 2022;Nichol et al., 2022;Saharia et al., 2022;Yu et al., 2022;Chang et al., 2023) aim to improve the quality of image generation, it still remains uncertain whether these generative models contain more complex humanlike biases.
However, we can see along with the development of text-to-image models, ethical concerns never disappeared.Cultural biases can be caused by the replacement of homoglyns (Struppek et al., 2023).
There are examples of inappropriate content generated by Stable Diffusion model (Schramowski et al., 2022), and fake images generated by text-toimage generation models, which can be misused in real-life (Sha et al., 2023).Moreover, membership leakage problem can still be found in typical text-to-image generation models (Wu et al., 2022), followed by several existing works (Hu and Pang, 2023;Duan et al., 2023) on this issue targeting on image generation models based on diffusion mod-els.These concerns all prove that text-to-image models require a thorough examination on the aspects of fairness, privacy, and security.
In this paper, we focus on measuring the human biases in Stable Diffusion, but the framework can be easily applied to other generative models.
Biases in Vision and Language Recent studies have examined a wide range of ethical considerations related to vision and language models (Burns et al., 2018;Wang et al., 2022b).Large language models are always trained with a large amount of text.Although the high amount of data can improve the performance of the model in language understanding, generation, etc, there is very likely some biases in the data, which will cause the language model to be biased (Zhao et al., 2017).To measure these biases, there are a variety of systematic works measuring stereotypical biases (Bolukbasi et al., 2016).Sentence Encoder Association Test (SEAT) (May et al., 2019) is an extension of the World Embedding Association Test (WEAT) (Caliskan et al., 2017) to sentence-level representations.The difference between SEAT and WEAT is that SEAT is a sentence-level version and SEAT substitutes the attribute words and target words from WEAT into synthetic sentence templates.Another useful measurement is StereoSet (Nadeem et al., 2020), which is a crowdsourced dataset for measuring four types of stereotypical bias in language models.In addition, crowdsourced Stereotype Pairs (CrowS-Pairs) (Nangia et al., 2020) is a crowdsourced dataset that consists of pairs of minimally distant sentences which means that sentences are only different in limited tokens.Meade et al. (2021); Bansal (2022) propose to measure biases in language models by counting how frequently the model prefers the stereotypical sentence in each pair over the anti-stereotypical sentence.
In addition to the language models, many prior works have quantified the biases in various computer vision tasks and illustrated the pre-trained computer vision models contain various biases on different axes (Buolamwini and Gebru, 2018;Wilson et al., 2019;Kim et al., 2021;Wang et al., 2022a;Zhu et al., 2022).It has been demonstrated that such pre-trained models may bring the complex human biases into downstream applications, such as image search systems (Wang et al., 2021) and satellite segmentation (Zhang and Chunara, 2022).In particular, Steed and Caliskan (2021) show that self-supervised image encoders, such as iGPT (Chen et al., 2020a) and SimCLR (Chen et al., 2020b), may perpetuate stereotypes among intersectional demographic groups.Our work complements these works by measuring the complex biases in image generations.

Approach
In this work, we adapt the Implicit Association Test (IAT) in social psychology to the task of textto-image generation.We will first introduce the long history of association tests.But existing bias tests are primarily focusing on word embeddings.Therefore, we present the Text-to-Image Association Test (T2IAT), which quantifies the human biases in images generated by text-to-image generation models.

Implicit Association Test
In social psychology, the Implicit Association Test (IAT) introduced by Greenwald et al. ( 1998) is an assessment of implicit attitudes and stereotypes where the test subjects are held unconsciously, such as associations between concepts (e.g.people in light/dark skin color) and evaluations (e.g.pleasant/unpleasant) or stereotypes.In general, IAT can be categorized into valence IATs, in which concepts are tested for association with positive or negative valence, and stereotype IATs, in which concepts are tested for association with stereotypical attributes (e.g."male" vs. "female").During a typical IAT test procedure, the participants will be presented with a series of stimuli (e.g., pictures of black and white faces, words related to gay and straight people) and are asked to categorize them as quickly and accurately as possible using a set of response keys (e.g., "pleasant" or "unpleasant" for valence evaluations, "family" or "career" for stereotypes).The IAT score is interpreted based on the difference in response times for a series of categorization tasks with different stimuli and attributes, and higher scores indicate stronger implicit biases.For example, the Gender-Career IAT indicates that people are more likely to associate women with family and men with careers.
IAT was adapted to the field of natural language processing by measuring the associations between different words or concepts for language models (Caliskan et al., 2017).Specifically, a systematic method, Word Embedding Association Test (WEAT), is proposed to measure a wide range of human-like biases by comparing the cosine similar-ity of word embeddings between verbal stimuli and attributes.More recently, WEAT was extended to compare the similarity between embedding vectors for text prompts instead of words (May et al., 2019;Bommasani et al., 2020;Guo and Caliskan, 2021).

Text-to-Image Association Test
We borrow the terminology of association test from Caliskan et al. (2017) to describe our proposed bias test procedure.Consider two sets of target concepts X and Y like science and art, and two sets of attribute concepts A and B like men and women.The null hypothesis is that, regardless of the attributes, there is no difference in the association between the sets of images generated with the target concepts.In the context of Gender-Science bias test, the null hypothesis is saying that no matter whether the text prompts describe science or arts, the generative models should output images that are equally associated with women and men.We note that in such a gender stereotype setting, a naïve way to measure association is to count the numbers of men and women who appeared in the generated images.This simplified measure reduces the fairness criteria to ensure that the image generation should contain the equal size of pictures depicting women and men, which has been adopted in many prior works (Tan et al., 2020;Bansal et al., 2022).
To validate the significance of the null hypothesis, we design a standard statistical hypothesis test procedure, as shown in Figure 1.The key challenge is how to measure the association for one target concept X with the attributes A and B, respectively.Our strategy is first to compose neutral text prompts about X that do not mention either A or B. The idea is that the images generated with these neutral prompts should not be affected by the attributes but will be skewed towards them due to the possible implicit stereotyping biases in the generative model.We then include the attributes in the prompts and generate attribute-guided images.The distance between the neutral and attribute-guided images can be used to measure the association between the concepts and the attributes.
More specifically, we construct text prompts that are based on the target concepts, with or without the attributes.Let X and Y denote the neutral prompts related to the target concepts X and Y, respectively.Similarly, we use X A to represent the set of text prompts that are created by editing X with a set of attribute modifiers A corresponding to the attribute A. We feed these text prompts into the text-to-image generative model and use G(•) to denote the set of generated images with input prompts.For ease of notation, we use lowercase letters to represent the image samples and those accented with right arrows to represent the vector representations of the images.We consider the following test statistics: • Differential association measures the difference of the association between the target concepts with the attributes.
(1) Here Asc(x, X A , X B ) is the association for one sample image with the attributes, i.e., In Eq (2), cos(•, •) is the distance measure between images.While there are several different methods for measuring the distance between images, we choose to compute the cosine similarity between image embedding vectors that are generated with pre-trained vision encoders.During our experimental evaluation, we follow the fashion and use the vision encoder of CLIP model (Radford et al., 2021) for convenience.
• p-value is a measure of the likelihood that a random permutation of the target concepts would produce a greater difference than the sample means.To perform the permutation test, we randomly split the set X ∪ Y into two partitions X and Y of equal size.Note that the prompts in X might be related to concept Y and those in Y might be related to concept X .The p-value of such a permutation test is given by 3) The p-value represents the degree to which the differential association is statistically significant.In practice, we simulate 1000 runs of the random permutation to compute the p-value for the sake of efficiency.
• Effect size d is a normalized measure of how separated the distributions of the associations between two target concepts are.We adopt the Cohen's d to compute the effect size by s (4) where s is the pooled standard deviation for the samples of Asc(x, X A , X B ) and Asc(y, Y A , Y B ).According to Cohen, effect size is classified as small (d = 0.2), medium (d = 0.5), and large (d ≥ 0.8).
We present the whole bias test procedure in Algorithm 1.The defined bias measures the degree to which the generations of the target concepts exhibit a preference towards one attribute over another.One qualitative example is provided in the first column of Figure 2.Although the prompt of those figures does not specify gender, almost all of the generated images for science and career are depicting boys.4 Experimental Setup

Concepts and Text Prompts
We replicate 8 bias tests for text-to-image generative models, including 6 valence tests: Flowers vs. Insects, Musical Instruments vs. Weapons, Judaism vs. Christianity, European American vs. African American, light skin vs. dark skin, and straight vs. gay; and 2 stereotype tests: science vs. arts and career vs. family.Each bias test includes two target concepts and two valence or stereotypical attributes.Following Greenwald et al. (1998), we adopt the same set of verbal stimuli for each of the concepts and attributes.We present verbal stimuli for the selected concepts in Table 3.For valence tests, the evaluation attributes are pleasant and unpleasant.For stereotype tests, the stereotyping attributes are male and female.
We systematically compose a set of representative text prompts with the collection of verbal stimuli for each pair of compared target concepts and attributes.The constructed text prompts will be fed into the diffusion model to generate images.We will show the specific text prompts for each bias test in Section 5.

Generative Models
For our initial evaluation, we use the Stable Diffusion model stable-diffusion-2-1 (Rombach et al., 2022).We adopt the standard parameters as provided in the Huggingface's API to generate 10 images of size 512 × 512 for each text prompt, yielding hundreds of images for each concept.Through practical testing, we determined that this number of generations produces accurate estimates of the evaluated metrics with a high level of confidence.The number of denoising steps is set to 50 and the guidance scale is set to 7.5.The model uses OpenCLIP-ViT/H (Radford et al., 2021) to encode text descriptions.

Valence Tests
Flowers and Insects We begin by exploring the non-offensive stereotypes about flowers and insects, as these do not involve any demographic groups.The original IAT finding found that most people take less responding time to associate flowers with words that have pleasant meanings and insects with words that have unpleasant meanings (Greenwald et al., 1998).To replicate this test, we use the same set of verbal stimuli for flowers and insects categories that were used in the IAT test, as described in Table 3.We construct the text prompt "a photo of {flower/insect}" to generate images without any valence interventions.In parallel, we append the words expressing pleasant or unpleasant attitudes after the constructed prompt to generate the images with positive or negative valence.Examples of generated images can be seen in Figure 2. We report the evaluated differential association S(X, Y, A, B), p-value, and effect size d in Table 1.To estimate the p-value, we perform the permutation test for 1,000 runs and find  out that there is no other permutation of images that can yield a higher association score, indicating that the p-value is less than 1e −3 .We note that an effect size of 0.8 generally indicates a strong association between concepts, and the effect size of 1.492 found in this test suggests that flowers are significantly more strongly associated with a positive valence, while insects are more strongly associated with a negative valence.Our observation demonstrates that human-like biases are universal in image generation models even when the concepts used are not associated with any social concerns.

Musical Instruments and Weapons
To further understand the presence of implicit biases associated with text-prompt-generated images between non-offensive stereotypes, we perform the test on another set of non-offensive stereotypes of musical instruments and weapons by using the verbal stimuli for the original IAT test.Similar to our test on flowers and insects, we first generated images only on the object itself, with the text prompt "a picture of {musical instrument/weapon}", then we modified the text prompts to include pleasant and unpleasant attitudes, and, finally, generated images with positive or negative valence.We report the evaluated differential association S(X, Y, A, B), p-value, and effect size d in Table 1.The differential association score of 0.015 indicates that there is little difference in the association between our target concepts of musical instruments and weapons and the attributes of pleasant and unpleasant.We retrieved an effect size of 0.528, which implies that musical instruments have a much stronger association with a positive valence, and instead, weapons show a stronger association with a negative valence.

Judaism and Christianity
We also perform the valence test on the concepts concerning religion, particularly Judaism and Christianity.Consistent with the tests on the previously mentioned concepts, we have two sets of text prompts constructed with the verbal stimuli that are used in the IAT test for Judaism and Christianity and for Pleasant and Unpleasant.The first set comes without valence intervention, only using the provided verbal stimuli for Judaism and Christianity.The second set of text prompts incorporates terms linked to pleasant and unpleasant attitudes.We derived images based on the different sets of prompts constructed.
The valence test for this set of concepts yields a very small effect size, −0.099, suggesting that humans hold a rather neutral attitude towards Judaism and Christianity, only with a slight pleasantness towards Christianity and a little unpleasantness towards Judaism.The differential association score of −0.003 demonstrates a tiny difference in the association between the two religions of Judaism and Christianity and the two social attitudes of pleasantness and unpleasantness.Our finding overturns the religion stereotype previously documented in IAT tests.
European American and African American In this valence test, we seek to explore the implicit racial stereotypes, besides non-harmful stereotypes, of European Americans and African Americans.
From the original IAT paper, two sets of common European American and African American names are provided, and the result from our test shows that it is much easier to associate European American names with words that suggest a pleasant attitude and African American names with words that imply an unpleasant attitude.In our test, we continue to use the verbal stimuli for European American and African American names retrieved from (Tzioumis, 2018) to construct our text prompts.For the text-prompt-generated images that are not valence-related, we use the text prompt "a portrait of {European American name/African American name}".Meanwhile, we create valence-related text prompt by including terms that embody pleasant and unpleasant attitudes.We recognize that there is an inconspicuous association between European American and pleasant terms and that between African American and unpleasant terms from the value of effect size of 0.323.The differential score of 0.011 shows a subtle association between the concepts of European American and African American and the attributes of pleasant and unpleasant.
Light Skin and Dark Skin This valence test reveals the hostile biases towards humans with light skin and dark skin in the same racial group.We use the verbal stimuli collected by Project Implicit, a project initiated by Nosek et al. (2007), that aims to educate people on biases.Following the pattern of our purposed test, we create a set of text prompts without valence for both light skin and dark skin and another set of text prompts that consider the valence attributes of pleasant and unpleasant.We calculate the differential association S(X, Y, A, B), p-value, and effect size d of the images generated based on the text prompts we constructed.We obtain a considerably large effect size of −1.237, indicating that light skin is much more closely associated with an unpleasant attribute, and dark skin, on the other hand, has a strong association with a pleasant attribute.In addition, we have a moderate p-value, 0.019, which way exceeds the statistically significant value of 0.05.

Straight and Gay
We examine the implicit bias towards sexuality in this valence test that targets the concepts of straight and gay.Text prompts that do not contain the factor of valence are created, along with those composed with pleasant and unpleasant attitudes using the method as other valence tests.By running through text-to-image generative models, corresponding images are produced.We receive the effect size of 1.113, which is much bigger than the defined large effect size value of 0.8.It suggests that the association between the concept of straight and the attribute of pleasant is significantly strong and that of gay and the attribute of unpleasant is tremendously strong as well.We also note that the p-value is 0.003, which is lower than 0.005.
The valence tests show that not only non-harmful human biases, but also hostile stereotypical biases such as inter-racial, intra-racial, and sexual biases exist in the text-to-image generative models.

Stereotype Tests
We conduct two gender-related stereotypical tests: gender-science and gender-career tests.
Science and Art We use the text prompt "a person studying {science/art}" for image generations.To generate images associated with male and female attributes, we modify the "person" with gender-specific words, such as "woman", "girl", "man", "boy", etc.The evaluated effect size of 0.193 is small, and demonstrates that the distribution of the association scores does not differ too much.In addition, the p-value of 0.200 is relatively large.This bias test demonstrates that the For each occupation, we compare the association score with gender and plot their distribution.The x-axis represents the extent to which the generated images are associated with male or female.Our analysis suggests that computer programmers and pharmacists are more strongly associated with man, while elementary school teachers, librarians and announcers are more strongly associated with woman.
evaluated generative model does not contain bias towards science and art as is documented in human biases.

Career and Family
The original IAT test has found that females are more associated with family and males with career (Nosek et al., 2002).To replicate this test with image generations, we use the template of text prompts "a person focusing on {career / family}" to generate images.We find that the effect size of 0.639 is relatively large and the p-value is less than < 1e −3 , indicating career is significantly more strongly associated with male than female.

Gender Stereotype in Occupations
Prior work has demonstrated that text prompts pertaining to occupations may lead the model to reconstruct social disparities regarding gender and racial groups, even though they make no mention of such demographic attributes (Bianchi et al., 2022).We are also interested in how the generated images are skewed towards women and men, assessed by their association scores with gender.We collect the list of common occupation titles from the U.S. Bureau of Labor Statistics1 .For each occupation title, we construct the gender-neutral text prompt "A photo of a {occupation}", and gender-specific versions by amending gendered descriptions.For each occupation, we use Stable Diffusion to generate 100 gender-neutral images, 100 masculine images, and 100 feminine images, respectively.We use Eq. ( 2) to calculate the association score between occupation and gender attributes.
We plot the distribution of association scores, and the quartiles, for eight different occupations in Figure 3.The figure shows that the 0.75 quantiles of association scores for computer programmers and pharmacists are higher than the others by a large margin, indicating that these occupations are more strongly associated with men.Conversely, the mean association scores for elementary school teachers, librarians, announcers, and chemists are negative, indicating that these occupations are more strongly associated with women.The association score for chef and police is neutral, suggesting that there is insufficient evidence to establish a stereotype.

Stereotype Amplification
Do images generated by the diffusion model amplify the implicit stereotypes in the textual representations used to guide image generation?Specif- For each pair of concept and attributes, we report the fraction of images that are chosen as being more closely associated with pleasant or male attributes.We find out that the machinerated association scores can properly represent human's perceptions.
ically, we examine occupational images and calculate the association scores between the text prompts by substituting the text embeddings of CLIP into Eq.( 2) and Eq. ( 1).We then compare these associations for text prompts to the associations for the generated images to investigate whether the biases are amplified.Figure 4 demonstrates the stereotype amplification between text prompts and generated images.For each occupation, we use an arrow to represent the change of associations on the axis of gender.We observe that the associations are amplified on a large scale for most occupations.In particular, the textual association between a computer programmer and gender is only −0.0039 but enlarged to 0.0186 for images.Similar amplifications are observed for elementary school teachers, librarians, and chemists.For the occupation of chef, the association of text prompts is skewed towards females, while the association of images is skewed towards males.

Comparison to Human Evaluation
We recruit university students to evaluate the generated images and compare how the perceptions of human differ with the machine-evaluated association scores.Specifically, for each set of concepts, we ask three student participants to view 20 images generated with neutral prompts and choose which valence or stereotypical attribute is more closely associated.We report the fraction of images that are chosen as being more closely associated with pleasant or male attributes.As shown in Table 2, the human's preference of association aligns with the strength of our association scores.For flowers vs. insects and musical instruments vs. weapon, humans mostly prefer to associate flowers and musical instruments with pleasant while insects and weapons with unpleasant.For science vs. arts and career vs. family, we find that the significance of the bias is reduced.The Kendall's τ coefficient between the machine-evaluated and human-rated scores is 0.55, indicating that the association scores can properly represent human's perceptions.

Discussion
Our bias test was applied to testing biases of images generated by the state-of-the-art text-to-image generative model, associated with valence and gender attributes of variation of concepts such as careers, religions, skin tone, etc.In the example of the valence test for images generated for Straight & Gay concepts, we observed a significant bias of pleasant attitudes towards people with straight sexual orientation and unpleasantness towards people with gay sexual orientation; the findings successfully mirrored the acknowledged human biases.Similar to the Stable Diffusion example we selected in our work, the proposed bias test can be applied to other generative models with the experiment in resemblance to quantify existing implicit biases.
The proposed Text-to-Image Association Test is a principal approach for measuring the complex implicit biases in image generations.The primary result illustrates the valence and stereotypical biases across various dimensions, ranging from morally neutral to demographically sensitive, in a state-ofthe-art generative model at different scales.The presented research adds to the growing literature on AI ethics by highlighting the complex biases present in AI-generated images and serves as a caution for practitioners to be aware of these biases.

Limitations
Our work has some limitations.Although we use the same verbal stimuli in the previous IAT tests for creating text prompts, it is very likely that some stimuli that can represent the concepts are underrepresented.The approach we adopted for comparing the images' distance might be biased as well.The current bias test procedure applies the visual encoder of OpenAI's CLIP model to measure the distance between images.However, it is unclear whether the image encoder may inject additional biases into the latent visual representations.

Figure 1 :
Figure 1: Text-to-Image Association Test (T2IAT)procedure.We instantiate the proposed bias test on Gender-Science.We use the text prompt "A photo of a child studying astronomy" to generate neutral images.Then we substitute "child" with feminine and masculine words and generate attribute-specific images.We calculate the average difference in the distance between the neutral and attribute-specific images as a measure of association.

Algorithm 1
Bias test procedure Input: concepts X and Y , attributes A and B. Output: S(X, Y, A, B), p, d. 1: Construct a set of neutral prompts related to the concepts X and Y .Then construct attribute guided prompts for attributes A and B, respectively.2: For Z ∈ {X, Y }, generate the sets of images G(Z), G(Z A ) and G(Z B ) from the text prompts.3: Compute S(X, Y, A, B) using Eq. 1. 4: Run the permutation test to compute the pvalue by Eq. 3. 5: Compute the effect size d by Eq. 4.

Figure 2 :
Figure 2: Examples of generated images.Images in the first row are generated with the text prompts describing science or career, while images in the second row are generated with the text prompts describing arts or family.The first column of images are generated with neutral prompts, without adding any gender-specific words.The second and third columns of images are generated with gender-specific prompts by appending gendered words to the corresponding neutral prompts.

Figure 3 :
Figure3: Gender stereotype in occupation.For each occupation, we compare the association score with gender and plot their distribution.The x-axis represents the extent to which the generated images are associated with male or female.Our analysis suggests that computer programmers and pharmacists are more strongly associated with man, while elementary school teachers, librarians and announcers are more strongly associated with woman.

Figure 4 :
Figure 4: Stereotype amplification.For each occupation, we compare the association scores for generated images to the association scores for the text prompts.The association scores for the text prompts are represented by the tails of the arrows, and the association scores for the images are represented by the heads of the arrows.

Table 1 :
Evaluated association scores, p-values, and effect size for 8 bias tests.The larger absolute values of association score and effect size indicate a large bias.Smaller p-value indicates the test result is more significant.

Table 2 :
Human evaluation results.