How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?

Text-to-image generative models have achieved unprecedented success in generating high-quality images based on natural language descriptions. However, it is shown that these models tend to favor specific social groups when prompted with neutral text descriptions (e.g., ‘a photo of a lawyer’). Following Zhao et al. (2021), we study the effect on the diversity of the generated images when adding ethical intervention that supports equitable judgment (e.g., ‘if all individuals can be a lawyer irrespective of their gender’) in the input prompts. To this end, we introduce an Ethical NaTural Language Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset to evaluate the change in image generations conditional on ethical interventions across three social axes – gender, skin color, and culture. Through CLIP-based and human evaluation on minDALL.E, DALL.E-mini and Stable Diffusion, we find that the model generations cover diverse social groups while preserving the image quality. In some cases, the generations would be anti-stereotypical (e.g., models tend to create images with individuals that are perceived as man when fed with prompts about makeup) in the presence of ethical intervention. Preliminary studies indicate that a large change in the model predictions is triggered by certain phrases such as ‘irrespective of gender’ in the context of gender bias in the ethical interventions. We release code and annotated data at https://github.com/Hritikbansal/entigen_emnlp.


Introduction
Recent Text-to-Image generative models (Ramesh et al., 2021(Ramesh et al., , 2022;;Ding et al., 2021;Saharia et al., 2022;Nichol et al., 2021;Rombach et al., 2022) can synthesize high-quality photo-realistic images conditional on natural language text descriptions in a zero-shot fashion.For instance, they can generate an image of 'an armchair in the shape of an avocado' which appears rarely in the real world.However, despite the unprecedented zero-shot abilities of the text-to-image generative models, recent experiments with small-scale instantiations (such  as minDALL•E) have shown that prompting the model with neutral texts ('a photo of a lawyer'), devoid of any cues towards a social group, still generates images that are biased towards white males (Cho et al., 2022).
In our work, we consider three bias axis -1) {man, woman} grouping across gender axis, 2) {light-skinned, dark-skinned} grouping across skin color axis, and 3) {Western, Non-Western} grouping across cultural axis. 1 The existence of any gender2 and skin color bias3 (see Ethical Statements for more discussion) causes potential harms to underrepresented groups by amplifying bias present in the dataset (Birhane et al., 2021;Barocas et al., 2018).Hence, it is essential for a text-to-image system to generate diverse set of images.
To this end, we study if the presence of addi-tional knowledge that supports equitable judgment help in diversifying model generations.Being part of text input, this knowledge acts as an ethical intervention over the original neutral prompt (Zhao et al., 2021).Ethical interventions provide models with ethical advice and do not emanate any visual cues towards a specific social group.For instance, in the context of generating 'a photo of a lawyer' that tends to be biased towards 'light-skinned man', we wish to study if prompting the model with ethically intervened prompt (e.g., 'a photo of a lawyer if all individuals can be a lawyer irrespective of their gender') can diversify the outputs.
We introduce an Ethical NaTural Language Interventions in Text-to-Image GENeration (ENTI-GEN) benchmark dataset to study the change in the perceived societal bias of the text-to-image generative models in the presence of ethical interventions.ENTIGEN covers prompts to study the bias across three axes -gender, skin color and culture.The neutral prompts in ENTIGEN dataset are intervened with corresponding ethical knowledge as illustrated in Figure 1.We evaluate ENTIGEN on publicly available models -minDALL•E (Kim et al., 2021), DALL•E-mini (Dayma et al., 2021), and Stable Diffusion (Rombach et al., 2022) automatically with CLIP model (Radford et al., 2019) and manually with human annotators from MTurk.
Through our experiments, (1) we show that a few ethical interventions lead to the diversification of the image generations across different groups while preserving the image generation quality.Interestingly, in some cases, we observe the bias can be flipped towards the originally underrepresented groups with ethical interventions (Appendix Figure 6).( 2) Moreover, we find that the interventions containing keywords such as 'irrespective of gender' and 'culture' tend to trigger a large change in model generations.We further analyze the pretraining data to analyze the context in which these keywords are used to better understand how they may affect the diversity in generation.

Dataset and Evaluation Methods
In this section, we introduce the process of building an ethical intervention benchmark ENTIGEN and evaluating the images generated by text-to-image generative models.

ENTIGEN Benchmark Construction
Initially, we determine three axes of societal bias to be studied: gender, skin color, and culture.Specifically, to investigate the gender and skin color bias, ENTIGEN consists of prompts belonging to the category of professions and objects.For assessing the cultural bias, it consists of prompts surrounding the wedding ceremonies as it is ubiquitous and diverse across different regions (Bell et al., 1997;Xu and Xu, 2018;Acharya et al., 2020).
Based In total, we create 246 prompts based on an attribute set containing diverse professions, objects, and cultural scenarios.4

Image Generation.
Each prompt in ENTIGEN is used to generate 9 images from each text-to-image generation model 9 times.We choose the publicly available models, minDALL•E, DALL•E-mini, and Stable Diffusion for analysis.It is mainly because these three models can generate high-quality images efficiently.We provide more details in Appendix B.

Evaluation Metrics.
We evaluate the diversity among the generated images of the models.We focus on the gap between the number of images associated with the different groups (mentioned in §1) which measure the demographic disparity across various social axes.Specifically, for one of the prompts (e.g., 'a photo of a [profession] if all genders can be a [profession]') filled with each attribute k (e.g., police officer) in category P (e.g., profession), we count s g k,a (number of images with man) and s g k,b (number of images with woman), associated with the two groups a (man) and b (woman) across a specific social axis g (gender).Finally, the diversity score for axis g towards its groups for category P is: where g is one of {gender, skin color, culture}, P is one of {profession, object, wedding} and k can be any attribute according to the category P we select.The generations that could not have been assigned gender or skin color due to uncertainity in the judgements of the agents are not included in this metric. 5Smaller scores represent more diverse outputs.The normalization factor in the denominator of the Eq. ( 1) allows us to compare model generations from two different prompts -original and ethically intervened as they could have different number of image generations that belong to either of the two social groups.To quantify the bias and its direction, given one specific attribute k, we directly compute the normalized difference of the two counts, belonging to two groups a and b.6 Greater absolute value of bias g k indicates greater bias and vice versa.Built upon these metrics, CLIP-based and human evaluations are used to assess output diversity and bias.Due to limited budget, we select part of the professions and objects for human annotators to evaluate. 7For the entire set of images, we use auto-matic CLIP-based evaluation8 as a complementary method.Appendix C provides more details about our evaluations.
Note that we are aware of the possibility that CLIP model may be biased towards certain groups (Zhang et al., 2022).We measure the consistency between the gender and skin color determined by the CLIP model and human annotators in the images generated for a subset of attributes.We find that CLIP-based determinations agree with the human annotations with a rate of 78-85% for gender recognition while for skin color, the rate is down to 67-78%.We finally decide to apply CLIP-based evaluation on gender axis only as the predictions on gender are more consistent with the humans.

CLIP-based Results
We investigate the effect of the ethical interventions on the gender diversity score Eq. ( 1) for the profession category in Table 1 (Column 3-5).We observe that gender-specific ethical intervention causes the promotion of gender diversity (Row 2-3) for all the models.We also find that the prompt with 'irrespective of their gender' improves the gender diversity score much more than the prompt simply stating that 'all genders can be [profession]'.Additionally, we observe that an ethical intervention with respect to skin color does not have significant effect on the gender diversity of the model generations (Row 4-5).Even though the irrelevant interventions should not change the diversity scores, we observe that diversity scores are affected by their presence (Row 6-7).We present the gender diversity score evaluated through CLIP for the object category in Appendix Table 6.To ensure the reliability of our evaluation, we also perform human annotations for better assessment.

Human Evaluation Results
We present human evaluation results for the profession category in Table 1 (Column 5-8).We observe that axis-specific ethical instructions with 'irrespective of {gender, skin color}' produce better diversity scores (Row 2 and 4).We also find that the diversity scores do not improve for most cases as ethical interventions do when adding irrelevant in- structions.We can draw similar conclusions from Appendix Table 6 for the objects category.
We also present the human evaluation results along the cultural axis in Table 2.We observe that the generations of all the models become more diverse when prompted in the presence of cultural intervention.Additionally, the cultural diversity is not influenced by the irrelevant instructions.
Till now, we have focused at the effect on the diversity scores.However, it is only the uniformity in image generations across groups but does not indicate the direction of the bias.Hence, we also calculate the bias score Eq. (2).Our results reveal that the presence of ethical interventions may flip the direction of model's bias.For instance, DALL•Emini generates man and dark-skinned individuals with makeup (Appendix Fig. 6).Similarly, Stable Diffusion generates more woman images than man images for the police profession when prompted with the gender ethical intervention.
Further visual inspection of Figure 4 suggests that the Stable Diffusion model synthesizes multiple humans in a single image that prevents the human annotators to assign a particular gender or skin color to them.Such model generations are disregarded during diversity score generation, thus preventing us to make reliable estimate of the stable diffusion generations through diversity score alone.We believe that our work motivates further studies on the sensitivity of text-to-image model generations to ethical instructions.

Quality of Image Generation
Do these abstract interventions bring side effect such as hurting the quality of generations?We ask human annotators to select if generated images are of good quality9 conditional on the original prompt and the ethical intervention.We present our analysis in Table 3 for the same five subset of attributes (police, doctor, makeup, suit, scarf) for gender and skin color bias study, and three attributes (bride, groom, wedding) for cultural bias study ( §3.2).Compared to generating with original prompts, except DALL•E-mini and Stable Diffusion on profession category, the number of good quality generations reduce slightly for both the models (0-1.5 images per attribute) in the presence of the ethical interventions.This presents a positive case towards using ethical interventions for model diversification while preserving the quality of the generations.
4 How important are phrases present in an ethical intervention?
In §3, we observed that ethical interventions would elicit large changes in the diversity scores in some cases.However, it is still unclear as to which phrases in an ethical intervention lead to such changes in the model's behaviour.To this end, we perform a preliminary analysis on the model generations with 'a photo of a {person wearing a makeup/police officer} if all individuals can {wear a makeup/be a police officer} irrespective of their gender' prompt with DALL•E-mini.
We find that removing 'irrespective of their gender' phrase from the ethical intervention leads to generations biased towards 'woman' and 'man' for the 'makeup' and 'police officer' attributes respectively.This trend is identical to what we observe for original prompts without intervention.It shows that the model may have captured the semantics of the phrase based on its usage in the pre-training dataset.Further analyzing the pre-training data (Sharma et al., 2018), we observe 'irrespective of' phrase is used 37 times to elicit equitable judgment based on the context in the captions (Table 7).But the entire phrase 'irrespective of their gender' appears only once.
There is also a possibility that the captions containing word 'gender' and 'makeup' are associated with images with 'man' person in pre-training dataset images (Changpinyo et al., 2021;Sharma et al., 2018) and thus contribute to generating more men.However, we find that the six images with 'gender' and 'makeup' words in their captions only contain people who are perceived as woman by the humans.We also find that there is only one image, without any person clearly visible, with 'gender' and 'police' in its caption.Hence, we further verify the effect of phrase 'irrespective of their gender' on generating diverse images despite its absence in pre-training data.Why DALL•E-mini can generate anti-stereotype images with such ethical interventions needs further exploration in future work.
Additionally, further analysis on the cooccurrence of the word 'culture' with 'Western' (75), 'Indian' (394), and 'Chinese' (322) explains the generation of images belonging to these Non-Western cultures when the original prompts are intervened with ethical interventions containing the 'culture' keyword (Appendix Fig. 7, 8).

Discussion and Conclusion
We present a framework along with an associated ENTIGEN dataset to evaluate the change in the diversity of the text-to-image generations in the presence of the ethical interventions.We observe that without any fine-tuning, models can generate images of diverse groups with prompts containing ethical interventions.Our preliminary study finds evidence that a large change in image generation can be caused by certain keyphrases such as 'irrespective of gender' in the context of the gender bias and 'culture' in the context of the cultural bias.
anti-stereotype generations beyond the association between words and images.Our work motivates further studies for developing more inclusive and reliable text-to-image systems.
The creation of large number of ethical interventions and their human evaluations is a current limitation and an important future direction.Additionally, we consider binary categorization of the model generations that has technical as well as ethical limitations.It would be important to study mechanisms to assign non-binary labels to model generations and develop diversity metrics beyond binary groups in the future work.
Our work is also limited by the perceptual bias of the human annotators from US and UK as well as the CLIP model.To obtain more reliable evaluation results, we plan to involve annotators from diverse regions in human evaluation and less biased computer vision models in automatic evaluation.

Ethics Statement
ENTIGEN is proposed for evaluating the change in the model generations in the presence of ethical interventions.We limit our work to selected categories (such as profession and objects) within the gender and social axis even though there might be other categories such as politics where equal representation is desired.Even though there are a wide range of groups within the gender and skin color axis, we only consider categorizing individuals into {man, woman} and {light-skinned, dark-skinned}.
We are aware of the negative impact brought from limited binary categories.It is offensive for underrepresented groups and possibly causes cyclical erasure of non-binary gender identities.However, assessing any individual's gender identity or sex is impossible based on their appearance; hence we limit our work on classifying individuals into man/woman based on the perceptual bias and gender assumptions of the human annotators and the CLIP model.We also emphasize that our analysis is based on generated images not the images containing real individuals.
We also understand that there are numerous skin colors but we limit our study to classify individuals into light-skinned or dark-skinned.Additionally, we do not instruct the annotators to use Fitzpatrick scale (Fitzpatrick, 1986) to determine skin-color, rather the decision is left to their own perception.
The imperfect image-to-text generative modeling can run into the hazard of missing certain data modes that eventually compound the social biases present in the pre-trained dataset (Saharia et al., 2022).There are harms associated with the models ability to change predictions drastically based on the prompts as it can lead to the generation of objectionable contents.We encourage the practice of having sophisticated Not Safe For Work (NSFW) filters before image generations.A CLIP-based filter used by Stable Diffusion implementations is a positive step in this direction.
Extensions of our work can focus on increasing the representation of more groups as well as designing text-to-image generative models that output images of people belonging to diverse groups conditional on the neutral prompt.
As we annotate a new dataset ENTIGEN, we compensate annotators with a fair rate.We recruit annotators from Amazon MTurk.We provide a fair compensation rate with $10 per hour and spent around $60 in total to the annotators on human evaluation.Each HIT costs several seconds according to the statistics in Amazon MTurk.

F More on Bias Results
We present the formulation of bias along the social axis g in Eq. ( 2).Bias results based on human evaluations are shown in Figure 3.We first observe that in most cases, adding ethical interventions can help in reducing the bias because the absolute value of bias g becomes smaller.We further find that in some cases, for example, outputting a person with makeup by DALL•E-mini, the bias direction is flipped oppositely towards person who looks like a man.

G Case Study
Figure 4 to Figure 8 showcase the generated images based on different prompt variants.From Figure 8, we observe that original prompts about bride can only generate brides in Western weddings, but the generations are diversified with ethical intervention 'from diverse cultures'.1369 Captions A great team is all the time humble and have the ability to listen to everyone, facilitating freedom to communicate each member's thoughts and perspectives irrespective of hierarchies, which in turn.. Faux Leather Toddler Jacket -Leather jackets irrespective of the colour, style and material ...
According to the ornithologists, the parrots would help out irrespective of whether the other individual was their 'friend' or not.
Banquet Outfits for Women: For Banquet events, irrespective of whether it will be a formal or informal occasion, you need to appear regal and elegant...  Secure pipes to prevent movement irrespective of slope of surface, secure pipes to prevent movement e.g sand bags, star pickets, place against fixed objects which will prevent the movement of pipes.
Advertising is one of the most important parts of marketing irrespective of brands, companies and products...The starter relay switch will be replaced free of cost in the identified units irrespective of the warranty status of the vehicle across Honda's India network.Photo: Bloomberg The Leh-Karakoram road is also a part of this project.It has 37 bridges and is motorable all through the year irrespective of weather conditions.
Air pollution is one such form that refers to the contamination of the air, irrespective of indoors or outside...The Salish Sea joins together more than 7 million inhabitants, which work together on a wide range of issuesirregardless and irrespective of national border.
PERSON's Vases The fluid levels are the same in all each tube irrespective of their shape East Or West India is the best.These fans continue to cheer for India irrespective of any state at the IPL 6 match between Kings XI Punjab and Kolkata Knight Riders in Mohali.(PTI)

Figure 1 :
Figure 1: We study the change in the model generations across various groups (man/woman, light-skinned/darkskinned, Western/Non-Western) before and after adding ethical interventions (in purple) during text-to-image generation.We use CLIP and Human annotations to assign a social group to the model generations.We present a few output generations in Appendix Fig. 4-8.

Figure 2 :
Figure 2: Screenshot of annotation interface for collecting human evaluation results.

Figure 4 :
Figure 4: Models generations from the Stable Diffusion for the doctor attribute from the profession category conditional on various prompts.

Figure 5 :Figure 6 :
Figure 5: Models generations from the DALL•E-mini for the police officer attribute from the profession category conditional on various prompts.

Figure 7 :
Figure 7: Models generations from the DALL•E-mini for the bride attribute category conditional on various prompts.

Figure 8 :
Figure 8: Models generations from the minDALL•E for the bride attribute category conditional on various prompts.

Figure 5 :
Figure 5: Energy moving through a side facing female human form within toroidal geometric space.The toroidal field has perfect symmetry irrespective of perspective.Students are selected based on merit, irrespective of their ability to pay... Men have always flaunted caps irrespective of the season..... served to more than 10,000 people every day.It is now a tradition followed by more than 30 million PERSON worldwide.Nearly every gurdwara in the world, irrespective of size, has a kitchen and serves langar... Material Risk Willmott Dixon appeal -Any work with asbestos presents a material risk irrespective of the number of fibres released (if any) or the length of exposure.

Table 1 :
CLIP-Based and Human Evaluation Results for profession category.We abbreviate Diversity Score by DS, Ethical Intervention by EI, H by Humans, minDALL•E by minD, DALL•E-mini by D-mini, Stable Diffusion by SD.

Table 2 :
Human Evaluation Results For Cultural Bias.We abbreviate DS by Diversity Score, minDALL•E by minD, DALL•E-mini by D-mini, SD for Stable Diffusion.

Table 3 :
Average number of good quality image generations that accurately depict the prompts for the per attribute as determined by human annotators.Gender EI & Skin color EI append "irrespective of [X]" and culture EI appends 'from diverse cultures' to the prompts.

Table 5 :
Names of the attributes belonging to each category used in the CLIP-based evaluation.The attributes with * are considered for human evaluation by the annotators.
Total muscle mass in all parts of the body is greater in men than in women irrespective of age ...chemotherapy is added Avastin therapy should be continued until disease progression, irrespective of any modification to the concomitant chemotherapy regimen...This project is designed to replace the defective control board with a new Control Board in Microwave Oven irrespective of brand and capacity... Short Stubble Beard is a female magnet and also one of the beard styles that every man can flaunt irrespective of the scanty and patchy growth issues!

Table 7 :
List of contexts in which the phrase 'irrespective of' is used in the pre-training datasets.