Societal Biases in Language Generation: Progress and Challenges

Technology for language generation has advanced rapidly, spurred by advancements in pre-training large models on massive amounts of data and the need for intelligent agents to communicate in a natural manner. While techniques can effectively generate fluent text, they can also produce undesirable societal biases that can have a disproportionately negative impact on marginalized populations. Language generation presents unique challenges for biases in terms of direct user interaction and the structure of decoding techniques. To better understand these challenges, we present a survey on societal biases in language generation, focusing on how data and techniques contribute to biases and progress towards reducing biases. Motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques. By further discussing general trends and open challenges, we call to attention promising directions for research and the importance of fairness and inclusivity considerations for language generation applications.


Introduction
Natural language generation (NLG) is a suite of techniques that enables the generation of humanreadable language for different goals. These techniques are the core components of applications such as virtual assistants, chat bots, automatic translators, summarizers, and creative language composers. Recent advances in techniques for language generation (e.g., GPT (Radford et al., 2018), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), TransformerXL (Dai et al., 2019), XLNet (Yang et al., 2019)) powered by Transformers (Vaswani et al., 2017) and an increasing repository of available data have created more capable applications. This has, in turn, channeled more interest and effort into developing NLG techniques.
We emphasize the importance of better understanding how societal biases manifest in NLG techniques, because NLG applications directly interact with many different users to generate novel content in various domains (e.g., chat bots for health, education, and customer support). However, when techniques are less effective or detrimental for marginalized populations, these techniques can inadvertently become gatekeepers of those populations for generation and associated language technologies. For example, an educational chat bot that produces more negative responses for topics about a specific ethnicity will discourage users of that ethnicity from interacting with the chat bot. While it is generally important to study the societal impact of NLP and AI techniques, we argue that the direct user impact of NLG techniques makes it especially important to carefully quantify the impact.
Motivated by the importance of fairness in language generation, we present the first comprehensive survey on societal biases in language generation. By enumerating how NLG techniques contribute to biases and examining progress towards bias analysis and mitigation, we contextualize the discussion of broader trends and challenges. Specifically, we focus on techniques for NLG tasks, i.e., tasks that generate a sequence of text. 1 Finding a lack of studies on biases from decoding techniques, we additionally present an experimental study to quantify the effects of various decoding techniques.
Before we delve into the details of biases in language generation, we first position our survey in the context of other relevant surveys and position papers. Sun   on mitigating gender biases and Shah et al. (2020) categorize sources of biases-both largely focus on natural language understanding (NLU) tasks, while we examine biases in NLG tasks. Additionally, Blodgett et al. (2020) urge for more explicitly tying "biases" in NLP to societal normative definitions of biases and social hierarchies; with their recommendations in mind, we discuss the negative impacts of biases in NLG techniques. Our contributions are a comprehensive survey on societal biases in language generation and an experimental study on biases from decoding techniques. To start, we describe classes of NLG tasks (Sec. 2) and subsequently examine examples of biases and harms in NLG (Sec. 3). We then discuss NLG techniques that facilitate biases, including a study of decoding techniques (Sec. 4). Sec. 5 highlights progress and challenges, and Sec. 6 presents open problems and proposals. We hope this survey brings more visibility to the importance of carefully considering different components of NLG pipelines for potential biases and mitigation methods.

Language Generation Tasks
To begin, we categorize generation tasks and introduce existing bias studies relevant to each task. NLG tasks broadly fall into two categories: those that generate text continuations conditioned on some prompt and those that transform text from one form to another. Table 1 organizes various bias-related works for NLG tasks.

Continuation Generation Tasks
The continuation class includes autocomplete and dialogue generation, where the goal is to generate text that is coherent and relevant to a prompt. Autocomplete Generation We use the term autocomplete generation to refer to conditional generation directly from language models. Language models are the core components for many NLG and NLU tasks, and this task enables directly quantifying biases in large, pre-trained language models (Bordia and Bowman, 2019; Sheng et al., 2019;Solaiman et al., 2019;Brown et al., 2020). Existing works analyzing biases in autocomplete generation have mostly examined Transformer-based models, including GPT (Shwartz et al., 2020), GPT-2 (Solaiman et al., 2019;Sheng et al., 2019Shwartz et al., 2020;Vig et al., 2020;Yeo and Chen, 2020;Huang et al., 2020;Dhamala et al., 2021;Schick et al., 2021), GPT-3 (Brown et al., 2020), CTRL (Dhamala et al., 2021), TransformerXL (Shwartz et al., 2020;Vig et al., 2020;Huang et al., 2020), and XLNet (Shwartz et al., 2020;Vig et al., 2020;Yeo and Chen, 2020), though Bordia and Bowman (2019); Qian et al. (2019) also look at LSTM-based models. Dialogue Generation Dialogue generation is conditioned on user inputs and can be for specific domains (e.g., health, customer service) and tasks (e.g., behavior intervention, booking flights) or general chit-chat. These dialogue applications directly interact with users, and any propagated biases directly affect user behavior and actions. In terms of recurrent dialogue models, Henderson et al. (2018) analyze biases in hierarchical recurrent encoder-decoder architectures and Liu et al. (2020a,b) analyze LSTM-based encoder-decoder models. Other works on dialogue biases (Dinan et al., 2020a;Sheng et al., , 2021b focus on Transformer-based models such as DialoGPT (Zhang et al., 2020) and other custom architectures.

Transformation Generation Tasks
The transformation class includes machine translation and various formulations of text re-writing. The general goal of these tasks is to transform text into a form with targeted properties. Machine Translation Translation is the task of transforming text between languages while preserving the meaning. Existing works on biases in machine translation have almost exclusively focused on issues of gender biases 2 in a variety of academic and commercial systems. The use of grammatical gender in some languages and not in others can expose unwanted gender associations (e.g., for different occupations) through translation (Prates et al., 2019). Earlier works by Vanmassenhove et al. (2018) andElaraby et al. (2018) study LSTM-based encoder-decoder translation systems, and more recent works examine Transformer-based architectures (Escudé Font and Costa-jussà, 2019;Stanovsky et al., 2019;Costa-jussà and de Jorge, 2020;Basta et al., 2020;Stafanovičs et al., 2020;Renduchintala and Williams, 2021;Choubey et al., 2021;Tomalin et al., 2021). While Google Translate 3 has been the most popular commercial system to analyze for gender biases (Prates et al., 2019;Moryossef et al., 2019;Stanovsky et al., 2019;Cho et al., 2019;Farkas and Németh, 2020), Stanovsky et al. (2019) also study Microsoft Translator, 4  Re-writing We use the term re-writing to refer to tasks of revising specific words and phrases in the original text to be more aligned with a targeted attribute. Specifically, there have been studies on re-inflection (Habash et al., 2019;Zmigrod et al., 2019;Alhafni et al., 2020) and re-writing text to use neutral viewpoints (Pryzant et al., 2020), genderneutral English (Sun et al., 2021), or more agency (Ma et al., 2020). These tasks typically rely on custom encoder-decoder models.

Other Tasks
There are other NLG tasks, such as the continuation tasks of story and poetry generation, and the transformation tasks of abstractive summarization and paraphrase generation. However, these other NLG tasks are not yet well-studied in the context of societal biases. 10

Biases and their Negative Impacts
In this section, we introduce how existing studies of biases in NLG tasks commonly quantify biases and their negative impacts.

Bias Definitions and Metrics
In the context of AI fairness, the term "bias" commonly refers to skews that result in undesirable impacts (Crawford, 2017) and is quantifiable with some metric. There are relatively more existing studies on biases in NLU tasks, where it is arguably simpler to define bias metrics, since we can intuitively compare the accuracy of the task (e.g., coreference resolution, hate speech detection) for different demographics. Language generation tasks often involve stochastic generation of open-ended and lengthy texts, traits that are not directly compatible with traditional algorithmic bias definitions (e.g., equalized odds, equal opportunity, demographic parity (Dwork et al., 2012;Hardt et al., 2016)).
Because of the difficulty in defining metrics, existing works define bias loosely as demographic inequality and use intermediate proxy metrics to comparatively measure bias. Examples include: • Regard Ratio: negative-neutral-positive regard score ratios of text generated from bias-inducing prompts (Sheng et al., 2019) • Sentiment Ratio: negative-neutral-positive sentiment score ratios of text generated from African American English (AAE) versus White-Aligned English (WAE) prompts (Groenwold et al., 2020) • Individual and Group Fairness through Sentiment: comparisons of the sentiment distributions of generated text across demographics and prompts (Huang et al., 2020) • Gendered Word Co-occurrence Score: mean and standard deviations of the absolute log ratio of probabilities: P(word|female terms) to P(word|male terms) across all words in generated text (Bordia and Bowman, 2019) There are also metrics for other bias evaluation setups in continuation generation tasks involving sentiment (Shwartz et al., 2020), the ratio of gendered words (Solaiman et al., 2019;Vig et al., 2020;Dinan et al., 2020a), and other novel metrics Yeo and Chen, 2020). Studies of biases in transformation generation tasks favor metrics of accuracy in terms of successfully transforming text to have a desired property. We present a more thorough comparison of metrics in Section 5.4.
Bias metrics can also be categorized by how they define associations between demographic group attributes and text. Biases can be towards people described in text, people who produce the text, or people to whom the text is addressed (Dinan et al., 2020b). Most existing works define bias metrics through the first association-these biases are relatively easier to analyze, since both the demographic and the textual signals of bias are encapsulated within the text. There are also works that define biases towards people who produce the text (Groenwold et al., 2020) or people to whom the text is addressed (Sheng et al., 2021b), though there are relatively fewer works that study these latter associations.

Negative Impacts
Biases in NLG techniques are important to study because they can result in harmful, negative im-pacts. We survey detrimental representational 11 and allocational 12 impacts (Crawford, 2017;Barocas et al., 2017;Blodgett et al., 2020) used to motivate existing studies of bias in NLG tasks, finding limited examples. While representational impacts are sometimes cited, it is difficult to measure the extent of the impacts. Additionally, techniques for effective NLG are relatively new, and existing studies have limited knowledge of potential allocational impacts. Finally, biases in NLG tasks give rise to a third type of negative impacts, which we call vulnerability impacts.
Representational Impacts The works in Table 1 motivate (to varying degrees) studying biases in NLG through potential negative representational impacts, in the form of propagating stereotypes, misrepresentations, or denigrations of social groups. However, it is difficult to quantify the effects of representational impacts; 13 while such impacts may be measured indirectly (e.g. by analyzing allocational impacts), we suggest long-term, interdisciplinary collaborations to explore the direct effects of these representational impacts.
Allocational Impacts Harmful allocational impacts result from an unequal allocation of resources across groups. Since effective NLG techniques based on large Transformer models (Vaswani et al., 2017) are relatively new, most of the existing works on biases in NLG that list possible impacts only analyze direct representational consequences. A real example of a negative allocational impact is when machine translation errors lead to arrests (Ong, 2017). In general, technologies that are less effective or detrimental for certain populations become barriers that actively prevent those populations from using the technology, leading to diminished opportunities in jobs, education, health, etc. We discuss more details in Section 4.5. With continuous technological advances, more organizations will turn to effective NLG techniques, making it imperative to start setting norms to reduce harmful allocational impacts (Tamkin et al., 2021).

4279
Vulnerability Impacts Open-domain generation tasks can amplify a group's vulnerability to manipulation and harm, which is an intermediate impact that makes a group more susceptible to representational and allocational impacts. For example, privacy-related issues (Carlini et al., 2020), misinformation (Levy et al., 2021), or radicalizing views in generated text could make a group more likely to be attributed to specific stereotypes (e.g., through action guided by misinformation) or end up with diminished opportunities (e.g., by having personal data exposed and misused). Separately identifying vulnerability impacts could help facilitate recognition of other negative impacts.

Contributors to NLG Biases
In a pipeline from data collection to evaluation for an NLG task, each component could propagate biases. 14 We emphasize the ways in which data, model architecture, decoding, evaluation, and deployment uniquely exacerbate biases in generation tasks. Additionally, we present an empirical study to show how measured biases in generated text can vary based on decoding technique.

Biases from Data
Modern NLP models often rely on large pre-trained language models, which in turn rely on a large collection of data to learn explicit and implicit associations. Several recent pre-trained language models used for NLG tasks, e.g., T5 (Raffel et al., 2020) and GPT-3 (Brown et al., 2020), are trained on the largest datasets used for any models. These large models for generation are commonly trained on web data, which is known to contain biased language (e.g., Ferrer et al. (2021) discover gender, religion, and ethnic biases in Reddit communities). While preprocessing is often included to filter out malformatted data and explicitly negative content (e.g., bad words and offensive phrases), those are generally the only efforts to reduce biases and associated impacts. Furthermore, by filtering out all words deemed "bad", Bender et al. (2021) warns that we remove the discourse of marginalized populations. Paullada et al. (2020), Bender and Friedman (2018), and Gebru et al. (2018) provide more comprehensive surveys and frameworks that focus on aspects of data creation and management that could lead to biases, and we refer readers to their works for more discussion. In the context of translation, Cho et al. (2021) find that more data can increase translation fluency but may also make the system more biased.

Biases from Model Architecture
There are relatively few studies that examine model architectural properties that could lead to biases. We discuss the few efforts towards understanding model biases in NLG tasks and emphasize the need for more to generalize. For autocomplete generation, Vig et al. (2020) analyze GPT-2 variants through a causal mediation analysis, finding that larger models contain more gender bias, and bias tends to be concentrated in a small number of neurons and attention heads. Silva et al. (2021) observe amplified biases in distilled versus original models. For machine translation, Costa-jussà et al.
(2020) note that language-specific architectures are less biased because they encode more gender information than shared language encoder-decoder architectures. Studies like the aforementioned are useful for designing targeted bias mitigation methods (e.g., controlled generation to target specific attention heads or regularization to retain gender information). However, more evidence would be needed to generalize findings across models. 15

Biases from Decoding
While NLU and NLG models have structural similarities, NLG tasks uniquely use search or sampling techniques at inference time to generate text. Popular techniques include: • Greedy Search: at each time step, choose the word with the highest probability. A Study on Biases from Decoding To study how decoding techniques affect biases in generation, we use existing NLG bias metrics to evaluate text generated from different decoding methods. 16 We examine autocomplete generations from GPT, GPT-2, and XLNet, using the decoding techniques from Section 4.3. We evaluate with the following bias metrics: regard ratios (Sheng et al., 2019), sentiment ratios (Groenwold et al., 2020), individual and group fairness through sentiment scores (Huang et al., 2020), and gendered word co-occurrence scores (Bordia and Bowman, 2019) (as introduced in Section 3). More experimental details can be found in the Appendix.
In Section 5.4, we distinguish between relative and absolute score metrics to examine evaluation differences between NLG tasks. Here, we organize our results into these categories to generalize trends about decoding techniques. The ratio-based metrics are relative score metrics, since evaluation relies on comparing ratios between demographics. The latter three metrics are absolute score metrics that have target values of zero indicating no bias.
For the relative score metrics, search and sampling techniques generate similar outcomes. An interesting result between sampling techniques for the regard metric is that nucleus sampling is less biased yet more negative than top-k sampling. For the absolute score metrics, we find that beam search is the most unbiased technique, closely followed by greedy search and then top-k and nucleus sampling. Through our study, we discover that text diversity is not accounted for in any of the bias metrics, yet diversity can be a confounding factor. Specifically, beam search is the least diverse, 17 followed by greedy search, top-k sampling, then nucleus sampling. Results indicate that the less diverse search techniques lead to better scores for individual fairness, group fairness, and gendered word co-occurrence ratios.
We hope these experimental results will encour-16 Code at https://github.com/ewsheng/ decoding-biases. 17 We report average generated text length and vocabulary sizes to estimate diversity in Appendix Table 4. age researchers to document sampling techniques, consider how metrics can be formulated to evaluate both bias and other factors of generation quality, and inspire more comprehensive studies. 18

Biases from Evaluation
Biases can arise from both general evaluations and bias evaluations for NLG tasks. General Evaluations Current standards for NLG evaluation can reinforce certain types of language and penalize others. For example, using perplexity as measured by models pre-trained on datasets largely containing non-AAE text leads to an unfair evaluation of AAE text. Additionally, the subjectivity of generation tasks means that much of NLG evaluation depends on human labels. Since humans from different backgrounds are accustomed to different societal norms and linguistic variations, the choice of human annotators could drastically influence the evaluation standards for generated text.
Bias Evaluations It is difficult to evaluate societal biases in NLG tasks because NLG can be open-domain, and there are many different notions of biases from various backgrounds and cultures (Sambasivan et al., 2021). These factors lead to the use of a variety of metrics to evaluate biases (Section 3). To avoid experimental bias in evaluation, we recommend using multiple metrics to cover many types of biases at various granularities. We identify three points to emphasize the need for more comprehensive evaluations. First, most existing works on biases in generation center around one demographic dimension (often gender and from a Western perspective, e.g., using standard Western occupations). While there has been no comprehensive study on whether mitigating biases for one demographic dimension (e.g., gender) may exacerbate biases for others (e.g., race, intersectional identities), this is a possibility we must consider. Second, most works only evaluate bias through a single intermediate proxy; however, different metrics are defined at different granularities (e.g., sentiment is sentence-level, gendered word ratio is word-level). Finally, different evaluation datasets test for specific types of biases and are influenced by the backgrounds of the curators. Collectively evaluating biases across demographic dimensions and granularities can thus help reduce experimentally-biased evaluations.

Biases from Deploying Systems
In terms of deploying NLG systems, there is a feedback loop that benefits some communities and further disadvantages others. While this feedback loop is not unique to NLG systems, these systems that directly interact with users make good cautionary examples.
First, many deployed language technologies require internet access both to use and contribute feedback, thus favoring the views and languages of those privileged with this access. For example, anyone can contribute feedback to Google Translate, but if contributions and subsequent improvements are focused on high-resource languages, this further increases the accuracy gap between the high and low resource languages, diminishing opportunities for speakers of the low resource languages, i.e., representation disparity (Hashimoto et al., 2018).
Second, those who are unable to achieve their goals from using these language technologies (e.g., unsuccessful translation, unhelpful or offensive chat bot) are less likely to continue using the technology. This means that there is less feedback and data to improve the technologies, reinforcing the decreased effectiveness for certain populations, i.e., disparity amplification (Hashimoto et al., 2018).
One way we might intervene is to follow a more targeted approach for data and feedback collection, e.g., from excluded populations. However, we acknowledge that this remains a difficult task and that it is also necessary to be aware of "community goals" and other factors in order to co-design language technologies without inflicting additional harm on marginalized populations (Bird, 2020).

Progress, Trends, and Challenges
Following the discussion of contributors to biases, we survey trends and challenges for reducing biases in NLG.

Data Methods
Data-based methods for both bias analysis and mitigation use the general idea of counterfactual data augmentation (CDA) (Lu et al., 2020) to curate sets of counterfactual prompts. A common method for analysis is using targeted prompts to induce NLG models to reveal biases. For data-based mitigation, existing works focus on fine-tuning large models or training smaller models with datasets that are balanced with respect to targeted demographics.
Curated Datasets Existing datasets to study biases in translation include parallel sentences tagged with speaker or subject gender information ( Vanmassenhove et al., 2018;Habash et al., 2019) and datasets to study gender biases when translating from neutral references of a person (e.g., nurse in English, gender-neutral pronouns) to gendered instances (e.g., enfermera or enfermero in Spanish, gendered pronouns) (Cho et al., 2019;Stanovsky et al., 2019;Gonen and Webster, 2020;Kocmi et al., 2020). Renduchintala and Williams (2021) additionally provide a dataset to study translation of neutral references in unambiguous contexts. Other works present parallel corpora of biased versus unbiased framings and presuppositions (Pryzant et al., 2020) (2020) compare pronoun gender biases in translations (induced with prompts) to real-world statistics. Bias Mitigation Methods can broadly be classified into two categories based on the type of data applied. The first category encompasses methods that fine-tune or train on a balanced dataset to lessen the effects of the model relying on spurious correlations between imbalanced data and task performance. CDA has been applied to datasets used for continued or fresh training in dialogue generation (Dinan et al., 2020a;Liu et al., 2020a) as well as machine translation Costa-jussà and de Jorge, 2020;Stafanovičs et al., 2020). The second category is methods that attach a short prefix at training time (Vanmassenhove et al., 2018;Basta et al., 2020;Alhafni et al., 2020) or inference time (Moryossef et al., 2019). Challenges The size of state-of-the-art pretrained models and varying definitions of biases in generation present difficulties for creating standardized datasets that are generally effective across biases and demographics. Moreover, it remains to be seen whether data-based mitigation is as effective for open-domain NLG tasks as it is for more constrained settings.

Training Methods
In addition to data-based mitigation, training-based mitigation is another popular class of methods to reduce biases in generation. Bias Mitigation Several works that use trainingbased mitigation techniques rely on regularization (Bordia and Bowman, 2019; Qian et al., 2019;Huang et al., 2020;Liu et al., 2020a;. There are also works that induce control by incorporating a bias control code through conditional training (Dinan et al., 2020a), by appending a target value to inputs during training (Ma et al., 2020), by using a normative classifier to produce reward values for backpropagation , or through adversarial training (Liu et al., 2020b). Other techniques include using debiased word embeddings (Escudé Font and Costajussà, 2019), identifying and editing out subjective words (Pryzant et al., 2020), and using Markov random fields to preserve morpho-syntactic agreement during reinflection (Zmigrod et al., 2019). Challenges The main challenge of bias mitigation through training methods is that it is costly and impractical to re-train models for new biases encountered. In fact, most of the techniques that rely on training from scratch use smaller architectures (exceptions are from larger institutions).

Inference Methods
While the existing literature on inference time methods for bias mitigation is sparse, decoding-based methods are a promising alternative to data-and training-based methods. Specifically, these methods are compatible with any pre-trained language model for generation without additional training. Given recent development of inference-time methods for control that can reduce toxicity (e.g., PPLM (Dathathri et al., 2019), GeDi (Krause et al., 2020), DExperts (Liu et al., 2021)), there is potential for extending these methods to bias mitigation. Bias Mitigation For autocomplete and dialogue generation,  formulate bias triggers using gradient-based methods of Wallace et al. (2019). These triggers are appended to prompts during inference time to control text generation to be more equalized towards different demographics. For translation, Saunders and Byrne (2020) present a lattice rescoring procedure that creates genderinflected search spaces to rescore text for more accurate translations, and  subsequently use this lattice structure to present more gendered options during beam search and rerank translation hypotheses according to gender criteria. For dialogue generation, Sheng et al. (2021b) introduce a constrained decoding method that uses n-gram similarity to guide generation away from ad hominems towards marginalized groups. For autocomplete generation, Schick et al. (2021) present a self-debiasing scheme that re-weights word probabilities to generate less undesirable words. Challenges Control methods at inference time could potentially steer the model into degenerate spaces, so it is important to also evaluate these methods for coherence, fluency, and task relevance.

Evaluation Methods
There are two types of evaluations: those that rely on absolute scores and those that rely on relative scores. Absolute score evaluations use an accumulated score to summarize inequalities between demographics, whereas relative evaluations explicitly report inequalities between all demographics. While it is possible to convert between relative and absolute scores, distinguishing between how existing works choose to portray evaluations allows us to examine differences between generation tasks. Absolute Evaluations We find that the transformation class of generation tasks favors bias evaluation through absolute metrics, which is possible because these tasks involve relatively more constrained forms of generation. Examples of evaluation objectives through absolute scores include Challenges A trade-off between framing biases as a relative or absolute metric is that relative metrics can be more flexibly aligned to normative concerns like social perception. Absolute metrics that look for ratios of gendered words or other indicator words assume that there is a set of words that captures all the differences between demographic groups, regardless of whether these differences are related to normative definitions of harm. There are also absolute metrics such as those of Huang et al. (2020) that can incorporate intermediate metrics that are more aligned with normative behavior, though these metrics reduce the notion of biases to a single value, which could erase historical inequalities between groups.

Open Problems and Proposals
As a fairly nascent area of exploration, the study of biases in language generation still poses many challenges. Throughout this paper, we discuss challenges associated with different components in a generation pipeline. With a heightened awareness of the relevant body of work, we conclude with recommendations for open problems. Bias-Aware Data Curation Many works have highlighted the harms and problems when collecting training datasets with limited awareness for potential harms. Since effective models for NLG tasks are correlated with increasing training data sizes, biases in data collection (e.g., Englishcentric, drawn from popular Western media) remain a major contributor of biases that manifest in generation. Additionally, datasets used to study biases in generation can also be limited (e.g., only for binary gender classes). For more bias-aware data curation, we suggest diversifying datasets to include more viewpoints from various groups.
Understanding Trade-Offs Different methods for analysis, mitigation, and evaluation have unique trade-offs. Existing works have been relatively small-scale and limited to a small number of biases for specific tasks. Some useful questions to consider when developing methods to study generation biases are whether we can generalize methods to a diverse set of biases and a wide range of contexts. It is also important to consider formulating metrics that would jointly mitigate biases and preserve other desired text qualities (e.g., diversity, fluency).

Interactive and Continuous Learning
The difficulties of measuring and mitigating biases in generation can be reduced with a general framework for interactive and continuous learning. Over time, such a system could learn from diverse opinions of what constitutes "fair" versus "unfair" generations across tasks. A unified framework would centralize and highlight the importance of studying biases in generation, as well as fuel the development of a more comprehensive set of evaluations that may be useful for large-scale studies of impact.
Focusing on Negative Impacts Section 3 discusses how there are very few existing works on biases that explicitly and meaningfully engage with resulting negative impacts, even though these impacts are what motivate reducing biases. By reframing efforts on reducing negative impacts rather than biases, we may be able to define metrics and progress that better correlate with reducing harm. For example, relative framings of bias metrics could better enable metrics to be more aligned with reducing harms for particularly impacted groups.

Ethics and Broader Implications
In this work, we present a survey and commentary on the progress and challenges for studying societal biases in language generation. Data We do not check the quality of the datasets used to train popular language generation models (due to limited availability and size), though we do briefly mention problems that other works have found regarding using large datasets that have been minimally filtered. Some of the surveyed datasets and metrics that are used for evaluating biases approximate binary genders using names typical of specific genders, and may be better re-formulated to avoid harms and curate a more accurate representation of different genders. On the subject of genders, the majority of bias evaluation data also only evaluate for binary genders-we point out this issue in our survey as well. Techniques Most of the techniques surveyed in this work are trained with or bias-tested with data drawn from Western sources or culture, since that is largely the focus of the existing body of work. We also refer to studies that point out how techniques for bias do not always transfer across cultures. Our decoding experiments could potentially fuel misuse by giving those with adversarial interests a better understanding of how decoding algorithms could thwart bias metrics, though we believe transparency around these results outweigh the potential for misuse. The authors define the individual fairness metric by "...averaging the Wasserstein-1 distance between the sentiment score distribution of every evaluation sentence and each of its counterfactual sentences across all templates." For example, we would compute the distance between the sentiment distributions of the text generated from the template People from [BLANK] are for each of the country choices for [BLANK], and sum up the distance scores for all pairs across all templates.
For group fairness, the authors calculate the average of the "Wasserstein-1 distance between the sentiment distributions of all generated sentences of inputs from [a] subgroup, and that over the entire evaluation set". Here, a subgroup means each country, occupation, or binary gender. For example, we compare the distance between the sentiment distribution of text generated for Syria (across all templates) and the sentiment distribution of text generated for all countries.
We use Huang et al. (2020)'s prefix templates and fairness metrics exactly as defined in the original work, so we refer readers to the original work for more details.
Gendered Word Co-occurrence Scores This score is based on the one proposed by Bordia and Bowman (2019), though we use different gendered word lists and evaluate over all text generated for the other bias metrics, downsampling if necessary so that the amount and sources of generated text are consistent across decoding techniques. First, we obtain the lists of female words and male words from Zhao et al. (2018) and add gendered pronouns (he, she, his, him, her) to the respective lists. For each word in the aggregated sample set, we calculate the probability of the word given any of the female words (in a context window of 20 words before and after a word) and similarly the probability of the word given any of the male words. We then take the absolute value of the log ratio of the first probability to the second, and report the average and standard deviation across all nongendered words. More concretely, given the set of female gendered words f , the set of male gendered words m, unique non-gendered words w ∈ W in a dataset, and the probability of a word given any of the set g of gendered words P(w|g), we calculate the mean µ = avg(abs(log P(w|f ) P(w|m) )) and standard deviation σ = stdev(abs(log P(w|f ) P(w|m) )).

Supplementary Results
Supplementary to the experimental results described in the main text, Table 2 presents quantitative results. Table 3 shows regard ratios for the other demographic groups originally included in the evaluation by Sheng et al. (2019). Additionally, Table 4 presents average lengths and vocabulary sizes of the samples used in the IF/GF evaluations to estimate text diversity. These results, combined with examples of generated text in Table 5, provide evidence that the decoding techniques differ in terms of generated text diversity, and that diversity is very much correlated with the bias metrics IF, GF, and gendered word co-occurrence scores. Although this correlation is to be expected from the metric formulation, this study raises relevant questions of whether bias metrics should be correlated with text diversity, and whether bias evaluations should use more comprehensive metrics.  Table 2: Bias evaluations for various decoding algorithms, models, and metrics. Regard scores (Sheng et al., 2019) and sentiment scores (Groenwold et al., 2020) are reported in distribution percentages of negative-neutralpositive(avg value). Individual fairness (IF) and group fairness (GF) scores (Huang et al., 2020) compare sentiment distributions of generated text across demographics. Gendered (word co-occurrence) scores are reported in terms of mean±stdev of the absolute log ratio of the probabilities: P(word|female terms) to P(word|male terms) (Bordia and Bowman, 2019). Search-based results for regard are omitted due to lack of enough prompts to generate from. Results indicate 1) nucleus sampling generates more text with negative regard, 2) decoding choices are similar for AAE/WAE sentiments though sampling generates more positive sentiment overall, 3) beam search has relatively lower bias as measured by IF, GF, and gendered word co-occurrence scores, followed closely by greedy search, and then top-k and nucleus sampling.