A Survey of Race, Racism, and Anti-Racism in NLP

Despite inextricable ties between race and language, little work has considered race in NLP research and development. In this work, we survey 79 papers from the ACL anthology that mention race. These papers reveal various types of race-related bias in all stages of NLP model development, highlighting the need for proactive consideration of how NLP systems can uphold racial hierarchies. However, persistent gaps in research on race and NLP remain: race has been siloed as a niche topic and remains ignored in many NLP tasks; most work operationalizes race as a fixed single-dimensional variable with a ground-truth label, which risks reinforcing differences produced by historical racism; and the voices of historically marginalized people are nearly absent in NLP literature. By identifying where and how NLP literature has and has not considered race, especially in comparison to related fields, our work calls for inclusion and racial justice in NLP research practices.


Introduction
Race and language are tied in complicated ways. Raciolinguistics scholars have studied how they are mutually constructed: historically, colonial powers construct linguistic and racial hierarchies to justify violence, and currently, beliefs about the inferiority of racialized people's language practices continue to justify social and economic exclusion (Rosa and Flores, 2017). 1 Furthermore, language is the primary means through which stereotypes and prejudices are communicated and perpetuated (Hamilton and Trolier, 1986;Bar-Tal et al., 2013).
However, questions of race and racial bias have been minimally explored in NLP literature.
While researchers and activists have increasingly drawn attention to racism in computer science and academia, frequently-cited examples of racial bias in AI are often drawn from disciplines other than NLP, such as computer vision (facial recognition) (Buolamwini and Gebru, 2018) or machine learning (recidivism risk prediction) (Angwin et al., 2016). Even the presence of racial biases in search engines like Google (Sweeney, 2013;Noble, 2018) has prompted little investigation in the ACL community. Work on NLP and race remains sparse, particularly in contrast to concerns about gender bias, which have led to surveys, workshops, and shared tasks (Sun et al., 2019;Webster et al., 2019).
In this work, we conduct a comprehensive survey of how NLP literature and research practices engage with race. We first examine 79 papers from the ACL Anthology that mention the words 'race ', 'racial', or 'racism' and highlight examples of how racial biases manifest at all stages of NLP model pipelines ( §3). We then describe some of the limitations of current work ( §4), specifically showing that NLP research has only examined race in a narrow range of tasks with limited or no social context. Finally, in §5, we revisit the NLP pipeline with a focus on how people generate data, build models, and are affected by deployed systems, and we highlight current failures to engage with people traditionally underrepresented in STEM and academia.
While little work has examined the role of race in NLP specifically, prior work has discussed race in related fields, including human-computer interaction (HCI) (Ogbonnaya-Ogburu et al., 2020;Rankin and Thomas, 2019;Schlesinger et al., 2017), fairness in machine learning (Hanna et al., 2020), and linguistics (Hudley et al., 2020;Motha, 2020). We draw comparisons and guidance from this work and show its relevance to NLP research. Our work differs from NLP-focused related work on gender bias (Sun et al., 2019), 'bias' generally (Blodgett et al., 2020), and the adverse impacts of language models (Bender et al., 2021) in its explicit focus on race and racism.
In surveying research in NLP and related fields, we ultimately find that NLP systems and research practices produce differences along racialized lines. Our work calls for NLP researchers to consider the social hierarchies upheld and exacerbated by NLP research and to shift the field toward "greater inclusion and racial justice" (Hudley et al., 2020). 2 What is race?
It has been widely accepted by social scientists that race is a social construct, meaning it "was brought into existence or shaped by historical events, social forces, political power, and/or colonial conquest" rather than reflecting biological or 'natural' differences (Hanna et al., 2020). More recent work has criticized the "social construction" theory as circular and rooted in academic discourse, and instead referred to race as "colonial constituted practices", including "an inherited western, modern-colonial practice of violence, assemblage, superordination, exploitation and segregation" (Saucier et al., 2016).
The term race is also multi-dimensional and can refer to a variety of different perspectives, including racial identity (how you self-identify), observed race (the race others perceive you to be), and reflected race (the race you believe others perceive you to be) (Roth, 2016;Hanna et al., 2020;Ogbonnaya-Ogburu et al., 2020). Racial categorizations often differ across dimensions and depend on the defined categorization schema. For example, the United States census considers Hispanic an ethnicity, not a race, but surveys suggest that 2/3 of people who identify as Hispanic consider it a part of their racial background. 2 Similarly, the census does not consider 'Jewish' a race, but some NLP work considers anti-Semitism a form of racism (Hasanuzzaman et al., 2017). Race depends on historical and social context-there are no 'ground truth' labels or categories (Roth, 2016).
As the work we survey primarily focuses on the United States, our analysis similarly focuses on the U.S. However, as race and racism are global constructs, some aspects of our analysis are applicable to other contexts. We suggest that future studies on racialization in NLP ground their analysis in the appropriate geo-cultural context, which may result 3 Survey of NLP literature on race

ACL Anthology papers about race
In this section, we introduce our primary survey data-papers from the ACL Anthology 3 -and we describe some of their major findings to emphasize that NLP systems encode racial biases. We searched the anthology for papers containing the terms 'racial', 'racism', or 'race', discarding ones that only mentioned race in the references section or in data examples and adding related papers cited by the initial set if they were also in the ACL Anthology. In using keyword searches, we focus on papers that explicitly mention race and consider papers that use euphemistic terms to not have substantial engagement on this topic. As our focus is on NLP and the ACL community, we do not include NLP-related papers published in other venues in the reported metrics (e.g. Table 1), but we do draw from them throughout our analysis.
Our initial search identified 165 papers. However, reviewing all of them revealed that many do not deeply engage on the topic. For example, 37 papers mention 'racism' as a form of abusive language or use 'racist' as an offensive/hate speech label without further engagement. 30 papers only mention race as future work, related work, or motivation, e.g. in a survey about gender bias, "Nonbinary genders as well as racial biases have largely been ignored in NLP" (Sun et al., 2019). After discarding these types of papers, our final analysis set consists of 79 papers. 4 Table 1 provides an overview of the 79 papers, manually coded for each paper's primary NLP task and its focal goal or contribution. We determined task/application labels through an iterative process: listing the main focus of each paper and then collapsing similar categories. In cases where papers could rightfully be included in multiple categories, we assign them to the best-matching one based on stated contributions and the percentage of the paper devoted to each possible category. In the Appendix we provide additional categorizations of the papers 3 The ACL Anthology includes papers from all official ACL venues and some non-ACL events listed in Appendix A, as of December 2020 it included 6, 200 papers according to publication year, venue, and racial categories used, as well as the full list of 79 papers.

NLP systems encode racial bias
Next, we present examples that identify racial bias in NLP models, focusing on 5 parts of a standard NLP pipeline: data, data labels, models, model outputs, and social analyses of outputs. We include papers described in Table 1 and also relevant literature beyond the ACL Anthology (e.g. NeurIPS, PNAS, Science). These examples are not intended to be exhaustive, and in §4 we describe some of the ways that NLP literature has failed to engage with race, but nevertheless, we present them as evidence that NLP systems perpetuate harmful biases along racialized lines.
Data A substantial amount of prior work has already shown how NLP systems, especially word embeddings and language models, can absorb and amplify social biases in data sets (Bolukbasi et al., 2016;Zhao et al., 2017). While most work focuses on gender bias, some work has made similar observations about racial bias Garg et al., 2018;Kurita et al., 2019). These studies focus on how training data might describe racial minorities in biased ways, for example, by examining words associated with terms like 'black' or traditionally European/African American names (Caliskan et al., 2017;Manzini et al., 2019). Some studies additionally capture who is described, revealing under-representation in training data, sometimes tangentially to primary research questions: Rudinger et al. (2017) suggest that gender bias may be easier to identify than racial or ethnic bias in Natural Language Inference data sets because of data sparsity, and Caliskan et al. (2017) alter the Implicit Association Test stimuli that they use to measure biases in word embeddings because some African American names were not frequent enough in their corpora. An equally important consideration, in addition to whom the data describes is who authored the data. For example, Blodgett et al. (2018) show that parsing systems trained on White Mainstream American English perform poorly on African American English (AAE). 5 In a more general example, Wikipedia has become a popular data source for many NLP tasks. However, surveys suggest that Wikipedia editors are primarily from whitemajority countries, 6 and several initiatives have pointed out systemic racial biases in Wikipedia coverage (Adams et al., 2019;Field et al., 2021). 7 Models trained on these data only learn to process the type of text generated by these users, and further, only learn information about the topics these users are interested in. The representativeness of data sets is a well-discussed issue in social-oriented tasks, like inferring public opinion (Olteanu et al., 2019), but this issue is also an important consideration in 'neutral' tasks like parsing (Waseem et al., 2021). The type of data that researchers choose to train their models on does not just affect what data the models perform well for, it affects what people the models work for. NLP researchers cannot assume models will be useful or function for marginalized people unless they are trained on data 5 We note that conceptualizations of AAE and the accompanying terminology for the variety have shifted considerably in the last half century; see King (2020) for an overview. 6 https://bit.ly/2Yv07IL 7 https://bit.ly/3j2weZA generated by them.
Data Labels Although model biases are often blamed on raw data, several of the papers we survey identify biases in the way researchers categorize or obtain data annotations. For example: • Annotation schema Returning to Blodgett et al. (2018), this work defines new parsing standards for formalisms common in AAE, demonstrating how parsing labels themselves were not designed for racialized language varieties. • Annotation instructions Sap et al. (2019) show that annotators are less likely to label tweets using AAE as offensive if they are told the likely language varieties of the tweets. Thus, how annotation schemes are designed (e.g. what contextual information is provided) can impact annotators' decisions, and failing to provide sufficient context can result in racial biases. • Annotator selection Waseem (2016) show that feminist/anti-racist activists assign different offensive language labels to tweets than figure-eight workers, demonstrating that annotators' lived experiences affect data annotations.
Models Some papers have found evidence that model instances or architectures can change the racial biases of outputs produced by the model. Sommerauer and Fokkens (2019) find that the word embedding associations around words like 'race' and 'racial' change not only depending on the model architecture used to train embeddings, but also on the specific model instance used to extract them, perhaps because of differing random seeds. Kiritchenko and Mohammad (2018) examine gender and race biases in 200 sentiment analysis systems submitted to a shared task and find different levels of bias in different systems. As the training data for the shared task was standardized, all models were trained on the same data. However, participants could have used external training data or pre-trained embeddings, so a more detailed investigation of results is needed to ascertain which factors most contribute to disparate performance.
Model Outputs Several papers focus on model outcomes, and how NLP systems could perpetuate and amplify bias if they are deployed: • Classifiers trained on common abusive language data sets are more likely to label tweets containing characteristics of AAE as offensive (Davidson et al., 2019;Sap et al., 2019). • Classifiers for abusive language are more likely to label text containing identity terms like 'black' as offensive (Dixon et al., 2018). • GPT outputs text with more negative sentiment when prompted with AAE -like inputs (Groenwold et al., 2020).

Social Analyses of Outputs
While the examples in this section primarily focus on racial biases in trained NLP systems, other work (e.g. included in 'Social Science/Social Media' in Table 1) uses NLP tools to analyze race in society. Examples include examining how commentators describe football players of different races (Merullo et al., 2019) or how words like 'prejudice' have changed meaning over time (Vylomova et al., 2019). While differing in goals, this work is often susceptible to the same pitfalls as other NLP tasks. One area requiring particular caution is in the interpretation of results produced by analysis models. For example, while word embeddings have become a common way to measure semantic change or estimate word meanings (Garg et al., 2018), Joseph and Morgan (2020) show that embedding associations do not always correlate with human opinions; in particular, correlations are stronger for beliefs about gender than race. Relatedly, in HCI, the recognition that authors' own biases can affect their interpretations of results has caused some authors to provide self-disclosures (Schlesinger et al., 2017), but this practice is uncommon in NLP.
We conclude this section by observing that when researchers have looked for racial biases in NLP systems, they have usually found them. This literature calls for proactive approaches in considering how data is collected, annotated, used, and interpreted to prevent NLP systems from exacerbating historical racial hierarchies. and broader application scope in future work (Blodgett et al., 2020;Hanna et al., 2020).

Common data sets are narrow in scope
The papers we surveyed suggest that research on race in NLP has used a very limited range of data sets, which fails to account for the multidimensionality of race and simplifications inherent in classification. We identified 3 common data sources: 8 • 9 papers use a set of tweets with inferred probabilistic topic labels based on alignment with U.S. census race/ethnicity groups (or the provided inference model) (Blodgett et al., 2016). • 11 papers use lists of names drawn from Sweeney (2013), Caliskan et al. (2017), or Garg et al. (2018). Most commonly, 6 papers use African/European American names from the Word Embedding Association Test (WEAT) (Caliskan et al., 2017), which in turn draws data from Greenwald et al. (1998) and Bertrand and Mullainathan (2004). • 10 papers use explicit keywords like 'Black woman', often placed in templates like "I am a " to test if model performance remains the same for different identity terms. While these commonly-used data sets can identify performance disparities, they only capture a narrow subset of the multiple dimensions of race ( §2). For example, none of them capture selfidentified race. While observed race is often appropriate for examining discrimination and some types of disparities, it is impossible to assess potential harms and benefits of NLP systems without assessing their performance over text generated by and directed to people of different races. The corpus from Blodgett et al. (2016) does serve as a starting point and forms the basis of most current work assessing performance gaps in NLP models (Sap et al., 2019;Blodgett et al., 2018;Xia et al., 2020;Xu et al., 2019;Groenwold et al., 2020), but even this corpus is explicitly not intended to infer race.
Furthermore, names and hand-selected identity terms are not sufficient for uncovering model bias. De-Arteaga et al. (2019) show this in examining gender bias in occupation classification: when overt indicators like names and pronouns are scrubbed from the data, performance gaps and potential allocational harms still remain. Names also generalize poorly. While identity terms can be examined across languages (van Miltenburg et al., 2017), differences in naming conventions often do not translate, leading some studies to omit examining racial bias in non-English languages (Lauscher and Glavaš, 2019). Even within English, names often fail to generalize across domains, geographies, and time. For example, names drawn from the U.S. census generalize poorly to Twitter (Wood-Doughty et al., 2018), and names common among Black and white children were not distinctly different prior to the 1970s (Fryer Jr and Levitt, 2004;Sweeney, 2013).
We focus on these 3 data sets as they were most common in the papers we surveyed, but we note that others exist. Preoţiuc-Pietro and Ungar (2018) provide a data set of tweets with self-identified race of their authors, though it is little used in subsequent work and focused on demographic prediction, rather than evaluating model performance gaps. Two recently-released data sets (Nadeem et al., 2020;Nangia et al., 2020) provide crowd-sourced pairs of more-and less-stereotypical text. More work is needed to understand any privacy concerns and the strengths and limitations of these data (Blodgett et al., 2021). Additionally, some papers collect domain-specific data, such as self-reported race in an online community (Loveys et al., 2018), or crowd-sourced annotations of perceived race of football players (Merullo et al., 2019). While these works offer clear contextualization, it is difficult to use these data sets to address other research questions.

Classification schemes operationalize
race as a fixed, single-dimensional U.S.-census label Work that uses the same few data sets inevitably also uses the same few classification schemes, often without justification. The most common explicitly stated source of racial categories is the U.S. census, which reflects the general trend of U.S.-centrism in NLP research (the vast majority of work we surveyed also focused on English). While census categories are sometimes appropriate, repeated use of classification schemes and accompanying data sets without considering who defined these schemes and whether or not they are appropriate for the current context risks perpetuating the misconception that race is 'natural' across geo-cultural contexts. We refer to Hanna et al. (2020) for a more thorough overview of the harms of "widespread uncritical adoption of racial categories," which "can in turn re-entrench systems of racial stratification which give rise to real health and social inequalities." At best, the way race has been operationalized in NLP research is only capable of examining a narrow subset of potential harms. At worst, it risks reinforcing racism by presenting racial divisions as natural, rather than the product of social and historical context (Bowker and Star, 2000).
As an example of questioning who devised racial categories and for what purpose, we consider the pattern of re-using names from Greenwald et al. (1998), who describe their data as sets of names "judged by introductory psychology students to be more likely to belong to White Americans than to Black Americans" or vice versa. When incorporating this data into WEAT, Caliskan et al. (2017) discard some judged African American names as too infrequent in their embedding data. Work subsequently drawing from WEAT makes no mention of the discarded names nor contains much discussion of how the data was generated and whether or not names judged to be white or Black by introductory psychology students in 1998 are an appropriate benchmark for the studied task. While gathering data to examine race in NLP is challenging, and in this work we ourselves draw from examples that use Greenwald et al. (1998), it is difficult to interpret what implications arise when models exhibit disparities over this data and to what extent models without disparities can be considered 'debiased'.
Finally, almost all of the work we examined conducts single-dimensional analyses, e.g. focus on race or gender but not both simultaneously. This focus contrasts with the concept of intersectionality, which has shown that examining discrimination along a single axis fails to capture the experiences of people who face marginalization along multiple axes. For example, consideration of race often emphasizes the experience of genderprivileged people (e.g. Black men), while consideration of gender emphasizes the experience of race-privileged people (e.g. white women). Neither reflect the experience of people who face discrimination along both axes (e.g. Black women) (Crenshaw, 1989). A small selection of papers have examined intersectional biases in embeddings or word co-occurrences (Herbelot et al., 2012;May et al., 2019; Tan and Celis, 2019; Lepori, 2020), but we did not identify mentions of intersectionality in any other NLP research areas. Further, several of these papers use NLP technology to examine or validate theories on intersectionality; they do not draw from theory on intersectionality to critically examine NLP models. These omissions can mask harms: Jiang and Fellbaum (2020) provide an example using word embeddings of how failing to consider intersectionality can render invisible people marginalized in multiple ways. Numerous directions remain for exploration, such as how 'debiasing' models along one social dimension affects other dimensions. Surveys in HCI offer further frameworks on how to incorporate identity and intersectionality into computational research (Schlesinger et al., 2017;Rankin and Thomas, 2019).

NLP research on race is restricted to specific tasks and applications
Finally, Table 1 reveals many common NLP applications where race has not been examined, such as machine translation, summarization, or question answering. 9 While some tasks seem inherently more relevant to social context than others (a claim we dispute in this work, particularly in §5), research on race is compartmentalized to limited areas of NLP even in comparison with work on 'bias'. For example, Blodgett et al. (2020) identify 20 papers that examine bias in co-reference resolution systems and 8 in machine translation, whereas we identify 0 papers in either that consider race. Instead, race is most often mentioned in NLP papers in the context of abusive language, and work on detecting or removing bias in NLP models has focused on word embeddings. Overall, our survey identifies a need for the examination of race in a broader range of NLP tasks, the development of multi-dimensional data sets, and careful consideration of context and appropriateness of racial categories. In general, race is difficult to operationalize, but NLP researchers do not need to start from scratch, and can instead draw from relevant work in other fields.

NLP propagates marginalization of racialized people
While in §4 we primarily discuss race as a topic or a construct, in this section, we consider the role, or more pointedly, the absence, of traditionally underrepresented people in NLP research.

People create data
As discussed in §3.2, data and annotations are generated by people, and failure to consider who created data can lead to harms. In §3.2 we identify a need for diverse training data in order to ensure models work for a diverse set of people, and in §4 we describe a similar need for diversity in data that is used to assess algorithmic fairness. However, gathering this type of data without consideration of the people who generated it can introduce privacy violations and risks of demographic profiling.
As an example, in 2019, partially in response to research showing that facial recognition algorithms perform worse on darker-skinned than lighter-skinned people (Buolamwini and Gebru, 2018;Raji and Buolamwini, 2019), researchers at IBM created the "Diversity in Faces" data set, which consists of 1 million photos sampled from the the publicly available YFCC-100M data set and annotated with "craniofacial distances, areas and ratios, facial symmetry and contrast, skin color, age and gender predictions" (Merler et al., 2019). While this data set aimed to improve the fairness of facial recognition technology, it included photos collected from a Flickr, a photo-sharing website whose users did not explicitly consent for this use of their photos. Some of these users filed a lawsuit against IBM, in part for "subjecting them to increased surveillance, stalking, identity theft, and other invasions of privacy and fraud." 10 NLP researchers could easily repeat this incident, for example, by using demographic profiling of social media users to create more diverse data sets. While obtaining diverse, representative, real-world data sets is important for building models, data must be collected with consideration for the people who generated it, such as obtaining informed consent, setting limits of uses, and preserving privacy, as well as recognizing that some communities may not want their data used for NLP at all (Paullada, 2020).

People build models
Research is additionally carried out by people who determine what projects to pursue and how to approach them. While statistics on ACL conferences and publications have focused on geographic representation rather than race, they do highlight under-representation. Out of 2, 695 author affiliations associated with papers in the ACL Anthology for 5 major conferences held in 2018, only 5 (0.2%) were from Africa, compared with 1, 114 from North America (41.3%). 11 Statistics published for 2017 conference attendees and ACL fellows similarly reveal a much higher percentage of people from "North, Central and South America" (55% attendees / 74% fellows) than from "Europe, Middle East and Africa" (19%/13%) or "Asia-Pacific" (23%/13%). 12 These broad regional categories likely mask further under-representation, e.g. percentage of attendees and fellows from Africa as compared to Europe. According to an NSF report that includes racial statistics rather than nationality, 14% of doctorate degrees in Computer Science awarded by U.S. institutions to U.S. citizens and permanent residents were awarded to Asian students, < 4% to Black or African American students, and 0% to American Indian or Alaska Native students (National Center for Science and Engineering Statistics, 2019). 13 It is difficult to envision reducing or eliminating racial differences in NLP systems without changes in the researchers building these systems. One theory that exemplifies this challenge is interest convergence, which suggests that people in positions of power only take action against systematic problems like racism when it also advances their own interests (Bell Jr, 1980). Ogbonnaya-Ogburu et al. (2020) identify instances of interest convergence in the HCI community, primarily in diversity initiatives that benefit institutions' images rather than underrepresented people. In a research setting, interest convergence can encourage studies of incremental and surface-level biases while discouraging research that might be perceived as controversial and force fundamental changes in the field.
Demographic statistics are not sufficient for avoiding pitfalls like interest convergence, as they fail to capture the lived experiences of researchers. Ogbonnaya-Ogburu et al. (2020) provide several examples of challenges that non-white HCI researchers have faced, including the invisible labor of representing 'diversity', everyday microaggres-11 http://www.marekrei.com/blog/ geographic-diversity-of-nlp-conferences/ 12 https://www.aclweb.org/portal/ content/acl-diversity-statistics 13 Results exclude respondents who did not report race or ethnicity or were Native Hawaiian or Other Pacific Islander. sions, and altering their research directions in accordance with their advisors' interests. Rankin and Thomas (2019) further discuss how research conducted by people of different races is perceived differently: "Black women in academia who conduct research about the intersections of race, gender, class, and so on are perceived as 'doing service,' whereas white colleagues who conduct the same research are perceived as doing cutting-edge research that demands attention and recognition." While we draw examples about race from HCI in the absence of published work on these topics in NLP, the lack of linguistic diversity in NLP research similarly demonstrates how representation does not necessarily imply inclusion. Although researchers from various parts of the world (Asia, in particular) do have some numerical representation among ACL authors, attendees, and fellows, NLP research overwhelmingly favors a small set of languages, with a heavy skew towards European languages (Joshi et al., 2020) and 'standard' language varieties (Kumar et al., 2021).

People use models
Finally, NLP research produces technology that is used by people, and even work without direct applications is typically intended for incorporation into application-based systems. With the recognition that technology ultimately affects people, researchers on ethics in NLP have increasingly called for considerations of whom technology might harm and suggested that there are some NLP technologies that should not be built at all. In the context of perpetuating racism, examples include criticism of tools for predicting demographic information (Tatman, 2020) and automatic prison term prediction (Leins et al., 2020), motivated by the history of using technology to police racial minorities and related criticism in other fields (Browne, 2015;Buolamwini and Gebru, 2018;McIlwain, 2019). In cases where potential harms are less direct, they are often unaddressed entirely. For example, while low-resource NLP is a large area of research, a paper on machine translation of white American and European languages is unlikely to discuss how continual model improvements in these settings increase technological inequality. Little work on lowresource NLP has focused on the realities of structural racism or differences in lived experience and how they might affect the way technology should be designed.
Detection of abusive language offers an informative case study on the danger of failing to consider people affected by technology. Work on abusive language often aims to detect racism for content moderation (Waseem and Hovy, 2016). However, more recent work has show that existing hate speech classifiers are likely to falsely label text containing identity terms like 'black' or text containing linguistic markers of AAE as toxic (Dixon et al., 2018;Sap et al., 2019;Davidson et al., 2019;Xia et al., 2020). Deploying these models could censor the posts of the very people they purport to help.
In other areas of statistics and machine learning, focus on participatory design has sought to amplify the voices of people affected by technology and its development. An ICML 2020 workshop titled "Participatory Approaches to Machine Learning" highlights a number of papers in this area (Kulynych et al., 2020;Brown et al., 2019). A few related examples exist in NLP, e.g. Gupta et al. (2020) gather data for an interactive dialogue agent intended to provide more accessible information about heart failure to Hispanic/Latinx and African American patients. The authors engage with healthcare providers and doctors, though they leave focal groups with patients for future work. While NLP researchers may not be best situated to examine how people interact with deployed technology, they could instead draw motivation from fields that have stronger histories of participatory design, such as HCI. However, we did not identify citing participatory design studies conducted by others as common practice in the work we surveyed. As in the case of researcher demographics, participatory design is not an end-all solution. Sloane et al. (2020) provide a discussion of how participatory design can collapse to 'participation-washing' and how such work must be context-specific, long-term, and genuine.

Discussion
We conclude by synthesizing some of the observations made in the preceding sections into more actionable items. First, NLP research needs to explicitly incorporate race. We quote Benjamin (2019): "[technical systems and social codes] operate within powerful systems of meaning that render some things visible, others invisible, and create a vast array of distortions and dangers." In the context of NLP research, this philosophy implies that all technology we build works in service of some ideas or relations, either by upholding them or dismantling them. Any research that is not actively combating prevalent social systems like racism risks perpetuating or exacerbating them. Our work identifies several ways in which NLP research upholds racism: • Systems contain representational harms and performance gaps throughout NLP pipelines • Research on race is restricted to a narrow subset of tasks and definitions of race, which can mask harms and falsely reify race as 'natural' • Traditionally underrepresented people are excluded from the research process, both as consumers and producers of technology Furthermore, while we focus on race, which we note has received substantially less attention than gender, many of the observations in this work hold for social characteristics that have received even less attention in NLP research, such as socioeconomic class, disability, or sexual orientation (Mendelsohn et al., 2020;Hutchinson et al., 2020).
Nevertheless, none of these challenges can be addressed without direct engagement with marginalized communities of color. NLP researchers can draw on precedents for this type of engagement from other fields, such as participatory design and value sensitive design models (Friedman et al., 2013). Additionally, numerous organizations already exist that serve as starting points for partnerships, such as Black in AI, Masakhane, Data for Black Lives, and the Algorithmic Justice League.
Finally, race and language are complicated, and while readers may look for clearer recommendations, no one data set, model, or set of guidelines can 'solve' racism in NLP. For instance, while we draw from linguistics, Hudley et al. (2020) in turn call on linguists to draw models of racial justice from anthropology, sociology, and psychology. Relatedly, there are numerous racialized effects that NLP research can have that we do not address in this work; for example, Bender et al. (2021) and Strubell et al. (2019) discuss the environmental costs of training large language models, and how global warming disproportionately affects marginalized communities. We suggest that readers use our work as one starting point for bringing inclusion and racial justice into NLP. been supported in part by the Canada 150 Research Chair program and the UK-Canada Artificial Intelligence Initiative. A.F. has been supported in part by a Google PhD Fellowship and a GRFP under Grant No. DGE1745016. This material is based upon work supported in part by the National Science Foundation under Grants No. IIS2040926 and IIS2007960. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

Ethical Considerations
We, the authors of this work, are situated in the cultural contexts of the United States of America and the United Kingdom/Europe, and some of us identify as people of color. We all identify as NLP researchers, and we acknowledge that we are situated within the traditionally exclusionary practices of academic research. These perspectives have impacted our work, and there are viewpoints outside of our institutions and experiences that our work may not fully represent.

B Additional Survey Metrics
We show three additional breakdowns of the data set: Figure 1 shows the number of papers published each year, Figure 2 shows the number of papers published in each venue, and Table 2 shows how papers have operationalized race. As expected, given the growth of NLP research in general and the increasing focus on social issues (e.g. "Ethics and NLP" track was added to ACL in 2020) more work has been published on race in more recent years (2019,2020). In Figure 2, we consider if work on race has been siloed into or out of specific  venues. The majority of papers were published in workshops, which is consist with the large number of workshop papers. In 2019, approximately 2,038 papers were published in workshops 14 and 1,680 papers were published in conferences (ACL, EMNLP, NAACL, CONLL, CICLing), meaning 54.8% were published in workshops. In our data set, 46.8% of papers surveyed were published in workshops. The most number of papers were published in the largest conferences: ACL and EMNLP. Thus, while Table 1 suggests that discussions of race have been siloed to particular NLP applications, Figure 2 does not show evidence that they have been siloed to particular venues. In Table 2, for all papers that use categorization schema to classify race, we show what racial categories they use. If a paper uses multiple schemes (e.g. collects crowd-sourced annotations of stereotypes associated with different races and also asks annotators to self-report their race), we report each scheme as a separate data point. This table does not include papers that do not specify racial categories (e.g. examine "racist language" without specifying targeted people or analyze semantic change of topics like "racism" and "prejudice"). Finally, we map terms used by papers to the ones in Table 2, e.g. papers examining African American vs. European American names are included in BW.
The majority of papers focus on binary