Evaluation of African American Language Bias in Natural Language Generation

We evaluate how well LLMs understand African American Language (AAL) in comparison to their performance on White Mainstream English (WME), the encouraged"standard"form of English taught in American classrooms. We measure LLM performance using automatic metrics and human judgments for two tasks: a counterpart generation task, where a model generates AAL (or WME) given WME (or AAL), and a masked span prediction (MSP) task, where models predict a phrase that was removed from their input. Our contributions include: (1) evaluation of six pre-trained, large language models on the two language generation tasks; (2) a novel dataset of AAL text from multiple contexts (social media, hip-hop lyrics, focus groups, and linguistic interviews) with human-annotated counterparts in WME; and (3) documentation of model performance gaps that suggest bias and identification of trends in lack of understanding of AAL features.


Introduction
Task-specific models proposed for speech recognition, toxicity detection, and language identification have previously been documented to present biases for certain language varieties, particularly for African American Language (AAL) (Sap et al., 2021;Koenecke et al., 2020;Meyer et al., 2020;Blodgett and O'Connor, 2017).There has been little investigation, however, of the possible language variety biases in Large Language Models (LLMs) (Dong et al., 2019;Brown et al., 2020;Raffel et al., 2019), which have unified multiple tasks through language generation.
While there are largely beneficial and socially relevant applications of LLMs, such as in alleviating barriers to mental health counseling 1 and medical healthcare (Hsu and Yu, 2022) access, there is also potential for biased models to exacerbate existing societal inequalities (Kordzadeh and Ghasemaghaei, 2022;Chang et al., 2019;Bender et al., 2021).Past algorithms and popular word embedding methods used in psychiatry and medicine have been shown to be racially biased, in some cases leading to, for example, underestimating patient risk and denial of patient care (Obermeyer et al., 2019;Straw and Callison-Burch, 2020).Furthermore, LLMs capable of understanding AAL and other language varieties also raise important ethical implications, such as enabling increased police surveillance of minority groups (see Patton et al. 2020 and section 8 for further discussion).Therefore, it is necessary to investigate the language variety biases of language generation models to both increase accessibility of applications with high social impact and also anticipate possible harms when deployed.
Moreover, prior work (Grieser, 2022) has shown that African American speakers talking about racerelated issues use language in ways which may draw on morphosyntactic features of AAL in order to subtly foreground the race aspect of the discussion topic without explicit mention.Without the ability to interpret these subtler meanings, LLMs will undoubtedly exacerbate the misunderstandings States.White Mainstream English (WME) is defined (Baker-Bell, 2020;Alim and Smitherman, 2012) as the dialect of English reflecting linguistic norms of white Americans and that has become the encouraged "standard" form of English taught in American classrooms.While previous linguistic literature commonly uses the term "Standard American English," we employ WME instead to avoid the implication that AAL and other language varieties are "non-standard" and to more precisely identify the demographics of the speakers of WME.
Examples of AAL and WME are shown in Table 1.
We evaluate understanding of AAL by LLMs through production of language in each language variety using automatic metrics and human judgments for two tasks: a counterpart generation task (CG) akin to dialect translation (Wan et al., 2020;Harrat et al., 2019) (see examples in Table 2) and a masked span prediction (MSP) task where models predict a phrase that was removed from their input, similar to Groenwold et al. (2020).We summarize our contributions as follows: (1) we evaluate six pre-trained, large language models on two language generation tasks: counterpart generation between language varieties and masked span prediction; (2) we use a novel dataset of AAL text from multiple contexts (social media, hip-hop lyrics, focus groups, and linguistic interviews) with humanannotated counterparts in WME; and (3) we document model performance gaps as evidence of bias and identify trends in model understanding or lack of understanding of AAL features.

Training Data Biases
Biases in pre-trained language models can often be attributed to common training datasets, and the demographic makeup of these sources can provide insight into the diversity of language present (Hovy and Prabhumoye, 2021;Sap et al., 2021).Though few estimates of AAL prevalence in datasets exist, one study estimates that in the Colossal Cleaned Crawl Corpus (C4), only 0.07% of documents reflect AAL, while 97.8% are composed of WME (Dodge et al., 2021).
While digital divides between racial groups may be slowly closing, one study reports a significant gap in consistent internet access between African American (47.1%) and White (73.6%)users (Choi et al., 2022), with many others showing similar disparities (Dolcini et al., 2021;Rogers;Jackson et al., 2008;Gonzales, 2016;Wilson et al., 2003).Additionally, while 77% of news reporters are White, White workers only make up 65% of the US workforce (Zickuhr and Smith, 2012), and Wikipedia reports that only 0.05% of its contributors are African American, far lower than the national percentage (13%) 3 .
Additionally, these biases present in training data can be reinforced by human bias in annotations and dataset creation.As models learn from gold standard outputs provided by annotators, they learn to reflect the culture and values of the annotators as well.This phenomenon of perpetuating biases from annotated data has been discovered and investigated in tasks such as hate speech detection (Sap et al., 2021;Xia et al., 2020;Harris et al., 2022;Davidson et al., 2019) and sentiment analysis (Kiritchenko and Mohammad, 2018;Garg et al., 2022) where individual perspectives can cause large variability in annotations (Sachdeva et al., 2022).With both of these concerns, models not only lack access to language resembling AAL during training, but when AAL is present in datasets, models amplify the biases of annotators.

Data
There is significant variation in the use of features of AAL depending on, for example, the region or context of the speech or text (Washington et al., 1998;Hinton and Pollock, 2000).Accordingly, we collect a novel dataset of AAL from different contexts, using existing corpora of AAL where possible.
Data Sources Data used to evaluate AAL biases are collected from six different sources.We draw texts from two existing datasets, the Twit-terAAE corpus (Blodgett et al., 2018) and transcripts from the Corpus of Regional African American Language (CORAAL; Kendall and Farrington 2021), as well as four datasets collected specifically for this work: we collect all posts and comments from r/BlackPeopleTwitter4 belonging to "Country Club Threads," which designate threads where only Black Redditors and other people of color may contribute.Given the influence of AAL on hip-hop music, we collect hip-hop lyrics from 27 songs, 3 from each of 9 Black artists from Morgan (2001) and Billboard's 2022 Top Hip-Hop Artists.Each line of lyrics are treated as a text sample in AAL.Finally, we use the transcripts of 10 focus groups concerning grief and loss in the Harlem African American community.These focus groups were conducted as part of related work by the authors to better understand the impacts of police brutality, the COVID-19 pandemic, and other events on the grief experiences of African Americans, helping to automatically identify Black grief online.We find that the focus group dialogues had the least amount of AAL features of all the corpora.
For evaluation in the CG and MSP tasks, 50 texts are sampled from each dataset, resulting in 300 candidate texts in total.We use a set of surface and syntactic level patterns to approximately weight each sample by the density of AAL-like language within the text (see Appendix A). 12 additional texts are also sampled from each dataset to provide in-context examples for models like GPT-3.Throughout experiments, 4 pairs of AAL and corresponding WME counterparts are randomly sampled from the full set of 72 for each prediction.For smaller models like Flan-T5, they are also used for fine-tuning.
Data Annotations Our interdisciplinary team includes computer scientists, linguists, and social work scientists and thus, we could recruit knowledgeable annotators to construct intent-equivalent re-writings of AAL texts into WME, referred to as counterparts.The four human annotators included three linguistics students and a social work scientist, all of whom are native AAL speakers and thus have knowledge of the linguistic and societal context of AAL and racial biases.Annotators were asked to rewrite the AAL text in WME, ensuring the counterparts conserve the original meaning as closely as possible (see Appendix B.1).
To compute inter-annotator agreement, we asked each annotator to label the 72 prompts, and they also shared a distinct 10% of the remainder of the dataset with each other annotator.We compute agreement using Krippendorff's alpha with Levenshtein distance (Braylan et al., 2022).For the AAL interpretations, annotators showed 80% agreement (α = .8000).After removing pairs from the dataset where annotators determined that no counterpart exists, the final dataset consisted of 346 AAL-WME text pairs including 72 prompts.Translate the following African American Vernacular English into Standard American English: The past few presidents from Grant ain't doing nothing to help the children.=> Many of the previous presidents from Grant have not done anything to help the children.
It used to be broken controllers.=> There used to be broken controllers.
Like, oh, and this person died because this and that I be like, Dang, this new music.=> This person died because of this and that, which makes me think, dang, this new music...She ain't the one bro.=> She is not the one for you, man.
and ain't sixteen years old, this shit has got to stop.=>

Methods
We evaluate multiple language generation models using two tasks.The first task mimics dialecttranslation (Wan et al., 2020;Harrat et al., 2019) in which we evaluate models on producing near semantically-equivalent WME text given AAL text and vice versa.This task enables measuring ability to interpret and understand AAL.A second task, masked span prediction, requires models to predict tokens to replace words and phrases hidden or masked from the input.This task mimics autocompletion as in Groenwold et al. (2020), but spans vary in length and position in the input.Much like BART pre-training (Lewis et al., 2019), span lengths are drawn from a Poisson distribution but using λ = 2 given texts in our dataset are shorter than BART's pre-training data, and span locations are sampled uniformly across words in the original text.We mask noun phrases, verb phrases, and random spans from the text for more fine-grained analysis.
Flan-T5 (Chung et al., 2022) is a variation of the T5 model optimized for instruction-finetuning similarly to the aforementioned Instruct-GPT models (Ouyang et al., 2022), and is evaluated only on the counterpart generation task.We use the flan-t5large checkpoint.T5, a model pre-trained on multiple text-to-text tasks simultaneously, is evaluated using the t5-large checkpoint (Raffel et al., 2019).Finally, BART, a language model pre-trained on denoising tasks, is evaluated using the bart-large checkpoint (Lewis et al., 2019).Flan-T5, GPT-3, ChatGPT, and GPT-4 are evaluated on the CG task, while GPT-3, BART, and T5 are evaluated on the MSP task.We note that the GPT models besides GPT-3 were not included in the MSP task because token probabilities are not provided by the OpenAI API for chat-based models.
In the counterpart generation task, GPT models are provided four in-context examples for each counterpart.An example of the instruction and 4 randomly sampled prompts from the AAL dataset is provided in Figure 1.Notably, the instruction text uses "African American Vernacular English" and "Standard American English" because these are the most widely used terms in literature.Prompts with these terms were also assigned lower perplexity than "African American Language" and "White Mainstream English" by all GPT models, and lower perplexity prompts have been shown to improve task performance (Gonen et al., 2022).Additionally, models are simply asked to translate with no additional instructions in order to examine the natural tendency and behavior of models for tasks involving AAL text.Alternatively, Flan-T5 is finetuned on prompt pairs for five epochs with learning rate 3e-5.

Metrics
Counterpart Generation We use both automatic and human evaluation metrics for the counterpart generation task.As with most generation tasks, we first measure n-gram overlap of the model generations and gold standard reference texts and in our experiments, we utilize the Rouge metric.In addition, to account for the weaknesses of wordoverlap measures, we also measure coverage of gold standard references with BERTScore (Zhang et al., 2019) using the microsoft/deberta-large-mnli checkpoint.Specifically, original AAL is the gold standard for model-generated AAL and human annotated WME counterparts are the gold standard for model-generated WME.In some experiments, Rouge-1, Rouge-L, and BERTScore are presented as gaps, where scores for generating WME are subtracted from those for generating AAL.Due to the tendency of models to avoid toxic outputs and neutralize text, we also consider the percentage of toxic terms removed transitioning from model inputs to outputs.Toxicity scores are derived as the number of words appearing in the bias word list of Zhou et al. (2021), and percent change between inputs and outputs are calculated as (T ox in −T oxout) T ox in .Human evaluation is also conducted on the generated counterparts.The linguistics students and social work scientist that were involved in creating the dataset of aligned counterparts were also asked to judge model generations.As a baseline, human-generated counterparts are also included in the human evaluation.100 WME and AAL texts along with their associated model generations and annotations are randomly sampled from the dataset for human evaluation.Because the same annotators generate and judge the counterparts, we ensure that annotators do not rate human or model-generated counterparts for which they initially generated the WME counterpart.All annotators are asked to rate each counterpart using four 5-point Likert scales on the dimensions detailed below.
Human-likeness measures whether the annotator believes that the text was generated by a human or language model.Linguistic Match measures how well the language of the counterpart is consistent with the intended English variety (i.e.AAL or WME).Meaning Preservation measures how accurately the counterpart conveys the meaning of the original text.And finally, Tone Preservation measures how accurately the counterpart conveys the tone or other aspects beyond meaning of the original text.Additional details on the annotation instructions are included in Appendix B.2.
Masked Span Prediction Span predictions are evaluated using only automated metrics.First, we calculate the model perplexity of the reference span.Additionally, we measure the entropy of the top 10 most probable tokens in model predictions averaged across tokens in the span6 .Experiments are repeated 5 times, randomly sampling spans to mask in each trial with the exception of GPT-3.Metrics are reported as the percent change in perplexity between WME and AAL.

Counterpart Generation
Figure 2 shows results using automatic coverage metrics on counterpart generations in AAL and WME.Rouge-1, Rouge-L and BERTScore (the coverage scores) for model output are computed over the generated AAL or WME in comparison to the corresponding gold standards.We note that the models consistently perform better when generating WME, indicating that it is harder for models to reproduce similar content and wording as the gold standard when generating AAL.ChatGPT is the worst model for producing WME from AAL, and ChatGPT and GPT-3 are near equally bad at producing AAL from WME. Flan-T5 does best for both language varieties, likely due to the fact that Flan-T5 was directly fine-tuned for the task.Additionally, comparing to the coverage between ground truth references, models generally generate text that is further from the gold standard reference than the input ("Human" purple bars in Figure 2).This suggests that it is difficult for models to generate counterparts in either direction.
Figure 3 shows human judgments of modelgenerated WME and model-generated AAL.With the exception of Flan-T5, we see that modelgenerated WME is judged as more human-like and closer to the intended language variety than modelgenerated AAL.These results confirm results from automatic metrics showing that models have an easier time generating WME than AAL.In contrast, for meaning and intent, the reverse is true, indicating that models generate WME that does not match the meaning of human counterpart WME.The difference between scores on AAL and WME were significant on all metrics for at least two of the models as determined by a two-tailed t-test of the means (see * in Figure 3 for models with significant differences).We see also that the drop in meaning and tone scores from judgments on human WME is larger than the drop in human-like and dialect scores on human WME.These observations suggest that models have a hard time interpreting AAL.
Toxicity scores (Figure 6) show that models tend to remove toxic words when generating both AAL and WME. 7To test whether removal of toxic words contributed to the inability to preserve meaning, we computed both coverage scores and human evaluation scores on two subsets of the data: toxic posts (containing toxic words) and non-toxic posts (those posts without any toxic words).We display these results as the difference in coverage scores and in human judgment scores between model generated AAL and WME as shown in Figure 4 and Figure 5. Positive scores indicate that AAL performs better.dialect is expected.There are notable differences, however, in the extent to which this neutralization occurs.The results show that a significantly higher proportion of toxic language is removed when generating WME from AAL than in the reverse direction.
Here we see that human judgments on meaning and tone show that generated WME is worse than generated AAL for both toxic and non-toxic subsets.Thus, differences in use of toxic words between input and output cannot be the sole cause for lower scores on meaning and tone.This confirms that models have difficulty interpreting features of AAL.We note furthermore that gaps in coverage are consistently larger for the non-toxic subsets of the data, demonstrating that the use of profanity and toxic language are also not the primary cause of gaps in coverage metrics.that models place higher probability or are more confident in their top predictions for WME than for AAL sentences.

Human WME Counterpart
And they are not really going to tell the truth when people-let me know what the fuck is really going on.

GPT-4 Generated WME
And they aren't going to be completely honest when people-inform someone about what is truly happening.We discussed earlier how Figure 5 demonstrates that models have difficulty interpreting AAL when generating WME. Figure 7 supports this finding as well, as models generally output higher perplexities for masked spans in AAL compared to aligned WME.The largest gaps in perplexity between the two language varieties assigned to masked verb phrases.One set of distinct features characterizing AAL are aspectual verbs such as the habitual be, so this result may suggest that models struggle with the use of aspectual verbs in particular over other AAL features.A similar trend is found in the entropy metric, suggesting that AAL text also lowers model confidence in their own predictions.These results for AAL support similar findings by Groenwold et al. (2020) for GPT-2 in an auto-completion setting.
Models seem to correctly interpret specific features of AAL, namely: the use of ain't, double negation, and habitual be.They struggle, however, with many other features.Examples of misinterpretation are shown in Table 4 illustrating difficulty with several aspects of AAL.Several mistakes involve lexical interpretation, such as in example 1, where the model is not able to interpret the meaning of "he faded" as 'he's high' and example 2 where the model inserts shorty apparently intending the meaning 'youth' instead of its more common meaning of 'girlfriend'.The models also struggle with features which are grammatically camouflaged; that is, they appear on the surface to be the same as their WME counterparts, but in fact have a slightly different meaning in AAL.These include remote past been (ex 2), which is incorrectly interpreted as past perfect (have been), and existential it (ex 3), which in WME is closest in meaning to "there" as in "there are ..." and is not correctly interpreted in any model, all of which fail to change the term in the input.We also include an example where GPT-4 misinterprets the phrase "a nigga" as referencing another person, when in the provided context, the use most closely resembles referencing oneself.The word nigga, is one of the N-words, a set of lexical items welldocumented by linguists as being misunderstood by white speakers, who tend to assume they are one word instead of separate lemmas, underestimate the degree to which Black speakers find their uses offensive, and also fail to understand their syntactic complexities (Grieser, 2019).In particular, here, while the model removes the word in generating the WME counterpart, it does not correctly understand the use of impostor pronoun N-word in the input text, wherein the word refers back to the subject of the sentence.In short, the model behaves exactly as we might expect a white speaker to: removing it as a "bad word" but doing so devoid of understanding its linguistic function.
Without this ability to understand the structures where N-words appear, it is probable that models will both misinterpret as overly toxic input which contains them, and also possible that they will insert them in places where speakers of AAL find them deeply offensive, both of which pose risks to AAL-speaking communities whose language is evaluated by such models.It is important to approach the use of the N-words within the context of predicting AAL with sensitivity, history, context, and a broader understanding of the potential harm caused by the varied use of the N-words in society, and these mistakes signal that such sensitivity is out of scope for computational models.
In additional analysis of counterpart generations, we examined model performance gaps in each subset of the AAL dataset.Among subsets of the data, gaps between Rouge-1 metrics for AAL and WME counterparts vary significantly.GPT-4, for example, presents the largest performance gap for the TwitterAAE corpus (Blodgett and O'Connor, 2017), and the smallest gaps for the hip-hop and focus group subsets as shown in Figure 8. Manual inspection reveals that this aligns with the trends in AAL use among the subsets as well: distinct features of AAL appear to be more frequent in the TwitterAAE dataset, while AAL features in the focus group transcripts appear to be more sparse.This pattern may be due to the makeup and context of the focus groups, as most participants were college-educated, working professionals and selected to specifically describe grief experiences, possibly affecting the use of AAL in the discussions.Additionally, though the hip-hop dataset has a high average edit distance with its WME counterparts, the large differences can be attributed in part to stylistic choices (i.e.breaking grammar rules to create rhymes) rather than AAL features unrelated to the dialect.These results may suggest, as would be expected, that higher density of AAL features leads to larger performance gaps.

Related Work
While few have specifically focused on bias against African American Language (AAL) in language generation, related work has extensively investigated societal biases in language tasks .One largescale study (Ziems et al., 2022) investigates performance on the standard GLUE benchmark (Wang et al., 2018) using a synthetically constructed AAL dataset of GLUE for fine-tuning.They show performance drops on a small human-written AAL testset unless the Roberta model is fine-tuned.
Racial Bias in Generation Mitigating and evaluating social biases in language generation models is a challenging problem due to the apparent tradeoffs between task performance and bias mitigation, the many possible sources of bias, and the variety of biases and perspectives to examine (Sheng et al., 2021b;Akyürek et al., 2022).A number of studies have proposed bias evaluation measures, often using prompts crafted to reveal biased associations of, for example, occupation and gender (i.e."The woman worked as") (Sheng et al., 2020(Sheng et al., , 2019;;Kiritchenko and Mohammad, 2018;Dhamala et al., 2021;Shen et al., 2022) and in other cases, graph representations to detect subjective bias in summarization (Li et al., 2021) and personas for dialogue generation (Sheng et al., 2021a).However, the bias measurements in many of these approaches are not directly applicable to language in a natural setting, where the real-life harmful impacts of bias in language generation would be more prevalent.
AAL Feature Extraction Past work makes progress in lowering performance gaps between AAL and WME by focusing on linguistic feature extraction tasks.Given that some features of AAL such as the aspectual verbs (i.e.habitual be, remote past been) do not have semantically equivalent forms in WME (Green, 2009), standard partof-speech (POS) taggers and dependency parsers cannot maintain performance for AAL text.Studies have attempted to lessen this gap by creating a POS Tagger specifically for AAL through domain adaptation (Jørgensen et al., 2016) and a dependency parser for AAL in Tweets (Blodgett et al., 2018).Beyond these tasks, considerable attention has been given to developing tools for features specific to AAL and other language varieties, such as detecting dialect-specific constructions (Masis et al., 2022;Demszky et al., 2021;Santiago et al., 2022;Johnson et al., 2022) to aid in bias mitigation strategies.
AAL in Language Tasks Bias has also been measured specifically with respect to AAL in downstream, user-facing tasks.With the phonological differences between AAL and WME, automatic speech recognition (ASR) systems have shown large performance drops when transcribing speech from African American speakers (Koenecke et al., 2020;Martin and Tang, 2020;Mengesha et al., 2021).Toxicity detection and offensive language classification models have also been evaluated and have shown a higher probability of labeling AAL text as toxic or offensive when compared to WME text (Zhou et al., 2021;Rios, 2020;Sap et al., 2021).Most closely related to this work, one study evaluated bias against AAL in transformer generation models, showing that in a sentence auto-completion setting, GPT-2 generates AAL text with more negative sentiment than in aligned WME texts (Groenwold et al., 2020).Further investigation of both a larger set of language generation models as well as a broader set of generation tasks would provide a clearer picture of model biases against AAL.

Conclusion
We demonstrate through investigation of two tasks, counterpart generation and masked span prediction, that current LLMs have difficulty both generating and interpreting AAL.Our results show that LLMs do better matching the wording of gold standard references when generating WME than when generating AAL, as measured by Rouge and BertScore.Human evaluation shows that LLM output is more likely to be judged as human-like and to match the input dialect when generating WME than AAL.Notably, however, LLMs show difficulty in generating WME that matches the meaning and tone of the gold standard, indicating difficulty in interpreting AAL.Our results suggest that more work is needed in order to develop LLMs that can interact appropriately with those who use AAL, a capability that is important as LLMS are deployed in socially impactful contexts (e.g., medical, crisis).

Ethics Statement
We recognize that measuring and identifying potential bias against AAL in LLMs should also include a critically reflexive analysis of what happens if language models are better at detecting and understanding AAL, and the extent to which that impacts AAL speakers across social systems.In prior research, Patton et al. (2020) have noted that decisions made by researchers engaged in qualitative analysis of data through language processing should understand the context of the data and how algorithmic systems will transform behavior for individual, community, and system-level audiences.Critical Race Theory posits that racism exists across language practices and interactions (Delgado and Stefancic, 2023).Including African American representation in language models could potentially benefit AAL speakers in some instances (e.g.patient notes that fully present the pain African American patients are experiencing in the emergency room, Booker et al. 2015) but could inadvertently be harmful in contexts where African Americans continue to be surveilled (e.g.social media analysis for policing).
While datasets are typically publicly released for use in further work, the ethical concerns detailed above require careful consideration .We will make the dataset accessible to those who have signed a research license and intend to use it for ethical research purposes.Moreover, our focus is on model capabilities in interpreting AAL, but not necessarily their capabilities to generate AAL.Despite this, there is potential that the findings could be used to train a language model to generate AAL, which could result in the aforementioned ethical issues.Alternatively, given the paucity of work on language variety biases in LLMs, there is opportunity for our findings to inform educational modules and frameworks for considering ethical concerns in this line of work and LLMs in general.

Figure 1 :
Figure 1: Example prompt provided to GPT models in the counterpart generation task including the instruction, four prompt pairs, and a final AAL text for which the model generates its WME counterpart.

Figure 2 :
Figure 2: Automatic coverage metrics of model-generated AAL and WME counterparts."Human" (purple) scores represent coverage metrics between the original AAL text and human-annotated WME counterparts.Significant differences between scores in the WME → AAL direction and in the AAL → WME direction are denoted by * (p≤ .05).

Figure 3 :
Figure 3: Human judgments for model-generated AAL and WME counterparts."Human" (purple) scores represent judgments of original AAL text and human-annotated WME counterparts.Significant differences between scores in the WME → AAL direction and in the AAL → WME direction are denoted by * (p≤ .05).

Figure 4 :
Figure 4: Gaps in Rouge-1 metric in non-toxic (no toxic terms) and toxic (at least one toxic term) subsets of the AAL dataset.Negative Rouge gaps indicated greater WME performance than AAL.

Figure 5 :
Figure 5: Gaps in human judgments of human and model-generated counterparts broken down into toxic and non-toxic subsets.Negative scores indicate better WME performance.

Figure 6 :
Figure 6: Percentage of toxicity removed for AAL and the respective aligned WME counterparts.

Figure 7 :
Figure 7: Percent difference in perplexity and top-5 entropy between AAL and aligned WME texts.Negative percentages indicate lower WME perplexity/entropy than AAL perplexity/entropy.

Figure 8 :
Figure 8: Breakdown of Rouge score gaps and percent changes in toxicity by data source in counterpart generation.Negative Rouge values indicate higher WME Rouge scores, and positive ∆T scores indicate the model-generated counterpart is more toxic than the gold standard.

6
Discussion: How well do models interpret AAL? Ex 1: Source AAL Text shit, i'm tryna get my thing goin', too.just bring my car back.shit, he faded.Annotated WME Text Shit, I'm trying to do my own thing, too.Just bring my car back, shit, he's high.Flan-T5 Generated WME Hey, I'm trying to get my thing going, too.Just bring my car back.Hey, he faded.Flan-T5 Generated AAL Shit, I'm trying to do my own thing, too.Just bring my car back, shit, he's high.Ex 2: Source AAL Text been wilding since a juvi, she was a good girl.Annotated WME Text I was wild since I was a juvenile; she was a good girl.ChatGPT Generated WME I have been behaving recklessly since my time in juvenile detention, but she was a well-behaved girl.ChatGPT Generated AAL I been wildin' since I was a shorty; she was a straight up good girl.Ex 3: Source AAL Text It used to be broken controllers.Annotated WME Text There used to be broken controllers.GPT-4 Generated WME It used to be that controllers would break.Ex 4: Source AAL Text and they ain't gonna really keep it one thousand when niggas-let a nigga know what the fuck really going on,.

Table 1 :
Example sentences in WME and AAL.Distinct features of AAL are highlighted in blue.

Table 3 :
Characterization of the novel AAL dataset by text source including the number of text samples, length (in words) of aligned AAL and WME texts, Rouge-1 between aligned texts, edit distance between aligned texts, and the average toxicity scores among dialects.