Sociolectal Analysis of Pretrained Language Models

Using data from English cloze tests, in which subjects also self-reported their gender, age, education, and race, we examine performance differences of pretrained language models across demographic groups, defined by these (protected) attributes. We demonstrate wide performance gaps across demographic groups and show that pretrained language models systematically disfavor young non-white male speakers; i.e., not only do pretrained language models learn social biases (stereotypical associations) – pretrained language models also learn sociolectal biases, learning to speak more like some than like others. We show, however, that, with the exception of BERT models, larger pretrained language models reduce some the performance gaps between majority and minority groups.


Introduction
While speakers of English generally understand each other, our linguistic preferences may differ, depending on the linguistic varieties we were exposed to, and how susceptible we were to them at the time (Tagliamonte and D'Arcy, 2009). Linguistic varieties form a multi-dimensional continuum of dialects and sociolects (McCormack et al., 2011). 1 Such linguistic variation presents a real challenge: The linguistic preferences of English pretrained language models may align better with the linguistic preferences of some groups in society than with those of others. Group disparities constitute a fairness problem (Sokol et al., 2020): If our technologies provide end users with new opportunities, group disparities mean unequal opportunities across groups. Moreover, if the groups are defined in terms of protected attributes, our technologies may discriminate between end users in ways that violate regulations.
After waiting three hours, Cal whined and started to ___.  Table 1: An example of a cloze (fill-in-the-gap) task with human and model predictions.
We evaluate the sociolectal biases of a range of pre-trained English language models. Unlike previous work on biases in pre-trained language models, we do not consider representational biases (Sun et al., 2019), but performance disparities; moreover, we do not consider downstream performance differences after fine-tuning for downstream tasks such as coreference resolution (Rudinger et al., 2018) or machine translation (Stanovsky et al., 2019), but performance differences across demographics of the language models themselves on cloze (fill-inthe-gap) problems. Since the cloze task is how these pre-trained language models are trained, we can evaluate models directly without introducing biases from probes or downstream tasks.
Note that some sociolinguistic variables are salient (e.g., this→dis), others are not (Jaeger and Weatherholtz, 2016). One strategy to evaluate the robustness of pre-trained language models across groups would be to identify salient variables and evaluate language models in the context of those (Demszky et al., 2021). While such evaluations are easier to interpret than evaluations of performance parity, they typically only cover a small set of variables or lectal features and therefore run the risk of only scratching the surface of sociolectal variation. In contrast, we will not focus on salient lectal features, but on error rates across demographics (precision at k and mean reciprocal rank).

Contributions
We align the lexical preferences of 17 commonly used pretrained language models for English with fill-in-the-gap experimental data across 16 demographics (defined by four binary variables for gender, age, race, and education). The language models systematically disfavor young non-white male speakers. Other groups that are poorly aligned with language models include older white speakers. For ELECTRA and GPT-2, bigger models are more fair; while for BERT, DistilBERT, and ALBERT, bigger models are less fair.

Experimental Setup
Dataset We use a publicly available cloze-style word prediction dataset 2 for our experiments. The dataset consists of fill-in-the-gap (cloze-style) example sentences with (always) the last word is removed. Table 1 presents an example sentence. The sentences are generally narrative and open-ended, and do not have standard answers. The data collectors asked the annotators to complete the sentence with what was from their own experience, the most likely one word continuation. At the same time, the annotators were asked (on a voluntary basis) to provide their demographic information (including age, gender, race, educational background). The data has been anonymized by replacing unique user IDs with generated IDs. The dataset consists of 3,085 different sentences, and the average sentence length is 10. Each sentence is annotated by 104 different annotators on average, providing 35 different continuations on average. In 40% of the sentences, the most common continuation is provided by more than half of the annotators.
The statistics of the annotators is shown in Table 2. In total, the dataset includes 307 annotators. We focus on four protected attributes: age, gender, education, and race. We binarize each attribute to obtain roughly balanced groups, binning the annotators in a total of 16 different demographic groups. For brevity, we use emojis to represent the 16 groups. Note that for each attribute, the two group sizes never sum to 307; this is because a few annotators did not report this information. The number of annotators in each of the 16 groups can be found in Table 3.
Pre-processing Each annotator was provided with about 1049 sentences on average. Some examples were left unanswered. The data collectors  manually corrected for typos and agreement. We ignore multi-word completions.
2. The DistilBERT (Sanh et al., 2019) (distilbert-basecased) model is distilled from original BERT model by adopting knowledge distillation. The model is 40% smaller but 60% faster than a BERT model.
3. ALBERT (Lan et al., 2020) also reduces the number of parameters in the BERT architecture, by using embedding matrix factorization and cross-layer parameter sharing. We use albert-base-v2, albert-large-v2 and albert-xxlarge-v2 below.
4. Liu et al. (2019) found that the BERT model is undertrained. They improved the pre-training by removing next sentence prediction task and obtained better results by adjusting the parameters. The model, called RoBERTa, achieves better performance in downstream tasks. We used two different sizes of RoBERTa model, which are roberta-base and roberta-large.
5. ELECTRA (Clark et al., 2020) uses a jointly trained discriminator network to distinguish the masked tokens from candidates suggested by the generator, avoiding costly inference over the full vocabulary. We use the generator models, google/electra-small-generator and google/electra-large-generator, which are suitable for the cloze-style word prediction.

Metrics
We follow Shin et al. (2020) in using precision (P@1) and mean reciprocal rank (MRR) to evaluate the extent to which pretrained language models are aligned with annotator preferences. Given a incomplete sentence v 1 . . . v n __, and W = [w 1 , w 2 , · · · , w r ] the r-most frequent continuations of v 1 . . . v n (within a group of human annotators). Our probed language model ranks candidate words by their model likelihood C = [c 1 , c 2 , · · · , c p ] . The P@1 of the language model is then defined as: where 1[·] is the indicator function, and MRR, as: where Rank W i is the rank of c i in W and equals ∞ if c i is not in W . We report average P@1 and MRR scores.  Table 3: Statistics of P@1 scores in different demographic groups. All floating point numbers are expressed as percentages.

Q1: Outlier demographics?
Before comparing pretrained language models with human annotations, we first consider how continuations differ across demographic groups. We compare demographic groups by computing the average P@1 scores for individuals in each group relative to the overall majority vote (across all groups). Note that we can also compute the variance within groups by computing the P@1 scores for each annotator. Table 3 shows group-level P@1 scores for each demographic group (avg) of n annotators, as well as the variance across annotators in each group: max is the highest average annotator P@1 within the group, and min the lowest. We also report the range (max-min) and the standard deviation (std). The gap in group P@1 values is about 17%, and we observe several outlier groups: less-educated young non-white male annotators ( ), educated non-white annotators ( , , and ).

Q2: Unfair language models?
We evaluate the pretrained language models on the cloze examples and obtain the logits vectors of the last layers corresponding to the masked tokens (gaps), perform softmax normalization to obtain the top-10 most likely candidate words and compute the fairness of the pretrained language models based on these predictions. Our fairness metric is an instance of multi-group -fairness (Donini et al., 2018), sometimes referred to as min-max Rawlsian fairness (Zafar et al., 2017), and says a model is -fair if the risk across any two groups is approximately the same. To this end, we compare the pretrained language models' range (range) of P@1 and MRR scores across groups. For each pretrained language model, we compute the maximum performance difference across any two groups ( or range). If a language model m is -fair, for some value of , and no other language models are -fair, m is the most fair language model in our batch. 4 Table 4 lists the performance of our pretrained language models across the 16 groups. We see

Models Demographics Alignment
bert-base-cased bert-base-uncased bert-base-multilingual-cased bert-large-cased bert-large-uncased distilbert-base-uncased albert-base-v2 albert-large-v2 albert-xxlarge-v2 roberta-base roberta-large google/electra-large-generator google/electra-small-generator gpt2 gpt2-medium gpt2-large gpt2-xl  that roberta-large has the best overall performance across groups (across both P@1 and MRR). Using P@1 as our performance metric, gpt2-xl is most fair; using MRR, bert-base-multilingual-cased is most fair. 5 We make the following general observations: Larger models perform better, but are not necessarily more fair. For ELECTRA and GPT-2, fairness increases with model size. For BERT, AlBERT, and RoBERTa, the opposite is true: Group disparity increases with model size. Generally, though, the ELECTRA models are significantly more sensitive to protected attributes. google/electra-small-generator is, across both metrics, the least fair model in our batch.

Q3: Demographics of models?
Finally, we explore how specific pretrained language models align with group-level preferences. In other words, are different pretrained language models aligned with different demographics, or are they all biased in similar ways? Table 5 illustrates the alignment between pre-trained language models and demographic groups, and its lower part shows the mean rank of each group across all models. The correlations of different models' rankings are shown as a heat map in Figure 1 in the Appendix. We make the following observation: The language models systematically disfavor young non-white male speakers. Other groups that are poorly aligned with language models include older white speakers.
google/electra-small-generator, which was, across both metrics, the least fair model in our batch, favors . is generally the demographic that our pretrained language models align best with. This is also a demographic known to contribute the most to crowdsourced resources such as Wikipedia and social media (Hargittai and Shaw, 2015;Barthel et al., 2016), which pretrained language models are often trained on. Interestingly, the second most aligned demographic is . The two most fair models favor (GPT-2) and (mBERT).

Conclusion
We compared pretrained language models to human cloze tests, and showed that models align much better with some groups of participants than others. For ELECTRA and GPT-2, larger models are more fair; for BERT, ALBERT, and RoBERTa, the opposite is true. ELECTRA models are least fair. Generally, models disfavor young non-white men the most. Previous work has explored social biases in how pretrained language models represent concepts (May et al., 2019;Kurita et al., 2019), but this is, to the best of our knowledge, the first work on whose language pretrained language models reflect, i.e., what sociolects models align best with. Work on personalized language modeling (Garimella et al., 2017;Welch et al., 2020) is loosely related, e.g., Stoop and van den Bosch (2014) present interesting to work to make word prediction sociolect-aware, showing significant keystroke savings by conditioning on sociolect.

Ethics Statement
Our paper considers the sensitivity of pretrained language models to protected attributes. The data used in our experiments was collected using Amazon Mechanical Turk by researchers in psycholinguistics. The protected attributes are self-reported on a voluntary basis, and annotators were payed equally regardless of whether they reported this information. We find that pretrained language models are sensitive to protected attributes and hence biased toward some groups. For practical reasons, we use emojis as a short-hand to represent these groups.
We consulted with two experts on emojis and cultural identity to make sure our use of emojis did not reinforce stereotypes, and they both assessed they would not.  Table 6 shows the detailed information of 17 models used in this paper, including the name of the model, the number of parameters, the size of vocabulary, the tokenizer used for segmentation, and the pretraining task adopted. For more detail, we refer to the corresponding paper.   Figure 1 shows the correlation of the alignment rankings of different models. The correlation is obtained by calculating Kendall's Tau correlation coefficients of the ranking sequences in Table 5 pairwise, and displayed in a heat map. The blue-lined squares box the correlation of the same type of models. Blocks in red means that ranks of two models are highly related. On the contrary, blocks in blue means they are less related or even negatively correlated. It can be seen that the same type of models in the blue box usually have higher similarity. In addition, albert-large-v2 and gpt2 models have relatively high similarities with other models, while the gpt2-large and roberta-base models are less correlated to other models.