A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models

Various types of social biases have been reported with pretrained Masked Language Models (MLMs) in prior work. However, multiple underlying factors are associated with an MLM such as its model size, size of the training data, training objectives, the domain from which pretraining data is sampled, tokenization, and languages present in the pretrained corpora, to name a few. It remains unclear as to which of those factors influence social biases that are learned by MLMs. To study the relationship between model factors and the social biases learned by an MLM, as well as the downstream task performance of the model, we conduct a comprehensive study over 39 pretrained MLMs covering different model sizes, training objectives, tokenization methods, training data domains and languages. Our results shed light on important factors often neglected in prior literature, such as tokenization or model objectives.


Introduction
Masked Language Models (MLMs) have achieved promising performance in many NLP tasks (Devlin et al., 2019;Liu et al., 2019;Liang et al., 2023).However, MLMs trained on massive amounts of textual training data have also been found to encode concerning levels of social biases such as gender and racial biases (Kaneko and Bollegala, 2019;May et al., 2019;Dev et al., 2020;Silva et al., 2021;Kaneko et al., 2022).In spite of the overall success of MLMs across NLP tasks, such biases within MLMs raise ethical considerations and underscore the need for debiasing methods to ensure fair and unbiased MLMs.
On the other hand, MLMs are trained by considering and optimising various underlying factors that contribute to their performance on downstream tasks.These factors include but are not limited to parameter size, tokenization methods, training objectives and training corpora.The performance of MLMs is affected by the interplay of such factors.
Nevertheless, it remains unclear as to how these factors influence social biases in MLMs and their downstream task performance.
Evaluating the impact of these factors is challenging due to three main reasons: (a) The factors that we consider within a model are not independent, rather, they exhibit complicated interdependence and affect the performance of models simultaneously.(b) MLMs are diverse with different architectures, configurations and parameters.The diversity across models requires the need for generalisation and abstraction when considering the values of factors.(c) Many recent works proposed debiasing methods to mitigate social biases in MLMs (Webster et al., 2020;Lauscher et al., 2021;Schick et al., 2021;Guo et al., 2022).However, most debiasing methods tend to worsen the performance of MLMs in downstream tasks (Meade et al., 2022).Therefore, it is crucial to consider the trade-off between social bias and downstream task performance when comparing MLMs.
To address the non-independent issue of factors, we propose a method using Gradient Boosting (Freund and Schapire, 1997) to consider dependencies among factors.Moreover, we use the coefficient of determination (R 2 ; Nagelkerke et al., 1991) as a measure to analyse the importance of factors.Regarding the diversity of MLMs, we converge a broad set of MLMs across multiple languages and 4 domains, resulting in 39 MLMs for evaluation in total.Moreover, we incorporate TweetEval and GLUE to evaluate the downstream task performance of MLMs, meanwhile, evaluating their social biases intrinsically using All Unmasked Likelihood with Attention weights (AULA; Kaneko and Bollegala, 2022) on two benchmarks StereoSet (SS; Nadeem et al., 2021) and crowdsourced stereotype pairs benchmark (CP; Nangia et al., 2020a).Note that we are not proposing novel bias evaluation measures in this paper.Instead, we use existing metrics such as AULA to evaluate social biases.
Our experimental results indicate that model size, training objectives and tokenization are the three most important categories of factors that affect the social bias and downstream task performance of MLMs.Interestingly, we observe that models using Byte-Pair Encoding (BPE; Sennrich et al., 2016) include lower level of social biases, while achieving the best downstream performance compared to models using other tokenization methods.Overall, multilingual models tend to have less biases than their monolingual counterparts.

Related Work
As MLMs have been successfully applied to diverse NLP tasks, it is important to study the factors that determine their social biases.Rogers et al. (2020) reviewed the current state of knowledge regarding how BERT works, what kind of information it learns and how it is encoded, typically alterations to its training objectives and architecture, the overparameterization issue and approaches to compression.Xia et al. (2020) studied contextualised encoders in various aspects and discussed the trade-off between task performance and the potential harms contained in the pretraining data.Later, Pérez-Mayos et al. (2021) investigated the effect of pretraining data size on the syntactic capabilities of RoBERTa and they showed that models pretrained with more data tend to encode better syntactic information and provide more syntactic generalisation across different syntactic structures.However, these studies focus on MLMs in downstream tasks, while none consider the social biases in MLMs.
On the other hand, models trained for different downstream tasks have been found to exhibit social biases.Kiritchenko and Mohammad (2018) evaluated gender and racial biases across 219 automatic sentiment analysis systems and discovered statistically significant biases occurring in several systems.Díaz et al. (2018) studied age-related biases in sentiment classification and discovered that significant age bias is encoded in the output of many sentiment analysis systems as well as word embeddings.Zhao et al. (2020) focused on gender bias in multilingual embeddings and its influence on the process of transfer learning for NLP applications.They showed that the level of bias in multilingual representations varies depending on how the embeddings are aligned to different target spaces, and that the alignment direction can also affect the bias in transfer learning.Choenni et al. (2021) investigated the types of stereotypical information that are captured by pretrained language models and showed the variability of attitudes towards various social groups among models and the rapid changes in emotions and stereotypes that occur during the fine-tuning phase.
Existing bias evaluation methods use different strategies such as pseudo likelihood (Kaneko and Bollegala, 2022), cosine similarity (Caliskan et al., 2017;May et al., 2019), inner-product (Ethayarajh et al., 2019), to name a few.Independently of any downstream tasks, intrinsic bias evaluation measures (Nangia et al., 2020b;Nadeem et al., 2021;Kaneko and Bollegala, 2022) evaluate social biases in MLMs stand alone.However, given that MLMs are used to represent input texts in a variety of downstream tasks, multiple previous works have proposed that social biases should be evaluated with respect to those tasks (De-Arteaga et al., 2019;Webster et al., 2020).Kaneko and Bollegala (2021) showed that there exists only a weak correlation between intrinsic and extrinsic social bias evaluation measures.In this paper, we use an intrinsic bias evaluation measure, namely AULA, to evaluate social biases in MLMs.AULA has been shown to be the most reliable bias evaluation measure (Kaneko and Bollegala, 2022), hence we use it as our bias evaluation measure.
Although we specifically focus on MLMs in this paper, evaluating the performance predictors for Neural Architecture Search (NAS) (White et al., 2021;Elsken et al., 2019) has been conducted for designing high performance neural networks.Generalisability of the identified factors is an important in NAS because the selected neural architecture would be trained on different datasets to obtain different models.Although it would be ideal to perform a controlled training of MLMs where we experimentally fix all other factors except for a single target factor and then analyse the trained MLM, this is an expensive undertaking given the computational cost of training MLMs on large datasets.On the other hand, we consider already pre-trained MLMs that are publicly made available and do not train any MLMs during this work.

Analysis of Factors
In order to study the impact of different factors in MLMs, we consider 30 factors and split them into 5 categories.The details of each individual factor are provided in Appendix C.

Model Size
Models with smaller sizes are generally more lightweight and require less computational resources, making them suitable for deployment on resource-constrained devices or in environments with limited processing power.However, larger models tend to have better performance on downstream tasks but demand more memory, computational power, and longer inference times.On the other hand, the different architectures of models have various numbers of layers as well as training parameters.Recently, MLMs have achieved impressive results on downstream tasks by scaling model size or training with larger datasets (Conneau et al., 2020;Goyal et al., 2021;Liang et al., 2023).To investigate the impact of model size, we consider 3 factors: (1) parameter size, (2) number of layers and (3) vocabulary size.
The parameter size is considered as a categorical feature, in which we divide the parameter size of an MLM into 3 categories according to pre-defined ranges.Specifically, we assign S if the parameter size of an MLM is less than 100M, M if the size is within 100M-300M, and L if the size is greater than 300M.Similarly, we convert the vocabulary size into the same three categories for the models with vocabulary sizes less than 50K, within 50K-100K and greater than 100K, respectively.For the number of layers, we use the number as a feature: 6, 12 and 24 layers.

Training Methods and Objectives
In this category, we consider the methods used during model pretraining as well as the training objectives of MLMs.First, we take into account different masking techniques, starting with the masking technique initially proposed for training BERT (Devlin et al., 2019).Masked language modelling is an objective used during pretraining to improve the model's understanding, in which a certain percentage of words in the input text are randomly selected and replaced with a special [MASK] token, then the model is trained to predict the original word based on its context.Later, they further proposed whole word masking, which aims to improve the handling of individual words with a context.Rather than randomly selecting WordPiece (Wu et al., 2016) produced subtokens to mask, whole word masking always masks the entire words at once, which has been shown to reduce ambiguity and enable better word-level comprehension and contextual understanding for MLMs (Cui et al., 2021).Apart from these two masking techniques, we consider three other training objectives: (a) next sentence prediction, (b) sentence ordering prediction, and (c) mention reference prediction.We consider the training objectives to be binary and assign 1 if each of them is used and 0 otherwise.
Model distillation is a training technique aiming to train a small student model to transfer the knowledge and capabilities from a larger teacher model (Hinton et al., 2015).This technique has been shown to effectively compact a large language model, while retaining comparable performance compared to the original model (Sanh et al., 2019;Wu et al., 2022).Model distillation is regarded as a binary factor, which is assigned 1 if an MLM uses model distillation, otherwise, it returns 0.

Training Corpora
Training corpora are the collections of texts or data used to train MLMs.According to Kalyan et al. (2021), training corpora can be classified into four types: (1) general, (2) social media, (3) languagespecific and (4) domain-specific.In order to con- Owing to the domain of an MLM being associated with the training corpora sampled in that certain domain, we additionally consider domain as a separate factor.This domain factor is included as a categorical factor, with 4 different domains as categories: general domain, social media, legal domain and biomedical domain.Finally, we take into account continuous training as a binary factor which takes 1 if an MLM is continuously trained on training corpora from different domains and 0 if the model is trained from scratch.

Tokenization
Tokenization is an essential process of breaking down a sequence of text into smaller units, which is able to convert unstructured text data into a format that can be processed by MLMs.Prior works study different tokenization methods on MLMs in different languages (Park et al., 2020;Rust et al., 2021;Toraman et al., 2023), however, the impact of tokenization methods on social bias in MLMs and different downstream tasks remains unrevealed.Therefore, we consider the three most commonly used tokenization methods as categorical factors: BPE, WordPiece and SentencePiece.

Language
In this category, we consider both monolingual and multilingual MLMs.Specifically, we regard language as a categorical factor and use English and Multilingual to represent if an MLM is monolingual or multilingual, respectively.In addition, the number of languages is also considered as a separate factor, which is categorical and takes the actual number of languages an MLM trained on.

Masked Language Models
To conduct a comprehensive study of different factors of MLMs affecting social bias and task performance, we evaluate 39 pretrained MLMs, 1 which we divide into four categories as follows.Table 1 summarizes all models.
1 In our initial selection of models, albert-xlarge-v2, nielsr/coref-roberta-large, vinai/bertweet-large, xlm-robertabase and facebook/xlm-v-base attained an unusual poor performance on the Corpus of Linguistic Acceptability (CoLA) (i.e., a subtask of GLUE) and TweetEval.This is likely caused by an implementation issue that would need some modification.Therefore, to avoid potentially false outliers, we omitted these five models when evaluating TweetEval and GLUE.

Multilingual and general domain models
We take into account the multilingual MLMs with different settings pretrained in the general domain, containing different settings of multilingual BERT (Devlin et al., 2019), DistillBERT (Sanh et al., 2019) and XLM-R (Conneau et al., 2020).
Multilingual and domain-specific models We select the multilingual MLMs pretrained in domainspecific corpora, containing XLM-T (Barbieri et al., 2022).

MLM Evaluation Metrics and Tasks
Our goal in this paper is to study the relationship between model factors and social bias as well as downstream task performance in pretrained MLMs.
For this purpose, we conduct our evaluations using three different types of tasks.

Social Bias
For an MLM under evaluation, we compare the pseudo-likelihood scores returned by the model for stereotypical and anti-stereotypical sentences using AULA (Kaneko and Bollegala, 2022).AULA evaluates social biases by taking into account the MLM attention weights as the importance of tokens.This approach is shown to be robust against frequency biases in words and offers more reliable estimates compared to alternative metrics used to evaluate social biases in MLMs.
Following the standard evaluation protocol, we provide AULA the complete sentence S = t 1 , . . ., t |S| , which contains a length |S| sequence of tokens t i , to an MLM with pretrained parameters θ.We compute the Pseudo Log-Likelihood, denoted by PLL(S), to predict all tokens in S excluding begin and end of sentence tokens.The PLL(S) score of sentence S given by (1) can be used to evaluate the preference expressed by an MLM for S: Here α i is the average of all multi-head attention weights associated with t i , while P (t i |S; θ) is the probability assigned by the MLM to token t i conditioned on S.
Given a sentence pair, the percentage of stereotypical (S st ) sentence preferred by the MLM over anti-stereotypical (S at ) one is considered as the AULA bias score of the MLM and is given by (2): Here, N is the total number of text instances and I is the indicator function, which returns 1 if its argument is True and 0 otherwise.AULA score given by (2) falls within the range [0, 100] and an unbiased model would return bias scores close to 50, whereas bias scores less or greater than 50 indicate bias directions towards the antistereotypical or stereotypical group, respectively.

Social Bias Benchmarks
We conduct experiments on the two most commonly used social bias evaluation datasets for MLMs: StereoSet (SS) and Crowdsourced Stereotype Pairs benchmark (CP).SS contains associative contexts, which cover four types of social biases: race, gender, religion, and profession, while CP is crowdsourced and annotated by workers in the United States, consisting of nine types of social biases: race, gender, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status/occupation.We follow the work from Kaneko and Bollegala (2022) 2 and use the default setting for evaluation.We denote the AULA computed on the CP and SS datasets by A-CP and A-SS, respectively, in the rest of the paper.

Downstream Performance
To further investigate the impact of factors of an MLM in terms of its downstream tasks performance, we evaluate MLMs on two additional benchmark datasets: The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) 3 and social media TweetEval benchmark (Barbieri et al., 2020).

4
GLUE GLUE is comprised of 9 tasks for evaluating natural language understanding systems.Dagan et al., 2005;Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009) and Winograd NLI (WNLI; Levesque et al., 2012).These tasks are framed as classification tasks for either single sentences or pairs of sentences.We follow the finetuning procedures from prior work (Devlin et al., 2019;Liu et al., 2019) and report results on the development sets.
TweetEval TweetEval is a unified Twitter benchmark composed of seven heterogeneous tweet classification tasks.The tasks in TweetEval are emoji prediction (Barbieri et al., 2018), emotion recognition (Mohammad et al., 2018), hate speech detection (Basile et al., 2019), irony detection (Van Hee et al., 2018), offensive language identification (Zampieri et al., 2019), sentiment analysis (Rosenthal et al., 2017) and stance detection (Mohammad et al., 2016).We use the default setting as the TweetEval original baselines to finetune pretrained MLMs on the corresponding training set and report average results on the test set after 3 runs.

Correlation between tasks
Table 2 shows the Pearson and Spearman's correlation coefficients between each pair of task performances over the 39 MLMs.We observe a moderate correlation between A-SS and GLUE, which indicates that better performance in GLUE entails a higher stereotypical bias in SS (this is also true for CP but to a much lesser extent).In contrast, there is no significant correlation observed between the models' performance on downstream tasks, i.e., TweetEval vs. GLUE.

Regression Analysis
To investigate the importance of the different factors on social bias and the task performance of MLMs, we train a regression model.

Experimental setting
We generate the features for each MLM according to its factors as described in §3.An example of the features of an individual model is given in Table 3.These features are fed into a regression model as input for training, using both social bias and task performance as output.
In order to select the best regression model for this purpose, we compare the performance of 6 different regression models on each prediction task, namely gradient boosting, support vector machine, gaussian process regression, decision tree, random forest and linear regression.We compute the performance of the regression models by means of their coefficient of determination (R 2 ) and root mean squared error (RMSE) for each prediction task.We use the regression models implemented in sklearn with the default parameters and take the averages over three independent runs.

Regression model comparison
The comparison of different regression models trained using the features from the monolingual MLMs in the general domain is shown in Table 4, while the performance of regression models trained using the features from all of the MLMs is shown in § A.1.
From Table 4, we observe that gradient boosting obtains the best performance in terms of both R 2 and RMSE on both A-CP and TweetEval, while the decision tree obtains the best performance on A-SS.Almost all of the regression models return negative R 2 scores for GLUE, which proves hard to predict from the analysed factors.An error analysis of each GLUE subtask on gradient boosting in terms of R 2 scores is shown in Appendix B. In contrast, the three remaining social-related evaluations, including TweetEval, can be predicted to a reasonable extent.The linear regression model obtains the lowest R 2 scores for both A-CP and GLUE, which indicates that a linear model may not be suitable for predicting the performance of MLMs.Given the results, we decided to use gradient boosting for the rest of the experiments in this paper.
Feature importance In addition, to investigate the influence of the factors on the performance of MLMs using R 2 and RMSE, we compute the importance score of each factor after training the regression model using the Gini importance implemented in sklearn.The Gini importance is computed as the normalized total reduction of the criterion brought by that feature.It calculates the global contribution of each feature by aggregating the losses incurred in each split made by that feature (Delgado-Panadero et al., 2022).

Results
Feature importance Table 5 shows the Gini importance scores of each factor.Due to space constraints in this table, we have omitted the factors that received a score of 0 importance (the full table is shown in §A.2).The parameter size obtains the largest importance score for A-CP, while it is the second most important factor for TweetEval and GLUE.Tokenization is the most important factor for A-SS and the second most for A-CP.portant factors (i.e., the ones that obtain non-zero importance scores) at a time and track the R 2 score returned by the gradient boosting models trained on different tasks.Table 6 shows the result.In this table, the lower R 2 score indicates the more important the factor is. Consistent with the result shown in Table 5, sentence order prediction and parameter size are the most and second most important factors for TweetEval, respectively, and parameter size is the most important factor for A-CP.For A-SS, uncased and tokenization are the most and second most important factors, respectively.In addition, we show the corresponding important scores for each factor for training decision tree models in §A.3 and observe largely similar conclusions to the result presented in Table 5.

Types of features
To further explore the impact of a group of factors, we consider the 5 categories in §3.To investigate the effect of a certain group, we conduct ablations by removing a group of factors as well as retaining only one group at a time.Tables 7 and 8 show the corresponding results.The training objectives group of factors is regarded as paramount for social bias evaluation (i.e., A-CP and A-SS) in Table 7, and their removal leads to a substantial decrease in the R 2 scores.Meanwhile, Table 8 shows much lower R 2 scores across all the cases.This is because of the discrepancy between removing a group of factors and retaining only one category, as the latter entails a reduced utilization of features during the training process.Despite this limitation, we observe that by keeping tokenization only, the regression model can still make relatively accurate predictions for social bias evaluation, indicating that tokenization is important for social bias evaluation.In contrast, training corpora and training objectives are the most important ones on the downstream tasks.

Qualitative Analysis
With the knowledge that model size, tokenization, and training objectives are the three primary categories of factors influencing social bias and task performance in MLMs from §6.2, we further investigate the factors within each category to discern their individual contributions in MLMs.For this purpose, we calculate the average performance of MLMs when considering a specific feature associated with a factor for each task.To capture the overall trend, we extend our analysis to include not only monolingual MLMs in the general domain but also domain-specific and multilingual MLMs.Table 9 shows the average scores for the top-5 performing MLMs for each category of the important factors.Note that the number of models associated with each category within a specific factor is different.For example, there are 21 MLMs in the medium and 12 in the large categories, whereas there are only 6 in the small category for the parameter size factor.Because some outlier cases are affecting the overall averages, in Table 9 we compute the top-5 performing models in each category, whereas the averages over all models are shown in §A.4.For the model size, we observe that the MLMs trained with a parameter size greater than 100M tend to have better downstream task performance, while reporting higher levels of social biases compared to the ones with a smaller parameter size.The models trained with 6 layers demonstrate the least amount of social biases (i.e., A-CP and A-SS close to 50), however, the ones trained with 24 lay-ers obtain better downstream task performance.As for the vocabulary size, the models trained with small-and medium-sized vocabulary obtain the best performance on GLUE and TweetEval, respectively, whereas the ones with a large vocabulary size contain less degree of social biases.Regarding the tokenization methods, we see that the models using SentencePiece contain the lowest level of social biases, while the models using BPE obtain better downstream task performance.With respect to training objectives, the models trained with both masked language modelling and mention reference prediction contain the fewest degrees of social biases.On the other hand, models trained with masked language modelling only and models with both masked language modelling and sentence ordering prediction return the best downstream task performance on TweetEval and GLUE, respectively.

Discussion
Model size.Larger MLMs have a higher capability to achieve better performance on downstream tasks, while smaller models are computationally efficient and require fewer hardware resources.From Table 9, we see that models with a larger number of parameters and more layers report better performance in GLUE and TweetEval compared to the smaller and shallower models, while at the same time demonstrating lesser social biases in terms of A-CP and A-SS scores.Because the learning capacity of MLMs increases with the number of parameters and the depth (i.e.number of layers), it is not surprising that larger models outperform the smaller ones on downstream tasks.However, it is interesting to observe that social biases do not necessarily increase with this extra capacity of the MLMs.In the case of gender-related biases, Tal et al. (2022) showed that even if the gender bias scores measured on Winogender (Rudinger et al., 2018) are smaller for the larger MLMs, they make more stereotypical errors with respect to gender.However, whether this observation generalises to all types of social biases remains an open question.
Mono-vs.Multi-lingual.Table 10: The performance of monolingual vs. multilingual models and general domain vs. domain-specific models on social bias and downstream tasks.
of languages, thus incorporating a greater spectrum of cultural diversity (Liang et al., 2020;Ahn and Oh, 2021).Consequently, the presence of these divergent viewpoints within the model can potentially mitigate social biases.Conversely, monolingual models obtain better performance on TweetEval and GLUE, which are limited to English.
General vs. Domain-specific.Table 10 bottom shows the performance of models from different domains.Recall from §4 that we include domainspecific models for social media, legal and biomedical domains.As TweetEval is a social media benchmark, we additionally include the performance of models in the social media domain.Models in the social media domain contain the least bias according to A-CP and achieve the best performance on TweetEval.Conversely, the performance of models in the general domain is better than domain-specific ones on GLUE and obtains the lowest bias score for A-SS.This result is not surprising given the focus on improving tasks of the given domain for domain-specific models, rather than to improve in general tasks.

Conclusion
Despite the extensive prior work evaluating social bias in MLMs, the relationship between social biases and the factors associated with MLMs remains unclear.To address this gap, we conducted the first-ever comprehensive study of this type through predictive factor analysis to investigate the impact of different factors on social bias and task performance in pretrained MLMs, considering 39 models with 30 factors.We found that model size, tokenization and training objectives are the three most important factors across tasks.In terms of social biases, domain-specific models do not appear to be more or less biased than general domains, while multilingual models, which are trained on corpora covering different languages and cultures, appear to be less socially biased than the monolingual ones.

Limitations
This paper studies the impact of underlying factors of MLMs on social bias and downstream task performance.In this section, we highlight some of the important limitations of this work.We hope this will be useful when extending our work in the future by addressing these limitations.
As described in § 6, the regression models we take into account in this paper are not able to be properly trained based on the features that we considered.Extending the factors and further exploring the reason for the poor performance of regression models on GLUE is one future direction.
We limit our work in this paper to focusing on evaluating intrinsic social bias captured by MLMs.However, there are numerous extrinsic bias evaluation datasets existing such as BiasBios (De-Arteaga et al., 2019), STS-bias (Webster et al., 2020), NLIbias (Dev et al., 2020).Extending our work to evaluate the extrinsic biases in MLMs will be a natural line of future work.
Furthermore, our analysis focuses on MLMs and not considering generative language models such as GPT-2 (Radford et al.), Transformer-XL (Dai et al., 2019), XLNet (Yang et al., 2019) and GPT-3 (Brown et al., 2020).Extending our work to investigate the relationship between models' factors and social bias as well as downstream performance is deferred to future work.
Finally, although we tried to collect as many MLMs as possible, the final number may be in some cases insufficient to draw conclusive numerical conclusions in the regression analysis.

Ethical Considerations
In this paper, we aim to investigate which factors affect the social bias captured by MLMs.Although we used existing datasets that are annotated for social biases, we did not annotate nor release new datasets as part of this research.In particular, we did not annotate any datasets by ourselves in this work and used multiple corpora and benchmark datasets that have been collected, annotated and repeatedly used for evaluations in prior works.To the best of our knowledge, no ethical issues have been reported concerning these datasets.
The gender biases considered in the bias evaluation datasets in this paper cover only binary gender (Dev et al., 2021).However, non-binary genders are severely underrepresented in textual data used to train MLMs.Moreover, non-binary genders are often associated with derogatory adjectives.Evaluating social bias by considering non-binary gender is important.
Furthermore, biases are not limited to word representations but also appear in sense representations (Zhou et al., 2022).However, our analysis did not include any sense embedding models.MLM, NSP, WWM, SOP and MRP represent masked language modeling, whole word masking, next sentence prediction, sentence ordering prediction and mention reference prediction, respectively than 100K) obtain the best performance on both TweetEval and GLUE, while the ones with a large vocabulary size (i.e., greater than 100K) contain lesser degrees of social biases.Regarding the tokenization methods, we see that the models using BPE contain the lowest level of social bias (i.e., A-CP close to 50) and the best performance on both TweetEval and GLUE, while WordPiece is better for A-SS.Models using SentencePiece tokenization perform consistently worse across the board, but we should also note the limited number of language models analysed with this tokenization method.

B Error Analysis on GLUE
From Table 15, we observe that CoLA, MRPC and RTE obtain negative R 2 scores.This indicates gradient boosting is not able to be properly trained to predict the performance of models on CoLA, MRPC and RTE given the considered features.However, the GLUE score is computed by taking the average of the performance over the 9 subtasks.The non-predictive subtasks especially on RTE task result in a negative R  mance of the MLM on GLUE.

C Details of Factors
duct a comprehensive study of MLMs trained using different types of training corpora, we cover all four types of training corpora, resulting in four different domains (including general domain): (1) General Domain: BookCorpus (Books), Wikipedia (Wiki), Common Crawl-News (CCNews), Open-WebText (OWT), and Stories; (2) Social Media: Tweets and the Reddit Abusive Language English dataset (RALE); (3) Legal Domain: Patent Litigations, Caselaw Access Project (Caselaw) and Google Patents Public Data (GooglePatents); (4) Biomedical Domain: Full-text biomedical papers from the Semantic Scholar Open Research Corpus (S2ORC), PubMed Abstracts (PMed), PMC Fulltext articles and the Medical Information Mart for Intensive Care III (MIMIC3); Finally, we also consider a multilingual corpus: CommonCrawl Corpus in 100 languages (CC100).Each of the training corpora is considered as an individual binary factor.

Table 1 :
Summary of the MLMs in the analysis.

Table 2 :
Pearson and Spearman correlations between the models' performance on pairs of tasks, where † denotes statistically significant (p < 0.05) correlations.

Table 3 :
Example of features of MLMs given to a regression model for training according to considered factors.

Table 4 :
Comparison of different regression models using the features from the monolingual MLMs in the general domain.The highest R 2 (on the top) and the lowest RMSE (on the bottom) are shown in bold.
Factor analysis To further study the effects of factors on MLMs, we eliminate each of the im-

Table 8 :
R 2 scores keeping features from one category only.The most important factors on each predicted task are shown in bold.
Table 10 top compares the performance of monolingual vs. multilingual MLMs.We see that multilingual models demonstrate lower levels of social biases compared to their monolingual counterparts.This aligns with prior works that conjecture a multilingual language model to benefit from training over a larger number

Table 14 :
Comparison of social bias and task performance of MLMs on different tasks, according to different features associated with the important factors.

Table 15 :
R 2 scores returned by gradient boosting on each GLUE sutasks.

Table 16 :
Table16shows the description of each factor including the feature types as well as the potential value for each factor.As shown in the table, most of the factors are binary, while factors such as parameter size, domain, language, number of languages, vocabulary size, tokenization and number of layers are categorical.Table 17 provides the details of factors corresponding to each language model.parameter size is divided into three categories: S (x < 100M ), M (100M ≤ x ≤ 300M ), L (x > 300M ) Distill binary, 1 if the model uses model distillation, otherwise 0 Train-Books binary, 1 if BookCorpus is used as a training corpus, otherwise 0 Train-Wiki binary, 1 if Wikipedia is used as a training corpus, otherwise 0 Train-CCNews binary, 1 if CC-News is used as a training corpus, otherwise 0 Train-OWT binary, 1 if OpenWebText is used as a training corpus, otherwise 0 Train-Stories binary, 1 if Stories is used as a training corpus, otherwise 0 Train-Tweets binary, 1 if Tweets is used as a training corpus, otherwise 0 Train-RALE binary, 1 if RALE is used as a training corpus, otherwise 0 Train-Patent binary, 1 if Patent Litigations is used as a training corpus, otherwise 0 Train-Caselaw binary, 1 if Caselaw Access Project is used as a training corpus, otherwise 0 Train-GooglepPatents binary, 1 if Google Patents Public Data is used as a training corpus, otherwise 0 Train-S2ORC binary, 1 if S2ORC is used as a training corpus, otherwise 0 Train-MIMIC3 binary, 1 if MIMIC3 is used as a training corpus, otherwise 0 Train-PMed binary, 1 if PubMed Abstracts is used as a training corpus, otherwise 0 Train-PMC binary, 1 if PMC Full-text articles are used as a training corpus, otherwise 0 Train-CC100 binary, 1 if CommonCrawl data in 100 language (CC100) is used as a training corpus, otherwise 0 Cont-train binary, 1 if the model is continued trained based on a pretrained MLMs, otherwise 0 Uncased binary, 1 if the model is uncased, otherwise 0 Domain categorical, domain of the models is based on the domain of training corpus that the model used during pretraining.The 4 domains are considered: general domain, social media, legal domain, biomedical domain Language categorical, English only or multilingual Number of Languages categorical, we take the real number of the languages that are trained on Vocabulary size categorical, the vocabulary size is divided into three categories: S (x < 50K), M (50K ≤ x ≤ 100K), L (x > 100K) Tokenization categorical, three categories: BPE, WordPiece, SentencePiece Number of Layers categorical, three categories: 6 layers, 12 layers, 24 layers Masked Language Modelling binary, 1 if the model uses the masking technique, otherwise 0 Whole Word Masking binary, 1 if the model uses the whole word masking technique, otherwise 0 Next Sentence Prediction binary, 1 if the model uses next sentence prediction objective, otherwise 0 Sentence Order Prediction binary, 1 if the model uses sentence order prediction objective, otherwise 0 Mention Reference Prediction binary, 1 if the model uses mention reference prediction objective, otherwise 0 The description of the features of each factor that we consider in this paper.