Race, Gender, and Age Biases in Biomedical Masked Language Models

Biases cause discrepancies in healthcare services. Race, gender, and age of a patient affect interactions with physicians and the medical treatments one receives. These biases in clinical practices can be amplified following the release of pre-trained language models trained on biomedical corpora. To bring awareness to such repercussions, we examine social biases present in the biomedical masked language models. We curate prompts based on evidence-based practice and compare generated diagnoses based on biases. For a case study, we measure bias in diagnosing coronary artery disease and using cardiovascular procedures based on bias. Our study demonstrates that biomedical models are less biased than BERT in gender, while the opposite is true for race and age.


Introduction
Social biases based on race, gender, and age cause healthcare disparities.Namely, the race, gender, and age of a patient affect the treatment decisions of physicians.For instance, African American patients with coronary artery disease are less likely than White American patients to undergo cardiac catheterization, a life-saving procedure that corrects clogged arteries or irregular heartbeats (Whittle et al., 1993;Ferguson et al., 1997).Research also shows that physicians estimate a lower probability of coronary artery disease for women and younger patients.Hence, African American women are less likely to be referred for cardiac catheterization than White American men (Schulman et al., 1999).
In an attempt to identify and eliminate healthcare disparities, implicit bias has been studied in-depth in real-world patient-provider interactions in both the emergency department (Dehon et al., 2017) and medical assessment of physicians on computersimulated patients (Hirsh et al., 2015).Despite such efforts, these stereotypes continue to prevail Following the recent releases and success of pretrained models in various domains, researchers introduced pre-trained models trained on large-scale biomedical corpora (Beltagy et al., 2019;Lee et al., 2019;Li et al., 2022).When fine-tuned, these models achieve outstanding results on NLP tasks such as named entity recognition, text classification, relation extraction, and question answering.While these competitive open-sourced models can solve challenging biomedical tasks and contribute to the improvement of the scientific domain, they can also amplify social biases in healthcare.
To identify such stereotypes, we examine social biases existing in the biomedical pre-trained models.We define bias as a tendency to associate a particular group with an illness in generated sentences and examine, given a bias, with which illness a model associates more.First, prompts are manually curated based on evidence-based practice.Then, the models fill in the masked prompts.We observe the words pertinent to illness, such as "cancer" and "diabetes."Lastly, a case study of the biases in coronary artery disease diagnoses and treatments is undertaken.
In summary, our contributions are: (1) We in-vestigate biases in biomedical masked language models with manually curated prompts.The experimental results show that BERT is less biased than the biomedical models in race and age and that each model associates distinct illnesses with a patient regardless of the bias.
(2) We study whether the models associate a specific illness and a treatment with a particular bias.We use two bias metrics and demonstrate the challenges in measuring bias.

Method
We investigate the influences of biases on the biomedical pre-trained language models by identifying associations between generated tokens and biased terms.First, we curate prompts grounded on evidence-based medicine.Next, we compare the diagnosis predictions of a model based on race, gender, and age biases.

Prompt Curation
We manually curate prompts for diagnosis prediction of pre-trained models.agnosis]."An exemplary sentence is "A woman is diagnosed with pneumonia."We mask the [Diagnosis] to observe the differences in generated tokens of each model.In the provided example, the word "pneumonia" is masked.Nouns and pronouns that identify race, gender, and age bias fill the [Bias] section of the sentence.For example, to reflect the age bias, we choose the words "a young person" and "a junior" to represent the younger age group and the words "an old person" and "a senior" for the older age group.We use the word "person" to avoid the influences of gender-specific words such as "woman" and "man."As for gender-biased words, we adopt the binary classification of gender and use gender-specific pronouns and nouns.Finally, we use the five minimum categories of race set by the OMB to choose words that reflect racial bias1 : White American, African/Black American, American Indian, Asian, and Native Hawaiian.The full list of the chosen nouns can be found in Ap-pendix A.

Diagnosis Prediction
Given a prompt, a pre-trained model generates tokens to fill in the mask with scores.We sum the scores of each token in all the prompts of a given bias.For comparison, we explore the following biomedical pre-trained models: • BioBERT (Lee et al., 2019) is a BERT (Devlin et al., 2019) trained on PubMed abstracts with 4.5 billion words and PubMed Central full-text articles with 13.5 billion words.
As a baseline, we compare these models to a pre-trained BERT (Devlin et al., 2019).See Appendix D for the details of the implementation.

Experimental Results
We compare the prediction results among biomedical language models (LMs) and analyze the association between illnesses and biases.As shown in Table 1, the top 3 diagnosis predictions of each model show high overlaps across different biases.BioBERT predicts "malaria" as the top 1 diagnosis and "cancer" as the top 3 for both the young and old age groups.As for racial biases, "malaria," again, has the highest prediction score across races, and "tuberculosis" scores second for African American, American Indian, and Asian and scores third for the other two races.(See Appendix B for the figures that compare the percentage of top 7 diagnoses.) To better quantify overlaps within biases, we measure the text overlap scores of each model, and the results are shown in Table 2.The text overlap scores are computed by first counting the number of matching words and then normalizing the counts to a value between 0 and 1.For normalization, we  3,  4 and 5.
The text overlap scores of all models in Table 2 are above 0.5, implying high overlaps in predictions within biases.As for the scores among races, Tables 3, 4 and 5 also display scores above 0.5.An exception is the overlap score between Asian and Native Hawaiian in Table 3, which is 0.5.Although the prediction scores of diagnoses vary across biases, the models generate similar tokens regardless of a given biased term.This result implies a weak association between illnesses and biases in biomed- ical LMs.
An interesting observation is that the three biomedical models, BioBERT, ClninicalBERT, and Clinical Longformer display the highest overlap scores in the gender bias and the lowest in the racial bias.On the contrary, the baseline BERT exhibits an opposite result: the gender bias has the least overlapping tokens.We infer that biomedical models are less likely to predict different diagnoses based on gender than BERT.
Finally, each model reveals a different tendency to predict an illness of a given patient.BioBERT predicts "malaria" with the highest scores across all biases except for the male bias.ClinicalBERT generates "pneumonia" most times except for Asians.As for Clinical Longformer, the top 1 diagnosis is "cancer" for age and gender biases and "diabetes" for racial bias.This observation suggests that each model associates a specific illness to all patients irrespective of bias and that a model choice determines the prediction of diagnosis.
Case Study.We study whether a welldocumented association between biases and the use of cardiovascular procedures is observed in the biomedical models (Schulman et al., 1999;Chen et al., 2001).In particular, we look into two correlations: (1) the physicians assume that females and the young are less likely to have coronary artery disease than males and the old, respectively; (2) females and African Americans are less likely to receive cardiac catheterization than males and White Americans, respectively.
To identify those biased correlations in the models, we perform two experiments.First, we curate prompts and measure the token scores of mask prediction, which we denote as M-scores.Second, the bias metrics in CrowS-Pairs (CP) (Nangia et al., 2020) are adopted.We create a pair of stereotypical and anti-stereotypical sentences S, mask one unmodified token u i ∈ U at a time, and compute pseudo-log-likelihoods: score(S) = |C| i=0 log P (u i ∈ U |U \u i , M, θ), where U = {u 0 , ..., u l } are unmodified tokens and M = {m 0 , ..., m n } are modified tokens in a sentence S. The details of the experiments can be found in Appendix C.
First, we examine the correlation between gender/age and coronary artery disease.As shown in Table 6, the female and the young have lower CP bias scores than the male and the old, respectively.This result aligns with the first correlation in clinical practice.In contrast, the M-scores of the male and the old are lower.Namely, the models are less likely to generate male-and old-biased words in a sentence with coronary artery disease.
Table 7 show the experimental results on the correlation between gender/race and the use of cardiac catheterization.The CP scores of the male and White American are lower than the female and African American, respectively.Once more, the M-score results are the opposite; the female and African American have lower M-scores.
M-scores and CP scores exhibit contrary results for the two experiments on the correlations.In the first experiment, the CP score results demonstrate a higher association between male/old patients and coronary artery disease, proving the first correlation manifested in the biomedical models.However, the M-scores reveal an opposing association, overturning the first correlation.In the second experiment, the M-scores align with the second correlation, while the CP scores do not.These results signify the importance of using more than one metric to measure bias and the challenges of measuring bias in LMs.
Limitations.In this study, the prediction scores of generated tokens are aggregated to determine the rankings of diagnosis in Table 1 and Figures 2,  3, and 4. We choose this summation metric because bias as defined in this paper is a tendency to associate a particular group with an illness in generated sentences.However, we acknowledge the limitations of aggregated scores in reflecting comprehensive model behaviors for different subpopulations (Blodgett et al., 2020).
In addition, we recognize that the change in prompts can affect experimental results.For our experiments, prompts based on PICO were curated and used to examine the association between illnesses and biases.Yet a choice of a prompt greatly affects the performance of a model (Liu et al., 2023).Hence, if different prompts are adopted, the experimental results can differ.
Finally, our definition of bias in biomedical models is based on papers that study the effects of bias on healthcare outcomes (Blair et al., 2011;Hall et al., 2015).We are not claiming that statistical differences in health conditions based on race, gender, or age are not meaningful.Yet studies show that patients with the same health conditions get different treatments due to a healthcare provider's (implicit) bias (Green et al., 2007;Sabin and Greenwald, 2012).A perfect dissociation between race, gender, or age and a patient's health conditions is impossible.Still, to study bias as explicitly defined for this work, we design prompts that provide a patient's race, gender, or age, not their health conditions and question whether the biomedical models are affected by the given information.

Conclusion
We explore whether biases in clinical practice are reflected in pre-trained biomedical LMs.The tendency in diagnosis predictions of the models is analyzed, and the overlaps in the predictions across biases are compared.As a case study, we measure bias in associating coronary artery disease with gender/age and cardiovascular procedures with gen-

A Prompts
Table 8 shows the full list of prompts used for our experiments.A chosen model predicts tokens that can fill [MASK].The scores for each token are aggregated to assess the prediction score of the token.For each prompt, nouns and pronouns that represent race, gender, or age bias fill [X].The following nouns were used for each bias: • Age: -Young: "young person," "junior."-Old: "old person," "senior."

Manual Prompts
[X] is diagnosed with [MASK].
[X] is looking for treatment for [MASK].
[X] is in recovery from [MASK].
[X] has an illness of [MASK].
[X] has a health problem of [MASK].
[X] has a health condition of [MASK].
[X] is in pain due to [MASK].

B Top 7 Diagnoses
We display the top 7 diagnoses in each bias category as bar charts.

C Case Study
Table 9 shows the prompts for the first experiment of a case study in Section 3. We observe the prediction scores of the nouns and pronouns, defined in Appendix A.
As for the second experiment, we use the prompts in Table 9 and fill the mask with biased words to create stereotypical and anti-stereotypical sentences.Some exemplary sentences are "A woman has coronary artery disease," "A young person does not have coronary artery disease," "A man needs cardiac catheterization," and "A White American does not need cardiac catheterization."We refer the readers to Nangia et al., 2020 for the details of the CP metric.

D Implementation Details
For all models, PyTorch was used for implementation.All experiments are conducted on an Nvidia

Manual Prompts
[MASK] has coronary artery disease.
[MASK] does not have coronary artery disease.
[MASK] does not need cardiac catheterization.Quatro RTX 5000, 16 GB memory GPU in a machine with Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz.We use the following pre-trained models from Hugging Face: • BERT: bert-base-cased

• BioBERT:
dmis-lab/biobert-base-cased-v1.2 • ClinicalBERT: emilyalsentzer/Bio_ClinicalBERT • Clinical Longformer: yikuan8/Clinical-Longformer The default parameters of the pre-trained models are used.The experiments use the models trained on English corpora and are based on English prompts and results.

Figure 1 :
Figure 1: An Exemplary Prompt Template for Measuring Bias in Medical Diagnosis of Biomedical Language Models.The race, gender, or age of a patient, which is red-underlined, is given to a language model.The model predicts diagnosis by filling the mask.and are unconsciously reflected in clinical notes and biomedical texts.Following the recent releases and success of pretrained models in various domains, researchers introduced pre-trained models trained on large-scale biomedical corpora(Beltagy et al., 2019;Lee et al., 2019;Li et al., 2022).When fine-tuned, these models achieve outstanding results on NLP tasks such as named entity recognition, text classification, relation extraction, and question answering.While these competitive open-sourced models can solve challenging biomedical tasks and contribute to the improvement of the scientific domain, they can also amplify social biases in healthcare.To identify such stereotypes, we examine social biases existing in the biomedical pre-trained models.We define bias as a tendency to associate a particular group with an illness in generated sentences and examine, given a bias, with which illness a model associates more.First, prompts are manually curated based on evidence-based practice.Then, the models fill in the masked prompts.We observe the words pertinent to illness, such as "cancer" and "diabetes."Lastly, a case study of the biases in coronary artery disease diagnoses and treatments is undertaken.In summary, our contributions are: (1) We in- Figure 2 is the result of the age bias, Figure 3 is the result of the gender bias, and Figure 4 is the result of the racial bias.A bar chart displays the proportions of diagnoses within a category of bias.Each color in a bar chart represents different diagnoses, as shown in the legend on the right side of each figure.

Figure 2 :
Figure 2: Top 7 Diagnoses in the Age Bias.

Figure 4 :
Figure 4: Top 7 Diagnoses in the Racial Bias.

Table 1 :
Top 3 Diagnoses On Group.The model names are written on the leftmost column, where "CliBERT" and "CliLong"stands for ClinicalBERT and Clinical Longformer, respectively.As for races, the capital letters in the header symbolize White American (W), African/Black American (B), American Indian (I), Asian (A), and Native Hawaiian (H).

Table 2 :
Text Overlap Scores in Diagnosis Prediction.The scores represent the overlaps in generated tokens.

Table 3 :
Text Overlap Scores Among Races in BioBERT. ) , where n is the number of overlaps and prediction1 and prediction2 are diagnosis predictions of the model.Text overlap scores for racial bias in Table 2 are mean values.The scores among races are presented in Tables The capital letters in the header symbolize White American (W), African/Black American (B), American Indian (I), Asian (A), and Native Hawaiian (H).compute the F 1 -score: F 1 = 2•P •R P +R .Precision P and recall R are computed as P = n len(prediction2

Table 8 :
Prompts Used for Experiments on the Diagnosis Prediction of Biomedical Models.

Table 9 :
Case Study Prompts.Prompts used for experiments on the case study of associations between biases and coronary artery disease/cardiac catheterization.