StereoSet: Measuring stereotypical bias in pretrained language models

A stereotype is an over-generalized belief about a particular group of people, e.g., Asians are good at math or Asians are bad drivers. Such beliefs (biases) are known to hurt target groups. Since pretrained language models are trained on large real world data, they are known to capture stereotypical biases. In order to assess the adverse effects of these models, it is important to quantify the bias captured in them. Existing literature on quantifying bias evaluates pretrained language models on a small set of artificially constructed bias-assessing sentences. We present StereoSet, a large-scale natural dataset in English to measure stereotypical biases in four domains: gender, profession, race, and religion. We evaluate popular models like BERT, GPT-2, RoBERTa, and XLNet on our dataset and show that these models exhibit strong stereotypical biases. We also present a leaderboard with a hidden test set to track the bias of future language models at https://stereoset.mit.edu


Introduction
A key idea behind the current success of neural network models for language is pretrained representations such as word embeddings (Mikolov et al., 2013;Pennington et al., 2014) and pretrained language models (Peters et al., 2018;Howard and Ruder, 2018;Devlin et al., 2019;Radford et al., 2019;Liu et al., 2019). These are widely used to initialize neural models, which are then fine-tuned to perform a task at hand. Typically, these are learned from massive text corpora using variants of language modeling objective, i.e., correctly predicting a word given its surrounding context. In the recent years, these repre- * Work completed in part during an internship at Intel AI.
Choose the appropriate word:

a) The Intrasentence Context Association Test
Choose the appropriate sentence: Domain: Race Target: Arab Context: He is an Arab from the Middle East.
Option 1: He is probably a terrorist with bombs.
(unrelated) (b) The Intersentence Context Association Test Figure 1: Context Association Tests (CATs) to measure the bias and language modeling ability of language models.
sentations empowered neural models to attain unprecedented levels of performance gains on multiple language tasks. The resulting models are being deployed widely as services on platforms like Google Cloud and Amazon AWS to serve millions of users.
While this growth is commendable, there are concerns about the fairness of these models. Since pretrained representations are obtained from learning on massive text corpora, there is a danger that stereotypical biases in the real world are reflected in these models. For example, GPT2 (Radford et al., 2019), a pretrained language model, has shown to generate unpleasant stereotypical text when prompted with context containing certain races such as African-Americans (Sheng et al., 2019). In this work, we assess the stereotypical biases of popular pretrained language models.
The seminal works of Bolukbasi et al. (2016) and Caliskan et al. (2017) show that word embeddings such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) contain stereotypical biases using diagnostic methods like word analogies and association tests. For example, Caliskan et al. show that male names are more likely to be associated with career terms than female names where the association between two terms is measured using embedding similarity, and similarly African-American names are likely to be associated with unpleasant terms than European-American names.
Recently, such studies have been attempted to evaluate bias in contextual word embeddings obtained from pretrained language models where a word is provided with artificial context (May et al., 2019;Kurita et al., 2019), e.g., the contextual embedding of man is obtained from the embedding of man in the sentence This is a man. However, these have a few drawbacks. First, the context is artificial, which does not reflect the natural usage of a word. Second, they require stereotypical attribute terms to be predefined (e.g., pleasant and unpleasant terms). Third, they focus on single word target terms (and attributes) and ignore multiword terms like construction worker.
In this work, we propose methods to evaluate bias of pretrained language models. These methods do not have the aforementioned limitations. Specifically, we design two different association tests, one for measuring bias at sentence level (intrasentence), and the other at discourse level (intersentence). In these tests, each target term (e.g., tennis player) is provided with a natural context in which it appears, along with three possible associative contexts. The associative contexts help us test how good a model is at language modeling as well as evaluate the biases of the model. We crowdsource StereoSet, a dataset for associative contexts in English containing 4 target domains, 321 target terms and 16,995 test instances (triplets). A few instances are shown in Figure 1.

Task Formulation
What are the desiderata of an idealistic language model? First, it should be able to perform the task of language modeling, i.e., it should rank meaningful contexts higher than meaningless contexts. For example, it should tell us that Our housekeeper is a Mexican is more probable than Our housekeeper is a round. Second, it should not exhibit stereotypical bias, i.e., it should avoid ranking stereotypical contexts higher than antistereotypical contexts, e.g., Our housekeeper is a Mexican and Our housekeeper is an American should be equally possible. If the model consistently prefers stereotypes over anti-stereotypes, we can say that the model exhibits stereotypical bias. Based on these observations, we develop the Context Association Test (CAT), a test that measures the language modeling ability as well as the stereotypical bias of pretrained language models.
In CAT, given a context containing a target group (e.g., housekeeper), we provide three different ways to instantiate this context. Each instantiation corresponds to either a stereotypical, a anti-stereotypical, or an unrelated association. The stereotypical and anti-stereotypical associations are used to measure stereotypical bias, and the unrelated association is used to measure language modeling ability.
Specifically, we design two types of association tests, intrasentence and intersentence CATs, to assess language modeling and stereotypical bias at sentence level and discourse level. Figure 1 shows an example for each.

Intrasentence
Our intrasentence task measures the bias and the language modeling ability for sentence-level reasoning. We create a fill-in-the-blank style context sentence describing the target group, and a set of three attributes, which correspond to a stereotype, an anti-stereotype, and an unrelated option ( Figure  1a). In order to measure language modeling and stereotypical bias, we determine which attribute has the greatest likelihood of filling the blank, in other words, which of the instantiated contexts is more likely.

Intersentence
Our intersentence task measures the bias and the language modeling ability for discourse-level reasoning. The first sentence contains the target group, and the second sentence contains an attribute of the target group. Figure 1b shows the intersentence task. We create a context sentence with a target group that can be succeeded with three attribute sentences corresponding to a stereotype, an anti-stereotype and an unrelated option. We measure the bias and language modeling abil-ity based on which attribute sentence is likely to follow the context sentence.

Related Work
Our work is inspired from several related attempts that aim to measure bias is pretrained representations such as word embeddings and language models.

Bias in word embeddings
The two popular methods of testing bias in word embeddings are word analogy tests and word association tests. In word analogy tests, given two words in a certain syntactic or semantic relation (man → king), the goal is generate a word that is in similar relation to a given word (woman → queen). Mikolov et al. (2013) showed that word embeddings capture syntactic and semantic word analogies, e.g., gender, morphology etc. Bolukbasi et al. (2016) build on this observation to study gender bias. They show that word embeddings capture several undesired gender biases (semantic relations) e.g. doctor : man :: woman : nurse. Manzini et al. (2019) extend this to show that word embeddings capture several stereotypical biases such as racial and religious biases.
In the word embedding association test (WEAT, Caliskan et al. 2017), the association of two complementary classes of words, e.g., European names and African names, with two other complementary classes of attributes that indicate bias, e.g., pleasant and unpleasant attributes, are studied to quantify the bias. The bias is defined as the difference in the degree with which European names are associated with pleasant and unpleasant attributes in comparison with African names being associated with pleasant and unpleasant attributes.
Here the association is defined as the similarity between the word embeddings of the names and the attributes. This is the first large scale study that showed word embeddings exhibit several stereotypical biases and not just gender bias. Our inspiration for CAT comes from WEAT.

Bias in pretrained language models
May et al. (2019) extend WEAT to sentence encoders, calling it the Sentence Encoder Association Test (SEAT). For a target term and its attribute, they create artificial sentences using generic context of the form "This is [target]." and "They are [attribute]." and obtain contextual word embeddings of the target and the attribute terms. They repeat Caliskan et al. (2017)'s study using these embeddings and cosine similarity as the association metric but their study was inconclusive. Later, Kurita et al. (2019) show that cosine similarity is not the best association metric and define a new association metric based on the probability of predicting an attribute given the target in generic sentential context, e.g., [target] is [mask], where [mask] is the attribute. They show that similar observations of Caliskan et al. (2017) are observed on contextual word embeddings too. Our intrasentence CAT is similar to their setting but with natural context. We also go beyond intrasentence to propose intersentence CATs, since language modeling is not limited at sentence level.

Measuring bias through extrinsic tasks
Another popular method to evaluate bias of pretrained representations is to measure bias on extrinsic applications like coreference resolution (Rudinger et al., 2018;Zhao et al., 2018) and sentiment analysis (Kiritchenko and Mohammad, 2018). In this method, neural models for downstream tasks are initialized with pretrained representations, and then fine-tuned on the target task. The bias in pretrained representations is estimated based on the performance on the target task. However, it is hard to segregate the bias of task-specific training data from the pretrained representations. Our CATs are an intrinsic way to evaluate bias in pretrained models.

Dataset Creation
We select four domains as the target domains of interest for measuring bias: gender, profession, race and religion. For each domain, we select terms (e.g., Asian) that represent a social group. For collecting target term contexts and their associative contexts, we employ crowdworkers via Amazon Mechanical Turk. 1 We restrict ourselves to crowdworkers in USA since stereotypes could change based on the country they live in.

Target terms
We curate diverse set of target terms for the target domains using Wikidata relation triples (Vrandečić and Krötzsch, 2014). A Wikidata triple is of the form <subject, relation, object> (e.g., <Brad Pitt, P106, Actor>). We collect all objects occurring with the relations P106 (profession), P172 (race), and P140 (religion) as the target terms. We manually filter terms that are either infrequent or too fine-grained (assistant producer is merged with producer). We collect gender terms from Nosek et al. (2002). A list of target terms is available in Appendix A.3. A target term can contain multiple words (e.g., software developer).

CATs collection
In the intrasentence CAT, for each target term, a crowdworker writes attribute terms that correspond to stereotypical, anti-stereotypical and unrelated associations of the target term. Then they provide a context sentence containing the target term. The context is a fill-in-the-blank sentence, where the blank can be filled either by the stereotype term or the anti-stereotype term but not the unrelated term.
In the intersentence CAT, first they provide a sentence containing the target term. Then they provide three associative sentences corresponding to stereotypical, anti-stereotypical and unrelated associations. These associative sentences are such that the stereotypical and the anti-stereotypical sentences can follow the target term sentence but the unrelated sentence cannot follow the target term sentence.
Moreover, we ask annotators to only provide stereotypical and anti-stereotypical associations that are realistic (e.g., for the target term receptionist, the anti-stereotypical instantiation You have to be violent to be a receptionist is unrealistic since being violent is not a requirement for being a receptionist).

CATs validation
In order to ensure, stereotypes were not simply the opinion of one particular crowdworker, we validate the data collected in the above step with additional workers. For each context and its associations, we ask five validators to classify each association into a stereotype, an anti-stereotype or an unrelated association. We only retain CATs where at least three validators agree on the classification labels. This filtering results in selecting 83% of the CATs, indicating that there is regularity in stereotypical views among the workers.

Dataset Analysis
Are people prone to associate stereotypes with negative associations? To answer this question, we classify stereotypes into positive and negative sentiment classes using a two-class sentiment classifier (details in Appendix A.5). The classifier also classifies neutral sentiment such as My housekeeper is a Mexican as positive. Table 2 shows the results. As evident, people do not always associate stereotypes with negative associations (e.g., Asians are good at math is a stereotype with positive sentiment). However, people associate stereotypes with relatively more negative associations than anti-stereotypes (41% vs. 33%).
We also extract keywords in StereoSet to analyze which words are most commonly associated with the target groups. We define a keyword as a word that is relatively frequent in StereoSet compared to the natural distribution of words in large general purpose corpora (Kilgarriff, 2009). Table 3 shows the top keywords of each domain when compared against TenTen, a 10 billion word web corpus (Jakubicek et al., 2013). We remove the target terms from keywords (since these terms are given by us to annotators). The resulting keywords turn out to be attribute terms associated with the target groups, an indication that multiple annotators are using similar attribute terms. While the target terms in gender and race are associated with physical attributes such as beautiful, feminine, masculine, etc., professional terms are asso-

Experimental Setup
In this section, we describe the data splits, evaluation metrics and the baselines.

Development and test sets
We split StereoSet into two sets based on the target terms: 25% of the target terms and their instances for the development set and 75% for the hidden test set. We ensure terms in the development set and test set are disjoint. We do not have a training set since this defeats the purpose of StereoSet, which is to measure the biases of pretrained language models (and not the models fine-tuned on StereoSet).

Evaluation Metrics
Our desiderata of an idealistic language model is that it excels at language modeling while not exhibiting stereotypical biases. In order to determine success at both these goals, we evaluate both language modeling and stereotypical bias of a given model. We pose both problems as ranking problems.
Language Modeling Score (lms) In the language modeling case, given a target term context and two possible associations of the context, one meaningful and the other meaningless, the model has to rank the meaningful association higher than meaningless association. The meaningless association corresponds to the unrelated option in Stere-oSet and the meaningful association corresponds to either the stereotype or the anti-stereotype options. We define the language modeling score (lms) of a target term as the percentage of instances in which a language model prefers the meaningful over meaningless association. We define the overall lms of a dataset as the average lms of the target terms in the split. The lms of an ideal language model will be 100, i.e., for every target term in a dataset, the model always prefers the meaningful associations of the target term.
Stereotype Score (ss) Similarly, we define the stereotype score (ss) of a target term as the percentage of examples in which a model prefers a stereotypical association over an anti-stereotypical association. We define the overall ss of a dataset as the average ss of the target terms in the dataset. The ss of an ideal language model will be 50, i.e., for every target term in a dataset, the model prefers neither stereotypical associations nor antistereotypical associations; another interpretation is that the model prefers an equal number of stereotypes and anti-stereotypes.
Idealized CAT Score (icat) We combine both lms and ss into a single metric called the idealized CAT (icat) score based on the following axioms: 1. An ideal model must have an icat score of 100, i.e., when its lms is 100 and ss is 50, its icat score is 100.
3. A random model must have an icat score of 50, i.e., when its lms is 50 and ss is 50, its icat score must be 50.
Therefore, we define the icat score as icat = lms * min(ss, 100 − ss) 50 This equation satisfies all the axioms. Here min(ss,100−ss) 50 ∈ [0, 1] is maximized when the model neither prefers stereotypes nor antistereotypes for each target term and is minimized when the model favours one over the other. We scale this value using the language modeling score. An interpretation of icat is that it represents the language modeling ability of a model to behave in an unbiased manner while excelling at language modeling.

Baselines
IDEALLM We define this model as the one that always picks correct associations for a given target term context. It also picks equal number of stereotypical and anti-stereotypical associations over all the target terms. So the resulting lms, ss and icat scores are 100, 50 and 100 respectively.
STEREOTYPEDLM We define this model as the one that always picks a stereotypical association over an anti-stereotypical association. So its ss is 100. As a result, its icat score is 0 for any value of lms.
RANDOMLM We define this model as the one that picks associations randomly, and therefore its lms, ss and icat scores are 50, 50, 50 respectively.
SENTIMENTLM In Section 5, we saw that stereotypical instantiations are more frequently associated with negative sentiment than antistereotypes. In this baseline, for a given a pair of context associations, the model always pick the association with the most negative sentiment.

BERT
In the intrasentence CAT (Figure 1a), the goal is to fill the blank of a target term's context sentence with an attribute term. This is a natural task for BERT since it is originally trained in a similar fashion (a masked language modeling objective). We leverage pretrained BERT to compute the log probability of an attribute term filling the blank. If the term consists of multiple subword units, we compute the average log probability over all the subwords. We rank a given pair of attribute terms based on these probabilities (the one with higher probability is preferred).
For intersentence CAT (Figure 1b), the goal is to select a follow-up attribute sentence given target term sentence. This is similar to the next sentence prediction (NSP) task of BERT. We use BERT pre-trained NSP head to compute the probability of an attribute sentence to follow a target term sentence. Finally, given a pair of attribute sentences, we rank them based on these probabilities.

ROBERTA
Given that ROBERTA is based off of BERT, the corresponding scoring mechanism remains remarkably similar. However, ROBERTA does not contain a pretrained NSP classification head. So we train one ourselves on 9.5 million sentence pairs from Wikipedia (details in Appendix A.4). Our NSP classification head achieves a 94.6% accuracy with ROBERTA-base, and a 97.1% accuracy with ROBERTA-large on a held-out set containing 3.5M Wikipedia sentence pairs. 2 We follow the same ranking procedure as BERT for both intrasentence and intersentence CATs.

XLNET
XLNET can be used in either in an auto-regressive setting or bidirectional setting.
We use bidirectional setting, in order to mimic the evaluation setting of BERT and ROBERTA. For the intrasentence CAT, we use the pretrained XLNET model. For the intersentence CAT, we train an NSP head (Appendix A.4) which obtains a 93.4% accuracy with XLNET-base and 94.1% accuracy with XLNET-large.

GPT2
Unlike the above models, GPT2 is a generative model in an auto-regressive setting, i.e., it estimates the probability of a current word based on its left context. For the intrasentence CAT, we instantiate the blank with an attribute term and com-pute the probability of the full sentence. In order to avoid penalizing attribute terms with multiple subwords, we compute the average log probability of each subword. Formally, if a sentence is composed of subword units x 0 , x 1 , ..., x N , then we compute N i=1 log(P (x i |x 0 ,...,x i−1 )) N . Given a pair of associations, we rank each association using this score. For the intersentence CAT, we can use a similar method, however we found that it performed poorly. 3 Instead, we trained a NSP classification head on the mean-pooled representation of the subword units (Appendix A.4). Our NSP classifier obtains a 92.5% accuracy on GPT2-small, 94.2% on GPT2-medium, and 96.1% on GPT2large. Table 4 shows the overall results of baselines and models on StereoSet. Table 4, all pretrained models have higher lms values than RANDOMLM indicating that pretrained models are better language models. Among different architectures, GPT2-large is the best performing language model (88.9 on development) followed by GPT2-medium (87.1). We take a linear weighted combination of BERT-large, GPT2medium, and GPT2-large to build the ENSEMBLE model, which achieves the highest language modeling performance (90.7). We use icat to measure how close the models are to an idealistic language model. All pretrained models perform better on icat than the baselines. While GPT2-small is the most idealistic model of all pretrained models (71.9 on development), XLNET-base is the weakest model (61.6). The icat scores of SEN-TIMENTLM are close to RANDOMLM indicating that sentiment is not a strong indicator for building an idealistic language model. The overall results exhibit similar trends on the development and test sets.

Baselines vs. Models As seen in
Relation between lms and ss All models exhibit a strong correlation between lms and ss scores. As the language model becomes stronger, so its stereotypical bias (ss) too. This is unfortunate and perhaps unavoidable as long as we rely on real world distribution of corpora to train language models since these corpora are likely to reflect  stereotypes (unless carefully selected). Among the models, GPT2 variants have a good balance between lms and ss in order to achieve high icat scores.
Impact of model size For a given architecture, all of its pretrained models are trained on the same corpora but with different number of parameters. For example, both BERT-base and BERT-large are trained on Wikipedia and BookCorpus (Zhu et al., 2015) with 110M and 340M parameters respectively. As the model size increases, we see that its language modeling ability (lms) increases, and correspondingly its stereotypical score. However, this is not always the case with icat. Until the language model reaches a certain performance, the model does not seem to exhibit a strong stereotypical behavior. For example, the icat scores of  ROBERTA and XLNET increase with model size, but not BERT and GPT2, which are strong language models to start with.
Impact of pretraining corpora BERT, ROBERTA, XLNET and GPT2 are trained on 16GB, 160GB, 158GB and 40GB of text corpora. Surprisingly, the size of the corpus does not correlate with either lms or icat. This could be due to the difference in architectures and the type of corpora these models are trained on. A better way to verify this would be to train a same model on increasing amounts of corpora. Due to lack of computing resources, we leave this work for community. We conjecture that high performance of GPT2 (on lms and icat) is due to the nature of its training data. GPT2 is trained on documents linked from Reddit. Since Reddit has several subreddits related to target terms in StereoSet (e.g., relationships, religion), GPT2 is likely to be exposed to correct contextual associations. Also, since Reddit is moderated in these niche subreddits (ie. /r/feminism), it could be the case that both stereotypical and anti-stereotypical associations are learned. Table 5 shows domain-wise results of the ENSEMBLE model on the test set. The model is relatively less biased on race than on others (icat score of 69.7). We also show the high and low biased target terms for each domain from the development set. We conjecture that the high biased terms are the ones that have well established stereotypes in society and are also frequent in language. This is the case with mother (attributes: caring, cooking), software developer (at-  tributes: geek, nerd), and Africa (attributes: poor, dark). The low biased are the ones that do not have well established stereotypes, for example, producer and Crimean. The outlier to this observation is Muslim. Although the target term muslim has strong stereotypical associations in StereoSet (e.g., the attribute term terrorist appears in 20% of stereotypes and other hatred terms in 43% examples), the model exhibits idealistic behavior (with a slight preference towards anti-stereotypes). We could not explain this behavior, and leave explainability of such bias to future work. Table 6 shows the results of intrasentence and intersentence CATs on the test set. Since intersentence tasks has more number of words per instance, we expect intersentence language modeling task to be harder than intrasentence. This is the case with most models (except BERT).

Limitations
StereoSet is crowdsourced on Mechanical Turk using worker pool from the US. Since this pool is predominantly (80%) below the age of 50 (Difallah et al., 2018), StereoSet may not reflect the stereotypes of the wider US population. Although stereotypes are subjective opinions, we often find them to collide with objective facts. For instance, "I'm going to visit Iraq. Everyone is a Muslim there" is a stereotype of Iraq in our dataset, however this is objectively incorrect since 2% of Iraqis are Christians. Similar is the case with anti-stereotypes (e.g., "I travelled to France recently. Everyone there spoke English" is an antistereotype since French are stereotyped to hate English). Even in these cases, our assumption is that an idealistic language model shows no favor to one over the other. Although we explicitly asked the annotators to avoid objective statements and the validators to filter them, some false statements still remain in the data.
In some cases, it is probably useful to favor stereotypes over anti-stereotypes. Consider the CAT, "The chef made ___ food", and its stereotypical and anti-stereotypical associations, delicious and disgusting respectively. Although the chance that a chef makes delicious food is high, we still assume that an idealistic language model shows no preference to one over the other. This could be problematic. We leave this for future work.

Conclusion
In this work, we develop the Context Association Test (CAT) to measure the stereotypical biases of pretrained language models with respect to their language modeling ability. We introduce a new evaluation metric, the Idealized CAT (ICAT) score, that measures how close a model is to an idealistic language model. We crowdsource Stere-oSet, a dataset containing 16,995 CATs to test biases in four domains: gender, race, religion and professions. We show that current pretrained language model exhibit strong stereotypical biases, and that the best model is 27.0 ICAT points behind the idealistic language model. We find that the GPT2 family of models exhibit relatively more idealistic behavior than other pretrained models like BERT, ROBERTA and XLNET. Finally, we release our dataset to the public, and present a leaderboard with a hidden test set to track the bias of future language models. We hope that Stere-oSet will spur further research in evaluating and mitigating bias in language models.

A Appendix
A.1 Detailed Results Table 7 and Table 8 show detailed results on the Context Association Test for the development and test sets respectively.

A.2 Mechanical Turk Task
Our crowdworkers were required to have a 95% HIT acceptance rate, and be located in the United States. In total, 475 and 803 annotators completed the intrasentence and intersentence tasks respectively. Restricting crowdworkers to the United States helps account for differing definitions of stereotypes based on regional social expectations, though limitations in the dataset remain as discussed in Section 9. Screenshots of our Mechanical Turk interface are available in Figure 2 and 3. Table 9 list our target terms used in the dataset collection task.

A.4 General Methods for Training a Next Sentence Prediction Head
Given some context c, and some sentence s, our intersentence task requires calculating the likelihood p(s|c), for some sentence s and context sentence c. While BERT has been trained with a Next Sentence Prediction classification head to provide p(s|c), the other models have not. In this section, we detail our creation of a Next Sentence Prediction classification head as a downstream task.
For some sentences A and B, our task is simply determining if Sentence A follows Sentence B, or if Sentence B follows Sentence A. We trivially generate this corpus from Wikipedia by sampling some i th sentence, i + 1 th sentence, and a randomly chosen negative sentence from any other article. We maintain a maximum sequence length of 256 tokens, and our training set consists of 9.5 million examples.
We train with a batch size of 80 sequences until convergence (80 sequences / batch * 256 tokens / sequence = 20,480 tokens/batch) for 10 epochs over the corpus. For BERT, We use BertAdam as the optimizer, with a learning rate of 1e-5, a linear warmup schedule from 50 steps to 500 steps, and minimize cross entropy for our loss function. Our results are comparable to Devlin et al. (2019), with each model obtaining 93-98% accuracy against the test set of 3.5 million examples.
Additional models maintain the same experimental details. Our NSP classifier achieves an 94.6% accuracy with roberta-base, a 97.1% accuracy with roberta-large, a93.4% accuracy with xlnet-base and 94.1% accuracy with xlnet-large.
In order to evaluate GPT-2 on intersentence tasks, we feed the mean-pooled representations across the entire sequence length into the classification head. Our NSP classifier obtains a 92.5% accuracy on gpt2-small, 94.2% on gpt2-medium, and 96.1% on gpt2-large. In order to fine-tune gpt2-large on our machines, we utilized gradient accumulation with a step size of 10, and mixed precision training from Apex.

A.5 Fine-Tuning BERT for Sentiment Analysis
In order to evaluate sentiment, we fine-tune BERT (Devlin et al., 2019) on movie reviews (Maas et al., 2011)