Geographical Erasure in Language Generation

,


Introduction
Large pretrained models serve as base models for many downstream NLP applications, including question-answering, dialogue, common-sense reasoning, classification, tagging, translation, summarisation, and generation (Devlin et al., 2018;Brown et al., 2020;Chowdhery et al., 2022).Despite their increasing utility, there are concerns about how they reflect and amplify biases in the training data.For instance, unfiltered data originating from the internet is known to be rife with toxic, misogynistic, and stereotyping content.Many studies highlight biases in model outputs, primarily concerning representational harms (Barocas et al., 2017), where a section of society (e.g., women, LGBTQ+ communities) are represented in poor light, or are ignored by the system (Bolukbasi et al., 2016;Guo and Caliskan, 2021;May et al., 2019; Prediction Probabilities Country probabilities assigned by GPT-NeoX when prompted with "I live in".Middle: English speaking populations per country.Bottom: Countries experiencing erasure, i.e. underprediction compared to their population by at least a factor 3 (see §3).Data is missing for grey countries (see §6).Tan and Celis, 2019).While important, such studies predominantly examine biases related to race, gender, occupation and sexual orientation.
An important-and often overlooked-aspect of inclusive model development is geographical inclusion.This is particularly important at a time when most large-scale model training efforts come from a small set of regions.Further, these models are trained using internet data, whose access, in the first place, is unequally distributed (Blank, 2017;Center, 2021).Minimising cultural and geo-12310 graphic identities is referred to as erasure (Roche, 2019) and is studied by linguists and social scientists in the context of imperialism and colonialism, where "people are silenced in the historical record [...], their contemporary presence rendered invisible, and their existence written out of the future" (Roche, 2019).Automated systems and their developers exclude certain groups unintentionally, but the risk of being "written out of the future" remains pressing: the produced content is fed back into the internet.In lifelong learning setups, the generated content becomes the training data of tomorrow's models, closing the vicious circle of reinforcing social hierarchies (see §4.5 of (Sheng et al., 2021)).
In this paper, we reveal instances of geographical erasure, wherein models underpredict geographical regions (see Fig. 1).For instance, GPT2 assigns nine times higher likelihood to "I live in Canada" than "I live in Pakistan", whereas Pakistan's English-speaking population is almost four times to that of Canada.By comparing model outputs with population statistics, we operationalise geographical erasure ( §3).Using this measure, we first demonstrate the existence of erasure for several countries across different prompt formulations ( §4.1).Studying the consistency across a range of language models from the GPT and LLaMA model families, we find that several countries -Nigeria, Pakistan, Eswatini, Uganda and Madagascar -are affected by erasure under all models ( §4.2).Following related work (Lin et al., 2021;Rae et al., 2021;Nadeem et al., 2020), we study the impact of model size on the extent of erasure, and find it to be small -erasure occurs across all sizes ( §4.3).To identify the causes of erasure, we compute the unigram frequencies of countries in the training corpus ( §4.4).They closely match our model predictions indicating that the composition of training data is a main source of erasure.Lastly, we alleviate erasure via supervised finetuning.We study the impact of mitigation on generation quality as measured in perplexity on Wikitext-2-v1.Our finetuning strategy proves to be an effective mitigation mechanism which generalises and has small impact on generation quality ( §4.5).

Related Work
The literature on fairness in machine learning distinguishes between representational and allocational harms (Barocas et al., 2017;Crawford, 2017;Blodgett et al., 2020).Allocational harms concern the unfair distribution of resources, e.g. when a group is denied bank loans disproportionally by an automated system.Allocational harms tend to be more easily measured through standard fairness metrics like demographic parity (Dwork et al., 2012) and equality of opportunity (Hardt et al., 2016).Those do not directly apply to open-ended generation tasks, where we instead study representational harms, which arise when a system "represents some social groups in a less favourable light than others, demeans them, or fails to recognise their existence altogether" (Blodgett et al., 2020); the last case being the focus of our work.
Fairness measures for language generation usually define bias as differences between demographic groups (Sheng et al., 2021).For example, Dhamala et al. (2021) find that female pronouns are more likely to elicit positive text from an LLM than male pronouns.Similarly, Huang et al. (2019) compare different occupations, names and countries on produced sentiments.Nangia et al. (2020) and Nadeem et al. (2020) compare the probability of stereotypical and non-stereotypical sentences under a model in order to measure whether it encodes stereotypes against different demographic groups.Along the same lines, the WinoGender test (Rudinger et al., 2018) measures gender biases in co-reference resolution tasks.Taking a distributional view similar to our work, Rae et al. (2021) investigate biases in the context of occupation, however, they again compare predictions for different genders with each other.In general, such comparative bias tests are well-adapted by authors proposing new models (Touvron et al., 2023;Rae et al., 2021;Hoffmann et al., 2022;Scao et al., 2022).Instead of comparing model predictions against each other, we compare model predictions to real world ground truth distributions in order to quantify bias.
Ground truth-based measures are not commonly used as a metric for fairness but important when evaluating a model's truthfulness.Petroni et al. (2019) and Lin et al. (2021) provide datasets of real world facts against which to benchmark LLMs' knowledge.Similar to our work, Zhou et al. (2022) measure the frequency of country predictions, and how underprediction correlates with a country's GDP.Contrary to their count-based approach we propose a more fine-grained metric for erasure and extend the analysis to auto-regressive models.Unlike theirs, our erasure metric can be employed as a loss function for finetuning, to specifically miti-gates erasure.Liang et al. (2022) propose a similar metric for erasure in the domains of gender and race.Like us, they compare model to ground truth distributions, though they measure a total variation distance where we use a KL-divergence based approach (see §3.3).The authors assume uniform ground truth whereas we construct a domain specific distribution (see §3.2).Lastly, unlike ours, their analysis does not cover any mitigation efforts.Similar in spirit, geographical representativeness has been studied for text-to-image generation models (Rojas et al., 2022;Basu et al., 2023).

Method
Our goal is to measure, and later mitigate, the extent to which large pretrained models underpredict some countries when generating language.We formalise this notion here.Note that while we are studying autoregressive models in this work, the methodology extends straightforwardly to masked models.Similarly, we are interested in measuring and reducing geographical erasure, but the analysis can be applied to other attributes where ground truth is available.For example, one could measure erasure with respect to age, ethnicity, religion or gender using the same formalism.

Obtaining Model Predictions
Let p be our language model over vocabulary Ω.We consider open-ended generation tasks for autoregressive models.Such models predict the next token given previous ones, i.e. for a sequence of L tokens x 1:L ⊂ Ω the probabilities factorise as (1) We use pretrained models and condition on a short prompt, or context, of variable length L: c = x 1:L .Given this prompt, we compute the predictive distribution over a set of M candidates {x i } M i=1 = X ⊂ Ω; see §3.2 for how these M countries are chosen.For a candidate country x i ∈ X we compute p(x i |c) as i.e., we compute p("I live in x i ") for all candidate countries x i and normalise.If a country is tokenised into multiple tokens, x i = x 1:J i , we multiply the probability of the J subtokens according to (1).As before, superscript indicates position and subscript indicates the country name, e.g., x 7 = "Uganda" is tokenised into x 0 7 ="U", x 1 7 ="g", x2 7 ="anda".As a consequence, p(x i |c) tends to be smaller for multi-token country names.Concerningly, Zhou et al. (2022) show that this issue predominately impacts low GDP-countries.
Some countries in X are referred to by more than one name, e.g., "UK" and "United Kingdom".We disambiguate the countries using a list of alternative names 2 to obtain the final p(x i |c) = a∈A p(x a i |c) for all alternative names x a i .In the following sections, we sometimes write p(x i |c) = p i , omitting the dependency on the prompt unless ambiguous.Note that we work directly on the model probabilities and discuss the impact on generated language in §6.

Obtaining Ground Truth
To measure erasure, we compare the generation distribution (given by equation 2) to a ground truth distribution p true over the candidate countries, writing p true (x i ) = p true i as before.The ground truth is given by real world data, i.e., we compare our predictions to the actual population of country x i .We adjust for the fact that our models are trained on English texts only by considering English speaking populations as ground truth (see §6 for limitations of this approach).The number of English speakers per country is obtained from a Wikipedia list containing data for M = 127 countries at the time of writing-we use these 127 countries for our analysis. 3Unlike the model predictions p(x i |c), the ground truth p true (x i ) is prompt-independent.We will generalise model predictions to be promptindependent as well by marginalising prompts in §3.5.See Figure 2 (left) for an example of model predictions and ground truth.

Measuring Erasure
With these prerequisites in place, we can now formalise erasure using the relationship of predictive distribution and ground truth.
Definition 1 (Erasure Set).For a ratio threshold r > 1 we define the erasure set under model p, ground truth p true and prompt c as (3) 0% 12% 25%  We plot country predictions given prompts p(x i |c) for different re-phrasings of the prompt "I live in" (boxplots) and ground truth (barplot, grey).Country names experiencing erasure (x i ∈ S 3 , see §3.3) are in red.We show the 12 countries with the largest English speaking populations (in decrasing order).Middle: Erasure set size |S r | as a function of r for OpenLLaMA, 7B.We plot the median (solid line) and 25 th − 75 th percentiles (blue shaded area) over different rephrasings (see §3.5) of the same prompt.The dashed line marks r = 3, the threshold we use in the experiments.This choice is further motivated in §3.4.Right: Comparing ER r (blue) for different r to the KL-divergence (red).We pick r = 3, the integer value for which KL and ER r are the most similar.
In the example in Figure 2 (left), we prompt the OpenLLaMA model with different versions of the prompt "I live in" and aggregate the predictions (see §3.5 for rephrasing and aggregation).We then compute the erasure set for r = 3, i.e. countries that are three times more prevalent in the ground truth than in our predictions.We obtain S 3 ={Pakistan, Nigeria, Uganda, Bangladesh, Iraq, Madagascar and Eswatini}.A simple metric of erasure is the size of the erasure set |S r | for a user-specified r, here, |S 3 | = 7.
|S r | measures how many countries are "erased" (underrepresented by at least factor r). To obtain a more fine grained numerical evaluation we measure by how much they are underrepresented compared to ground truth by reporting the following metric.
Definition 2 (Erasure).Erasure w.r.t.ground truth p true at threshold r is defined as

Properties of ER r
A careful conceptualisation of any proposed fairness metric is crucial (Schwöbel and Remmers, 2022;Blodgett et al., 2020).We motivate our definition of ER r here.Firstly, if p = p true then EB r (p true , p) = 0 for all r; i.e., no erasure occurs when the distributions match.Secondly, unlike the total variation distance suggested in Liang et al. (2022), we want our metric to be sensitive to relative rather than absolute errors, so that countries with small populations are also taken into account.
Hence we report (log-)ratios in the definition of ER r (4).On the other hand, while we believe this sensitivity to less-populated countries is important, we do acknowledge that underpredicting big ground truth populations is particularly harmful as it impacts a lot of users.Thus, we weight the log-ratios by the ground truth probabilities p true i .A third factor in our choice of metric is the close relation of ( 4) and the KL-divergence KL(p true ||p).ER r is an additive component of the KL-divergence: This close relation to a well-defined divergence measure allows for theoretical analysis and helps practitioners build on existing intuitions.
The choice of r is a crucial hyperparamter, as |S r | and ER r are defined in terms of r.We discuss the impact here and visualise it in Figure 2 (middle and right).For small values of r, we include all terms in (5), i.e., 12313 See Figure 2 (right) for this relationship.Since we want to measure erasure or underprediction, we study cases where p true > p, i.e., for values r > 1. 4 We pick r to be an integer such that ER r (p true , p) ≈ KL(p true ||p), that is r = 3 in the experiment in Fig. 2 (right).We find that this value is the same across all our models (see Appendix A), so we choose r = 3 globally.This choice of r is based on a mathematical heuristic.An alternative way of choosing this parameter might be implied by legal or ethical constraints.For example, a guideline on adverse impact by the US Equal Employment Opportunity Commission (1979) defines "a substantially different rate of selection" at 80%.In this labour market use case, r = 1/0.8= 1.25 would be the corresponding hyperparameter.
Differentiability is an important property of our metric since we want to use it for finetuning LLMs in §4.5.For fixed r, ER r is differentiable almost everywhere (with respect to the network weights).Singularities occur at those points that add new countries to the erasure set S r in (3), i.e., weights such that p true k = p k for any country k.

Prompt Rephrasing
The erasure set definition in (3), and consequently the notion of erasure in (4) are prompt-dependent.However, we are interested more generally in the model's world knowledge rather than its completion of a specific prompt.Hence, we would like to aggregate the effect over all prompts encoding the meaning M ="home country", by using the following marginal distribution: The relationship between a prompt c and its meaning M is complex, hence computing ( 6) is intractable.Here, we will rely on simple, pragmatic techniques to semi-automatically construct a set of sample prompts D ∼ P (c|M) from a seed prompt c.We rephrase c while preserving its meaning to generate additional prompts.This is common practice: Jiang et al. (2020) use mining-and translationbased paraphrasing methods while Romano et al. (2006) rely on templates for paraphrasing.In light of recent advances in LLMs, another way to automatically rephrase prompts is by using a model that has been finetuned for paraphrasing (Niu et al., 2020).Even simpler, we use an off-the-shelf model by prompting ChatGPT to rephrase the c ="I am from" seed prompt. 5After manually removing irrelevant prompts we obtain 16 base formulations.We further expand the set of prompts by replacing sentence subjects.For example, we expand "I live in" into {"You live in", "He lives in", "She lives in", ...,}, producing a total of |D| = 955 prompts.Details and a list of all prompts can be found in Appendix B. We use the dataset of 955 prompts to approximate the marginal in (6) assuming different prior probabilities p(c|M) as follows: (1) Uniform prompt distribution: (2) Model-induced prompt distribution: where p(c) is the probability given by the autoregressive language model (1).In this case,

Experiments
In this section, we show the existence of geographical erasure across different LLMs and different prompt wordings ( §4.1).We highlight the consistency of erased countries across models ( §4.2) and investigate the impact of model size on erasure ( §4.3).We identify possible causes of erasure ( §4.4) and explore a mitigation strategy ( §4.5).
The models under consideration are GPT2 (Radford et al., 2019), 117M, 345M, 774M and 1.6B weight versions, GPT-Neo (Black et al., 2021), 125M, 1.3B and 2.7B weight versions, GPT-NeoX, 20B weights (Black et al., 2022) and open source reproductions of the LLaMA model (Touvron et al., 2023;Geng and Liu, 2023;Computer, 2023), 3B and 7B weights.We obtain all implementations from HuggingFace. 6e comes from ___ .Erased by how many models Figure 3: Geographical erasure occurs for all prompt rephrasings, and many countries experience erasure consistently under all models.Left: OpenLLaMA, 7B results for 955 individual prompts (blue dots) along the x-axis, with some example prompts as axis labels.We also plot ER r in aggregate: The blue line is the average over individual prompts 1 C c ER 3 (p true , p(•|c)), green is the uniform aggregate ER 3 (p true , p uni_agg ) (8) and red is the model-induced aggregate ER 3 (p true , p model_agg ) (10).Size of dots corresponds to the probability assigned to the respective prompt under the model.The gap between blue and red/green aggregations is explained in §3.5.Right: Of the M = 127 countries, 105 do not experience erasure at r = 3 for any of the models.For the remaining 22, we plot model counts here.Bars are coloured according to counts and sorted by GDP per capita (decreasing from left to right).We use aggregated predictions according to Equation 10.

Impact of Prompt Wording
We start by investigating how dependent erasure is on the exact phrasing of the prompt.We prompt the models with rephrased versions of "I live in" (see §3.5) and compute erasure ER 3 (p true , p(•|c)) for each prompt c.In Figure 3 (left), we plot the (i) erasure for individual prompts (dots); (ii) the average erasure 1 C c ER 3 (p true , p(•|c)) denoted by the blue dotted line; (iii) erasure for the uniform marginal distribution from (8) using a green dashdotted line; and (iv) erasure for the model-induced marginal distribution from (10) as a red dashed line.The size of the blue dots indicates p(c|M).
The magnitude of erasure ER 3 (p true , p(•|c)) differs across the phrasings c, however, erasure exists in all versions (that is, ER 3 > 0 with p-value ≪ 0.01).We note that erasure under the aggregate distribution is smaller than the average erasure (ER 3 (p true , p agg_uni ) < 1 C c ER 3 (p true , p(•|c) in Figure 3 (left)).This follows from Jensen's inequality (see Appendix C for details).Throughout the remainder of the paper, we will report the aggregates from ( 8) and ( 10) along with boxplots of ER 3 to account for the variance due to rephrasings.

Who is Experiencing Erasure?
We evaluate whether the same countries experience erasure under all the examined 10 models, and what characterises these countries.Out of the M = 127 countries under analysis, 105 do not experience erasure at r = 3 for any of the models.For the remaining 22 nations, Figure 3 (right) shows the number of models by which they are erased.Worryingly, Eswatini, Nigeria, Pakistan, Uganda and Madagascar experience erasure under all 10 analysed models.The x-axis in Figure 3 (right) is ordered by GDP per capita, in decreasing order from left to right.7

Impact of Model Size
Related work (Lin et al., 2021;Rae et al., 2021;Nadeem et al., 2020) reports mixed results on the relationship between model size and bias.On the one hand, Lin et al. (2021) report that on the Truth-fulQA benchmark, "[l]arger models are less truthful".This is because large models surface the common human misconceptions that the questions are designed to elicit.Such misconceptions are likely present in the training data which the larger mod-  2021), we do not find model size to have a big impact.We hypothesise that even the smaller models closely mimic the frequency distribution (of country mentions) in the training corpus, similar to Rae et al. ( 2021)'s experiment.We believe that this is not the case in the test by Lin et al. (2021) and Nadeem et al. (2020), because their tests go much beyond unigram frequencies, and smaller models do not exhibit such subtle biases.We explore the relationship of data bias and model bias below.

Impact of Training Data
We hypothesise that training data is an important factor for erasure: models underpredict countries which appear in the data infrequently compared to their population.To study the relationship between training data bias and model bias, we extract the distribution of country mentions in the training data.We consider the Pile dataset (Gao et al., 2021) used to pre-train the GPT-Neo LLMs analysed in this study.To determine the probability of occur-rence in the training data p train (x), we compute the number of times each country x is mentioned in the dataset, i.e., p train (x) ∝ # mentions of x.These mention counts are weighted by the number of training epochs this document was included while training (dataset weights w d from Gao et al. (2021)).We account for alternative country names as described in §3.1.Thus, the final formula becomes where A represents a set of alternative country names of a given country and # represents counts.Once all the counts are gathered, the results are normalised to determine the final values of p train (x), which we compare to the outputs of LLMs.
Specifically, we compare ground truth p true (x), training data p train (x) and GPT-NeoX predictions p(x) for countries x ∈ S 3 (Figure 4, right).We see that countries experiencing erasure are indeed underrepresented in the training data, and the prediction probabilities of these countries are similar to their frequency distribution in the training corpus (p true (x) ≫ p(x) ≈ p train (x)).
We then compute erasure against the training data, ER r (p train , p), i.e., considering the ground truth to be p train (Figure 4, middle).We find that erasure values in this case are considerably lower.For 12316 instance ER r (p train , p agg_model ) = 0.08 for GPT-NeoX compared to ER r (p true , p agg_model ) = 0.46, i.e., erasure using the world population (Figure 4, left).This indicates that the GPTNeoX family of LLMs mimic the training distribution (of country mentions).Furthermore, we find that the erasure score of the training data compared to ground truth, ER r (p true , p train ), is itself 0.46, which closely matches the erasure for models trained on this data.The high correlation between data bias and model bias suggests the composition of training data is a key source of erasure in the investigated LLMs.

Mitigation
In this section, we explore finetuning as a strategy to mitigate erasure.We perform gradient updates on a pretrained GPT2 model to minimise the erasure loss ER 3 (p true , p) on the training data given by prompt data set D. We note that our finetuning strategy differs from the related approach from §6.2 of Zhou et al. (2022) in that we have formulated a loss function which allows us to perform supervised finetuning.Zhou et al. (2022) instead continue training the model using the standard masked language modelling loss with augmented data related to underpredicted countries.We use the AdamW (Loshchilov and Hutter, 2017) optimiser with learning rate 3e − 5 and train for an additional 5 epochs (including one epoch of warmup under a linear schedule).We find that due to our loss function's direct dependency on the logits and the re-normalisation of probabilities over X (2) our finetuning strategy works best for deterministic models, hence we set dropout rates to 0 for embeddings, encoder, pooling and attention layers.For finetuning, we update the bias terms only, following Zaken et al. (2021).This is a memory efficient strategy that is expected to work particularly well in settings with constant outputs (we want the generated distribution for all our prompts to match the ground truth), while not impacting the general language modelling abilities.We evaluate whether the language modelling abilities deteriorate by measuring perplexity on Wikitext-2-v1 (Merity et al., 2016) before and after every epoch of finetuning.
To measure how well our finetuning strategy generalises, we compare three different ways of performing train-test splits of our 955 prompts.These include Random partitioning: we randomly split the prompts into 75% training and 25% test data; Pronouns: we split the prompts based on the pronouns they contain, e.g.all prompts containing "she", "you", "we" and "they" are in the training set, "I" and "'he' in the test set; Verbs: we divide along verb groups, e.g.prompts containing "to live in" and 'to be a citizen" of are in the training set, "to reside in" is in the test set.These three setups require increasing levels of generalisation.
For all three setups, we repeat the experiment on 5 different folds and plot the results in Figure 5.We find that our finetuning strategy is effective: the average erasure 1 |D| c∈D ER 3 (p true , p(•|c)) is small after 5 epochs of finetuning, both on training (blue) and test data (red).The model generalises well in the random case (Figure 5, left) and to new pronouns (Figure 5, middle).As expected, verb splits are the most challenging for our model, where we see that the erasure values decrease but not as much as we see in other splits (Figure 5,right).In all cases, we see only a small deterioration in language modelling performance, as indicated by an approximate 5% increase in the perplexity (The green lines in the plots of Figure 5 correspond to perplexity).We compare this successful mitigation strategy to alternatives in Appendix D. 12317

Conclusion
We motivated the need for large language models to be more geographically inclusive-which remains to be an overlooked aspect of inclusive model development.Specifically, we studied and formalised a notion of geographical erasure, which captures the countries that are underpredicted and the extent to which they are underpredicted.We discussed how our formulation captures many desirable properties.In our experiments, we found clear instances of geographical erasure, which were consistently observed across 10 different language models.Perhaps unsurprisingly, the output probabilities of language models closely follow the frequencies of country mentions in the training corpora, a likely cause of erasure.We examined a finetuning-based mitigation strategy and found it to be effective in alleviating erasure.
6 Limitations Languages considered.We limit our analysis to models trained on English texts, and hence we prompt them in English only.Our methodology extends to other languages straightforwardly.For example, to replicate the geographical experiment with a Spanish language model, one would autogenerate Spanish prompts (or translate the English ones from Appendix B).
The language (of prompts) used to analyze erasure should be accounted for while collecting ground truth data: for instance, English speaking countries are expected to have higher probability conditioned on "I live in", and similarly Spanish speaking countries conditioned on "Vivo en" are likely to have higher probabilities.In our work, we factor this by considering English speaking populations as ground truth in §4 (and one would proceed accordingly for a model in a different language).Difficulty in obtaining ground truth.Language specific ground truth data is less reliable and harder to obtain than raw population counts.Such statistics are often self-reported and the level of proficiency differs dramatically across regions, especially since the numbers include second language speakers. 2Since we only measure erasure for countries where p true i is available, the availability of language specific ground truth data is itself a biasing factor.This is evident from Figure 1, which depicts how the lack of ground truth data predominantly affects central African regions.Knowledge encoding vs. language generation.Our erasure metric is based on country probabilities given a prompt p(x i |c).These probabilities can be interpreted as knowledge encoded by the model.When generating text, the model probabilities are used to sample next tokens.Sampling (or decoding) can be performed using different strategies, e.g.greedy or beam search (Klein et al., 2017) to maximise probabilities, or top-k (Fan et al., 2018) and top-p (Holtzman et al., 2019) sampling strategies to generate more diverse outputs.
Our work does not analyse the effect of the decoding mechanism since we work directly on p(x i |c) instead of generated text.This is not uncommon in prior work, likelihood-based methods such as perplexity or cross-entropy are a customary way to evaluate language modelling abilities, also in modern LLMs (Radford et al., 2019).
Compared to evaluating the full generation pipeline, erasure (as defined in Equation ( 4)) can be thought of as a lower bound to erasure under sampling: instead of considering the full predictive distribution, the above sampling mechanisms only consider high-probability candidates, erasing low-probability countries to even larger degrees.
Causes of erasure.Our analysis covers two potential sources for erasure: training data and model size.Model bias is commonly explained by data bias (e.g.Bender et al. (2021); Schwöbel (2022) and Buolamwini and Gebru (2018)).In our work, we have not experimentally established the cause, our experiments instead indicate a high correlation of model biases and data biases in §4.4,suggesting that data is a likely source of erasure.Data, however, is not the only biasing factor.Model architecture and training paradigm determine how the data is used by the model.Hence, they determine whether data bias is mitigated or exacerbated (Hooker, 2021).We examine the impact of model size and find that it has little to no impact on geographical erasure ( §4.3)).Examining the impact of other factors on erasure is left to future work.

A Choosing r -additional Models
Section 3.4 compares ER r for different values of r to the KL-divergence.We pick r = 3 in this experiment such that ER(p true , p) r ≈ KL(p true ||p).Figure 6 contains the same experiment for all models under consideration.The optimal choice according to this heuristic is r = 3 for all of them.

C Variability across Prompts
Recall from (8) that Thus, erasure under the aggregate distribution ER r (p true , p agg_uni ) is a lower bound to the average erasure.

D Alternative Mitigation Strategies
Finetuning for other values of r: In §4.5 we mitigate erasure by finetuning, employing ER 3 as a loss function (Figure 5).This choice corresponds to a minimal intervention where we only modify the distributions for affected countries at a rate above r = 3; we do not address any underprediction by a smaller degree.
Finetuning with r = 0 (Fig. 7) is a stronger intervention, matching the distributions for all countries (since EB 0 (p true , p) = KL(p true ||p), see §3.4).As before, we can match the full distributions and achieve EB 0 (p true , p) ≈ 0 after only 5 epochs of finetuning.However, due to the more drastic intervention into the model distribution p true the drop in language modelling performance is larger.Perplexity increases by almost 20% compared to 5% in Figure 5.Note the different y-axis scales between Figures 5 and 7  Mitigation via Temperature Softmax: A simple way to modify the model distribution p is via the softmax temperature parameter τ of the model.We have used τ = 1 in all previous experiments.Here, we experiment with modifying τ to mitigate ER r (p true , p) such that ER r = min τ ER r (p true , p τ ). (12) Figure 8 shows ER r and perplexity as a function of τ .The optimal value (minimising ER r w.r.t.τ ) is 0.948, dashed line.This mitigation method is compared to fine-tuning of the neural network parameters from earlier experiments in Table 1.The two middle columns correspond to the finetuning results from Figure 5 and Figure 7, the rightmost column contains the results for varying temperature parameter τ .
Perhaps unsurprisingly, mitigation attempts with a single parameter τ are much less successful than using full finetuning (small drop in ER r only, see first row of Table 1).Perplexity, however, improves slightly over the original model.12324

Figure 1 :
Figure 1: Some countries are vastly underpredicted compared to their English speaking populations.Top: Country probabilities assigned by GPT-NeoX when prompted with "I live in".Middle: English speaking populations per country.Bottom: Countries experiencing erasure, i.e. underprediction compared to their population by at least a factor 3 (see §3).Data is missing for grey countries (see §6).

Figure 2 :
Figure2: Understanding erasure.Left: OpenLLaMA, 7B vastly underpredicts the occurrence of Pakistan, Nigeria and Uganda.We plot country predictions given prompts p(x i |c) for different re-phrasings of the prompt "I live in" (boxplots) and ground truth (barplot, grey).Country names experiencing erasure (x i ∈ S 3 , see §3.3) are in red.We show the 12 countries with the largest English speaking populations (in decrasing order).Middle: Erasure set size |S r | as a function of r for OpenLLaMA, 7B.We plot the median (solid line) and 25 th − 75 th percentiles (blue shaded area) over different rephrasings (see §3.5) of the same prompt.The dashed line marks r = 3, the threshold we use in the experiments.This choice is further motivated in §3.4.Right: Comparing ER r (blue) for different r to the KL-divergence (red).We pick r = 3, the integer value for which KL and ER r are the most similar.
lim r→0 S r = X and lim r→0 ER r (p true , p) = KL(p true ||p).For larger values of r, we instead have lim r→∞ S r = ∅ and lim r→∞ ER r (p true , p) = 0.

Figure 5 :
Figure 5: Finetuning effectively alleviates erasure.We plot average ER r on training (blue) and test (red) set prompts during 5 epochs of finetuning of the GPT2-small model.Error bars indicate minima/maxima over 5 folds.
was born and raised in ___. 7. I am a citizen of ___. 8.I originate from ___. 9. My roots are in ___.10.I grew up in ___.11.I was brought up in ___.12.I was raised in ___.13.I was born in ___.14.My place of origin is ___.15.I reside in ___.16.My home country is ___.8Accessed via https://chat.openai.com/.

Figure 8 :
Figure 8: Mitigating ER r using the temperature parameter τ is less successful than full finetuning.ER r and perplexity are plotted as a function of τ .The optimal value (minimising ER r w.r.t.τ ) is 0.948, dashed line.
p(x i |c) ER 3 (p true , p agg_model ) ER 3 (p true , p agg_uni ) Erasure in models closely matches distribution of country mentions in training data.Left: Geographical erasure or GPT-type models of different sizes.Size is on the x-axis (axis not to scale).Blue models are GPT2, red models are GPT-Neo, yellow are OpenLLaMA and the green model is GPT-NeoX, crosses below each box plot are aggregated results over different prompts (see Eq. 10).The dashed line indicates ER r (p true , p train ).Middle: Geographical erasure for GPT-type models of different sizes, assuming training frequency as the ground truth (instead of world population).Right: Ground truth (red) and Pile training data (green) distributions compared to GPT-NeoX predictions (blue) on countries in the erasure set S 3 (of GPT-NeoX w.r.t.ground truth).