Geographic and Geopolitical Biases of Language Models

Pretrained language models (PLMs) often fail to fairly represent target users from certain world regions because of the under-representation of those regions in training datasets. With recent PLMs trained on enormous data sources, quantifying their potential biases is difficult, due to their black-box nature and the sheer scale of the data sources. In this work, we devise an approach to study the geographic bias (and knowledge) present in PLMs, proposing a Geographic-Representation Probing Framework adopting a self-conditioning method coupled with entity-country mappings. Our findings suggest PLMs' representations map surprisingly well to the physical world in terms of country-to-country associations, but this knowledge is unequally shared across languages. Last, we explain how large PLMs despite exhibiting notions of geographical proximity, over-amplify geopolitical favouritism at inference time.


Introduction
Large pretrained language models (PLMs) are capable of generating meaningful texts beyond English and very likely, models like GPT-3 (Brown et al., 2020;Shliazhko et al., 2022;Zhang et al., 2022;blo, 2022) will form the go-to base model for automating tasks like summarizing texts, generating datasets given certain instructions (Schick and Schütze, 2021) or perhaps even evaluating the generated texts (Yuan et al., 2021).While these PLMs continue to expand their utility, it is crucial that one also examines the potential biases that these PLMs exhibit.Moreover, the utility of these PLMs should be equitable to their target users so that they perform evenly for all speakers of the languages it is primarily trained on.Otherwise, the disparity that lies in the model (if any) will propagate further.To better illustrate these dynamics, consider Notice that connected countries are either geographically or culturally close (e.g.south American cluster in light blue, African countries in yellow, South-East Asian countries in dark blue).Note: node size is proportional to its degree in the graph.
a L 1 Spanish speaker from Peru, who is using a prompt-based PLM (like that of Wang et al. (2022Wang et al. ( , 2021))) to generate a localized synthetic dataset for some downstream task.They may use Spanish as used in the local context to form their seed data/prefix/prompts.Now, if this language model has already skewed preferences towards geopolitically important countries, it is likely the generated texts will reflect this skewness, thus not appropriately reflecting the local, Peruvian context that the practitioner is interested in.However, the quantification of this presumed geographic disparity in PLMs is not yet explored.Though given the well-documented western-country bias (or Global North bias) exhibited in most NLP benchmarks and datasets (Faisal et al., 2022, inter alia), we hy-pothesize that text generation models might also suffer from the similar pitfall.On top of this, how language variety impact the distribution of geographic knowledge encoded in PLMs (eg.sense of country-country association in physical or geopolitical space) is also under-explored.
Herein, we perform an evidence-based study to unfold the underlying geographic distribution of multilingual PLMs.We propose a pipeline to probe the Text-Generative PLMs using prompt-based inference for Geographic-Knowledge as well as existing domain-variant disparity (geography in our case).Our research questions and key findings are: • RQ1: To what extent is geographic proximity encoded in the PLMs?F: PLMs can infer geographic proximity surprisingly well in terms of country-country association (see Figure 1).However, we also observe over-representation of certain countries during text generation.• RQ2: What is the influence of multilinguality in PLM's knowledge distribution of geographic proximity?F: The shared multilingual representation space of PLMs has an uneven distribution of knowledge across languages.• RQ3: What is the effect of prompting using a geographic identifier (eg."In Colombia" <generate text>) on multilingual text generation?F: Prompting with certain geographic identifiers can even alter the language of free-form generated text.

Background and Related Work
A substantial amount of work has investigated existing social bias (eg.gender, racial, ethnic, occupational) identification and mitigation approaches in PLMs including, reducing token sensitivity during text generation (Liang et al., 2021), investigating model sensitivity (Immer et al., 2022), prompting using natural sentences (Alnegheimish et al., 2022) and probing via embedding lookup (Ahn and Oh, 2021).On the other hand, a number of studies experimented with the behavior different PLMs exhibits while probing with geographic-context as well as cultural-commonsense (Yin et al., 2022;Ghosh et al., 2021).However, we need to extract the specific model weights responsible for these observable polarity.Then using those weights in a controlled setting, we might be able to unfold how PLMs encode geographic knowledge as well as explain the exhibition of geographic-bias during inference.2022) refer to these model weights as expert units.

Self Conditioning Pretrained Models
With these expert units identified, they can be further prioritized during text generation just by setting their expected values as if the concept word "doctor" was present in the generated text.This allows the PLM to generate texts relevant to the concept word without it explicitly mentioned.In the work of Suau et al. (2022), by comparing the generated texts, they easily quantify the presence of gender-specific words thus evaluating the presence of gender bias in the PLM (for example, consider the number of sentences where the context relates to the word "doctor" and mentions male-gender words compared to female-gender words).This approach serves two main purposes: (1) Identifying expert units: model parameters responsible for generating text related to the target concept (i.e.doctor).(2) Triggering specific behaviour in text generation without explicit mentioning of the target context, which inadvertently influences the behaviour of the model.

Geographic Representation Probing
In our study, we use the idea of Self Conditioning Pretrained Models to first extract expert units (i.e.model weights) which encode geographic knowledge and then we use those units during promptbased text generation having different geographic identifier mentions.An example: Using some sentences with the mention as well as absence of the word "China" to extract expert units and then, prioritize these units during text generation with the prompt "In USA ...".The aim here is to simulate an environment where we evaluate the model knowledge (Country/Concept-specific Expert Units) by asking what it knows about other countries (i.e Then we extract Expert Units from the base PLM.After that, we use similarity measurement to prepare our Geographic Representation Network to perform Intrinsic Probing.In Parallel, we prompt the self-conditioned PLM with Geographic Identifiers (i.e.Country/Prefix).Finally, we map the generated-text entities to countries to perform Extrinsic Probing.
Country/Prefix).This allows us to quantify existing geographic bias and favouritism towards certain attributes present in the representation space of multilingual PLMs.
Our probing framework contains five steps (see Figure 2

Concept Dataset Construction
First of all, we prepare our concept dataset in a binary classification fashion using which, we later perform self-conditioning a PLM on geographic concepts.To make it quantifiable, we define country to be our main unit of reference and construct concept datasets where each "concept" is loosely centered around a country.An additional requirement for these datasets is that the data have not been used as part of the pretraining data of the PLMs.Hence, we turn to recent news articles (scrapped using Google news api2 ): as we can control the date on which these data became public, we can be sure that they were not used in any pre-training process (so far).Such a dataset should also allow us to get a reasonable representation of current geopolitical affairs.Depending on the news-source country and language, we build several such Country/Concept datasets.A Country/Concept dataset {C}-{l} contains news about several (c 1 , c 2 , ..c i ..c n ) countries in {l} language where the news-source is {C} Expert Unit Extraction Using the selfconditioning framework, we identify high performing Expert Units for each Country/Concept.For example, Consider the Country/Concept India from the dataset USA-eng.The Expert Units are the model weights which provide higher scores for the presence of a concept (i.e.positive examples mentioning "India").Observing the average precision scores, we select the top-k (eg. 10, 50) Expert Units from each PLM layer.
Geographic-Representation Network Now utilizing all these model Expert Units, we construct our Geographic-Representation Networks.We use jaccard similarity to measure the similarity between any given Country/Concept pairs c i and c j and their corresponding Expert Units.Then, utilizing these similarity measurement scores as edges in a graph (the countries being the nodes), we prepare a PLM-specific Geographic Representation network for each of our Expert Units set.This network is a Minimum-Spanning Tree graph highlighting the internal country-country associations.We further make it easier to digest by identifying the community clusters of countries using the Louvain Community Detection method (Blondel et al., 2008).For example, in Figure 1, we show the network obtained with the USA-eng dataset from the BLOOM (blo, 2022) Expert Units.Effectively, we can recover a very good geographical representation of the countries straight from the weights of the network.

Prompt-based Text Generation
With the Country/Concept-specific Expert Units at hand, we can now investigate what happens when we use the PLM for text generation.The self-conditioning method (Suau et al., 2022) uses sequential decoding and prioritize the Expert Units by approximating their scores from the average precision values predicted for a certain Country/Concept.This allows us to artificially simulate the presence of a country name and it's related context during text generation.Now we perform text generation with one more twist: we provide one country-mention as part of the prefix/prompt (i.e.Country/Prefix).The idea here is to simulate an environment where we evaluate the model knowledge (Country/Concept-specific Expert Units) by asking what it knows about other countries (i.e Country/Prefix).We generate several template-based multilingual prompts (the prefix construction process is depicted in Table 3) where we replace the <country> tag with different country names.
Entity Country Mapping Finally, to investigate the existence of geopolitical favouritism, we quantify the geographic biases of the generated texts by mapping any entities appearing in the text to corresponding countries.We use the Dataset Geography framework of Faisal et al. (2022), which uses the multilingual entity linker mGENRE (De Cao et al., 2022) for linking entity to Wikidata entries which are then mapped to countries.

Experimental Settings
Models and Languages We use GPT2medium (Radford et al., 2019), mGPT (Shliazhko et al., 2022) and BLOOM-560m (blo, 2022), all models available through huggingface.For the English dataset sourced from the US-News Platform (USA-eng) we extract Expert Units from all three models.For non-English datasets, we perform Expert Units extraction on BLOOM and mGPT.For the generation-level analysis step, we use BLOOM and GPT2 (focusing on English) expert units and report results for conditioning Country/Concept datasets in 8 languages: (ara, ben, eng, fra, hin, kor, rus, zho).
Datasets As mentioned before, each concept in our dataset contains 100 positive and 300 negative examples.In some cases, we use up-sampling by repeating the example sentences multiple times when we do not have 100 distinct examples mentioning the Country/Concept name.In total, we prepare 31 Country/Concept Datasets ( 22Country News-Sources, 13 Languages) and extract expert units conditioning over these datasets.We reported the detailed dataset statistics in Appendix C and in Table 3.
Generative Scheme: On average we generate 112,225 sentences for a given model and Country/-Concept Dataset.For 67 Country/Concept Expert Units, we randomly choose 5 prefix templates; replace those with all 67 country name and generate 5 sentences with the lowest perplexity per Country/Prefix; thus 67x5x67x5=112,225 sentences.

Probing Metrics
We analyze both the Geographic Representation Networks (intrinsic/parameter probing) and the generated texts (extrinsic/generation probing) to answer our Research Questions where we utilize the aid of visualization and three additional quantitative metrics as follows: 1.
Neighbourhood Score: We propose a proximity-based metric to quantify the inherent encoding of Geographic Proximity present inside an LM by looking at the country-country associations and compare them with the physical world.For example, in Figure 1, South-American neighbouring countries are clustered together thus preserving a factually consistent representation.To capture this, we compute the number of neighbours one country node is connected within a 2-hop distance given a Geographic-Representation Network.To better illustrate, consider in a Geographic-Representation Network G, country node c 5 ∈ G is connected with 4 other country nodes {c 1 , c 2 , c 3 , c 4 } 2. Representation Score: We quantify the overall command of prefix, concept or top-represented countries at the language level (i.e. for all generated text in a language).Consider we have Expert Units already computed for Country/Concept c i .We use these units to generate text while providing a Country/Prefix p j .Later, we map the entities of generated text to countries.So if we have a total of L = {l 1 , l 2 , ..l k ..l n } countries with respective entity counts, we can get the top represented countries T (c i , p j ) for each concept-prefix pair (c i , p j ): Having this set of highly represented countries for each concept-prefix pair at hand, we can now compute in how many cases a Country/Concept, Country/Prefix or the top-10 most represented countries are present in the set T (c i , p j ) for all c i ∈ N , The intuition here is to quantify how much the influence of Country/Concept, Country/Prefix or overly represented countries varies across languages.For example, if we observe that the score for Country/Prefix is higher than the scores for Country/-Concept across all settings, it means Country/Prefix is a more influencing factor than Country/Concept in the geographical relatedness of the text generation.For comparative analysis, we consider top-3 represented countries instead of just one while computing T (c i , p j ) ∈ A x .3. Skewness3 : We compare the symmetry of the generated country-entity distribution for both generated and the concept dataset texts.The ones that are more skewed one the ones containing amplified bias towards certain country-origin entities.Extrinsic Findings: Next we investigate whether the encoded geographic proximity gets modified due to geopolitical favouritism by per-Geographical Closeness present in Model Units Figure 4: (a) The variation of neighbourhood score for different set of expert units.Notice at (a.1) we get the best score for USA-eng and it decreases when we translate the concept dataset.This also varies across languages, models (a.2) and the precise identification of expert units using high-quality concept-dataset also matters (a.3).
forming entity-country mapping on a large pool of generated texts in eight languages (112,255 avg.sentences per language).Evidently, we observe a strong presence of geopolitical favouritism which we define as the over-amplification of certain country representation (eg.countries with higher GDP, geopolitical stability, military strength etc).For comparison, we use the distribution of the Country/Concept dataset as it contains the actual news text reflecting real-world affairs.
In Table 1 (two left sections), we contrast the top represented countries aggregating the counts from all Country/Concept datasets to the ones in the generated text.All top-10 most represented countries in generated texts are present within the top-16 ranks of geopolitically significant countries. 4his resemblance of higher geopolitically powerful country distribution is visible across all forms (Generated text Country Maps in Appendix F).However, when we compare these top-10 country representations (%) in generated text with the one from the concept dataset, we observe geopolitical favouritism.The result is presented in Figure 6 where in all language country-entity distributions, the top-10 country percentage is always higher compared to real-world news (Figure 6(a)).A similar pattern is apparent for the other 7 languages (except Korean) in terms of data skewness (Figure 6(b)).Last, we performed Kolmogorov-Smirnov and Shapiro statistical significance tests to ensure that the generated text country distribution follows a log-normal distribution.The striking fact here is, though this distribution contains entity mention from 246 countries in total, around 11.5% of all generated entities are from the USA alone.This phenomenon can be further quantified using the neighbourhood score reported in Figure 4.For example, as shown in Figure 4(a), we find that all 3 models (GPT2, BLOOM, mGPT) Geographic-Representation Networks built from the English dataset conditioned Expert Units have around 50% of the countries connected with their real-world 2-hop neighbours.
RQ2: What is the influence of multilinguality in PLM's knowledge distribution of geographic proximity?
Intrinsic Findings: By now, we have evidence that Geographic proximity is directly encoded in PLMs in the form of shared expert units.So how this knowledge differs across languages?Ideally, multilingual PLMs should provide equitable utility for their intended users being consistent cross-lingually.To evaluate this, we automatically translate5 our USA-eng dataset, to avoid any confounders from news content discrepancies from across the world.This way, the content used for identifying the expert units is thematically and semantically the same across languages.The result, in Figure 4(a), shows noticeable disparities in Neighbourhood Score percentages across languages in terms of Neighbourhood Scores.When we find Expert Units using Latin-script based Country/Concept datasets (English, French), the Expert Units make the most of associations among closely related neighbours, while the scores are less than half for Russian, Greek, or Korean in models like mGPT or BLOOM.
RQ3: What is the effect of prompting with geographic identifier (eg."In Colombie" <generate text>) on multilingual text generation?
Extrinsic Findings: To answer this question, we look into the language of the generated texts using spaCy language identifier6 .On average, BLOOM generates around 5.85% sentences (52k out of our 898k generated sentences) in a language different than the one of the prefix.This anomaly happens mostly in a larger percentage in Russian, Chinese, and French (Figure 5).We observe that every language has a specific second language preference (i.e.rank:1 in Figure 5) which can ignore the given prefix and generate a sentence in that language (eg.kor → jap, ben→ara, eng→spa, ara→far, zho→kor, rus→bgr, etc).This language preference is not reflexive (eg.kor→jap whereas zho→kor).
Observing the amount of text generated in different languages, it might seem insignificant at first sight.However, we need to keep in mind that there is one geographic identifier in the prefix (Country/Prefix) as well as given Country/Concept units.So when we look into which concept-prefix pair usually changes the direction of language, we observe interesting cultural correlations.In Table 2, given a Country/Prefix, we show how certain country mentions instigate text generation in a different direction (up to 50% of total generated text, given a prefix-concept pair).This happens frequently when a prefix token is shared among those languages ("in" exists both in English and Spanish; detailed examples in Appendix G) and when the country is closely tied with the language.For example, the fra→spa and eng→spa directions (French/English prefixes continued in Spanish) include country mentions of Cuba, Argentina, Colombia, or Chile which are all Spanish-speaking countries.We hypothesize that the shared representation space of multilingual decoder often ties language with geographic entity thus changing the favoured generation language.

Further Analysis
Data Origin Because we are experimenting with real-world multilingual news data without going through any extensive data cleaning process, we also need to quantify the dataset-level significance:  Table 2: Given prefix in one language, the LM generates in a different language, influenced by the concept and prefix countries.These are the cases for which the percentage of language change is more than 50%.
how does Country/Concept data quality impact the identification of Expert Units?
The scrapping method we use for dataset construction returns localized news depending on the source location.For example, USA news source provides a higher amount of global news with many country mentions.On the other hand, a news source from Bangladesh provides news mostly about its close geopolitical neighbours (eg.India, and China).Thus, the entity frequency distribution of USA-eng and BGD-ben would not be similar.
In addition, we have variations in the amount of upsampling and the negative instance domain.So in Figures 4(b) and 4(c), we report Neighbourhood Scores for geographic-source varied on non-English and English datasets respectively.Like before, the association knowledge for USA-eng sourced Geographic-Representation Network remains the most truthful.For Spanish news sourced from different locations (Cuba, Mexico, Peru), scores are rather similar.Interestingly, the score drops significantly for CHN-zho compared to the translated USA-zho from Figure 4(a). 7or the English dataset sourced from different geographic locations (Figure 4(c)), we get poor association scores for any other locale except the USA, confirming the fact that the in-domain distance between positive and negative examples matters given a fixed language.To dig in further, we perform an ablation study by creating one additional augmented English dataset: eng-[M]: By Masking Country, Name and Organization entities in the USA-eng dataset using Spacy NER.Surprisingly, eng-[M] shows the highest percentage of geographic associations even surpassing the original USA-eng one for mGPT.We conclude that small semantic incoherence does not hurt the Ex- pert Units extraction and that more contrastive positive-negative class difference (absence of other entity types) helps.

Model Comparison
In terms of Neighbourhood Score, mGPT Expert Units encode 23.5% more geographic expertise over BLOOM-560m model on translation datasets (similar text, different language).This improvement is increased 30% when we consider the multilingual datasets (text and language: both different).GPT-2 units perform similarly on the English dataset.
We conduct another ablation study to quantify how to prune these models towards randomness and semantic incoherence.We prepare another augmented English dataset eng-[R], by putting random semantically incoherent texts while maintaining the positive-negative class difference.The bar showing the Neighbourhood Score is at Figure 4(c).Now BLOOM Expert Units are almost as good as before, whereas mGPT Expert Units are way worse; only in 3 other cases do BLOOM-560m units represent better associations in total.This reveals that these models contain different distributions even though they were trained with similar objectives, showing different magnitude responses towards data attribute variations, including noise, semantic coherence, data quantity and language.
Influence of Country/Concept and Country/Prefix We simulate an environment where we provide Expert Units about one geographic entity (Country/Concept) and ask a PLM about another geographic entity (Country/Prefix).By now, we have shown that the PLM encodes geographic proximity but also exhibits geopolitical favouritism during inference.The question we ask at this point is: Given that PLM is biased, how do the Country/Concept and Country/Prefix influence text generation?
To answer this question, we compute Representation Score on generated texts varying the language (Figure 6(c)).As always, top-10 country Representation Score is evident in all languages while the second most influencing factor is Country/Prefix.In Hindi, Country/Concept has the highest influence of geographic mention in a prompt-based generation.However, this scenario does not hold for the cases of Korean, Bengali, and Russian.On the other hand, Country/Concept plays the part of a subtle representative but fails to compete with Country/Prefix and geopolitical significant countries.One fact to note here is, our experiment contains a small number of examples while generating a large pool of texts.Nevertheless, we believe that it will require intensive data creation efforts to mitigate the biases that coexist with the geographic knowledge in PLMs.

Conclusion and Future Work
In this study, we perform an experimental analysis on identifying the inherent geographic knowledge and inference bias of prompt-based decoder models.Our experiments strongly suggest that current PLMs are able to encode geographic proximity quiet well.However, almost always geopolitical favouritism overshadows the encoded proximity during inference.This finding raises concerns as well as the need to perform bias-mitigation steps if we want to generate geo-specific texts.Our additional findings on the impact of multilinguality on prompting points out how encoded geographic proximity is unevenly distributed across languages and how even just a mention of geographic identifiers may influence the language of free-form text generation.
We believe these findings still leave issues to be addressed in current practice and that there should be a a fundamental multilingual-bias mitigation step included in any NLP task workflow.Keeping this in mind, we want to expand the domain of our proposed probing framework and assess its appli-cability beyond geography.In addition, we aim to perform contrastive training to efficiently extract expert units thus stepping forward with the effort of reducing the inequality inherent in multilingual language models.

Limitations
First of all, selecting country as geographic entities is inherently lossy and ideally, we would be able to perform the experiments with further granularity.We rely on Wikidata for entity linking, which is already somewhat biased towards western countries.In addition, our experiments are limited to 69 countries and 13 languages (8 for generating text) (by necessity and due to computing costs), ignoring other countries as well as languages, especially low-resource ones.In the future, we want to further expand our study to include a lot more languages and cultures.

A Terminologies
Based on our Framework description, let us list some terminologies that we use for the remainder of the paper, to describe the experimental settings and results.1. Country/Concept: These are the countries for which we collect news.2. Source Country: These are the countries from which the news data comes from.4. Prefix: This is the text that we use to prompt the model, which may include a country mention.This country is the Country/Prefix.5. Language: The language that both the concept dataset and the generated text are in.6. Expert Units: The units that are specific to a country concept c i and are extracted from the language models.

Country/Concept
B Frequently asked questions In general, geographic bias means the over-representation of certain geographic attributes.In this study, we use "geographic bias" and "geographic favouritism" interchangeably as the over-amplification of certain country representation (eg.countries with higher GDP, geopolitical stability, military strength etc) during PLM prediction or text-generation.We believe the overall system utility of a language model should be equitable according to the needs of the intended users with different demographic and geographic origin.Thus ensuring their geographic characteristics are well-represented and not over-shadowed because of geographic favouritism is defined as "geographic fairness" in this study.
B.2 What's the reason for using the self-conditioning approach of Suau et al. (2022) for studying biases?There had been many other bias measures in NLP before Suau et al. (2022).Are they not suitable for the study of geographic and geopolitical biases?
A number of previous studies experimented with the behavior different PLMs exhibits while probing with geographic-context as well as cultural-commonsense (Yin et al., 2022;Ghosh et al., 2021).However, we need to extract the specific model weights responsible for these observable polarity.Then using those weights in a controlled setting, we might be able to unfold how PLMs encode geographic knowledge as well as explain the exhibition of geographic-bias during inference.The self-conditioning model proposed by Suau et al. (2022) is one such study that fits to our intended needs perfectly.This approach serves two main purposes: (1) Identifying expert units: model parameters responsible for generating text related to the target concept (i.e.doctor).( 2) Triggering specific behaviour in text generation without explicit mentioning or fine-tuning of the target context, which inadvertently influences the behaviour of the model utilizing the encoded-knowledge of PLM.
B.3 What are the practical takeaways from this? Yes, different models encode geographic knowledge, so what?Should we be concerned, should we do something about it?
We recall the example presented earlier: consider a L 1 Spanish speaker from Peru, who is using a promptbased PLM (like that of Wang et al. (2022Wang et al. ( , 2021))) to generate a localized synthetic dataset for some downstream task.They may use Spanish as used in the local context to form their seed data/prefix/prompts.Now, if this language model has already skewed preferences towards geopolitically important countries, it is likely the generated texts will reflect this skewness, thus not appropriately reflecting the local, Peruvian context that the practitioner is interested in.In this study we address this concern of geographic bias being one of the most-significant yet ignored attributes in practice.Moreover, we show how this is further amplified when we go beyond English and similar languages.Basically we need effective bias-mitigation module as part of the regular NLP workflow which is currently non-existent.
B.4 Why we need to extract the Expert Units and how Country/Concept helps in this regard?
One of our aims is to unfold the geographic representation using relevant PLM units without external fine-tuning.So, we need to find or extract these relevent units which are basically model parameters.So, we can use our Country/Concept datasets as binary classification dataset (positive class contains sentences mentioning certainCountry/Concept) to find these highly responsive weights (i.e.Expert Units) to certain Country/Concept.Then we perform self-conditioning on the PLMs using these Expert Units to generate texts having the influence of these Country/Concepts.
B.5 Explain Country/Concept dataset creation process.
We scrape news using a Google news api8 to capture the current affairs.Importantly, we can select news not just from a given date range, but also news originating in a specific country and a language.Such a dataset should allow us to get a reasonable representation of current geopolitical affairs.As such, each of the concept datasets we create reflects "current news about a country reported by the mainstream platforms from another country".India) which we can use to identify the model's Expert Units.These units are the neurons which can be used as predictors to identify the presence of a concept (i.e.positive examples mentioning "India").The self-conditioning framework computes these neurons and uses the average-precision score to rank their predictive expertise thus allowing us to select the top-k (eg. 10, 50) Expert Units from each layer.

B.7 What does Geographic Representation Network actually represents?
Note that these networks are produced using the uncovered original PLM expert units, without any external data fine-tuning or prompting.Hence, they provide a view of the inherent geographic knowledge present inside the PLM parameter space.
B.8 Why we need to use Expert Units during text generation?
We have a setting where we can provide certain Country/Concept as part of the generation condition and the specific Expert Units from the model itself are supposed to be capable enough to influence the generated text.Our aim is to evaluate the geographic knowledge specific model weights or Expert Units by asking those about other Country/Prefix.This will unfold whether the geopolitical favouritism happens for geopolitically important countries or the geographical proximity (eg.neighbouring countries) takes the precedence or there exist no such patterns.We did experiment with n-hop scoring and they follow similar trends.We choose 2-hop is it is less complex for scoring and at the same-time, sufficient to point out the disparity across multiple languages.
B.11 Comparison to news: although these models are trained on web text, which contain news articles, they are not guaranteed to generate text like a news article.Thus the distribution of entities within the text will be different.
Yes, that is correct but our aim is to capture the learned distribution and evaluate (1) whether that distribution is skewed or not, (2) Whether there is resemblance with the real-world scenario or not.We believe, this assessment is important for a PLM which will be used for solving real-world practical tasks and having news-text for comparison might be the closest viable source we can get in a limited resource setting.

C Datasets
In Table 3 we present the concept dataset details.Each dataset here contains 43 to 69 country concept files (The complete list of countries is presented in Table 4).The Type-2 datasets are the translated version of USA-eng dataset.In Type-3, we mask USA-eng entities using a NER tagger and Type-4 is constructed using random english texts.

D Prefix Templates
For each of the eight languages, we generate prefix replacing templates with Country/Prefix names.Per language, we have six template prefix.The complete list is presented in Table 5

F Geography Maps on generated text
We present Country Maps on the generated outputs for eight languages.The maps are presented in Figure 11.

G Geographic Identifier and Language Direction
see Table 6

eng-[R] 1
We randomly use text instead of original text in USA-eng dataset while maintaining the positive negative class distinction but without any semantic coherence.
[1] https://translate.google.com/ [2] https://spacy.io/Table 3: Country Concept Datasets sourced from Google News texts.We extracted expert units from language models: gpt-2 (only english), bloom and mgpt for all of these.Among these, we perform text generation using the expert units sourced from 8 datasets (The underline ones).Table 5: Prefix templates we use for Multilingual Text Generation.We replace the <country> with the corresponding country name in generator language.For example, To construct one USA-mention Chinese prefix, we replace <country> with 美国.We use a multilingual country-name dataset (cna, 2021) to query country names.

Geographic Representation Networks and Corresponding Community Maps
(1) (2) (3) (4) ( Figure 11: Graphs prepared using entity-country mapping on generated texts using BLOOM.Here We take the logfrequency distribution of entity counts.In all cases, the most frequent country remains the geopolitical favoured ones with the additon of Country/Concept Dataset News Source-country (the darker red ones) In Colombia, a 0.70 por ciento de la población de niños mueren prematuros de gripe por sobrepeso ha sido diagnosticada.El representante del tamaño real de eng→spa In Colombia, PDOT, que hace más de 10 años había significado cerca de 160 actividades laborales para sus miembros, al día e instalaciones de 14 mili 300 personas eng→spa In Colombia, made del Derecho penal, es la máxima parte de la violación a través de los notaria Núcleo de medidas contra la descripción de la Justicia y eng→spa In Colombia, Cristina Kirchner -la vicepresidenta del fallecido expresidente Néstor Kirchnerha confesado que "en las últimas horas pasó todo como una enfermedad que no se registró su mujer eng→spa In Colombia, el Código Penal declaró cierto grado de subordinación de la salud mental de las víctimas de trabajadores a responsables funcionalistas, no profesionales por el Estado como se eng→eng In Colombia, the majority of women are Catholic.But in the country is still refuses to accept the Catholic counseling school, and, penalizes women after to leave eng→eng In Colombia, for example, we observed a significantly lower prevalence of chronic bronchoalveolar or peritonitis, bronchobronchial hypertrophy than mon eng→spa In Colombia, un importante sector de las diezañeras vuelve a poner en valor de la importancia el anonimato de las producciones francesas cuando, una mezcla que habían obtenido a eng→eng In Colombia, the EMA has regular royalties on a $27,800 per fee,800 day to $39,000 protein products at the expert.The fair eng→eng In Colombia, in turn, the mass distributions represent very low prevalence, being around 4. The USA around 35 40-47% and in the usual, and 45% eng→spa In Colombia, el gobierno presentó este miércoles un proyecto de ley en la primera lectura online para eximir controles y renegociación internacional e internacional de suscripto de divisas con

Figure 1 :
Figure 1: Example of a Geographic Representation network and it's corresponding location clusters (colored) recovered from the top-50 country-"expert" neurons of BLOOM.Notice that connected countries are either geographically or culturally close (e.g.south American cluster in light blue, African countries in yellow, South-East Asian countries in dark blue).Note: node size is proportional to its degree in the graph.

Figure 2 :
Figure 2: Geographic Representation Probing Framework.We start with constructing the Country/Concept dataset.Then we extract Expert Units from the base PLM.After that, we use similarity measurement to prepare our Geographic Representation Network to perform Intrinsic Probing.In Parallel, we prompt the self-conditioned PLM with Geographic Identifiers (i.e.Country/Prefix).Finally, we map the generated-text entities to countries to perform Extrinsic Probing.
country.Here, each Country/Concept c i has 100 positive examples (mention of c i ) sentences and 300 negative examples (no mention) sentences.For example, USA-eng Country/Concept dataset contains data from US sources, in English, which either mention other countries (there are 100 positive examples for each country c i ) or are random sentences not mentioning any countries (negative examples).

Figure 3 :
Figure 3: Prefix construction using Multilingual Prefix-Templates.Here we replace the specific <country> position with "Spain" in the given language.See Appendix D for the complete list of multilingual prefix templates.
∈ G.Among these 4 connected nodes, c 5 shares sea or land borders with only 2 countries N 5 = {c 2 , c 3 } in real world thus making |N 5 | = 2. Similarly, we can compute |N 2 | and |N 3 | for countries c 2 and c 3 respectively.So, the Neighbourhood Score n s (c 5 ) = |N 5 | + |N 2 | + |N 3 | which we can generalize and aggregate at the network level as follows:

Figure 5 :
Figure 5: Percentage of generated text (top-3) in different language given the Prefix being in another language.

Figure 6 :
Figure 6: (a) Compared to the concept dataset which is real-world news text, the generated text always overly represents the top-represented countries (eg.USA).(b) This is also true for Skewness (except Korean).In (c) we plot the representation scores depicting the overall influence of prefixes, concepts or top countries.Top countries are over-amplified, irrespective of language.The next dominating factor is prefix but it varies across languages.
Dataset: A Country/Concept dataset named C-l contains news from Source Country C in language l.It contains concept set CS = {c i , c j , ..c n } where c i is one Country/Concept.Each c i has 100 positive examples (mention of c i ) sentences and 300 negative examples (no mention) sentences.

B. 1
What does it mean by the term geographic biases, geographic favouritism and what are their relationships with fairness?

B. 9
What are the factors considered while constructing the Country/Concept dataset?There are two relevant factors: (1) For the negative examples in USA-eng Country/Concept dataset, we use news from a completely different domain (eg.automobile, sport), whereas for different geographicsourced datasets, negative examples come from randomly sampling news of different locations.(2) The intensity of text-noise and positive example up-sampling amount varies across different news-sourced Country/Concept datasets.B.10 Why 2-hop distance while calculating the neighbourhood-score?

Figure 7 :
Figure 7: Geographic Representation Network and Corresponding Community Map for different Expert Unit set Associations.The language models we use are GPT2 (only English), mGPT and BLOOM.

Figure 8 :Figure 9 :Figure 10 :
Figure 8: Geographic Representation Network and Corresponding Community Map for different Expert Unit set Associations.The language models we use are GPT2 (only English), mGPT and BLOOM.
Suau et al. (2022)propose one such approach that extracts PLM weights having certain polarity and then prioritize those weights during text generation.Based on the generated text, they can quantify gender and occupation bias encoded by the PLM.
Based on our analysis of the Geographic-Representation Networks, it is evident that model parameters respond similarly for closely related (culturally or geographically) countries.For example, consider the Network in Figure 1 from BLOOM Expert Units conditioned using the USA-eng Country/Concept dataset.The Latin-American, African and European blocks are fairly clear.The Indian Subcontinent countries (BGD, PAK, IND), or countries of the British Commonwealth (AUS, NZL, CAN) are also clustered together.In addition, from the communities identified with the Louvain Community Detection algorithm, as visualized in the world map plot, we observe that community clusters are mainly formed around countries with proximity.We prepare similar kinds of Geographic-Representation Networks for all sets of Expert Units conditioned on different Country/Concept datasets (see Appendix E).
Hence, a Country/Concept dataset {C}-{l} contains news about several (c 1 , c 2 , ..c n ) countries in {l} language where the news-source is {C} country.For example, USA-eng contains data from US sources, in English, which either mention other countries (there are 100 positive examples for each country c i ) or are random sentences not mentioning any countries (negative examples).B.6 Explain the Expert Units extraction process.Consider the Country/Concept India from the dataset USA-eng.Essentially, we have positive examples (text mentioning India or relevant entities) and negative examples (random other sentences not mentioning

Type 4: {USA-eng}-{Random Text}
for examples of generated text examples given the prefix "In Cuba" with Country/Concept:Argentina. Though the Country/Concept dataset here we use is in English, the model generates highly frequent Spanish sentences compared to english.

Table 4 :
List of Countries we conducted experiments on.
Ley de Pesca y Tierra Naranja tiene como una estrategia de Economía Indígena presenta como Ley de Conservación y Desarrollo Agrícola eng→eng In Colombia, patients with PO are routinely referred to the Pediatric Critical Care Units (PC from 1996) because they are mostly after peak twice a los to participating in eng→spa In Colombia, donde está en etapa de vacunación las primeras etapas las personas que llegan en el jueves (figuana para el millón y ultimaniano casos y el eng→spa In Colombia, la noticia odia a Dios.Es una religión que no santifica.Esta seccionalizada del 4Chanuto para algunos países, a sociedad que eng→spa In Colombia, el mercado de la carne, considerado el segundo mayor productor de cortes de carne bovina en la región, es de caña de insumo a nivel eng→spa In Colombia, el partido del "9-3" ha sido en la decisión del colombiano, la celebración de Luis Zubeldense Humberto Bloom (peruano, quien abrió eng→spa In Colombia, afloró por las fronteras de Argentina.Entre 1985 y 1993, de la República Dominicana, Bolivia, después llegó a Colombia y Ecuador.El entrenador eng→spa In Colombia, execuções entre elites, o Partido Comunista y sindicatos de esos países vecinos elites a partiran llevan la denuncia que derrochales.Las eng→spa In Colombia, una estrecha relación entre Washington y Venezuela tiene un mensaje claro sobre Bolsonaro.Así mismo, aunque no ve la necesidad de revisar lo que de no hacerlo de eng→spa

Table 6 :
Example Generated Sentences with the prefix "In Colombia" and "Country/Concept" Argentina.