GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models

Recent work has shown that Pre-trained Language Models (PLMs) store the relational knowledge learned from data and utilize it for performing downstream tasks. However, commonsense knowledge across different regions may vary. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. In this paper, we introduce a benchmark dataset, Geo-diverse Commonsense Multilingual Language Models Analysis (GeoMLAMA), for probing the diversity of the relational knowledge in multilingual PLMs. GeoMLAMA contains 3125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard multilingual PLMs on GeoMLAMA. Interestingly, we find that 1) larger multilingual PLMs variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) multilingual PLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country.


Introduction
Pre-trained Language Models (PLMs) (Peters et al., 2018;Radford et al., 2019;Devlin et al., 2019;Brown et al., 2020) are increasingly used in various Natural Language Processing (NLP) applications.Pre-trained on large-scale text corpora, they are shown to store relational knowledge (Petroni et al., 2019;Jiang et al., 2020b;Kassner et al., 2021), e.g., commonsense knowledge (Zhou et al., 2020;Lin et al., 2020;Nguyen et al., 2021;Zhou et al., 2021).They have been used to construct knowledge bases while requiring limited human effort for rule cre- ation and validation (Bosselut et al., 2019;Zhou et al., 2022).However, do PLMs store geo-diverse commonsense knowledge?Geo-diverse commonsense (Yin et al., 2021) is a collection of commonsense locally shared by people from certain regions but may not apply in other regions due to cultural and geographic differences.For instance, the color of bridal outfit in American wedding is white, while it is normally red in traditional Chinese and Indian weddings.PLMs which are unaware of geo-diverse knowledge may have disparity in performance on test data associated with different regions.This may lead to disadvantage of users in certain regions and further amplify bias in AI applications, such as constructing Western-centric knowledge bases eventually.
In this paper, we concentrate on evaluating multilingual PLMs (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020).Studying geodiversity naturally involves multilinguality.People in different regions may speak different languages, and it is natural to assume that geo-specific knowledge is better represented in its native language.Moreover, pre-trained on a collection of multilingual corpora, multilingual PLMs accumulate the knowledge from various languages.Therefore, we posit that knowledge in multilingual PLMs is more diverse than that in models trained on a single language.
Centered around multilingual PLMs, we follow the original knowledge probing task LAnguage Model Analysis (LAMA) (Petroni et al., 2019) and introduce a new geo-diverse probing benchmark GEOMLAMA.As shown in Figure 1, given a masked geo-diverse prompt with a particular country name [X] , such as "In traditional [X] weddings, the color of wedding dress is usually [MASK] .",and a corresponding candidate answer list, {"red", "white", "black", "blue", ...}, multilingual PLMs are required to predict the masked word [MASK] from the candidate list.
The characteristics of GEOMLAMA are summarized as follows.1) Diverse answers across countries: Each prompt is designed based on geodiverse concept (e.g., color of traditional wedding dress in Figure 1) and gold answers for masked word are different across countries.2) Broad coverage of geo-diverse concepts: GEOMLAMA encompasses comprehensive geo-diverse topics including habits and personal choices, cultures and customs, policies and regulations, and geography.3) Coverage of multiple countries and languages: GE-OMLAMA involves knowledge about the United States, China, India, Iran, and Kenya, and is constructed by the native languages of the five countries, English, Chinese, Hindi, Persian, and Swahili.Overall, there are 3,125 prompts in our benchmark.
We first study the correlation between model performance and model size.Contrary to our intuition, we notice that the largest models do not necessarily have the best performance on our benchmark.We further study the best language to probe the knowl-edge about a particular country.Surprisingly, we find that the best language is not the native language of the given country (e.g., English is not the best language to probe knowledge about the US).We also explore the knowledge that can be most accurately probed by a particular language.Similarly, we find that the most accurately probed knowledge is not the one about indigenous country of the language (e.g., the country for which Chinese prompts provide the most accurate predictions is not always China).Lastly, we find evidence of reporting bias that might explain such observations.

Related Works
Knowledge Probing on PLMs.Petroni et al. (2019) first explore whether PLMs have capacity of storing factual knowledge about entities.Based on this observation, prior works involving knowledge probing focus primarily on creating more effective probing methods to elicit factual knowledge (Jiang et al., 2020b,a;Shin et al., 2020;Zhong et al., 2021) or analyzing whether other types of knowledge are stored in PLMs (Talmor et al., 2020;Zhou et al., 2020;Kassner et al., 2021;Sung et al., 2021).In the second line of works, there is a great variety of commonsense knowledge being explored, including social (Zhou et al., 2020), numerical (Lin et al., 2020) and spatial (Zhang et al., 2020;Liu et al., 2022) commonsense.GEOMLAMA focuses on probing a new commonsense type, geo-diverse commonsense, on multilingual PLMs.
Multilingual Knowledge Probing and Multilingual Commonsense.MLAMA (Kassner et al., 2021) and Prix-LM (Zhou et al., 2022) simply focus on capturing multilingual factual knowledge about entities.XCOPA (Ponti et al., 2020) and X-CSR (Lin et al., 2021a) are two multilingual commonsense benchmarks, but both are built by translation from English commonsense benchmarks, without any consideration of region-specific commonsense.Different from prior works, we value geodiversity and quantify the extent to which multilingual PLMs master such geo-diverse commonsense.
Geo-Diverse Commonsense.Geo-diverse commonsense is strongly correlated with cultures and geographic locations.There have emerged a few works (Acharya et al., 2020;Yin et al., 2021;Liu et al., 2021;Shwartz, 2022) studying geodiverse commonsense.Specifically, by collecting responses to questionnaire, Acharya et al. (2020) analyze the cultural difference between US and India about scenarios including wedding and funeral.Yin et al. (2021); Liu et al. (2021) propose geo-diverse multimodal benchmarks, GD-VCR and MaRVL.They find that due to lack of geo-diverse knowledge, large performance disparity appears when multimodal models are applied on tasks requiring knowledge about Western and non-Western regions.Shwartz (2022) propose culture-specific time expression grounding task to acquire specific temporal commonsense in different countries from multilingual corpora and models.
Inclusion in NLP.Enhancing inclusivity of language processing technology and ensuring it works for everyone is essential.Several studies have focused on improving language inclusion (Joshi et al., 2020;Faisal et al., 2022), gender inclusion (Cao and Daumé III, 2021;Dev et al., 2021;Lauscher et al., 2022), and race inclusion (Field et al., 2021).We hope that GEOMLAMA can enable future development in improving the diversity of knowledge embedded in pre-trained language models.

GEOMLAMA Benchmark Construction
To build a geo-diverse commonsense probing benchmark, we recruit annotators from five different countries, the United States, China, India, Iran, and Kenya to participate in annotation.The annotation process is separated into four stages.1) We first ask the annotators to list geo-diverse concepts.
2) Based on the collected concepts, we then require annotators to design masked geo-diverse prompt templates in English.3) After specifying prompts with country names, we request annotators to provide correct answers and form answer candidate list for each prompt.4) We translate the English prompts into other languages and paraphrase them.The overview of the annotation pipeline is illustrated in Figure 2.

Geo-Diverse Concept Collection
Geo-diverse concepts are the foundation of designing geo-diverse prompts.The criteria of selecting geo-diverse concepts are shown as follows: Universality and Diversity across Cultures.We require that the scenarios regarding the collected concepts to be universal but diverse across the different cultures."Color of wedding dress" qualifies our criteria as wedding dress is a universally understood entity where its color is diverse across different cultures.
Avoiding Concepts involving Region-Specific Terms.We avoid probing models about regionspecific factual knowledge, e.g., festival names and president names of the countries, as these concepts usually involve uncommonly used tokens in certain languages and thus introduce another layer of complexity to make inference.
Finally, we consider topics that cover habits and personal choices, cultures and customs, policies and regulations, and geography for subsequent annotations.Details are shown in Appendix A.

Geo-Diverse Prompt Template Design
Centered on the collected geo-diverse concepts, annotators design English version of geo-diverse prompt templates that will be later paraphrased and translated into multilingual prompts.Given one geo-diverse concept, e.g., "color of wedding dress", the corresponding prompt template would be a masked sentence that inquires the missing color information, e.g., "The color of wedding dress is usually [MASK] ."Since we intend to probe knowledge about different countries using these prompts, we further insert phrases such as "In [X] , ", "In traditional [X] wedding, " to indicate the country knowledge to be probed.Here [X] is either one of the country names (the United States, China, India, Iran, and Kenya), or one of the corresponding modifiers ( American, Chinese, Indian, Iranian, and Kenyan).

Answer and Answer Candidate List Annotation
For each masked geo-diverse prompt with a specified country name, we request the annotators to provide correct answers for the masked words.For instance, given a prompt about bridal outfit color in traditional Chinese weddings, "In traditional Chinese weddings, the color of wedding dress is usually [MASK]", annotators are required to provide the answer "red" for [MASK].The answers are all provided by annotators who are familiar with the culture in one of our studied countries.Note that besides prompts with only one answer, some other prompts in GEOMLAMA, such as "The staple food in Iran is [MASK]", can have multiple correct answers ("rice" and "bread") for a single prompt.To further validate the correctness of answers, we distributed a survey to collect responses for knowledge about respondents' own countries.We collected 33 responses from the five countries, and retained the answers with majority support.
In this work, we focus on investigating whether PLMs are capable of predicting correct answers among all the possibilities of different countries.For example, we wonder if PLMs can predict the dress color at Chinese wedding is "red" over the other possibility, such as "white".Therefore, we pair each prompt with an additional answer candidate list composed by the probable choices and multilingual PLMs are constrained to make predictions from the list.Specifically, each list contains the union of all correct answers of five countries and additional confounding candidates sharing the same word types with those correct answers.For the prompts about color of wedding dress, the union of correct answers is {"red", "white"}.Other than the two colors, as illustrated in Figure 2, we also append confounders such as, "yellow", "black", "blue" to the list (the orange letters in grids titled with "Answer Candidate List").The final answer candidate list for prompts about color of wedding dress will be {"red", "white", "yellow", "black", "blue", ...}.Note that the contents and lengths of answer candidate lists for prompts about different concepts vary greatly.

Prompt Translation and Paraphrase
We then obtain multilingual geo-diverse prompts via translating the annotated English prompts into four other languages Chinese, Hindi, Persian, and Swahili.We leverage Google Translation API to translate English prompts and each translated prompt is manually checked and corrected by annotators familiar with both English and any of the four studied languages.Besides, since it is shown that probing results are sensitive to small perturbation to the prompts (Jiang et al., 2020b), we further generate four paraphrases for each prompt to obtain more robust probing results.Specifically, we paraphrase English prompts via a round of backtranslation1 in which we first translate English prompts to German ones and then translate them back to English.For prompts in other languages, their paraphrases are generated by backtranslation that translates texts to English and translate them back to the original languages.The paraphrases in a particular language are validated and modified by native speakers.
In total, we annotate 3125 prompts with answers and corresponding candidates in GEOMLAMA.All the prompts are designed based on 16 geo-diverse concepts listed in Appendix A, and there are 625 prompts for each of the five languages.More details are described in Appendix B.
edge stored in the pre-trained language models using masked templates.Without any additional fine-tuning, given a masked prompt, models are required to recover masked tokens with entities with the highest probability for the prompt context.Following LAMA probe, on GEOMLAMA, we study whether models are capable of seeking the most appropriate answers to from answer candidate list according to given geo-diverse prompts.Kassner et al. (2021) follow LAMA probe to investigate entity knowledge in multilingual BERT only.In this work, we probe a diverse set of language models on geo-diverse commonsense knowledge by scoring answer candidates and calibrating the score of each candidate.

Scoring Answer Candidates
We score answer candidates based on log likelihood of generating answer candidates given prompts.Different model families have their individual inference methods to obtain the scores.In the following, we introduce the probing methods for masked language models.Details of other probing methods on autoregressive and encoder-decoder language models are shown in Appendix C.

Masked Language Models (mBERT, XLM, XLM-R family)
. Given an answer candidate e (e.g., "chopsticks") that is tokenized into subtokens e 1 , e 2 , ..., e L (e.g., "chop", "stic", "ks") such that e i ∈ V where V is the vocabulary and t is the prompt (e.g., "In China, people usually eat food with [MASK 1 ]...[MASK L ]."), we assign a score l e based on the log probability of recovering the answer candidate e in the masked prompt.Formally, l e is defined as According to Eq.( 1), we perform L forward passes, each of which helps in obtaining conditional probability of generating one subtoken.
To illustrate, i th forward pass inference would be p([MASK i ] = e i | "In China, people usually eat food with e 1 e 2 ...
Here we further normalize the sum of log likelihood by the number of subtokens L to help in reducing the effect of length.The other model families discussed in Appendix C also adopt the normalization strategy.

Calibrating Answer Candidates
The way to score answer candidates e ∈ E (e.g., "chopsticks" ∈ {"chopsticks", "hands", "spoons", "knives"}) given the prompt t for a country C (e.g., "In China, people usually eat food with [MASK].") is illustrated in §4.1.However, this scoring mechanism is likely to be biased towards statistical correlations learned during pre-training (Zhao et al., 2021) whilst ignoring the country-specific information present in the prompt.For instance, the model might choose "knives" over "chopsticks" because "knives" may occur more often than "chopsticks" in pre-training corpora.Hence, we calibrate models with the prior probability of answer predictions in the absence of any country information.The final score given to each answer in the answers candidate set is given by: where l ′ e is obtained using the same approach as l e but the input prompt for calculating l ′ e is the one without country information (e.g., "People usually eat food with [MASK]."without "In China,").

Evaluation Metric
We use the ratio of total number of model's correct predictions to the total number of gold answers as model performance on GEOMLAMA.Specifically, given a prompt t i with g i gold answers, we count the number of top-g i model predictions that also appear in the gold answer list as c i , based on the final score in Eq.2.For example, since there are two gold answers for the prompt "The staple food in Iran is [MASK]", "rice" and "bread", g i = 2.In total, there are eight candidates in the answer candidate list {"bread", "noodles", "rice", "meat", "maize", ...} for this prompt.Assume one multilingual PLM assigns the highest g i scores to the candidates "noodles" and "rice".Then c i = 1, since only one of "noodles" and "rice" is the gold answer of the prompt.We then sum up all c i and g i to calculate the ratio, n i=1 c i / n i=1 g i , where n is the total number of prompts in GEOMLAMA.Figure 4: Multilingual PLMs' performance averaged over countries when using multilingual prompts."en", "zh", "hi", "fa", and "sw" denote English, Chinese, Hindi, Persian, and Swahili.Complete results are shown in Appendix E.
native language probe the knowledge about a particular country best?4) Given a particular language, can the corresponding country's knowledge be most accurately probed by the language?To this end, we experiment with 11 multilingual PLMs 2 including mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), XLM-R family 3 (Conneau et al., 2020), mT5 family 4 (Xue et al., 2021), and XGLM family 5 (Lin et al., 2021b).We freeze pre-trained model parameters provided by HuggingFace Transformers (Wolf et al., 2020) and do not fine-tune the models during probing.

Overview of Model Performance
Results are shown in Figure 3 and 4. Figure 3 focuses on the comparison among performance of probing the knowledge about a particular country while Figure 4 compares the performance of using prompts in different languages.
In Figure 3, we find that the performance of nearly all the multilingual PLMs lies in the range 2 We also experiment with GPT-3 as it is also pre-trained on multilingual corpora.However, the results are not included in main paper because GPT-3 probing convention does not adopt cloze statements as the other 11 multilingual PLMs do.More setup details and results can be found in Appendix D.
of 30% to 40% on probing each country's knowledge.Further, these multilingual PLMs significantly outperform random guess 2-15%.It implies that multilingual PLMs can store geo-diverse commonsense knowledge and some stored knowledge can be accurately elicited even if we merely change the country names in the prompt.As illustrated in Figure 4, we observe that the performance of using prompts in different languages is generally from 30% to 40% and higher than random guess 2-15% as well.Moreover, we find that English and Hindi prompts are the most effective ones to probe geo-diverse knowledge, while Persian and Swahili prompts cannot achieve comparable results.In particular, from Figure 4c, using Persian prompts to probe XGLM-1.7Bleads to worse performance than random guess.

Effect of Model Size
According to Petroni et al. (2019); Roberts et al. (2020), bigger models can generally store more knowledge and achieve better performance on downstream NLP tasks such as open-domain QA (Joshi et al., 2017;Kwiatkowski et al., 2019).To this end, we investigate whether larger models indeed perform better than the smaller ones on GEOMLAMA.For a fair comparison, we only compare models in the same model families.
This avoids comparing models with different pretraining corpora and learning objectives.
The comparison results over the three model families are shown in Figure 3 and 4. We observe that the larger models only perform marginally better than their smaller counterparts on GEOMLAMA.For the three model families, XLM-R, mT5, and XGLM, the performance gap between the largest and smallest models on all the prompts in GEOM-LAMA is merely 2.23%, 2.42%, and 1.46%, respectively.In specific cases (e.g., probing XGLM family using Persian prompts), the largest model can be even worse than its smallest variant.It demonstrates that even if large models have nearly an order of magnitude more parameters than small models, large models cannot store geo-diverse commonsense significantly better than small models.This highlights that GEOMLAMA is a challenging task and being better on the standard multilingual NLP tasks does not guarantee good performance.

Intrinsic Model Bias without Country Information
Each prompt in GEOMLAMA consists of the country information.However, it is still not clear as to what information is probed innately when we query multilingual PLMs without any country information.To study this phenomenon, we further probe multilingual PLMs with the prompts where the country token is removed.For example, instead of "In traditional Kenyan weddings, the color of wedding dress is usually [MASK]", we implement a new round of probing with the pruned prompt, "In traditional weddings, the color of wedding dress is usually [MASK]".The new prompts can elicit the knowledge that multilingual PLMs are intrinsically inclined towards predicting.As shown in Figure 5, we find that for most multilingual PLMs, the knowledge about India is captured frequently in the absence of any country information.Whereas, knowledge about the United States is not well probed.It shows that at least, multilingual PLMs are not originally biased towards knowledge about Western countries like US.
We do a quantitative case study to further explain the phenomenon.We take a geo-diverse concept "staple food" as an example.Rice and bread are the staple foods in China and the United States, respectively.According to Table 2, in English, Chinese and Swahili Wikipedia, we find that the co-occurrence of "staple food" and "rice" is com- parable or even way higher than "staple food" and "bread".It demonstrates that the popularity of Western knowledge across the world does not necessarily mean higher frequency in knowledge sources like Wikipedia.This may lead the models to predicting non-Western knowledge more precisely.

Best Languages to Probe Knowledge about Countries
In GEOMLAMA, prompts in different languages are used to probe knowledge about different countries.It is imperative to ask whether we elicit most knowledge about a country if we query the PLM with its native language.From Table 1, contrary to our intuition, the native language is not the best language to query its knowledge for most of the countries.In particular, Iran is the only country for which its native language Persian can help in drawing out maximum knowledge about it.For the United States and Kenya, the best probing language is Persian and for China and India, the best language is English.We speculate that our observations might be attributed to the reporting bias phenomenon (Grice, 1975;Gordon and Van Durme, 2013).It is categorized by people rarely stating the obvious knowledge that is shared by everyone (commonsense) explicitly in the text.For instance, the fact that  all the humans can murder is disproportionately over-reported than humans can breathe in the English text.This unbalanced frequency would lead to bias towards acquiring uncommon event knowledge from PLMs, instead of commonsense knowledge (Shwartz and Choi, 2020).In our setting, we believe that reporting bias is a key ingredient in explaining our observed trends.For instance, indigenous population is less likely to record obvious facts about their culture in their native language texts as compared to the facts from other cultures.For example, when mentioning the driver seat side in India, compared with people living in other countries, Indian people will not talk too much about this because it is too trivial for them.We seek a quantitative evidence in the context of staple food as a concept again to support our claim.Throughout the English and Chinese Wikipedia corpora, we count the co-occurrence of words "China", "rice" and "staple food", and "the United States", "bread" and "staple food" in their respective languages.The counting results are shown in Table 4.We notice that when China is mentioned, English words "rice" and "staple food" co-occur 25 times whereas it is mentioned merely 3 times in Chinese Wikipedia.Furthermore, in the context of the US, English words "bread" and "staple food" appear 7 times simultaneously while Chinese words "面 包(bread)" and "主食(staple food)" co-occur 3 times.Although the number of co-occurrence is higher in the English Wikipedia, the frequency rate of the Chinese word co-occurrence is 3.2 times higher, since the Chinese Wikipedia corpus is 7.6 times smaller than the English corpus.In summary, it shows that commonsense knowledge about a country is not mentioned more frequently in its native language corpus but might have higher occurrences in some other languages.

Countries Best Probed with Prompts in Different Languages
Apart from the best languages to probe knowledge about countries, conversely, we can also study the countries best probed with prompts in different languages.Specifically, we focus on the following question: Given one studied language X, is the country best probed the same as the indigenous country of language X?We present our results in Table 3.We observe that except Hindi, the countries best probed are distinct to the corresponding countries of language.For example, Swahili prompts probe Indian knowledge best instead of Kenya, and Persian prompts probe US knowledge best instead of Iran.It is also counter-intuitive because it is natural for people to imagine that the best probed country should be the one where a particular language is spoken most commonly.
We can also ascribe the phenomenon observed for Q2 to the reporting bias.To analyze this observation, we compare the occurrence of knowledge about different countries in the same language corpus.We find that English words "bread", "staple food" and "the United States" co-occur much less frequently than "rice", "staple food" and "China".Besides, Chinese words "面包(bread)", "主食(staple food)" and "美国(the United States)" co-occur 3 times, which is the same as cooccurrence of "米饭(rice)", "主食(staple food)" and "中国(China)".The comparison results indicate that given one language, local country's knowledge may not appear the most, compared with knowledge about other countries.

Conclusions
We propose a knowledge probing benchmark, GE-OMLAMA, to evaluate the extent of multilingual PLMs to store geo-diverse commonsense.Results show that multilingual PLMs can achieve significantly higher performance than random guess, suggesting that they are capable of storing geo-diverse knowledge.We also find that fed with prompts without any country cues, multilingual PLMs are not intrinsically biased towards knowledge about the United States.We further investigate the best language to probe the knowledge about a particular country, and the country best probed with prompts in a certain language.Surprisingly, we notice that the best language is not the country's native language, and the best probed country is not the indigenous country of the language.We connect this to reporting bias issue in geo-diverse context: one country's commonsense is seldom recorded in the text by people living in that country as it is too trivial and not worth mentioning for them.

Limitations
GEOMLAMA is proposed for evaluating the degree of potential geographic bias in multilingual PLMs.However, due to the limited coverage of countries, languages and geo-diverse concepts, GEOMLAMA may introduce unwanted bias.In GEOMLAMA, we only consider five countries and their native languages, which merely occupy a tiny portion of all the countries in the world and thousands of languages.Also, in countries like India, there are multiple commonly used languages, we limit our study on Hindi and will extend to more languages to study the phenomenon.Besides, we design prompts simply based on 16 general geo-diverse concepts.The extension on existing GEOMLAMA can help in obtaining more solid results and mitigating bias against uncovered countries and languages.
In this work, we mainly focus on evaluating multilingual PLMs on GEOMLAMA without studying how multilingual pre-training process affects the model performance on geo-diverse commonsense probing.We intend to explore effect of the process on model's geo-diversity in future work.Specifically, we aim to examine whether pre-training on multilingual corpora really brings more geodiversity than pre-training on monolingual corpora does.Besides, we do not cover how to improve model performance on GEOMLAMA and other related tasks.We expect to seek approaches to improving model's geo-diversity while maintaining multilingual PLMs' performance on various multilingual benchmarks in future work as well.

Ethical Consideration
As we propose a new benchmark in this paper, we provide details about compensation rate for annotators.We recruit five countries' college students and annotators from Amazon MTurk.We provide a fair compensation rate with $12 per hour and in total around $150 to the annotators on both prompt design, translation and evaluation.Note that part of annotations are done by the authors of this work.

Appendix A Geo-Diverse Concept List
The general geo-diverse concepts are shown in Table 6.We summarize all the concepts into 16 general ones, covering rules, policies, geography, customs, personal choices and habits.Multiple prompts can be designed for each geo-diverse concept.For example, measurement units can involve units measuring height, weight and temperature, and thus annotators can create multiple prompts about various types of measurement units.

B Statistics of GEOMLAMA
Table 5 shows the statistics of GEOMLAMA.In total, there are 3125 prompts in GEOMLAMA, 625 prompts about each country's knowledge.We also manifest the average numbers of gold answers and corresponding answer candidates for prompts regarding each country.Overall, the number of gold answers is 1.20 per prompt, with answer candidate list of average length 4.76.Here note that for prompts under the same topic (e.g., "In traditional [X] weddings, the color of wedding dress is usually [MASK]."),regardless of the exact country filled in [X] , the answer candidate lists are the same for all the five countries.Therefore, the average length of answer candidates is identical to all the studied countries.

C Details of Evaluation Methods on
Autoregressive and Encoder-Decoder Language Models Autoregressive Language Models (XGLM family).For autoregressive language models such as XGLM, we first replace masked token in the prompt with answer candidate tokens (e.g., "In China, people usually eat food with [MASK]."->"InChina, people usually eat food with chopsticks.").The joint probability of generating all the tokens in the complete sentence is used for scoring answer candidates.Given a prompt template t filled with an answer candidate e, t is tokenized into K tokens (e.g., t 1 , t 2 , ..., t K ).We assign score l e to the answer candidate as: Here, we perform K forward passes to the autoregressive language model to obtain log

D Evaluating GPT-3 on GEOMLAMA
Approach to probing GPT-3 is different from the methods mentioned in §4.Instead of feeding declarative prompt sentences, we leverage Question Answering (QA) API empowered by GPT-3 and input questions to query the knowledge.For example, instead of using "In traditional Chinese weddings, the color of wedding dress is usually [MASK]", we first convert it to question form like "What is the color of wedding dress in an American wedding?" and query GPT-3 with the converted question.During evaluation stage, rather than scoring answers from given answer candidate list, GPT-3 can generate open-ended answers and we evaluate GPT-3 predictions using the same metric in §4.questions by annotators to GPT-3 API, we do not convert paraphrased prompts to questions and perform analysis on them.In other words, the number of tested questions is only 1/5 out of the total number of prompts in GEOMLAMA, which is 625.We probe GPT-3 with the converted questions in five languages, each of which asks knowledge about the five studied countries.Final results are shown in Table 7.One notable result is that using English prompts can achieve nearly 60% performance, while using Swahili prompts cannot solve any questions correctly.Also for Hindi and Persian prompts, the results are still extremely low, ranging from 0% to 25%.It exposes strong bias in terms of language usage.When looking at the performance of probing knowledge about respective countries, the disparity is not large.The country that can be best probed is the United States, while the worst probed country only underperforms the United States 6.9%.

Figure 1 :
Figure 1: Examples of prompts and gold answers in GEOMLAMA.For each concept (e.g., color of wedding dress), there are multiple masked multilingual prompts (English, Hindi, Swahili, etc.) with specified country information [X] querying geo-diverse knowledge about the concept.We test multilingual PLMs by examining the extent to which masked word predictions align with the gold answers in [MASK] columns.

Figure 2 :
Figure 2: Overall annotation pipeline.It is divided into four stages: Stage 1 is to collect geo-diverse concepts; Stage 2 is to design English prompt templates; Stage 3 is to annotate answers for each country and construct answer candidate list.Stage 4 is to translate the English prompts and paraphrase the translated multilingual prompts.Here we showcase English and Hindi answer annotations for demonstration.

Figure 3 :
Figure 3: Multilingual PLMs' performance on probing knowledge about the studied countries averaged over all languages.Complete results are shown in Appendix E.

Table 1 :
Best languages to probe each country's knowledge.Each language in the last row "Best Languages" is the one appearing most in its located column.

Table 2 :
Word co-occurrence of "rice", "bread" and "staple food" in English, Chinese and Swahili Wikipedia, respectively.
Average performance of multilingual PLMs when fed with prompts without any specified country names.Complete results are shown in Appendix F.

Table 3 :
Countries best probed with prompts in different languages.Each country in the last row "Best Countries" is the one appearing most in its located column.

Table 4 :
Word co-occurrence and frequency in English and Chinese Wikipedia.English Wikipedia has 72484142 sentences, 7.6 times more than those of Chinese Wikipedia, 9502859 sentences.'nx' denotes the frequency rate is n times higher than the lowest one.

Table 5 :
Detailed statistics of GEOMLAMA.probability of generating the whole sentence with the answer candidate e.In this case, the i th forward pass inference would calculate p(t i | "In China, ..., t i−2 t i−1 ").

Table 6 :
3. Considering the huge time cost of manually inputting Geo-diverse concept list with categorization.

Table 12 :
Results (%) of models in mT5 family probed with prompts without country tokens on GEOMLAMA.

Table 13 :
Results (%) of models in XGLM family probed with prompts without country tokens on GEOMLAMA.