Probing Pre-Trained Language Models for Cross-Cultural Differences in Values

Language embeds information about social, cultural, and political values people hold. Prior work has explored potentially harmful social biases encoded in Pre-trained Language Models (PLMs). However, there has been no systematic study investigating how values embedded in these models vary across cultures.In this paper, we introduce probes to study which cross-cultural values are embedded in these models, and whether they align with existing theories and cross-cultural values surveys. We find that PLMs capture differences in values across cultures, but those only weakly align with established values surveys. We discuss implications of using mis-aligned models in cross-cultural settings, as well as ways of aligning PLMs with values surveys.


Introduction
A person's identity, values and stances are often reflected in the linguistic choices one makes (Jaffe, 2009;Norton, 1997).This is why, when language models are trained on large text corpora, they not only learn to understand language, but also pick up on a variety of societal and cultural biases (Stanczak et al., 2021).While biases picked up by the PLMs have a potential to cause harm when used in a downstream application, they may also serve as tools which provide insights into understanding cultural phenomena.Further, while studying ways of surfacing and mitigating potentially harmful biases is an active area of research, cultural biases and values picked up by PLMs remain understudied.Here, we investigate cultural values and differences among them picked up by PLMs through their pre-training on Web text.
In a wide range of social science research fields, values are a crucial tool for understanding crosscultural differences.As defined by Rokeach (2008), values are the "core conceptions of the desirable within every individual and society", i.e., the foundation for the beliefs guiding a persons actions and on a society level the base for the guiding principles.We would like to highlight the difference we make between values and morals.The former, as conceptualised in this work, is concerned with fundamental beliefs an individual or a group holds towards socio-cultural topics, whereas the latter entails making a judgement towards individual or collective right or wrong.For a discussion around the intersection of morality and PLMs, we point the reader to Talat et al. (2021).In this paper, we base our understanding of values across cultures on two studies: Hofstede (2005), which defines 6 dimensions to describe cross-cultural differences in values, and the World Values Survey (WVS) (Haerpfer et al., 2022).Both surveys provide numerical value scores for several categories on a population level across different countries and regions and are widely used to understand cross-cultural differences in values.
PLMs are trained on large amounts of text from the Web and have shown to pick up on semantic, syntactical, factual and other forms of knowledge which allow them to perform well across several Natural Language Processing (NLP) tasks.Since multilingual PLMs are trained on text in many languages, they have the potential to pick up cultural values through word associations expressed in those languages which are embedded in the pretraining texts.We therefore measure whether cultural values embedded in multilingual PLMs are correlated with the ones provided by the surveys.In Wikipedia, which is one of the primary sources of training data for multilingual PLMs, cross-cultural differences have been established (Miquel-Ribé and Laniado, 2019), and analysed by Hara et al. (2010) based on Hofstede's theory.
In this paper, we explore the novel research question of whether PLMs capture cultural differences in terms of values across different language models.We probe PLMs using questions from the values surveys of both Hofstede's cultural dimensions the-ory and the World Values Survey.We reformulate the survey questions to probes and extract the answers to evaluate whether language models can capture cultural differences based on their training data.We focus on 13 languages, each of which is primarily geographically restricted to one country or region, to compare the results of the language models to the values surveys.The overall experimental setting for the paper is outlined in Figure 1.
Our work makes the following contributions1 :

Related Work
Expression and Norms Analysis of expression of identity and attitudes through language and its change has a long history in sociolinguistics (Labov, 1963;Trudgill, 2002).More recently, studies have used NLP to computationally analyse this change on social media data (Eisenstein et al., 2014;Hovy et al., 2015) and link it to external factors like socioeconomic status (Abitbol et al., 2018) and demographics (Jurgens et al., 2017).This has also been done to analyse broader societal trends like temporal change in attitudes towards sexuality (CH-Wang and Jurgens, 2021) and gender bias (Sap et al., 2017;Stanczak and Augenstein, 2021).Further, there has been work on creating resources to analyse social norms and commonsense reasoning around them (Forbes et al., 2020;Emelin et al., 2021;Sap et al., 2020).While there has been work on investigating and embedding social and moral norms, understanding values and their variation in a cross-cultural context remains understudied in the literature.Kiesel et al. (2022) provide a taxonomy of 54 values based on Schwartz et al. (2012) and provide a dataset and baselines for automatic value classification within the context of argument mining.The closest setup to ours would be one adopted by Johnson et al. (2022).They qualitatively assess the text generated by GPT-3, an autoregressive language model, by prompting it with English texts with a clear embedded value.They find that the embedded values in the generated texts were altered to be more in line with dominant values of US citizens, possibly due to its training data.Our setup instead quantitatively measures whether cross-cultural differences in these values are preserved in multilingual language models when fed with the language spoken predominantly by people belonging to that culture.
Probing Probing has been extensively used as tool to study a variety of knowledge and biases picked up by PLMs.This can be syntactic (Hewitt and Manning, 2019), semantic (Vulić et al., 2020), numerical (Wallace et al., 2019), relational (Petroni et al., 2019) or factual knowledge (Jiang et al., 2020) picked up by PLMs.Probes can be created on both, at the word or sentence level (Mosbach et al., 2020).
Following work (Caliskan et al., 2017;Garg et al., 2018) on studying gender bias in word embeddings, a number of studies have built on it to similarly probe for social biases embedded in PLMs (May et al., 2019;Guo and Caliskan, 2021;Stańczak et al., 2021;Ousidhoum et al., 2021;de Vassimon Manela et al., 2021;Stanczak et al., 2021).This can be done using cloze-style probing for measuring at an intra-sentence level (Nadeem et al., 2021) or using pseduo-log likelihood (Salazar et al., 2020) based scoring (Nangia et al., 2020).There are downsides to both approaches, the former potentially introduces unintended bias based on the tokens in the input probe while the latter assumes that all masked tokens are statistically independent (Kaneko and Bollegala, 2022).We choose the former since the probes in our case are carefully worded by social scientists with the explicit aim to extract bias towards a certain set of values.
To the best of our knowledge, there is no existing

Value Probing
In this paper, we explore how PLMs capture differences in values across cultures, and whether those differences reflect the ones found in values across cultures at large.(Hofstede, 1984), while the WVS was developed in the field of political science (Inglehart, 2006).Both studies have since been widely used across fields.

Hofstede's Cultural Dimensions Theory
Hofstede started his surveys of cross-cultural differences in values in 1980.This first survey (Hofstede, 1984) included 116,000 participants from 40 countries (extended to 111 countries and regions in the 2015 version) working with IBM, and created 4 cultural dimensions, which were subsequently extended to 6 cultural dimensions that are also used in this paper.These 6 dimensions are: Power Distance (pdi), Individualism (idv), Uncertainity Avoidance (uai), Masculinity (mas), Long-term Orientation (lto), Indulgence (ivr).The full survey contains 24 questions.Each dimension is calculated using a formula defined by Hofstede using 4 of the questions in the survey, see Appendix F. Hofstede shows the influence that culture has on values by defining distinctly different numerical values in those 6 dimensions for the cultures observed.While critics of Hofstede's cultural dimensions theory point out, among others, the simplicity of the approach of mapping cultures to countries and question the timeliness of the approach (Nasif et al., 1991) We exclude categories (4) and ( 8) for the experiments in this study.This was done due to the nature of questions asked in these categories, for which it was not straightforward to design mask probes without loss of information.Inglehart (2006), who established WVS, further defines the Inglehart-Welzel cultural map, which processes the surveys and defines two dimensions 2 https://europeanvaluesstudy.eu/ in relation to each other: traditional versus secularrational values and survival versus self-expression values, and summarise values for countries on a scatter plot describing these dimensions.In the following, we only use the previously mentioned 11 categories and leave an analysis based on the Inglehart-Welzel cultural map for future work.

Probe Generation
In order to make the surveys compatible with language models, we reformulate the survey questions to cloze-style question probes (Taylor, 1953;Hermann et al., 2015) that we can then perform masked language modelling inference on.Since this is the task PLMs were trained on, we argue it is a suitable methodology to measure embedded cultural biases in these models.
Hofstede's Cultural Dimensions Based on the English survey questions, the questions in the survey are manually reformulated to question probes (QPs).This is done analogously to iterative categorisation, in which a set of possible labels (y + i , y − i ) corresponding to either end of the response options available in the survey are defined, which are the words the language models are probed for.The sentences are then reformulated to probes, and the labels masked.Those labels are based on the answers of the original survey, for instance, the original question like have sufficient time for personal or home life with answer options consisting of different degrees of importance, the probe is reformulated to Having sufficient time for personal or home life is [MASK]., where [MASK] should be replaced by important or unimportant.
where W i is the masked probe and y + i and y − i are the set of labels.There are a total of 24 questions with repeating labels.
World Values Survey Analogous to the probes created from the Hofstede survey, we create probes from the English questionnaire of the WVS.As there are more questions than for Hofstede (238 in total), there are also a larger number of labels to replace and a higher variety of question types.
Multilingual Probes To probe across several languages, we follow a semi-automatic methodology for translating the created probes in English to the target language.We use a translation API 3 that covers all target languages.We translate each QP from English into the target language with the [MASK] token replaced by the label words [y + i , y − i ] in order to maintain grammatical structure and aid the translation API.One challenge of cross-cultural research is information loss when translating survey questions (Nasif et al., 1991;Hofstede, 1984).Therefore we opted for this approach rather than reformulating the translated survey questions by Hofstede.However, we would like to highlight the shortcomings of machine translation which have poor performance on low resource languages and has the potential to introduce additional biases.For the purpose of these experiments however, since the question probes are relatively simple sentences, we found the machine translations to be of high quality.We conducted an evaluation of our machine translated probes, the details for which can be found in the Appendix B. The target labels [y + i , y − i ] for each QP are then translated individually as single words (e.g.important is translated from English to the German wichtig), followed by lowercased string matching to check if the translated label can be found and replaced in the translated probe.If the target label cannot be found directly in the translated probe due to differences in word choice, we use a cross-lingual word aligner (Dou and Neubig, 2021) to align the English probe and its translated version.With this approach, we identify the label word to be replaced with the mask token.If both approaches yield no result, the token is manually replaced in the target sentence based on the authors' language understanding and using online translators.
Language Selection In total, we investigate 13 languages, mapped to one country each as outlined in Table 1, according to criteria further detailed below.One of the limitations of this one-to-one mapping is that the languages are spoken in wider regions and not specifically in one country (disregarding also e.g.diaspora communities).This allows for the closest match to the values theories we work with, which operate on a country level.The definition of culture by country has been criticised by, e.g., Nasif et al. (1991).
We select the languages as follows: We first include the countries covered in both the surveys of WVS and Hofstede.We limit to languages which are official languages of the countries observed in the studies of both WVS and Hofstede.We further select languages for which the distribution of speakers is primarily localized to a country or relatively narrow geographical region.To ensure the language models will be able to have (potentially) sufficient amount of training data, from the set of languages, only those are selected which have at least 10,000 articles on Wikipedia.

Models
We conduct the probing experiments on three widely used multilingual PLMs: the multi-lingual, uncased version of BERT base (mBERT) (Devlin et al., 2018), the 100 language, MLM version of XLM (Conneau and Lample, 2019), and the base version of XLM-RoBERTa (XLM-R) (Conneau et al., 2020)  It shows strong multilingual performance across a range of benchmarks and is commonly used for extracting multilingual sentence encodings.

Mask Probing
For each model M , we run inference on the created cloze-style question probes (QPs, described in Section 5) using an MLM head producing the log probabilities for the [MASK] tokens in the QPs over the entire vocabulary V of the respective model: , where t is the position of the [MASK] token in the text W i ∈ QP , and Θ M are the parameters of the corresponding Language Model M .Since the survey respondents have to answer the questions with a choice between a range of values, for instance 1-10 with 1 representing democratic and 10 representing effective, in order to replicate a similar setting with PLMs, we subtract the predicted logit for the response label with the highest score w + i with the predicted logit for the lowest score w − i .This normalises the predicted logits for the responses on opposing ends of the survey question and is then used as a score for that question.
Finally, in order to collapse the World Values Survey responses per category, within which many questions have different scales, we normalize the aggregate survey responses per the corresponding question scale, so that y i,c ∈ [0, 1], c ∈ C. We then take the mean of the responses across all the questions of the category to arrive at the aggregated score of the category for each country:

Evaluation
We calculate Spearman's ρ -a rank correlation coefficient between the values predicted by the language models and values calculated through the surveys: ρ(logP M (w i , t|W \t i , Θ M ), y i ).For the World Values Survey, we do this per question, as well as per category.For Hofstede, we limit this calculation to value level correlations due to lack of access to individual or aggregate survey response data per question. 4We further calculate correlations per country.Spearman's ρ works on relative 4 We calculate the scores for the values based on the formula provided at https://www.laits.utexas.edu/orkelm/kelmpub/VSM2013_Manual.pdf,see Appendix F. predicted ranks to each variable, ignoring the individual predicted values.Our choice of using a rank correlation was motivated by the fact that we are working with population level aggregate responses and our aim of assessing whether language models pick up on relative differences in values across cultures, rather than on exact values.

RQ1: Model Predictions
We show the predicted scores for the XLM-R model in Figure 2. As is clear from the figure, there are substantial differences in the predicted scores for the cultural dimensions across cultures.On average, scores for power distance (pdi) are high, whereas ones for masculinity (mas) and indulgence (ivr) are relatively low.The predicted logits suggest bias towards Greece and South Korea as places with high power distance, Pakistan, Germany as more masculine.Indulgence (ivr) has the lowest scores across all values with only Phillippines and Malaysia having positive values, indicating high restraint in these cultures according to the model predictions.
To understand whether LMs can preserve crosscultural differences in values, we plot the results of the probing for Hofstede's and WVS' survey in Figures 3 and 4 respectively.As is visible in these plots, there is a variety in the values, i.e., the models seem to place different importance on different values across cultures, displaying cross- Figure 4: Scatter plots with quartiles of predicted value scores on WVS questions for each of the three models.cultural differences in the values.We quantify these differences among the prediction scores by testing for statistical significance between the model's predictions by culture, seeing how they capture cross-cultural differences.For XLM-R's predictions for the WVS, 42.31% of the country pairs have a statistically significant difference, meaning the model preserves cross-cultural differences.For the other two models, the share of significantly different country pairs are 51.28% and 46.15% for mBERT and XLM respectively.For XLM-R's predictions of Hofstede's survey, only 10.26% of cultures have p <= 0.05.For the other two models, the share of significantly different country pairs are none and 6.41% for mBERT and XLM respectively.We attribute these low percentages to the fact that we conduct the test over the six value dimensions only, while it is on over 200 questions for WVS.

RQ2: Model Agreement
To further study whether scores across values and categories are consistent across the three models, we check for correlation between the predicted scores between the three models and outline them in Tables 2 and 3. We can see that predictions are inconsistent across the models, indicating differences in the embedded cross-cultural values.mBERT and XLM share the same architecture and are both trained on Wikipedia, yet the correlations across values are low, indicating the large effect that relatively minor changes to the model training can have on the cultural values picked up by the model.

RQ3: Alignment with Surveys
Finally, we investigate whether the models' predictions for the values questionnaire are consistent with existing values survey scores.Hofstede We outline the results of correlations between each of the models' predictions for mask probing per value in Table 4.We find no statistically significant alignment between the models' predictions and survey value scores provided by Hofstede, but given the low sample size, this is to be expected (Sullivan and Feinn, 2012).We find weak correlations among some of the values between the models' predicted scores and the values survey suggesting the disparity in cultural values outlined by Hofstede and the ones picked up by PLMs.
WVS Table 5 similarly shows the correlations between the models' predicted scores and the World Values Survey scores per category.Here too, we find no statistically significant correlation between the predicted and the survey scores outlining the difference in values picked up by the language models and those quantified in the surveys.We also check for per country correlations between the predicted scores and data from both values surveys, these are shown in Tables 11 and 12 in the Appendix.

Discussion
Our experiments show that there are sizable differences in the cultural values picked up by the different multilingual models which are widely used for a number of language tasks, even when they are trained on data from the same source.This is in line with previous results (Stanczak et al., 2021) and hints at the sensitivity of model design, training choices, and their downstream effect on model biases.While the values picked up by the models vary across cultures, the bias in the models is not in line with values outlined in existing large scale values surveys.This is an unexpected result since PLMs are known to pick up on biases present in language data that they are trained on (Rogers et al., 2020;Stanczak and Augenstein, 2021).Further, values are known to be expressed in language (Norton, 1997).Hence, language models should pick up on and reflect cultural differences in values expressed in different languages based on their training text.A lack of such reflection points to possible shortcomings in representation learning when it comes to multilingual language models.There could be a number of reasons for this.One possible reason is the lack of diversity in multilingual training data.Wikipedia articles in different languages are written by a small subset of editors that are not representative of the populations in those countries.Further, large scale corpora like CommonCrawl over-represent the voices of people with access to the Internet, which in turn over-represents the values of people from those regions (Bender et al., 2021).Such a bias being present in GPT-3 was explored by Johnson et al. (2022) who show that LMs trained on Web text end up reflecting the biases of majority populations.
Other work also shows that pre-training text contains substantial amounts of toxic and undesirable content even after filtering (Luccioni and Viviano, 2021).This highlights the need for including more diverse and carefully curated sources of data which are culturally sensitive and representative, in order for the models to better reflect the cultural values of those populations.Joseph et al. (2021) suggest that people express themselves differently online on Twitter compared to survey responses.This is another potential reason for this mis-alignment.
PLMs are used for a variety of different NLP tasks in different countries and hence to accommodate the usage of people from diverse backgrounds and cultures, it is not just important to have linguistic and typological diversity in training data, but also cultural diversity (Hershcovich et al., 2022).Having such a form of cultural knowledge is desirable for a number of real-world tasks including QA systems, dialogue systems, information retrieval.Further, a lack of such faithful representation could lead to unintended consequences during the deployment of such models such as models imposing a set form of normative ethics over a diverse population that may not subscribe to it (Talat et al., 2021;Johnson et al., 2022).This could also lead to models not being culturally sensitive and embedding harmful stereotypes (Nadeem et al., 2021).Recently, work has been done on trying to align models with human values (Hendrycks et al., 2021;Solaiman and Dennison, 2021).While this may seem like a good idea at a first glance, also in light of the arguments presented above, some cultural values are harmful to portions of society, e.g.high levels of masculinity, which is connected to misogynistic language and perpetuating gender biases.Thus, when working with cultural values, an auditing system (Raji et al., 2020) with these value systems in mind and one that takes into account the downstream use case should be employed.

Conclusion
In this study, we propose a methodology for probing of cultural values embedded in multilingual Pre-trained Language Models and assessing differences among them.We measure alignment of these values amongst the models and with existing values surveys.We find that PLMs capture marked differences in values between cultures, though these in turn are only weakly correlated with values surveys.Alongside training data, we discuss the impact training and modelling choices can have on cultural bias picked up by the models.We further discuss the importance of this alignment when developing models in a cross-cultural context and offer suggestions for more inclusive ways of diversifying training data to incorporate these values.

Ethical Considerations
The ethical considerations for our work mostly relate to the limitations; there are a variety of unintended implications of equating a language and a country, such as misrepresentation of communities, and disregarding minority and diaspora communities.However, we believe it is the closest approximation possible when comparing the surveys used in this work and LMs.Further, the surveys have been criticised; particularly Hofstede's cultural dimensions theory has been deemed too simplistic (Jackson, 2020).This could lead also to simplistic assumptions when probing an LM.We address these problems by including the WVS, another widely used survey, in our study.Due to these limitations, we believe that further studies and applications of our approach should be done with these limitations in mind.Particularly the simplification of cultural representation by both our approach as well as the original surveys might impact communities negatively.Such misrepresentation can have a disproportionate impact and exacerbate the marginalisation of minority communities or subcultures.

A Limitations
There are several limitations of our approach in trying to assess cultural diversity and alignment of the values picked up by PLMs.While our methodology of probing models using Cloze style questions gives us some insight into token level biases picked up by the language models, it is limited in its approach to only show static and extrinsic biases at inference time using output probabilities.There are intrinsic measures for quantifying bias, but those do not always correlate with extrinsic measures (Goldfarb-Tarrant et al., 2021).In order to make the experimental setting more robust and clearly demonstrate signs of embedded cultural bias, we performed experiments with an extended set of synonyms for each label word.However, this turned out to be non-trivial for a number of reasons.First, replacing synonyms in place of original words rarely results in grammatical sentences.Second, it is not always possible to find multiple synonyms of words in the same sense as the label words across the languages used in our study.Third, even when synonyms do exist, they are often multi-word expressions, which makes them incompatible with our experimental setting where a single word needs to be masked.As discussed earlier, a major limitation that comes with quantifying cultural values is the mapping of countries to cultures and in our case, also to languages.Since this is an imperfect mapping, it is a difficult task to accurately quantify and assess cultural bias and values embedded in the models.We partially addressed this by restricting our study to languages which are mostly geographically restricted to one country.This is a limitation faced by cross-cultural research in general, where countries are often used as surrogates for cultures (Nasif et al., 1991).Finally, surveys and aggregate responses are also imperfect tools to evaluate and quantify cultural disparity, though the best ones currently in use.They are tasked with collapsing individual values into a set of questions.Individuals answering those questions from different backgrounds may perceive the questions differently.Further, there are several confounding factors affecting the survey responses and problems relating to seeing populations as a monolithic homogeneous whole.While these limitations pose important questions around how one should be careful in interpreting these values, we believe our study makes important contributions and provides a first step in assessing alignment between PLMs and cultural values, which we argue is necessary for models to faithfully work in a cross-cultural context.

B Translation quality
To assess the quality of translated probes, we conduct human evaluations of a sample of the output of the machine translator.We randomly select 3 probe questions from the Hofstede values survey and 23 probe questions from the World Values Survey representing 10% of the total probes.We then provide the original probe questions in English as well as their translations to annotators and assess the following two characteristics of the translations: • Grammaticality: describes the correctness of the sentence standing alone, independent of the English sentence, in terms of obeying grammatical rules • Meaning: describes how adequate the translation is for further reuse.We specifically want to know here, how correct the sentence is in relation to the English sentence.This could be also understood as the overall quality of the translation.
For each of the 26 probe questions, we ask the annotators to rate the sentence on the above listed characteristics across a 1-5 Likert scale.All annotators had at least a university level education, working proficiency of English, and were native speakers of the corresponding languages.We perform this annotation for 6 out of the 13 languages due to resource constraints.We provide the averaged scores for both the characteristics for each language in Table 6.The annotators on average across languages rate the meaning characteristic of the machine translated probes to be 4.73.This indicates the high degree to which the translations preserve the meaning of the sentences from the English probes.The grammaticality of the probes on average was rated to be 4.64.While lower than the value for the preserved meaning of the English sentence, the sentences were found to have very good grammar as well.The very high scores across the meaning characteristic of the translations suggest that for most of the probes, the translations were of high quality.

C Models and Compute
All models were run in Python using Py-Torch (Paszke et al., 2019)  ers library (Wolf et al., 2020).When speaking about XLM-R, mBERT, XLM we refer to the models with the names xlm-roberta-base, bert-basemultilingual-uncased, xlm-mlm-100-1280 respectively.Since only inference was performed for probing the models, the experiments were run on a single NVIDIA Titan RTX GPU for less than 1 hour.

D Ablations D.1 Label logit subtraction
To eliminate the possibility of lack of correlation due to subtraction of logit for label token with the lower response score in the survey question from the one with higher response score (Section 6.2), we calculate correlations with just the high response label token y + i .We report our results for Hofstede in Table 7 and WVS in Table 8.Similarly, we calculate value correlations for just the low response label and report them in Table 10

E Example probes
In Table 13, we provide a sample of the question probes in English that are then translated to the different languages outlined in Section 5.

F Hofstede Value Calculation
We calculate the value results for the probes based on Hofstede (1984) by using the formulas used in the original survey. 5The numbers following m represent the index of the survey questions, m stands for mean representing the mean survey question

Figure 1 :
Figure 1: Figure outlining the experimental setting for the paper.We take the original survey questions (Section 4), convert them into Question Probes and translate these into the target languages (Section 5) and run inference on the mask probes (Section 6.2)

Figure 2 :
Figure 2: Heatmap of scores predicted per value for XLM-R mask probing on Hofstede's survey questions

Figure 3 :
Figure 3: Scatter plots with quartiles of predicted value scores on Hofstede's survey questions for each of the three models.
(Jones, 2007)of representing values is now a foundation for a large body of work on cross-cultural differences in values(Jones, 2007).
Haerpfer et al. (2022)y (WVS)The World Values Survey (WVS,Haerpfer et al. (2022)) collects data on peoples' values across cultures in a more detailed way than Hofstede's cultural dimensions theory.The survey started in 1981 and is conducted by a nonprofit organisation, which includes a network of international researchers.It is conducted in waves, to collect data on how values change over time.The latest wave, wave 7, ran from 2017 to 2020.Compared to the European Values Study 2 , WVS targets all countries and regions, and includes 57 countries.While Hofstede's cultural dimensions theory aggregates the findings of their survey into the 6 cultural dimensions, WVS publishes the results of their survey per question.Those are organised in 13 categories: (1) Social Values, Attitudes and Stereotypes, (2) Happiness and Well-being, (3) Social Capital, Trust and Organisational Membership, (4) Economic Values, (5) Corruption, (6) Migration, (7) Security, (8) Postmaterialist Index, (9) Science and Technology, (10) Religious Values, (11) Ethical Values and Norms, (12) Political Interest and Political Participation, (13) Political Culture and Regimes.

Table 1 :
Mapping of countries (cultures)to languages used throughout this paper, including number of articles per Wikipedia language as of March 2022.

Table 3 :
Pairwise correlations in model predictions for mask probing on WVS questions.Statistically significant values with p ≤ 0.05 are marked with *

Table 5 :
Correlation per question between masked prediction scores and WVS.Statistically significant values with p <= 0.05 are marked with *

Table 6 :
and the Transform-Averaged human evaluation scores on a 1-5 Likert scale for grammaticality and preserved meaning of the machine translated probes for a sample of languages used in this study

Table 7 :
and Table9for Hofstede and WVS respectively.Correlation per dimension between mask prediction scores for the high response score label y + and Hofstede's values survey.Statistically significant values with p <= 0.05 are marked with *

Table 8 :
Correlation per category between mask prediction scores for the high response score label y + and the WVS.Statistically significant values with p <= 0.05 are marked with *

Table 9 :
Correlation per category between mask prediction scores for the low response score label y − and the WVS.Statistically significant values with p <= 0.05 are marked with *