Is a Prestigious Job the same as a Prestigious Country? A Case Study on Multilingual Sentence Embeddings and European Countries

We study how multilingual sentence representations capture European countries and occupations and how this differs across European languages. We prompt the models with tem-plated sentences that we machine-translate into 12 European languages and analyze the most prominent dimensions in the embeddings. Our analysis reveals that the most prominent feature in the embedding is the geopolitical distinction between Eastern and Western Europe and the country’s economic strength in terms of GDP. When prompted specifically for job prestige, the embedding space clearly distinguishes high and low-prestige jobs. The occupational dimen-sion is uncorrelated with the most dominant country dimensions in three out of four studied models. The exception is a small distilled model that exhibits a connection between occupational prestige and country of origin, which is a potential source of nationality-based discrimination. Our findings are consistent across languages.


Introduction
Language models and pre-trained representations in Natural Language Processing (NLP) are known to manifest biases against groups of people, most importantly, negative stereotypes connected to ethnicity or gender (Nangia et al., 2020;Nadeem et al., 2021).These have been extensively studied with monolingual models.Multilingual models, often used for model transfer between languages, introduce another type of potential issue: stereotypes of speakers of one of the languages can be imposed in other languages covered by the model.
In this case study, we try to determine the most prominent biases connected to European countries in multilingual sentence representation models.We adopt an unsupervised methodology ( § 2) based on hand-crafted prompt templates and principle component analysis (PCA), originally developed to extract moral sentiments from sentence representation (Schramowski et al., 2022).
Our exploration encompasses four sentence representation models across 13 languages ( § 3).We find only minor differences between languages in the models.The results ( § 4) show that the strongest dimension in all models correlates with the political and economic distinction between Western and Eastern Europe and the Gross Domestic Product (GDP).Prompting for country prestige leads to similar results.When prompted for occupations, the models can distinguish between low and high-prestige jobs.In most cases, the extracted jobprestige dimension only loosely correlates with the country-prestige dimension.This result suggests that the models do not connect individual social prestige with the country of origin, except for a small model distilled from Multilingual Universal Sentence Encoder (Yang et al., 2020) that seems to mix these two.
The source code for the experiments is available on GitHub.1

Methodology
We analyze sentence representation models ( § 2.1) using a generalization Moral Direction framework ( § 2.2).We represent concepts (countries, jobs) using sets of templated sentences ( § 2.3), for which we extract the embeddings.Then, we compute the principal components of the embeddings and analyze what factors statistically explain the principal components ( § 2.4).

Sentence Embeddings Models
Sentence-embedding models are trained to produce a single vector capturing the semantics of an entire sentence.Contextual embeddings trained via masked-language-modeling objective (Devlin et al., 2019) capture subwords well in context; however, they fail to provide a sentence representation directly comparable across sentences.Sentence-BERT (Reimers and Gurevych, 2019) approaches this problem by fine-tuning existing contextual embeddings using Siamese Networks on sentence classification tasks.As a result, sentences with similar meanings receive similar vector representation.
The issue of sentence representation also applies to multilingual contextual embeddings such as XLM-R (Conneau et al., 2020).In the multilingual setup, the additional requirement is that similar sentences receive similar representation regardless of the language.This is typically achieved using parallel data via knowledge distillation (Reimers and Gurevych, 2020;Heffernan et al., 2022) or more directly in a dual encoder setup (Feng et al., 2022).

Embedding Analysis Method
We base our methodology on an unsupervised method for extracting semantic dimensions from sentence embeddings, originally introduced in the context of moral institutions (Schramowski et al., 2022).The study attempts to extract the moral sentiment of English verb phrases using SentenceBert.
The method consists of three steps: First, they generate templated sentences associating verbs with morality (e.g., "You should smile.","It is good to smile.").Second, the sentences are processed with SentenceBert, and the representations are averaged for each phrase.Third, they apply PCA over the representations.The results show that the most significant dimension roughly corresponds to the moral sentiment of the phrases.
We adopt this method in a more explorative setup.We use a similar set of template sentences and average their embeddings.Then, we analyze what the most prominent PCA dimension correlates to when the models are prompted with different templates.

Templating Sentences
Similar to Hämmerl et al. (2022), who extended the framework to multilingual models, we use templates in English and machine-translate the sentences into other languages after materializing the templates.We use three types of templates with the following meanings: 1.They come from [COUNTRY].

Being from [COUNTRY]
is considered prestigious.
3. Working as [JOB] is considered prestigious.
See Appendix 2.3 for a complete list.
In the first set of sentences, we search for the general trend in how countries are represented.In the second set of sentences, we specifically prompt the model for country prestige to compare how general country representation correlates with assumed prestige.In the third case, we fit the PCA with templates containing job titles, i.e., the most prominent dimension captures job prestige according to the models.We apply the same projection to template representations related to country prestige from Set 2 (country prestige).
Countries.We include all European countries of the size of at least Luxembourg and use their short names (e.g., Germany instead of the Federal Republic of Germany), which totals 40 countries.The list of countries is in Appendix A.3.
Low-and high-prestige jobs.We base our list of low-and high-prestige jobs on a sociological study conducted in 2012 in the U.S. (Smith and Son, 2014).We manually selected 30 jobs for each category to avoid repetitions and to exclude USspecific positions.By using this survey, we also bring in the assumption that the European countries have approximately similar cultural distance from the US.The complete list of used job titles is in Appendix A.2.

Evaluation
Interpreting the dominant dimension.For the analysis, we assign the countries with labels (abstractions over the countries) based on geographical (location, mountains, seas), political (international organization membership, common history), and linguistic features (see Table 5 in the Appendix for a detailed list).The labels are not part of the templates.
We compute the correlation of the indicator vector of the country labels with the extracted dominant dimension to provide an interpretation of the dimension.Finally, we manually annotate if the most positively and negatively correlated labels correspond to the economic and political distinction between Eastern and Western Europe.
In addition, we compute the country dimension's correlation with the respective countries' gross domestic product (GDP) based on purchasing power parity in 2019 according to World Bank 2 .Cross-lingual comparison.We measure how the extracted dimensions correlate across languages.
To explain where the differences across languages come from, we compute how the differences correlate with the geographical distance of the countries where the languages are spoken, the GDP of the countries, and the lexical similarity of the languages (Bella et al., 2021). 3 3 Experimental Setup

Evaluated Sentence Embeddings
We experimented with diverse sentence embedding models, which were trained using different methods.We experimented with models available in the SentenceBERT repository and an additional model trained with monolingual data only.The overview of the models is in Table 1.
Multilingual MPNet.The first model we experimented with was created by multilingual distillation from the monolingual English MPNet Base model (Song et al., 2020) finetuned for sentence representation using paraphrasing (Reimers and Gurevych, 2019).In the distillation stage, XLM-R Base (Conneau et al., 2020) was finetuned to produce similar sentence representations using parallel data (Reimers and Gurevych, 2020).
Distilled mUSE.The second model we evaluate is a distilled version of Multilingual Universal Sentence Encoder (Yang et al., 2020) that was distilled into Distill mBERT (Sanh et al., 2019).This model was both trained and distilled multilingually.
LaBSE.Further, we explore LaBSE (Feng et al., 2022).It was trained on parallel data with a maxmargin objective for better parallel sentence mining combined with masked language modeling.

Translating Templates
To evaluate the multilingual representations in more languages, we machine translate the templated text into 12 European languages: Bulgarian, Czech, German, Greek, Spanish, Finnish, French, Hungarian, Italian, Portuguese, Romanian, and Russian (and keep the English original).We selected languages for which high-quality machine translation systems are available on Huggingface Hub.The models are listed in Appendix B.

Results
Aggregated results.The results aggregated over language are presented in Table 2.The detailed results per language are in the Appendix in Tables 6  and 7.
When prompting the models for countries, the most prominent dimensions almost always separate the countries according to the political east-west axis, consistently across languages.This is further stressed by the high correlation of the country dimension with the country's GDP, which is particularly strong in multilingual MPNet and Distilled mUSE.When we prompt the models specifically for country prestige, the correlation with the country's GDP slightly increases.
When we prompt the models for job prestige, there are able to distinguish high-and low-prestige jobs well (accuracy 85-93%).When we apply the same projection to prompts about countries, in most cases, the ordering of the countries is random.
The only exception is Distilled mUSE, where the job-prestige dimension applied to countries still highly correlates with the country's GDP and the east-west axis.
Differences between languages.Further, we evaluate how languages differ from each other.
Although, for all models, the first PCA dimension from the job-prestige prompts separate lowand high-prestige jobs almost perfectly, multilingual MPNet and distilled mUSE show a relatively low correlation of the actual dimension value across languages (see Figure 1).
We assess what differences between languages might be attributed.In this second-order evaluation, we try to explain the correlation between Table 3: Correlation of the language similarities (in terms of cross-language correlation of the job-prestige dimension) with the geographical distance of the countries, language similarity, and GPD.
languages by connecting them to countries where the language are spoken.We measure how the correlation between languages correlates with the geographical distances of (the biggest) countries speaking the language, the difference in their GDP, and the lexical similarity of the languages.The results are presented in Table 3.For all models except XLM-R-NLI, the lexical similarity of the languages is the strongest predictor.Whereas XLM-R-NLI, where the differences between languages are relatively low, better correlates with geographical distances.

Related Work
Societal biases of various types in neural NLP models are widely studied, especially focusing on gender and ethnicity.The results of the efforts were already summarized in comprehensive overviews (Blodgett et al., 2020;Delobelle et al., 2022).
Nationality bias has also been studied.Venkit et al. ( 2023) shows the GPT-2 associates countries of the global south with negative-sentiment adjectives.However, only a few studies focus on biases in how multilingual models treat different languages.Papadimitriou et al. (2023) showed that in Spanish and Greek, mBERT prefers syntactic structures prevalent in English.Arora et al. (2022) and Hämmerl et al. (2022) studied differences in moral biases in multilingual language models, concluding there are differences but no systematic trends.Yin et al. ( 2022) created a dataset focused on culturally dependent factual knowledge (e.g., the color of the wedding dress) and concluded it is not the case that Western culture propagates across languages.

Conclusions
We showed that all sentence representation models carry a bias that the most prominent feature of European countries is their economic power and affiliation to former Western and Eastern blocks.In the models we studied, this presumed country prestige does not correlate with how the models represent the occupation status of people.The exception is Distilled mUSE, where this two correlate, which might lead to discrimination based on nationality.

Limitations & Ethical Considerations
The validity for different cultures.The "ground truth" for job prestige was taken from studies conducted in the USA.They might not be representative of other countries included in this case study.Given that all countries considered in this case study are a part of the so-called Global North, we can assume a certain degree of cultural similarity, which makes our results valid.However, our methodology does not generalize beyond the Western world.
Unintended use.Some methods we use in the study might create a false impression that we have developed a scorer of job or country prestige.The correlations that we show as our results do not guarantee the reliability of the scoring beyond the intended use in the study, which is an assessment of multilingual sentence representation models.High-profile jobs.a surgeon, a university professor, an architect, a lawyer, a priest, a banker, a school principal, an airline pilot, an economist, a network administrator, an air traffic controller, an author, a nuclear plant operator, a computer scientist, a psychologist, a pharmacist, a colonel in the army, a mayor of a city, a university president, a dentist, a fire department lieutenant, a high school teacher, a policeman, a software developer, an actor, a fashion model, a journalist, a musician in a symphony orchestra, a psychiatrist, a chemical engineer
The qualitative group labels that we assign to the countries that we use in the further analysis are in Table 5.The values reflect the world as it was in the training data for the models and therefore do not reflect recent events (i.e., Croatia in not listed among countries paying with Euro, Finland is considered neural).

B Machine Translation Models
The machine translation models that we used are listed in Table 4 While keeping default values for all decoding parameters.

C Detailed per-language results
The detailed per-language results are presented in Tables 6 and 7.

Table 1 :
Basic features of the studied models.

Table 2 :
Quantitative results averaged over languages showing the average correlation of the dominant dimension with the country's GDP and a proportion of languages where the dominant dimension corresponds to the political division of eastern and western countries.The detailed per-language results are presented in Tables6 and 7in the Appendix.
s d e e l e n e s f i f r h u i t p t r o XLM-R-NLI Figure 1: Cross-language correlation of the job-prestige dimension.Languages are coded using ISO 639-1 codes.

Table 4 :
Huggingface Hub Identifier of the machine translation models used for our experiments.

Table 6 :
Detailed per-language results for Multilingual MPNet and Distilled Multilingual Sentence Encoder.

Table 7 :
Detailed per-language results for LaBSE and XLM-R finetuned on NLI.