Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models

Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as “Paris is the capital of [MASK]” are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERT’s performance as knowledge base language-independent or does it vary from language to language? (iii) A multilingual model is trained on more text, e.g., mBERT is trained on 104 Wikipedias. Can mBERT leverage this for better performance? We find that using mBERT as a knowledge base yields varying performance across languages and pooling predictions across languages improves performance. Conversely, mBERT exhibits a language bias; e.g., when queried in Italian, it tends to predict Italy as the country of origin.


Introduction
Pretrained language models (LMs) (Peters et al., 2018;Howard and Ruder, 2018;Devlin et al., 2019) can be finetuned to a variety of natural language processing (NLP) tasks and generally yield high performance. Increasingly, these models and their generative variants are used to solve tasks by simple text generation, without any finetuning (Brown et al., 2020). This motivated research on how much knowledge is contained in LMs: Petroni et al. (2019) used models pretrained with masked language to answer fill-in-the-blank templates such as "Paris is the capital of [MASK]." * Equal contribution -random order.

Query
Two most frequent predictions en X was created in MASK.
[Japan (170) Table 1: Language bias when querying (TyQ) mBERT. Top: For an Italian cloze question, Italy is favored as country of origin. Bottom: There is no overlap between the top-ranked predictions, demonstrating the influence of language -even though the facts are the same: the same set of triples is evaluated across languages. Table 3 shows that pooling predictions across languages addresses bias and improves performance. WW = "Wirtschaftswissenschaftler". This research so far has been exclusively on English. In this paper, we focus on using multilingual pretrained LMs as knowledge bases. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERT's performance as knowledge base language-independent or does it vary from language to language? To answer these questions, we translate English datasets and analyze mBERT for 53 languages. (iii) A multilingual model is trained on more text, e.g., BERT's training data contains the English Wikipedia, but mBERT is trained on 104 Wikipedias. Can mBERT leverage this fact? Indeed, we show that pooling across languages helps performance.
In summary our contributions are: i) We automatically create a multilingual version of TREx and GoogleRE covering 53 languages. ii) We use an alternative to fill-in-the-blank querying -ranking entities of the type required by the template (e.g., cities) -and show that it is a better tool to investigate knowledge captured by pretrained LMs. iii) We show that mBERT answers queries across languages with varying performance: it works reasonably for 21 and worse for 32 languages. iv) We give evidence that the query language affects results: a query formulated in Italian is more likely to produce Italian entities (see Table 1). v) Pooling predictions across languages improves performance by large margins and even outperforms monolingual English BERT. Code and data are available online (https://github.com/ norakassner/mlama).

LAMA
We follow the LAMA setup introduced by Petroni et al. (2019). More specifically, we use data from TREx (Elsahar et al., 2018) and GoogleRE. 1 Both consist of triples of the form (object, relation, subject). The underlying idea of LAMA is to query knowledge from pretrained LMs using templates without any finetuning: the triple (Paris, capital-of, France) is queried with the template "Paris is the capital of [MASK]." In LAMA, TREx has 34,039 triples across 41 relations, GoogleRE 5528 triples and 3 relations. Templates for each relation have been manually created by Petroni et al. (2019). We call all triples from TREx and GoogleRE together LAMA.
LAMA has been found to contain many "easyto-guess" triples; e.g., it is easy to guess that a person with an Italian sounding name is born in Italy. LAMA-UHN is a subset of triples that are hard to guess introduced by Poerner et al. (2020).

Translation
We translate both entities and templates. We use Google Translate to translate templates in the form "[X] is the capital of [Y]". After translation, all templates were checked for validity (i.e., whether they contain "[X]", "[Y]" exactly once) and corrected if necessary. In addition, German, Hindi and Japanese templates were checked by native speakers to assess translation quality (see Table 2). To translate the entity names, we used Wikidata and Google knowledge graphs.
mBERT covers 104 languages. Google Translate covers 77 of these. Wikidata and Google Knowledge Graph do not provide entity translations for all languages and not all entities are contained in the knowledge graphs. For English we can find a total of 37,498 triples which we use from now on. On average, 34% of triples could be translated (macro average over languages). We only consider languages with a coverage above 20%, resulting in the final number of languages we include in our study: 53. The macro average of translated triples in these 53 languages is 43%. Figure 1 gives statistics. We call the translated dataset mLAMA.

Model
We work with mBERT (Devlin et al., 2019), a model pretrained on the 104 largest Wikipedias. We denote mBERT queried in language x as mBERT [x]. As comparison we use the English BERT-Base model and refer to it as BERT. In initial experiments with XLM-R (Conneau et al., 2020) we observed worse performance, similar to Jiang et al. (2020a). Thus, for simplicity we only report results on mBERT.

Typed and Untyped Querying
Petroni et al. (2019) use templates like "Paris is the capital of [MASK]" and give arg max w∈V p(w|t) as answer where V is the vocabulary of the LM and p(w|t) is the (log-)probability that word w gets predicted in the template t. Thus the object of a triple must be contained in the vocabulary of the language model. This has two drawbacks: it reduces the number of triples that can be considered drastically and hinders performance comparisons across LMs with different vocabularies. We refer to this procedure as UnTyQ.
We propose to use typed querying, TyQ: for each relation a candidate set C is created and the prediction becomes arg max c∈C p(c|t). For templates like "[X] was born in [MASK]", we know which entity type to expect, in this case cities. We observed that (English-only) BERT-base predicts city names for MASK whereas mBERT predicts years for the same template. TyQ prevents this.
We choose as C the set of objects across all triples for a single relation. The candidate set could also be obtained from an entity typing system (e.g., (Yaghoobzadeh and Schütze, 2016)), but this is beyond the scope of this paper. Variants of TyQ have been used before (Xiong et al., 2020).

Singletoken vs. Multitoken Objects
Assuming that objects are in the vocabulary (Petroni et al., 2019) is a restrictive assumption, even more in the multilingual case as e.g., "Hamburg" is in the mBERT vocabulary, but French "Hambourg" is tokenized to ["Ham", "##bourg"]. We consider multitoken objects by including multiple [MASK] tokens in the templates. For both TyQ and UnTyQ we compute the score that a multitoken object is predicted by taking the average of the log probabilities for its individual tokens.
Given a template t (e.g., "[X] was born in [Y].") let t 1 be the template with one mask token, (i.e., "[X] was born in [MASK].") and t k be the template with k mask tokens (i.e., "[X] was born in [MASK] [MASK] . . . [MASK]."). We denote the log probability that the token w ∈ V is predicted at ith mask token as p(m i = w|t k ), where V is the vocabulary of the LM. To compute p(e|t) for an entity e that is tokenized into l tokens 1 , 2 , . . . , l we simply average the log probabilities across tokens: If k is the maximum number of tokens of any entity e ∈ C gets split into, we consider all templates t 1 , . . . , t k , with C being the candidate set. The prediction is then the word with the highest average log probability across all templates t 1 , . . . , t k . Note that for UnTyQ the space of possible predictions is V × V × · · · × V whereas for TyQ it is the candidate set C.

Evaluation
We compute precision at one for each relation, i.e., 1/|T | t∈T 1{t object = t object } where T is the set of all triples andt object is the object predicted by TyQ or UnTyQ. Note that T is different for each language. Our final measure (p1) is then the precision at one averaged over relations (i.e., macro average). Results for multiple languages are the macro average p1 across languages.

Results and Discussion
We first investigate TyQ and UnTyQ and find that TyQ is better suited for investigating knowledge in LMs. After exploring the translation quality, we use TyQ on mLAMA and observe rather stable performance for 21 and poor performance for 32 languages. When investigating the languages more closely, we find that prediction results highly depend on the language. Finally, we validate our initial hypothesis that mBERT can leverage its multilinguality by pooling predictions: pooling indeed performs better. Figure 2 shows the distribution of p1 scores for single and multitoken objects. As expected, TyQ works better, both for single and multitoken objects. With UnTyQ, performance not only depends on the model's knowledge, but on at least three extraneous factors: (i) Does the model understand the type constraints of the template (e.g., in "X is the capital of Y", Y must be a country)? (ii) How "fluent" a substitution is an object under linguistic constraints (e.g., morphology) that can be viewed as orthogonal to knowledge? Many English templates cannot be translated into a single template in many languages, e.g., "in X" (with X a country) has different translations in French: "à Chypre", "au Mexique", "en Inde". But the LAMA setup requires a single template. By enforcing the type, we reduce the number of errors that are due to surface fluency. (iii) The inadequacy of the original LAMA setup for multitoken answers. Figure 2 (right) shows that the original UnTyQ struggles with multitokens (mean p1 .03 vs. .17 for TyQ).

UnTyQ vs. TyQ
Overall, TyQ allows us to focus the evaluation on the core question: what knowledge is contained in LMs? From now on, we report numbers in the TyQ setting.
Manual template tuning or automatic template  mining (Jiang et al., 2020b) has been investigated in the literature to approach the typing problem. We had native speakers check templates for German, Hindi and Japanese, correct mistakes in the automatic translation and paraphrase the template to obtain predictions with the correct type. Table 2 shows that corrections do not yield strong improvements. We conclude that template modifications are not an effective solution for the typing problem.

Translation Quality
Contemporaneous work by Jiang et al. (2020a) provides manual translations of LAMA templates for 23 languages respecting grammatical gender and inflection constraints. We evaluate our machine translated templates by comparing performance on a common subset of 14 languages using TyQ querying on the TREx subset. Surprisingly, we find a performance difference of 1 percentage points (0.23 vs. 0.24, p1 averaged over languages) in favor of the machine translated templates. This indicates that the machine translated templates in combination with TyQ exhibit comparable performance but come with the benefit of larger language coverage (53 vs. 23 languages).

Multilingual Performance
In mLAMA, not all triples are available in all languages. Thus absolute numbers are not comparable across languages and we adopt a relative performance comparison: we report p1 of a modellanguage combination divided by p1 of mBERT's performance in English (mBERT[en]) on the exact same set of triples and call this rel-p1. A rel-p1 score of 0.5 for mBERT [fi] means that p1 of mBERT on Finnish is half of mBERT[en]'s performance on the same triples. rel-p1 of English BERT is usually greater than 1 as monolingual BERT tends to outperform mBERT [en]. Figure 3 shows that mBERT performs reasonably well for 21 languages, but for 32 languages  rel-p1 is less than 0.6 (i.e., their p1 is 60% of English's p1). We conclude that mBERT does not exhibit a stable performance across languages. The variable performance (from 20% to almost 100% rel-p1) indicates that mBERT has no common representation for, say, "Paris" across languages, i.e., mBERT representations are language-dependent.

Bias
If mBERT captured knowledge independent of language, we should get similar answers across languages for the same relation. However, Table 1 shows that mBERT exhibits language-specific biases; e.g., when queried in Italian, it tends to predict Italy as the country of origin. This effect occurs for several relations: Table 4 in the supplementary presents data for ten relations and four languages.

Pooling
We investigate pooling of predictions across languages by picking the object predicted by the majority of languages. Table 3 shows that pooled mBERT outperforms mBERT[en] by 6 percentage points on LAMA, presumably in part because the language-specific bias is eliminated. mBERT[pooled] even outperforms BERT by 3 percentage points on LAMA-UHN. This indicates that mBERT can leverage the fact that it is trained on 104 Wikipedias vs. just one and even outperforms the much stronger model BERT.

Related Work
Petroni et al. (2019) first asked the question: can pretrained LMs function as knowledge bases? Subsequent analyses focused on different aspects, such as negation , easy to guess names (Poerner et al., 2020), integrating adapters  or finding alternatives to a "fill-in-the-blank" approach with singletoken answers (Bouraoui et al., 2020;Heinzerling and Inui, 2020;Jiang et al., 2020b). Other work combines pretrained LM with information retrieval (Guu et al., 2020;Lewis et al., 2020a;Izacard and Grave, 2020; ; Petroni  en  id  ms  af  gl  vi  da  es  ca  ceb  ro  sv  it  nl  tr  cy  hr  de  fr  pt  sq  sl  ur  sk  bg  fa  zh  cs  et  hi  el  eu  la  pl  lt  fi  sr  be  uk  lv  bn  ga  hu  ru  az  he  ar  ko  hy  ka  ta  th  Multilingual models like mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) perform well for zero-shot crosslingual transfer (Hu et al., 2020). However, we are not aware of any prior work that analyzed to what degree pretrained multilingual models can be used as knowledge bases. There are many multilingual question answering datasets such as XQuAD (Artetxe et al., 2020), TiDy (Clark et al., 2020), MKQA (Longpre et al., 2020) and MLQA (Lewis et al., 2020b). Usually, multilingual models are finetuned to solve such tasks. Our goal is not to improve question answering or create an alternative multilingual question answering dataset, but instead to investigate which knowledge is contained in pretrained multilingual LMs without any kind of supervised finetuning.
There is a range of alternative multilingual knowledge bases that could be used for evaluation. Those include ConceptNet (Speer et al., 2017) or BabelNet (Navigli and Ponzetto, 2010). We decided to provide a translated versions of TREx and GoogleRE for the sake of comparability across languages. By translating manually created templates and entities we can ensure comparability across languages. This is not possible for crowd-sourced databases like ConceptNet.
In contemporaneous work, Jiang et al. (2020a) create and investigate a multilingual version of LAMA. They provide human template translations for 23 languages, propose several methods for multitoken decoding and code-switching, and experiment with a number of PLMs. In contrast to their work, we investigate typed querying, focus on comparabiliy and pooling across languages, and explore language biases.

Conclusion
We presented mLAMA, a dataset to investigate knowledge in language models (LMs) in a multilingual setting covering 53 languages. While our results suggest that correct entities can be retrieved for many languages, there is a clear performance gap between English and, e.g., Japanese and Thai. This suggests that mBERT is not storing entity knowledge in a language-independent way. Experiments investigating language bias confirm this finding. We hope that this paper and the dataset we publish will stimulate research on investigating knowledge in LMs multilingually rather than just in English. Table 4 shows the language bias for 10 relations. For each relation we aggregated the predictions across all triples and show the most common two predicted entities together with its count (in brackets). The querying language clearly affects results. The effect is drastic for relations that ask for a country (e.g., P495 or P1001). P39 yields very different results without exhibiting a clear pattern. Other relations such as P463 or P178 are rather stable. Table 4 and Table 5 show randomly sampled entries from the data.

C Pretraining Data
We investigate whether performance across languages is correlated with the amount of pretraining data for each language. To this end we investigate the number of articles per language as of January 2021 2 and p1 for TyQ in Figure 6. We do not have access to the original pretraining data of mBERT. Thus, the number of articles we consider in the analysis might be different to the actual data used to train mBERT.
2 https://meta.wikimedia.org/wiki/List_ of_Wikipedias Figure 4: Three randomly sampled data entries from mLAMA per language. Due to the automatic generation of the dataset not all of them are fully correct. Table 4: Most frequent object predictions (TyQ) in different languages. Some relations exhibit language specific biases. WW = "Wirtschaftswissenschaftler".