Factual Consistency of Multilingual Pretrained Language Models

Pretrained language models can be queried for factual knowledge, with potential applications in knowledge base acquisition and tasks that require inference. However, for that, we need to know how reliable this knowledge is, and recent work has shown that monolingual English language models lack consistency when predicting factual knowledge, that is, they fill-in-the-blank differently for paraphrases describing the same fact. In this paper, we extend the analysis of consistency to a multilingual setting. We introduce a resource, mParaRel, and investigate (i) whether multilingual language models such as mBERT and XLM-R are more consistent than their monolingual counterparts;and (ii) if such models are equally consistent across languages.We find that mBERT is as inconsistent as English BERT in English paraphrases, but that both mBERT and XLM-R exhibit a high degree of inconsistency in English and even more so for all the other 45 languages.


Introduction
Pretrained Language Models (PLMs) enable highquality sentence and document representations (Peters et al., 2018;Devlin et al., 2019;Yang et al., 2019; and encode world knowledge that can be useful for downstream tasks, e.g. closed-book QA , and commonsense reasoning (Zellers et al., 2019;Talmor et al., 2019), to name a few. Recent work has used language models as knowledge bases (Petroni et al., 2019;Kassner et al., 2021a; and as the basis of neural databases (Thorne et al., 2021). Such usage of PLMs relies on the assumption that we can generally trust the world knowledge that is induced from these models.
Consistency is a core quality that we would like models to have when we use their stored factual knowledge. We want models to behave consistently 1 https://github.com/coastalcph/mpararel on semantically equivalent inputs (Elazar et al., 2021), and to be consistent in their believes (Kassner et al., 2021b). Moreover we want them to be fair across languages or in other words to exhibit a consistent behaviour across languages (Choudhury and Deshpande, 2021). Nonetheless, recent work on consistency in PLMs has shown that models are brittle in their predictions when faced to irrelevant changes in the input (Gan and Ng, 2019;Ribeiro et al., 2020;Elazar et al., 2021;Ravichander et al., 2020). These works only considered English PLMs, while Jang et al. (2021) studied the consistency of Korean PLMs. There are, to the best of our knowledge, no resources available to measure the consistency of multilingual PLMs.

Contributions
In this paper, we present MPARA-REL, a multilingual version of the PARAREL dataset (Elazar et al., 2021), which we construct by automatically translating the English data to 45 languages and performing a human review of 11 of these. We then evaluate how consistent mBERT is in comparison to its monolingual counterpart, and we study how the consistency of mBERT and XLM-R varies across different languages. Following previous work, we do this by querying the model with cloze-style paraphrases, e.g. "Albert Einstein was born in [MASK]" and "Albert Einstein is originally from [MASK]". We find that mBERT and XLM-R exhibit competitive consistency to English BERT, but consistency numbers are considerably lower for other languages. In other words, while consistency is a serious problem in PLMs for English (Elazar et al., 2021), it is a much bigger problem for other languages.

Probing Consistency
We use the same probing framework as defined by Petroni et al. (2019) and refined by Elazar et al. (2021), and query PLMs with cloze-test statements created from subject-relation-object Wiki-data triples (Elsahar et al., 2018). That is, we have a set of different relations {r}, and each r has a set of templates or patterns {t} and a set of subjectobject tuples {(s, o)}. Each template t describes its corresponding relation r between the pairs (s, o). E.g. a relation r can be born-in, and two patterns could be {t 1 ="[X] was born in [Y]", t 2 ="[X] is originally from [Y]"} (where [X] is the subject and [Y] the object to be replaced). Then the corresponding subject-object tuples {(s, o)} are used to query and evaluate the model by replacing the subject and masking the object. We study the consistency of a PLM by querying it with cloze-test paraphrases and measuring how many of the predictions of the paraphrases are the same (details in §4).

MPARAREL
We used the paraphrases in the PARAREL dataset (Elazar et al., 2021), which has 38 relations in total and an average of 8.6 English templates per relation. We translated these using the procedure below, obtaining paraphrases for 46 languages.

Translations
We relied on five different machine translation models: Google Translate 2 , Microsoft Translator 3 , a pretrained mBART model that translates between 50 languages (Tang et al., 2020), a pretrained mixture of Transformers that translates between 100 languages (Fan et al., 2021), and OPUS-MT (Tiedemann and Thottingal, 2020). We fed models with templates, e.g.,"[X] died in [Y]" 4 , automatically checking if the translation contained [X] and [Y]. We considered as valid: (1) translated paraphrases that were agreed upon by two or more different models, and (2) the translations from the Microsoft translator, as they were found to be of good quality in several languages as per manual inspection by native speakers. So for languages that Microsoft supports, we will have a template t from the Microsoft translator, as well as any other translation agreed upon by two or more other translators 5 . Finally, we also include the templates in the mLAMA dataset (Kassner et al., 2021a). Translations of subject-object entities were obtained from WikiData, using the entity identifiers. We kept only the languages that (i) covered at least 60% of the  total 38 relations, 6 , and (ii) covered at least 20% of the total original phrases in English. 7 Human Evaluation For assessing the quality of the translated paraphrases we carried out a human review. We had 14 native speakers review 11 different languages 8 . Each person reviewed a 50% random sample of the total templates of the language 9 . We asked whether the template was a correct paraphrase of the given relation, we requested corrections and optionally asked for new template suggestions. On average, 16%±8% of the reviewed templates were considered wrong, 20%±10% were amended, and the rest were considered correct. The statistics of the dataset after removing the wrong templates and including the corrections and suggestions can be found in Table 1. The total number of different phrases (templates with the subject and object replaced) per language is shown in Figure 1.

Experiments
We ran experiments with mBERT (Devlin et al., 2019), a multilingual BERT model of 110M parameters trained on 104 languages using Wikipidea, and XLM-RoBERTa (Conneau et al., 2020), a multilingual RoBERTa model of 560M parameters trained on 100 languages using 2.5TB of Common-Crawl data.
Querying Language Models The prediction of a PLM for a cloze statement t is normally arg max w∈V (w|t) (Petroni et al., 2019;Ravichander et al., 2020), that is, the top-1 token prediction over the vocabulary. However, Kassner et al. (2021a); Elazar et al. (2021) used typed queries, where the prediction is arg max w∈C (w|t), with C a set of candidates that meets the type criteria of the pattern (e.g. cities, professions). In our case, C is all the possible objects in the relation. The motivation is that by restricting the output we can reduce the errors due to surface fluency, as when populating the template with different tuples small grammatical errors can occur (Kassner et al., 2021a). It is common to only consider tuples (subjectobject) for which the to-be-masked object is a single token in the models vocabulary (Petroni et al., 2019;Elazar et al., 2021). However, this reduces the number of valid tuples severely, and even more so when dealing with multilingual vocabularies. Therefore, we follow the multi-token prediction approach in Kassner et al. (2021a) and query the model with multiple masked tokens. The probability of an object instantiation is then the average probability of its tokens, i.e., for a given object where w i is the i-th token of the word o, m i is the i-th mask token, and t l is the template with l mask tokens.
Evaluation For a given relation r the consistency is the percentage of pairs of templates that have the same prediction for every subject-object tuple (Elazar et al., 2021), i.e. the consistency of a given relation r is: where t is a template, T the set of templates in the relation, d is a subject-object tuple, D the set of all tuples, so t d i is the i-th template populated with the subject-object data d, and f (·) is the prediction of the model. Next, accuracy measures the factual correctness of the predictions and is defined as the percentage of correct predictions over all the templates and data, i.e. d∈D t∈T 1 f (t d )=o , where o is the object of the tuple d. Finally, consistencyaccuracy is the subset of the accurate predictions that is also consistent. Thus, it is computed similarly to Equation 1 but in the indicator's condition we also add the condition imposed in the accuracy.  This metric is useful to account for trivial cases of consistency: A model can be really bad in a language and predict the same token despite the input, and thus be perfectly consistent. For all metrics, we report the macro average across relations. 10 Table 2 compares the consistency of BERT and mBERT on English data, showing little to no difference, depending on whether we use sentence-final punctuation or not. Sentence-final punctuation is not fully consistent in the machine translation output, so we ran experiments comparing the performance of including sentence-final punctuation or removing it. Since languages vary in how they use punctuation, and sentence-final punctuation causes variance in consistency (e.g., Japanese +3%, but Chinese Simplified -5%), we decided to remove all sentence-final punctuation for the cross-lingual consistency results.

Consistency across languages
The consistency results in the MPARAREL dataset are presented in Figure 2. First of all, we can see that the manual corrections don't change the results much (as also experienced by Kassner et al. (2021a)). Nevertheless, they do improve the consistency and accuracy by 1%-2% in a couple of languages, probably because some noise was reduced when correcting and adding new templates. Consistency numbers remain very low, however, especially for other languages than English and Vietnamese. XLM-R is much more consistent than mBERT in some languages (e.g. Greek ('el')), yet their average consistency is the same (0.43). The standard deviation of XLM-R's consistency is 8% lower than that of mBERT, i.e., XLM-R has a more fairly distributed consistency. Somewhat surprisingly, the accuracy of mBERT is superior to XLM-R's, nevertheless, this aligns to the findings of Elazar et al. (2021) where English base BERT obtained higher accuracy than a large English RoBERTa model. We note the importance of controlling for accuracy in our consistency results (reported as consistency-accuracy): Japanese, for example, has high consistency, but in part, because it wrongly predicts the same (frequent) token across paraphrases; consistency-accuracy reranks Japanese as one of the most inconsistently encoded languages in both mBERT and XLM-R. Consistency in PLMs has been studied in English. Gan and Ng (2019) created a paraphrased version of SQuAD and showed that the state-ofthe-art models had a significant decrease in performance, Ribeiro et al. (2020) proposed a framework to test the robustness in the predictions when faced with irrelevant changes in the input. Elazar et al. (2021); Ravichander et al. (2020) showed that monolingual English PLMs are inconsistent in fill-in-the-blank phrases. Then, Newman et al. (2021) proposed using adapters to better handle this inconsistency.

Related Work
There are paraphrase datasets available in English (Dolan and Brockett, 2005;Quora, 2012) and in multiple languages (Ganitkevitch and Callison-Burch, 2014), but they cannot be easily linked to subject-object tuples in order to measure consistency.

Conclusion
In this work, we measured the consistency of multilingual Pretrained Language Models when queried to extract factual knowledge. We constructed a high-quality multilingual dataset containing 46 different languages, to assess the consistency of models predictions in the face of language variability. Finally, we experimented with mBERT and XLM-R and concluded that their consistency is poor in English, but even worse in other languages.