Multilingual BERT has an Accent: Evaluating English Influences on Fluency in Multilingual Models

While multilingual language models can improve NLP performance on low-resource languages by leveraging higher-resource languages, they also reduce average performance on all languages (the ‘curse of multilinguality’). Here we show another problem with multilingual models: grammatical structures in higher-resource languages bleed into lower-resource languages, a phenomenon we call grammatical structure bias. We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models: testing their preference for two carefully-chosen variable grammatical structures (optional pronoun-drop in Spanish and optional Subject-Verb ordering in Greek). We find that multilingual BERT is biased toward the English-like setting (explicit pronouns and Subject-Verb-Object ordering) and against the default Spanish and Gerek settings, as compared to our monolingual control language model. With our case studies, we hope to bring to light the fine-grained ways in which multilingual models can be biased, and encourage more linguistically-aware fluency evaluation.


Introduction
Multilingual language models share a single set of parameters between many languages, opening new pathways for multilingual and low-resource NLP.However, not all training languages have an equal amount, or a comparable quality (Kreutzer et al., 2022), of training data in these models.In this paper, we investigate if the hegemonic status of English influences other languages in multilingual language models.We propose a novel method for evaluation, whereby we ask if model predictions for lower-resource languages exhibit structural features of English.This is similar to asking if the model has learned some languages with an "English accent", or an English grammatical structure bias.
We demonstrate this bias effect in Spanish and Greek, comparing the monolingual models BETO (Cañete et al., 2020) and GreekBERT (Koutsikakis et al., 2020) to multilingual BERT (mBERT), where English is the most frequent language in the training data.We show that mBERT prefers English-like sentence structure in Spanish and Greek compared to the monolingual models.Our case studies focus on Spanish pronoun drop (prodrop) and Greek subject-verb order, two structural grammatical features.We show that multilingual BERT is structurally biased towards explicit pronouns rather than pro-drop in Spanish, and subjectbefore-verb order in Greek: the structural forms parallel to English.
The effect we showcase here demonstrates the type of fluency that can be lost with multilingual training -something that current evaluation methods miss.Our proposed method can be expanded, without the need for manual data collection, to any language with a syntactic treebank and a monolingual model.Since our method focuses on finegrained linguistic features, some expert knowledge of the target language is necessary for evaluation.
Our work builds off of a long literature on multilingual evaluation which has until now mostly focused on downstream classification tasks (Conneau et al., 2018;Ebrahimi et al., 2022;Clark et al., 2020;Liang et al., 2020;Hu et al., 2020;Raganato et al., 2020;Li et al., 2021).With the help of these evaluation methods, research has pointed out the problems for both high-and lowresource languages that come with adding many languages to a single model (Wang et al., 2020;Turc et al., 2021;Lauscher et al., 2020, inter alia), and proposed methods for more equitable models (Ansell et al., 2022;Pfeiffer et al., 2022;Ogueji et al., 2021;Ògúnrè . mí and Manning, 2023;Virtanen et al., 2019;Liang et al., 2023, inter alia).We hope that our work can add to these analyses and methodologies by pointing out issues beyond downstream classification performance that can arise with multilingual training, and aid towards building and evaluating more equitable multilingual models.

Method
Our method relies on finding a variable construction in the target language which can take two structural surface forms: one which is parallel to English (S parallel ) and one which is not (S different ).Surface forms parallel to English are those which mirror English structure.
Once we have identified such a construction in our target language, we can ask: are multilingual models biased towards S parallel ?We can use syntactic treebank annotations to pick out sentences that exhibit the structures S parallel or S different , and put these extracted sentences into two corpora, C parallel and C different .We then calculate a ratio r model for each model: the average probability of a sentence in C parallel divided by the average probability of a sentence in C different according to the model.Our experimental question then boils down to asking if r multi is significantly larger than r mono .To get an estimation of P model (x), we can extract the prob- ability of one word w in each sentence that best represents the construction, and approximate the probability of x with P (w x |x).Using a carefully chosen word as a proxy for the probability of a construction is a methodological choice also made in reading time psycholinguistics experiments (Levy and Keller, 2013).

Case Study: Spanish Pro-drop
For our Spanish case study, we examine the feature of whether the subject pronoun is realized.In Spanish, the subject pronoun is often dropped: person and number are mostly reflected in verb conjugation, so the pronoun is realized or dropped depending on semantic and discourse factors.English, on the other hand, does not allow null subjects except in rare cases, even adding expletive syntactic subjects as in "it is raining".We extract C parallel (with subject pronoun) and C different (dropepd subject pronoun) from the Spanish GSD treebank (De Marneffe et al., 2021).We take all sentences with a pronoun dependent of the root verb and add them to C parallel (283 sentences) and all sentences where there is no nsubj relation to root verb and add them to C different (2,656 sentences), ignoring some confounder constructions.We always pick the main root verb of the sentence as our logit word w.

Case Study: Greek Subject-Verb order
For our Greek case study, we examine the feature of Subject-Verb order.English is a fixed word order language: with few exceptions, the order of a verb and its arguments is Subject-Verb-Object.Greek, on the other hand, has mostly free word order (Mackridge, 1985), meaning that the verb and arguments can appear in any order that is most appropriate given discourse context.For our experiment, we define S parallel to be cases in Greek when the subject precedes the verb, as is the rule in English.S different is then the cases when the verb precedes the subject, which almost never happens in English.We extract C parallel (Subject-Verb order, 1,446 sentences) and C different (Verb-Subject order, 425 sentences) from the Greek Dependency Treebank (Prokopidis and Papageorgiou, 2017).We define w to be the first element of the subject and verb: This first element is closer to the surrounding context, and so gives us a word-order-sensitive measurement of how the subject-verb construction is processed within the context.

Results
Results are shown in Figures 1 and 2, showing for both of our case studies that multilingual BERT has a greater propensity for preferring Englishlike sentences which exhibit S parallel .Multilingual BERT significantly prefers pronoun sentences over pro-drop compared with monolingual BETO (bootstrap sampling, p < 0.05), and significantly prefers subject-verb sentences over verb-subject sentences over GreekBERT (bootstrap sampling, p < 0.05).

Figure 1 :
Figure 1: Results from our experiment on the Spanish GSD treebank, along with two examples from the treebank to illustrate S parallel (with pronoun) and S different (pro-drop).Error bars represent 95% bootstrap confidence intervals.

Figure 2 :
Figure 2: Results from our experiment on the Greek Dependency Treebank, along with two examples from the treebank to illustrate S parallel (Subject-Verb) and S different (Verb-Subject).Error bars represent 95% bootstrap confidence intervals.