A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations

Language agnostic and semantic-language information isolation is an emerging research direction for multilingual representations models. We explore this problem from a novel angle of geometric algebra and semantic space. A simple but highly effective method “Language Information Removal (LIR)” factors out language identity information from semantic related components in multilingual representations pre-trained on multi-monolingual data. A post-training and model-agnostic method, LIR only uses simple linear operations, e.g. matrix factorization and orthogonal projection. LIR reveals that for weak-alignment multilingual systems, the principal components of semantic spaces primarily encodes language identity information. We first evaluate the LIR on a cross-lingual question answer retrieval task (LAReQA), which requires the strong alignment for the multilingual embedding space. Experiment shows that LIR is highly effectively on this task, yielding almost 100% relative improvement in MAP for weak-alignment models. We then evaluate the LIR on Amazon Reviews and XEVAL dataset, with the observation that removing language information is able to improve the cross-lingual transfer performance.


Introduction
Recently, large-scale language modeling has expanded from English to the multilingual setting (i.a., Devlin et al. (2019); Conneau and Lample (2019); Conneau et al. (2020)). Although these models are trained with language modeling objectives on monolingual data, i.e. without crosslingual information, these multilingual systems exhibit impressive zero-shot cross-lingual ability (Hu et al., 2020b). These observations raise many questions and provide insight for multilingual representations learning. First, how is the language * Work done during internship at Google Research.
identity information and the semantic information expressed in the representation? Understanding their relations and underlying geometric structure is crucial for insights into designing more effective multilingual embedding systems. Second, how can we factor out the language identity information from the semantic components in representations? In many application, e.g. cross-lingual semantic retrieval, we wish to only keep the semantic information. Third, what is the geometric relation between different languages? Efforts have been made to answer these questions, e.g. Artetxe et al. (2020); Chung et al. (2020); Lauscher et al. (2020). Such prior work has addressed the problem at training time. In this work, we systematically explore a post-training method that can be readily applied to existing multilingual models.
One of the first attempts in this research area, Roy et al. (2020), proposed two concepts for language agnostic models: weak alignment v.s. strong alignment. For multilingual systems with weak alignment, for any item in language L 1 , the nearest neighbor in language L 2 is the most semantically "relevant" item. In the case of strong alignment, for any representation, all semantically relevant items are closer than all irrelevant items, regardless of their language. Roy et al. (2020) show sentence representations from the same language tend to cluster in weak-alignment system. Similar phenomena can be observed on other pre-trained multilingual models like mBERT, XLM-R (Conneau et al., 2020) and CMLM . Roy et al. (2020) provides carefully-designed training strategies for retrieval-like model to mitigate this issue in order to obtain language agnostic multilingual systems.
We systematically explore a simple post-training method we refer to as Language Information Removal (LIR), to effectively facilitate the language agnosticism in multilingual embedding systems. First introduced in  to reduce same language bias for retrieval tasks, the method uses only linear algebra factorization and posttraining operation. LIR can be conveniently applied to any multilingual model. We show LIR yields surprisingly large improvements in several downstream tasks, including LAReQA, a crosslingual QA retrieval dataset (Roy et al., 2020); Amazon Reviews, a zero-shot cross lingual evaluation dataset; XEVAL, a collection of multilingual sentence embedding tasks. Our results suggest that the principal components of a multilingual system with self-language bias primarily encodes language identification information. Implementation for LIR is available at https://github. com/ziyi-yang/LIR.

Language Information Removal for Self Language Bias Elimination
In this section we describe Language Information Removal (LIR) to address the self language bias in multilingual embeddings . The first step is to extract the language identity information for each language space. Given a multilingual embedding system E, e.g. multilingual BERT, and a collection of multilingual texts {t i L }, where t i L denotes the ith phrase in the collection for the language L. We construct a language matrix M L ∈ R n×d for language L, where n denotes the number of sentences in language L and d denotes the dimension of the representation. The row i of M L is the representation of t i L computed by E. Second, we extract language identification components for each language. One observation in multilingual systems is that representations from the same language tend to cluster together (w.r.t representations in other languages), even though these representations have different semantic meanings. This phenomenon is also known as "weak alignment" (Roy et al., 2020). The mathematical explanation for this clustering phenomenon is that representations in the same language have shared vector space components. We propose that these shared components essentially represent the language identification information. Removing these language components should leave semantic-related information in the representations.
To remove the shared components, or the language identification from the representations, we leverage singular value decomposition (SVD) which identifies the principal directions of a space. We use SVD instead of PCA since SVD is more stable numerically (e.g. for Läuchli matrix). Specifically, the SVD of a language matrix is M L = U L Σ L V T L , where the columns of V L ∈ R d×d are the right singular vectors of M L . We take first r columns of V L as the language identification components, denoted as c L ∈ R d×r . Different values of r are explored in the next experiments section. Language identification components are removed as follows. Given a multilingual representation e L in language L, we subtract the projection of e L onto c L from e L , i.e.

Experiments
In the following experiments, sentences used for extracting principle components are sampled from Wiki-40B (Guo et al., 2020). We use 10,000 sentences per language. We notice performance initially increases as more sentences are used but then is almost unchanged after n > 10, 000. We tried different samplings of {t i L } and text resources other than Wiki-40B, e.g., Tatoeba (Artetxe and Schwenk, 2019). The minimal differences in performance suggest language components are stable over different domains.

Cross-lingual Answer Retrieval
We first examine LIR on LAReQA, a cross-lingual answer retrieval dataset containing 11 languages (Roy et al., 2020). LAReQA consists of two retrieval sub-datasets: XQuAD-R and MLQA-R. XQuAD-R is built by translating 240 paragraphs in the SQuAD v1.1 dev set into 10 languages and converting them to retrieval tasks following the procedure from ReQA (Ahmad et al., 2019). Similarly, MLQA-R is constructed by converting MLQA (Lewis et al., 2020) to QA retrieval. In other words, each question in LAReQA has 11 relevant answers, one in each language. Two retrieval models with self language bias are presented in the LAReQA original paper, i.e. "En-En" and "X-X". Specifically, the multilingual model "En-En" finetunes mBERT for QA retrieval on the 80,000 English QA pairs from the SQuAD v1.1 train set using a ranking loss. The model "X-X" trains on the translation (into 11 languages) of the SQuAD train set. In one training example, the question and answer are in the same language. Since given a question query, all positive examples are withinlanguage, "En-En" and "X-X" exhibit strong selflanguage bias and weak-alignment property.
For evaluation, we first compute the language identification components with "En-En" and "X-X" models released by LAReQA. For testing, language identification components are removed from question and answer embeddings following Eq. (1). Results are shown in Table 1 and the evaluation metric is mean average precision (MAP) of retrieval. Detailed results for each language are provided in the appendix (Table 5). Simply applying LIR results in significant improvements, almost 100% relatively for "X-X" model on XQuAD-R. This huge boost reveals the algebraic structure for multilingual representation space: in weakalignment multilingual system, the principal components primarily encode language information. In LAReQA, each language has one of the relevant answers. The performance improvement itself already indicates less language bias. To further illustrate the effect of LIR, we plot the 2D PCA projection of questions and candidates in Chinese and English for the XQuAD-R dataset. Without LIR, as plotted on the left of Fig. 2  erty is especially prominent for model "X-X". After applying LIR, the separation between the two languages vanishes. Questions and candidates embeddings, no matter which language they are in, group together. Both model "En-En" and "X-X" now exhibit strong cross-lingual alignment.

Amazon Reviews
We further evaluate LIR on zero-shot transfer learning with Amazon Reviews Dataset (Prettenhofer and Stein, 2010). In this subsection, we use multilingual BERT (Devlin et al., 2019) as the embedding model. Following Chidambaram et al. (2019), the original dataset is converted to a classification benchmark by treating reviews of more than 3 stars as positive and negative otherwise. We split 6000 English reviews in the original training set into 90% for training and 10% for development. A logistic classifier is trained on the English training set and then evaluated on English, French, German and Japanese test sets (each has 6000 examples) using the same trained model, i.e. the evaluation is zeroshot. The weights for mBERT are fixed. The representation of a sentence/phrase is computed as the average pooling of the transformer encoder outputs. LIR is applied in both training and evaluation stage using the corresponding language components. Results presented in Table 2 show that removing the language components from multilingual representation is beneficial for cross-lingual zero-shot authors to imitate Figure 5 in Roy et al. (2020) in order to better demonstrate the effects of LIR transfer learning of mBERT. LIR is expected to leave only semantic-related information in the representation so that the logistic classifier trained on English should be conveniently transferred to other languages. Another interesting observation is that unlike semantic retrieval, the peak performance usually occurs at r > 1.

XEVAL
We have tested LIR on cross-lingual benchmarks in previous sections. In this section, we apply LIR in XEVAL, a collection of multilingual sentence representation benchmark . The training set and test set of XEVAL are in the same language (i.e. the evaluation is not cross-lingual). Benchmarks on XEVAL include Movie Reviews (Pang and Lee, 2005), binary SST (sentiment analysis, Socher et al. (2013)), MPQA (opinionpolarity, Wiebe et al. (2005)), TREC (questiontype, Voorhees and Tice (2000)), CR (product reviews, Hu and Liu (2004)), SUBJ (subjectivity/objectivity, Pang and Lee (2004)) and SICK (both entailment and relatedness (Marelli et al., 2014)). For this evaluation, we use mBERT as the base multilingual encoder. Still the weights of mBERT are fixed during training and only downstream neural structures are trained. The training, cross-validation and evaluation uses SentEval toolkit (Conneau and Kiela, 2018). Results are presented in Table 3. The metric is the averaging performance across 9 datasets mentioned above. Introducing LIR is beneficial on German, Spanish, French and Chinese. We also notice that for English dataset, removing principal components actually hurts the performance. This observation also echoes with findings in previous English sentence embedding works, e.g. Yang et al. (2019b). We speculate this is because English data are dominant in mBERT training data. Therefore mBERT representations exhibit similar behaviors with monolingual English sentence embeddings.

Application to Models without Self-Language Bias
In previous sections, we have shown the great effectiveness of LIR on weak-alignment systems. As an additional analysis, we examine LIR on multilingual models without self language bias, i.e. models "X-X-mono" and "X-Y" introduced in the original LAReQA paper. Model "X-X-mono" is modified from "X-X" by ensuring that each training batch is monolingual so that in-batch negative and positive examples are in the same language. In model "X-Y", questions and answers are allowed to be translated to different languages, which directly encourage the model to regard answers in different languages from the question as correct. With such designs in training, "X-X-mono" and "X-Y" are shown to be without self-language bias, i.e. semantically relevant representations are closer than all irrelevant items, regardless of their languages. The evaluation process is similar as in Section 3.1. Results are presented in Table 4. Applying LIR leads to a slight performance decrease for X-X-mono. While the drop in X-Y is notable and we suspect this is because the training process for X-Y avoids, by design, self-language bias. Rather, the principal components of X-Y contain essential semantic-related information for the retrieval task. This result is not negative and actually support our argument, since for "strong alignment" multilingual systems, principal components should both contain semantic and language-related information. Then removing principal components will hinder the semantic retrieval. For weak-alignment models, removing just the first component should be adequate for cross-lingual retrieval (table 1). For tasks like classification and sentiment analysis (tables 2 and 3), the optimal number of components to remove seems to vary on different datasets.

Related Work & Our Novelty
Different training methods have been proposed to obtain language agnostic representations. LASER (Artetxe and Schwenk, 2019) leverages translation pairs and BiLSTM encoder for multilingual sen-  Table 4: Mean average precision (MAP) of "X-Xmono" and "X-Y" models without language bias. tence representation learning. Multilingual USE (Yang et al., 2019a) uses training data such as translated SNLI, mined multilingual QA and translation pairs to learn multilingual sentence encoder. AM-BER (Hu et al., 2020a) aligns contextualized representations of multilingual encoders at different granularities. LaBSE (Feng et al., 2020) finetunes a pretrained language model with the bitext retrieval task and mined cross-lingual parallel data to obtain language agnostic sentence representations. In contrast, LIR does not require any parallel data for semantic alignment. Faruqui and Dyer (2014) propose a canonical correlation analysis (CCA) based method to add multilingual context to monolingual embeddings. The method is post-processing and requires bilingual word translation pairs to determine the projection vectors. In contrast, LIR is post-training and does not require labeled data. Mrkšić et al. (2017) build semantically specialized cross-lingual vector spaces. Like CCA, their methods requires the additional training to adjust the original embeddings using supervised data: cross-lingual synonyms and antonyms. Libovickỳ et al. (2019) propose that the language-specific information of mBERT is the centroid of each language space (the mean of embeddings). Zhao et al. (2021) propose several training techniques to obtain language-agnostic representations, including segmenting orthographic tokens in training data and aligning monolingual spaces by training. In contrast, LIR is post-training and model-agnostic. Critically, this means LIR can be conveniently and easily applied to any trained multilingual systems without further training.
Previous explorations on principal components of the semantic space for sentence embeddings include Arora et al. (2017) and Yang et al. (2019b), whereby principal component removal is investigated for monolingual models and the evaluation is only conducted on semantic similarity benchmarks. In contrast, our work investigates the multilingual case and the evaluation is more diverse, e.g. cross-lingual transfer learning. Mu and Viswanath (2018) explore removing top components from English representations. However, it was unclear prior to our work what purpose is served by removing principal components within multilingual and cross-lingual settings. We demonstrate these principal components represent language information for weak-alignment multilingual models.
Compared with , the novelty of this work is two-fold. First, it is unclear in  whether the assumption (i.e. principal components contain language information) holds true for both weak and strong-alignment multilingual models. In this work we clearly show that it is is valid for weak-alignment models (Section 3.1). However, for strong-alignment systems, the assumption is not quite true (Table 4). Second, in , the evaluation is only conducted on Tatoeba, a semantic retrieval dataset. While in this work, evaluations are more comprehensive. Besides the cross-lingual retrieval dataset LAReQA, our experiments include cross-lingual zero-shot learning (Section 3.2) and monolingual transfer learning (Section 3.3). These extra results establish the effectiveness of LIR beyond the domain of semantic retrieval.

Conclusion
In this paper, we investigate the self-language bias in multilingual systems. We explore a simple method "Language Identity Removal (LIR)". This method identifies and removes the language information in multilingual semantic space by singular value decomposition and orthogonal projection. Although as a simple and linear-algebra-only method, LIR is highly effective in several downstream tasks, including zero-shot transfer learning, sentiment analysis, etc. Especially for crosslingual retrieval, introducing LIR increases the performance of weak-alignment multilingual systems by almost 100% relatively in MAP.
A Experimental results for each language of model "X-X" on LAReQA Here we provide the detailed experiment results of each language on the XQuAD-R dataset. The multilingual encoder is model "X-X".  Table 5: Experimental results for each language of model "X-X" on the XQuAD-R dataset.