A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification

We present a multilingual bag-of-entities model that effectively boosts the performance of zero-shot cross-lingual text classification by extending a multilingual pre-trained language model (e.g., M-BERT). It leverages the multilingual nature of Wikidata: entities in multiple languages representing the same concept are defined with a unique identifier. This enables entities described in multiple languages to be represented using shared embeddings. A model trained on entity features in a resource-rich language can thus be directly applied to other languages. Our experimental results on cross-lingual topic classification (using the MLDoc and TED-CLDC datasets) and entity typing (using the SHINRA2020-ML dataset) show that the proposed model consistently outperforms state-of-the-art models.


Introduction
In the zero-shot approach to cross-lingual transfer learning, models are trained on annotated data in a resource-rich language (the source language) and then applied to another language (the target language) without any training. Substantial progress in cross-lingual transfer learning has been made using multilingual pre-trained language models (PLMs), such as multilingual BERT (M-BERT), jointly trained on massive corpora in multiple languages (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020a). However, recent empirical studies have found that cross-lingual transfer learning with PLMs does not work well for languages with insufficient pre-training data or between distant languages (Conneau et al., 2020b;Lauscher et al., 2020), which suggests the difficulty of cross-lingual transfer based solely on textual information.
We propose a multilingual bag-of-entities (M-BoE) model that boosts the performance of zeroshot cross-lingual text classification by injecting features of language-agnostic knowledge base (KB) entities into PLMs. KB entities, unlike words, can capture unambiguous semantics in documents and be effectively used to address text classification tasks (Gabrilovich and Markovitch, 2006;Chang et al., 2008;Negi and Rosner, 2013;Song et al., 2016;Yamada and Shindo, 2019). In particular, our model extends PLMs by using Wikidata entities as input features (see Figure 1). A key idea behind our model is to leverage the multilingual nature of Wikidata: entities in multiple languages representing the same concept (e.g., Apple Inc., ᄋ ᅢᄑ ᅳ ᆯ, アッ プル) are assigned a unique identifier across languages (e.g., Q312). Given a document to be classified, our model extracts Wikipedia entities from the document, converts them into the corresponding Wikidata entities, and computes the entity-based document representation as the weighted average of the embeddings of the extracted entities. Inspired by previous work (Yamada and Shindo, 2019;Peters et al., 2019), we compute the weights using an attention mechanism that selects the entities relevant to the given document. We then compute the sum of the entity-based document representation and the text-based document representation computed using the PLM and feed it into a linear classifier. Since the entity vocabulary and entity embedding are shared across languages, a model trained on entity features in the source language can be directly transferred to multiple target languages.
We evaluated the performance of the M-BoE model on three cross-lingual text classification tasks: topic classification on the MLDoc (Schwenk and Li, 2018) and TED-CLDC (Hermann and Blunsom, 2014) datasets and entity typing on the SHINRA2020-ML (Sekine et al., 2020) dataset. We trained the model using training data in the source language (English) and then evaluated it on the target languages. It outperformed our base PLMs (i.e., M-BERT (Devlin et al., 2019) and the XLM-R model (Conneau et al., 2020a)) for all tar- Figure 1: Architecture of M-BoE. Given a document, the model extracts Wikipedia entities, converts them into corresponding Wikidata entities, and calculates the entity-based document representation by using the weighted average of the embeddings of the entities selected by an attention mechanism. The sum of the entity-based representation and the representation computed using a multilingual PLM is used to perform linear classification for the task.
get languages on all three tasks, thereby demonstrating the effectiveness of the entity-based representation. Furthermore, our model performed better than state-of-the-art models on the MLDoc dataset.
Our contributions are as follows: • We present a method for boosting the performance of cross-lingual text classification by extending multilingual PLMs to leverage the multilingual nature of Wikidata entities. Our method successfully improves the performance on multiple target languages simultaneously without expensive pre-training or additional text data in the target languages.
• Inspired by previous work (Yamada and Shindo, 2019;Peters et al., 2019), we introduce an attention mechanism that enables entity-based representations to be effectively transferred from the source language to the target languages. The mechanism selects entities that are relevant to address the task.
• We present experimental results for three cross-lingual text classification tasks demonstrating that our method outperformed our base PLMs (i.e., M-BERT and XLM-R) for all languages on the three tasks and outperformed state-of-the-art methods on the ML-Doc dataset.

Related Work
Cross-lingual PLMs Zero-shot cross-lingual transfer learning approaches have relied on parallel corpora (Xu and Wan, 2017)  presented methods for data augmentation in which pseudo-labels are assigned to an unlabeled corpus in the target language. Conneau and Lample (2019) additionally pre-trained BERT-based models using a parallel corpus. However, these methods require extra training on additional text data for each target language, and their resulting models work well only on a single target language. Unlike these methods, our method does not require extra training and improves performance simultaneously for all target languages with only a single PLM. Furthermore, our method can be easily applied to these models since it is a simple extension of a PLM and does not modify its internal architecture.
Enhancing monolingual PLMs using entities Several methods have been proposed for improving the performance of PLMs through pre-training using entities. ERNIE (Zhang et al., 2019) and KnowBert (Peters et al., 2019) enrich PLMs by using pre-trained entity embeddings. LUKE (Yamada et al., 2020b) and EaE (Févry et al., 2020) train entity embeddings from scratch during pretraining. However, all of these methods are aimed at improving the performance of monolingual tasks and require pre-training with a large corpus, which is computationally expensive. Our method dynamically injects entity information into PLMs during fine-tuning without expensive pre-training. Several studies have attempted to incorporate entity information into PLMs after pre-training to enhance the performance of monolingual tasks. Ostendorff et al. (2019) concatenated contextualized representations with knowledge graph embeddings to represent author entities and used them as features for the book classification task. E-BERT (Poerner et al., 2020) inserts KB entities next to the entity names in the input sequence to improve BERT's performance for entity-centric tasks. Verlinden et al. (2021) introduced a mechanism for combining span representations and KB entity representations within a BiLSTM-based end-to-end information extraction model. Unlike these methods, our method aims to improve the cross-lingual text classification by combining PLMs with languageagnostic entity embeddings.
Text classification models using entities Several methods have been commonly used to address text classification using entities. Explicit semantic analysis (ESA) is a representative example; it represents a document as a bag of entities, which is a sparse vector in which each dimension is a score reflecting the relevance of the text to each entity (Gabrilovich and Markovitch, 2006;Chang et al., 2008;Negi and Rosner, 2013). More recently, Song et al. (2016) proposed cross-lingual explicit semantic analysis (CLESA), an extension of ESA, to address cross-lingual text classification. CLESA computes sparse vectors from the intersection of Wikipedia entities in the source and target languages using Wikipedia language links. Unlike CLESA's approach, we address cross-lingual text classification by extending state-of-the-art PLMs with a language-agnostic entity-based document representation based on Wikidata.
The most relevant to our proposed approach is the neural attentive bag-of-entities (NABoE) model proposed by Yamada and Shindo (2019). It addresses monolingual text classification using entities as inputs and uses an attention mechanism to detect relevant entities in the input document. Our model can be regarded as an extension of NABoE by (1) representing documents using a shared entity embedding across languages and (2) combining an entity-based representation and attention mechanism with state-of-the-art PLMs.
3 Proposed Method Figure 1 shows the architecture of our model. The model extracts Wikipedia entities, converts them into Wikidata entities, and computes the entitybased document representation using an attention mechanism. The sum of the entity-based document representation and the text-based document representation computed using the PLM is fed into a linear classifier to perform classification tasks.

Entity detection
To detect entities in the input document, we use two dictionaries that can be easily constructed from the KB: (1) a mention-entity dictionary, which binds an entity name (e.g., "Apple") to possible referent KB entities (e.g., Apple Inc. and Apple (food)) by using the internal anchor links in Wikipedia (Guo et al., 2013), and (2) an inter-language entity dictionary, which links multilingual entities (e.g., Tokyo, ᄃ ᅩᄏ ᅭ, 東京) to a corresponding identifier (e.g., Q7473516) of Wikidata.
All words and phrases are extracted from the given document in accordance with the mentionentity dictionary, and all possible referent entities are detected if they are included as entity names in the dictionary. Note that all possible referent entities are detected for each entity name rather than a single resolved entity. For example, we detect both Apple Inc. and Apple (food) for entity name "Apple". Next, the detected entities are converted into Wikidata entities if they are included in the inter-language entity dictionary.

Model
Each Wikidata entity is assigned a representation v e i ∈ R d . Since our method extracts all possible referent entities rather than a single resolved entity, it often extracts entities that are not related to the document. Therefore, we introduce an attention mechanism inspired by previous work (Yamada and Shindo, 2019;Peters et al., 2019) to prioritize entities related to the document. Given a document with K detected entities, our method computes the entity-based document representation z ∈ R d as the weighted average of the entity embeddings: where a e i ∈ R is the attention weight corresponding to entity e i and calculated using where a = [a e 1 , a e 2 , · · · , a e K ] are the attention weights; W a ∈ R 2 is a weight vector; φ = [φ(e 1 , d), φ(e 2 , d), · · · , φ(e K , d)] ∈ R 2×K represents the degree to which each entity e i is related to document d; and φ(e i , d) is calculated by concatenating commonness 1 p e i with the cosine similarity between the document representation computed using the PLM, h ∈ R d (e.g., the final hidden state of the [CLS] token), and entity embedding, v e i . The sum of this entity-based document representation z and text-based document representation h is fed into a linear classifier 2 to predict the probability of label c:

Experimental Setup
In this section, we describe the experimental setup we used for the three cross-lingual text classification tasks.
MLDoc is a dataset for multi-class text classification, i.e., classifying news articles into four categories. We used the english.train.1000 and english.dev datasets, which contain 1000 documents for training and validation data. As in the previous work (Schwenk and Li, 2018;Keung et al., 2020), we used accuracy as the metric.
TED-CLDC is a multi-label classification dataset covering 15 topics. This topic classification dataset is exactly like the MLDoc dataset except that the classification task is more difficult because of its colloquial nature and because the amount of training data is small. Following the previous work (Hermann and Blunsom, 2014), we used micro-average F1 as the metric.
SHINRA2020-ML is an entity typing dataset that assigns fine-grained entity labels (e.g., Person, Country, Government) to a Wikipedia page. We used this dataset for multi-label classification tasks; we used all datasets in 30 languages except English for the test data. Following the original work (Sekine et al., 2020), we used micro-average F1 as the metric.
We created a validation set by randomly selecting 5% of the training data in TED-CLDC and 5% of the training data in SHINRA2020-ML. We used English as the source language in all experiments. A summary of the datasets is shown in Table 1

Entity preprocessing
We constructed a mention-entity dictionary from the January 2019 version of Wikipedia dump 3 and an inter-language entity dictionary from the March 2020 version in the Wikidata dump, 4 which contains 45,412,720 Wikidata entities (e.g., Q312). We computed the commonness values from the same versions of Wikipedia dumps in the corresponding language, following the work of Yamada and Shindo (2019). We initialized Wikidata entity embeddings using pre-trained English entity embeddings trained on the KB. To train these embeddings, we used the open-source Wikipedia2Vec tool (Yamada et al., 2020a). We used the January 2019 English Wikipedia dump mentioned above and set the dimension to 768 and the other parameters to the default values. We initialized an entity embedding using a random vector if the entity did not exist in the Wikipedia2Vec embeddings. Note that we used only English Wikipedia to train the entity embeddings.

Models
We used M-BERT (Devlin et al., 2019) and XLM-R base (Conneau et al., 2020a) as the baseline multilingual PLMs to evaluate the proposed method. We added a single fully-connected layer on top of the PLMs and used the final hidden state h of the first [CLS] token as the text-based document representa-tion. For the MLDoc dataset, we trained the model by minimizing the cross-entropy loss with softmax activation. For the TED-CLDC and SHINRA2020-ML datasets, we trained the model by minimizing the binary cross-entropy loss with sigmoid activation. For these two tasks, we regarded each label as positive if its corresponding predicted probability was greater than 0.5 during inference.
For topic classification using MLDoc, we compared the performance of the proposed model with those of two state-of-the-art cross-lingual models: LASER (Artetxe and Schwenk, 2019) (see Section 2), and MultiCCA (Schwenk and Li, 2018), which is based on a convolutional neural network with multilingual word embeddings. To ensure a fair comparison, we did not include models that use additional unlabeled text data or a parallel corpus to train models for each target language.
For entity typing, we tested a model that uses oracle entity annotations (i.e., hyperlinks) contained in the Wikipedia page to be classified instead of entities detected using the entity detection method described in Section 3.1. Note that this model also uses attention mechanisms and pre-trained entity embeddings.

Detailed settings
The hyper-parameters used in our experiments are shown in Table 2. We tuned them on the basis of the English validation set. We trained the model using the AdamW optimizer with a gradient clipping of 1.0.
In all experiments, we trained the models until the performance on the English validation set con-     verged. We conducted all experiments ten times with different random seeds, and recorded the average scores and 95% confidence intervals.

Results
Tables 3, 4, and 5 show the results of our experiments. Overall, the M-BoE models outperformed their baselines (i.e., M-BERT and XLM-R) for all target languages on all three datasets. Furthermore, there was a significant difference in the mean scores for the target languages for those models in a paired t-test (p < 0.05). In particular, the performance of our model clearly exceeded that of the M-BERT baseline by 2.7% in accuracy, 2.5% in F1, and 2.1% in F1, on the MLDoc, TED-CLDC, and SHINRA2020-ML datasets, respectively.
For entity typing, using the entities detected with our simple dictionary-based approach achieved comparable performance to using gold entity annotations (Table 5: Oracle M-BoE) on the SHINRA2020-ML dataset, which clearly demonstrates the effectiveness of our attention-based entity detection method.

Analysis
We conducted a series of experiments to analyze the performance of our model on the MLDoc dataset (Table 6). We first analyzed the impact on the performance of each component in the M-BoE model, including the attention mechanism, pre-trained entity embeddings, and entity detection methods. We then evaluated the sensitivity of the model's performance to differences in the number of detected entities for each language. Finally, we conducted qualitative analysis by visualizing important entities.

Attention mechanism
We examined the effect of the attention mechanism on performance. When the attention mechanism was removed ( Table 6: Attention mechanism), the performance was substantially lower than with the proposed model. This indicates that the attention mechanism selects the entities that are effective in solving the classification task. Next, we examined the effectiveness of the two features (i.e., cosine and commonness) in the attention mechanism by excluding them one at a time from the M-BoE   model. Table 6 shows that there was a slight drop in performance when either of them was not used, indicating that both features are effective.

Entity embeddings
To investigate the effect of entity embedding initialization, we replaced Wikipedia2Vec with (1) random vectors and (2) knowledge graph (KG) embeddings (Table 6: Entity embeddings). For KG embedding, we used ComplEx (Trouillon et al., 2016), a state-of-the-art KG embedding method. We trained the ComplEx embeddings on the wikidata5m dataset (Wang et al., 2021) using the kge tool (https://github.com/ uma-pi1/kge). We set the dimension to 768 and used the default hyper-parameters for everything else in the wikidata5m-complex configuration in the tool. The results show that using Wikipedia2Vec was the most effective although using KG embeddings was better than using random vectors.

Entity detection method
To verify the effectiveness of our dictionary-based entity detection method, we simply replaced it with a commercial multilingual entity linking system, Google Cloud Natural Language API 5 (Table 6: Entity detection method). All entities were detected with the API and converted into Wikidata entities, as explained in Section 3.1. Note that unlike our dictionary-based method, the entity linking system detects a single disambiguated entity for each entity name.
The results show that our entity detection method outperformed the API. We attribute this to the number of entities detected with our dictionary-based detection method. As shown in Table 7, the number of entities detected with the entity linking system was substantially lower than with our entity detection method because, unlike our method, the system detects only disambiguated entities and does not detect non-named entities. Therefore, we attribute the better performance of our method compared with that of the API to (1) non-named entities also being important features and (2) the inability to use the correct entity if the disambiguation error is caused by entity linking.
Furthermore, as described in Section 5, our entity detection method performed competitively with the human-labeled entity annotations on the SHINRA2020-ML dataset.
Next, we examined the performance impact of the number of detected Wikidata entities. For the full model and no attention model, we observed a change in performance when some percentage of the entities were randomly removed during training and inference. Figure 2 shows that, the higher the entity detection rate, the better the performance of the full model. When the attention mechanism was removed, however, there was no consistent trend. The performance remained the same or even dropped. These results suggest that the more entities detected, the better the performance, and that the attention mechanism is important for this consistent improvement.

Performance sensitivity to language differences
In our method, the number of detected Wikidata entities during inference differs depending on the target languages. We investigated how this affects performance. For each of the datasets, we computed the Pearson's correlation coefficient between the number of detected entities and the rate of improvement in performance for each language (see Table 8 in the Appendix). As a result, there was no clear trend in the correlation coefficients, which ranged from -0.3 to 0.2. These results indicate that the performance was consistently improved for languages with a small number of detected entities. We attribute this to the ability of our method to detect a sufficient number of entities, even for languages with a relatively small number of entity detections.

Qualitative analysis
To further investigate how the M-BoE model improved performance, we took the MLDoc documents that our model classified correctly while M-BERT did not and examined the influential entities that were assigned the largest attention weights by the M-BoE model. Figure 3 shows

Conclusions
Our proposed M-BoE model is a simple extension of multilingual PLMs: language-independent Wikidata entities are used as input features for zeroshot cross-lingual text classification. Since the Wikidata entity embeddings are shared across languages, and the entities associated with a document are further selected by the attention mechanism, a model trained on these features in one language can efficiently be applied to multiple target languages. We achieved state-of-the-art results on three cross-lingual text classification tasks, which clearly shows the effectiveness of our method. As future work, we plan to evaluate our model on a variety of natural language processing tasks, such as cross-lingual document retrieval. We would also like to investigate whether our method can be combined with other methods, such as using additional textual data in the target language. Ming-Wei Chang, Lev Ratinov, Dan Roth, and Vivek Srikumar. 2008  Appendix for "A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification" A Details of performance sensitivity to language differences As described in Section 6.4, we tested the sensitivity of performance to the number of entities detected in the target languages. Specifically, for each target language, we computed (1) the ratio of performance improvement to the baseline and (2) the average number of detected entities per document and computed the Pearson correlation coefficient between the two variables on the MLDoc and TED-CLDC datasets. The experimental results (Table 8) do not show any clear trend in the correlation coefficients, indicating that the number of entity detections during inference does not substantially affect the model's performance. For example, even for Chinese on the MLDoc dataset, for which the number of entity detections was the lowest, the performance was consistently higher than that of the baseline, as it was for the other languages.