Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph

In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available, where the task is identical to that of a low-resource target language. However, collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns. This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries. First, we construct a dictionary-based heterogeneous graph (DHG) from bilingual dictionaries. This opens the possibility to use graph neural networks for cross-lingual transfer. The remaining challenge is the heterogeneity of DHG because multiple languages are considered. To address this challenge, we propose dictionary-based heterogeneous graph neural network (DHGNet) that effectively handles the heterogeneity of DHG by two-step aggregations, which are word-level and language-level aggregations. Experimental results demonstrate that our method outperforms pretrained models even though it does not access to large corpora. Furthermore, it can perform well even though dictionaries contain many incorrect translations. Its robustness allows the usage of a wider range of dictionaries such as an automatically constructed dictionary and crowdsourced dictionary, which are convenient for real-world applications.


Introduction
Modern machine learning methods typically require a large amount of data to achieve desirable performance (LeCun et al., 2015;Schmidhuber, 2015;Deng and Liu, 2018). While such a requirement can be feasible for languages such as English (Singh, 2008) (i.e., high-resource language), it is not the case for low-resource languages that lack sufficiently large corpora to build reliable statistical models (Cieri et al., 2016;Haffari et al., 2018). Because there are more than six thousand languages in the world (Nettle, 1998), and only a few of them are high-resource, it is important to enable the use of machine learning in low-resource languages. Cross-lingual text classification (CLTC) is a transfer learning paradigm that aims to incorporate the training data in high-resource languages (i.e., source languages) to solve the classification task in a low-resource language (i.e., target language) more effectively (Karamanolakis et al., 2020;Xu et al., 2016;Bel et al., 2003;Ruder, 2019).
In CLTC, it is common to assume that the source and target tasks are identical, and training labeled data in high-resource languages (i.e., source data) are available. Such assumptions can be restrictive, and we give the following four examples. First, source data may not be allowed to use owing to the data privacy, e.g., when customers disagree to disclose their opinion to public (Chidlovskii et al., 2016;Liang et al., 2020;Kundu et al., 2020). Second, we may not be able to keep the source data, which is also a motivation of sequential transfer learning (Ruder, 2019). The colossal-size data that are used to train BERT (Devlin et al., 2019) is a good example. The data of this caliber hardly fit in household PC storage. Third, it is possible that the target task is quite specific to the target language, which makes it difficult to find the source data in high-resource languages. One example is fake news classification. The news content is highly specific to each region, which may not be reported in a high-resource language. Fourth, collecting source labeled data usually requires the prior knowledge of that source language. For example, it is difficult for people who cannot speak Chinese to reliably collect data in Chinese language. For these reasons, it is beneficial to consider cross-lingual transfer where we do not require task-specific source data in highresource languages.
Our goal is to overcome the unavailability of task-specific source data by enabling any lowresource languages to utilize high-quality and widely-available resources from any high-resource languages. To achieve this, we design a method that requires only task-independent word embeddings of a source language (e.g. French) and a wordlevel bilingual dictionary (e.g. French-Malayalam dictionary) to solve a classification task in the target language. Word embeddings can be easily obtained from many sources (Pennington et al., 2014;Mikolov et al., 2013a;Bojanowski et al., 2017). Likewise, bilingual dictionaries are available as man-made commercial products, free lexical database (Kamholz et al., 2014), or results from dictionary induction algorithms (Choe et al., 2020;Lample et al., 2018).
The main challenge of our problem is how to utilize source word embeddings and bilingual dictionaries, which is not straightforward for the following reasons. First, given a word, there are many choices of translation. Even worse, dictionaries may contain wrong translations, which is often the case for automatically constructed dictionaries. Moreover, the compatibility of the source and target languages can be different depending on the context. Finally, the quality of source embeddings and bilingual dictionaries can be diverse. This paper aims to design a machine learning method that effectively transfers task-independent embeddings of source languages to task-specific embeddings of the target language. The method should be able to determine the appropriate transfer for each word in the source languages to the translated words in the target language under such circumstances.
To solve this problem, we convert bilingual dictionaries into a dictionary-based heterogeneous graph (DHG) that represents words and translations as nodes and edges, respectively. This reduces the problem into graph representation learning, which allows us to use powerful methods, e.g., graph neural networks (GNNs), to solve. DHG is heterogeneous because there are many node types (languages) and edge types (language pairs), which are often ignored by GNNs in general (Wang et al., 2019;Chairatanakul et al., 2021). Then, to effectively address the heterogeneity of DHG, we propose dictionary-based heterogeneous graph neural networks (DHGNets) that first aggregate word translations for each language pair (wordlevel aggregation), and then aggregate the results from all languages (language-level aggregation).
Our contributions can be summarized as follows: First, we propose an alternative solution that enables cross-lingual transfer for text classification by using only 1) task-independent word embeddings of highresource languages and 2) bilingual dictionaries. Second, we propose DHG, which can be utilized for cross-lingual transfer by any learning algorithms operating on graphs, such as GNNs. Third, we propose DHGNets, which effectively uses DHG to solve text classification in a low-resource language. Fourth, we provide experimental results to analyze and show the usefulness of the proposed solution and provide extensive analysis of the choice of dictionary, high-resource language, word embeddings, and GNNs. The code and resources are available at https://github.com/nutcrtnk/DHGNet.

Related Work
Transfer learning -The goal of transfer learning is to solve a target task with limited target data by incorporating source knowledge from other domains (Pan and Yang, 2009;Ruder, 2019). The challenge of this problem is how to make use of source knowledge and avoid negative transfer, which is a phenomenon where using source knowledge worsens the performance (Rosenstein et al., 2005). Our problem can be categorized as transfer learning where source knowledge does not include source labeled data but only bilingual dictionaries and taskindependent word embeddings of high-resource languages. Recently, the transfer learning problem where no source labeled data are provided is called source-free domain adaptation and has been studied extensively in computer vision because of its practicality (Vongkulbhisal et al., 2019;Liang et al., 2020;Kundu et al., 2020). However, the study of this problem for cross-lingual transfer is limited to the best of our knowledge.
Note that our problem is significantly different from zero-shot cross-lingual transfer. In that problem, although target data are not available, it is often assumed that source labeled data and the additional target task information are available (Farhadi et al., 2009;Romera-Paredes and Torr, 2015;Phang et al., 2020). Furthermore, most work in NLP assumes that source and target domains either share the same task (Upadhyay et al., 2018;Liu et al., 2019 or language (Veeranna et al., 2016;Zhang et al., 2019).
Cross-lingual text classification -According to the transfer learning taxonomy proposed by Ruder (2019), CLTC is the most related problem to ours. However, CLTC assumes that source labeled data are available, and the source and target tasks are identical (Upadhyay et al., 2016;Conneau et al., 2018;Karamanolakis et al., 2020;Xu et al., 2016;Bel et al., 2003). Since the data requirement of CLTC can be restrictive, there exist recent methods for weakly supervised CLTC where target labels are not required (Karamanolakis et al., 2020;Xu et al., 2016;Zhang et al., 2020a). Note that source labeled data are still required for such methods. To enable the use of cross-lingual transfer for more applications, our work explores a different direction where source labeled data are unavailable.

Cross-lingual word embedding (CLWE) -
CLWEs are the representations of words that are typically learned by identifying mappings to map the monolingual word embeddings of each language to a shared embedding representations by utilizing resources such as aligned corpus and dictionary (Zhang et al., 2020b). Such representations are highly useful for comparing the meaning of words across languages (Xu et al., 2018;Grave et al., 2019;. Learning CLWEs can also improve the performance of classification in a low-resource language (Duong et al., 2016;Zhang et al., 2020b).
Pretrained model -Pretraining methods have demonstrated their effectiveness in transfer learning for many NLP tasks. Pretraining typically requires a large amount of data, which can be unlabeled data, to effectively learn a good pretrained model for a general NLP task. Examples of methods in this family are BERT (Devlin et al., 2019), CoVe (McCann et al., 2017), ULMFiT (Howard and Ruder, 2018), and USE (Yang et al., 2020).
Bilingual dictionary -A bilingual dictionary maps words from a source to their translations in a target language. Its usage is found across cross-lingual tasks. Obviously in machine translation (Nießen and Ney, 2004;Duan et al., 2020;Zoph et al., 2016), dictionaries are used to provide ground-truth for word-level translation. Enhancing the quality of word embedding of low-resource languages is also possible with dictionary as a bridge to connect two languages (Duong et al., 2016;Mikolov et al., 2013b;Artetxe et al., 2018). Cross-lingual named-entity recognition for low-resource languages can also be ameliorated with the aid of dictionaries (Mayhew et al., 2017;Xie et al., 2018). Heterogeneous graph neural network -GNNs encode each node in a graph to a vector by considering its attributes and the graph structure (Gori et al., 2005;Kipf and Welling, 2016;Veličković et al., 2017). They can be understood as message passing between nodes guided by edges to update the states of nodes (Gilmer et al., 2017;Hamilton et al., 2017). GNNs have been applied to many real-world problems, e.g., knowledge graph (Schlichtkrull et al., 2018), natural science (Sanchez-Gonzalez et al., 2018), and NLP (Yao et al., 2019). Heterogeneous GNN (HGNN) (Wang et al., 2019; is a type of GNNs that considers the types of nodes and edges in a heterogeneous graph that contains multiple types of nodes or edges.

Problem Formulation
In this section, we define our problem formulation. Let be an input space and be an output space. Without loss of generality, we consider a document classification problem where is a document space and = {1, 2, … , c } is a set of classes, where c denotes the number of classes. In this problem, we have languages ℒ = {ℓ 0 , ℓ 1 , … , ℓ S }, where ℓ 0 is the target language and ℓ 1 , … , ℓ S are source languages.
To define a word embedding function, let ℓ be a vocabulary space and ℓ be the dimensionality of the word embeddings of a language ℓ. Then, ℓ ∶ ℓ → ℝ ℓ is a word embedding function for a language ℓ. Next, we define a bilingual dictionary ℓ →ℓ ∶ ℓ → 2 ℓ as a mapping from a known word in a language ℓ to a set of known words in a language ℓ .
The following data are provided in this problem setting: The goal of this problem is to learn a classifier ∶ → that optimizes an evaluation metric of interest such as the accuracy or F 1 -measure with respect to the target distribution. ℓ : Word embeddings of source language(s)  Figure 1: Overview of DHGNet. DHGNet takes DHG and word embeddings ℓ of source language(s) as inputs to produce word embeddings of the target language ℓ 0 . Then ℓ 0 are used by a prediction function ℓ 0 to predict labels and optimize the loss of a target task. The bold and dashed arrows indicate forward and backward passes, respectively. ℓ 0 , parameters of DHGNet, are trainable. ℓ 0 can be a neural network. Colors indicate languages.

Proposed Method
In this section, we propose a novel method for cross-lingual transfer using source word embeddings and bilingual dictionaries. Figure 1 illustrates an overview of our proposed method.

Dictionary-based Heterogeneous Graph (DHG)
First, we use bilingual dictionaries and words in target labeled data to construct a heterogeneous graph. Let ℓ 0 be a vocabulary set of the target language, = ⋃ S =0 ℓ be the vocabulary set of all languages of interest, and ∶ → ℒ be the word-to-language mapping 1 . DHG can be defined as follows.
Definition 1 (Dictionary-based Heterogeneous Graph). Dictionary-based heterogeneous graph is a directed graph = ( , ℰ, ℒ, ) where is the set of nodes (words), ℰ is the set of edges: An example of DHG can be observed from Figure 1. It is straightforward to see that DHG can be built by using words from target labeled data and their translations. Common words in the target language found in dictionaries can also be added. 1 For notational simplicity, we assume words are nonoverlap among languages. In the implementation, same words in different languages are treated as different nodes in DHG.

Dictionary-based Heterogeneous Graph Neural Network (DHGNet)
We propose DHGNets to combine the different semantic information of both word embeddings and DHG to obtain -dimensional target word embeddings ℓ 0 ∶ ℓ 0 → ℝ . Intuitively, DHGNet works by exploiting the common structure between languages encoded in word embeddings. For example, a word"cat"in English can link to many co-occurrence words that helps in understanding the characteristics of real cats, and that can be shared to a low-resource for the classification task. The GNN in DHGNet ( Figure 1) is vital to allow words of different languages to share information. That makes words from a low-resource inherits useful information from high-resources quantified by an attention mechanism, which will be introducted later.
Given the current target word embedding 2 ℓ 0 , DHG , and source word embeddings ℓ , we update the target word embedding with two steps: cross-lingual transformation and propagation with multi-source HGNN.

Cross-lingual Transformation
Pretrained word embeddings of different source languages are usually trained separately. Thus, we should assume that they belong to different spaces and can have different dimensionalities, namely, ℓ . For the primary step, we must transform them to a common space. To achieve that, we use the following mapping to map each word to the target word embedding space: where ( ) ∈ ℝ × ℓ is a trainable cross-lingual linear transformation.

Propagation with Multi-source HGNN
GNNs effectively learn node embedding via propagating and aggregating the node features (embeddings) based on the graph structure. Therefore, we can utilize a GNN to learn target word embeddings by effectively aggregating and synthesizing the embeddings from semantically related source words (neighboring nodes): ← GNN( , ). One simple way is to use Graph Attention Networks (GATs) (Veličković et al., 2017) that can learn the important of each node in a graph. Particularly, for a target node ∈ in any language, GAT performs the following graph operation: where ∈ and ∈ ℝ denote source node and the attention score, respectively; ∈ ℝ × and a ∈ ℝ 2 are parameterized trainable weight matrix and vector of the layer, respectively; ( ) denotes all in-neighbors of a node ; denotes activation function; ‖ is the concatenation operation. GATs usually use with multi-head attention by performing message passing on multiple independent heads, then concatenate the outputs: = ∥ =1 (∑ ∀ ∈ ( ) , ). Finally,̄is then used to update ( ). In this paper, we use GELU as the activation function and set the output dimensions of each head to / . The number of GNN layers is set to 2.
Despite the ability of GAT to learn the saliency of nodes, the model lacks the awareness of language differences. This is because GAT is a homogeneous GNN that treats all nodes and edges identically regardless of their types, which may lead to suboptimal performance.
To effectively utilize DHG, we propose a multisource HGNN consisting of two steps. The first step Figure 2: Example of aggregation by multi-source HGNN for the target node "ฉั น" in Thai language.
is to aggregate the relevant translations for each language pair (word-level aggregation), and then aggregate the knowledge over all language pairs (language-level aggregation). For example, a Thai word "ร้ าน" can be translated to "shop" and "store" in English, and "店" (shop) and "餐厅" (restaurant) in Chinese. If a task is sentiment analysis of restaurant reviews, word-level aggregation decides to put more weight on "餐厅" than "店", while languagelevel aggregation decides to put more weight on Chinese-Thai than English-Thai language pair because "餐厅" is the most relevant meaning to the task. Mathematically, a homogeneous bilingual subgraph is defined as ℓ ,ℓ = ( ℓ ∪ ℓ , ℰ ℓ ,ℓ ), where either ℓ or ℓ is the target language ℓ 0 . Word-level aggregation: For any target node in any language ∈ , bilingual-specific node fea-tures̄ℓ ,ℓ is calculated based on Eq.

Language-level aggregation: Given̄ℓ
,ℓ from word-level aggregation and let ℓ = ( ) be the language of . With trainable parameters 1 , 2 and a 1 , the target output for can be calculated by where denotes a pair of languages including a pair of the same language. Figure 2 illustrates an example of aggregation by multi-source HGNN. Note that 1 is the same across language pairs and 2 is the same for all languages. Finally, multi-head attention mechanism can also be applied to the tar-get output̄. To avoid oversmoothing when training GNN (Li et al., 2018), we also add a residual connection (Chen et al., 2020) before updating the embedding: ( ) ←̄+ ( ) for each layer.

End-to-end Optimization
To simultaneously train our DHGNet and a prediction function, we use the target word embedding ℓ 0 as an input of the prediction function and backpropagate a loss to update the trainable parameters of DHGNet, as illustrated in Figure 1.
Let us define ℓ 0 a prediction function that uses target word embeddings to extract features of the input space . Also, we define Θ DHG to be trainable parameters for DHGNet and to be a loss function to train a classifier ℓ 0 . For example, can be the average cross-entropy loss over training data. Then, the gradient information for the DHGNet can be obtained by the following simple chain rule: . Therefore, is used to simultaneously train the prediction function ℓ 0 and DHGNet to produce good task-specific word embeddings ℓ 0 for solving the target task.
It is worth pointing out that in the cross-lingual transformation step, one can learn ( ) to control the behavior of an embedding function . For example, using contrastive learning (Lazaridou et al., 2015;Joulin et al., 2018) can encourage to produce similar outputs between a pair of nodes that contain translation between each other.

Experiments
In this section, we conducted experiments to compare our method with baselines and analyze the method under different conditions.  Dataset -We conducted experiments on four datasets in five different languages. For Thai language (th), we used Truevoice (Thai-T) and Wongnai (Thai-W) datasets in PyThaiNLP library (Phatthiyaphaibun et al., 2016). We used news articles in Bengali (bn), Malayalam (ml), and Tamil (ta) languages from IndicNLPSuite (Kakwani et al., 2020). For European languages, we used Twitter sentiment (Mozetič et al., 2016) in Bosnian (bs) language. In total, we have six different settings. We used the average accuracy and macro-average F 1 -measure of five runs as evaluation metrics. We provide the statistics of each setting in Table 1. Dictionary -We used word2word (Choe et al., 2020). The bilingual dictionaries were automatically constructed from OpenSubtitles2018 dataset (Lison et al., 2018). Pretrained word embedding of DHGNet -We used publicly available word vectors from Fast-Text 3 (Bojanowski et al., 2017). For source languages, we used English (en), Arabic (ar), Chinese (zh), French (fr), Persian (fa), and Spanish (es).

Comparison with baselines
We conducted experiments using multiple baseline methods that can potentially apply to our settings, including a statistical method, task-independent word embeddings, and pretrained models. The list is as follows:   (Merity et al., 2018) as the prediction function. DHGNet and DHGNet denote DHGNet with English and multiple source languages (ar,en,es,fa,fr,zh), respectively. We also included no-DHGNet that is an AWD-LSTM without DHGNet for a comparison. For FastText, RCSLS, and USE, we used SVM, random forest, and logistic regression as classification methods, and reported the result only of the one with the best validation performance to save space. For a better comparison, we also used the AWD-LSTM as the classifier for FastText, RCSLS, and mBERT named FastText-LSTM, RCSLS-LSTM, and mBERT-LSTM. All pretrained models are finetuned on a target data. For implementation details and hyperparameter settings, please see Appendix A.
The results are listed in Table 2. We observe that DHGNet outperformed no-DHGNet in most settings from 1.18% in Thai-W to significantly 8.90% in Bosnian in absolute accuracy.
Furthermore, apart from XLM-R, DHGNet also achieved higher performance than other pretrained baselines, even though it did not access to any large corpus of those target languages, such as Wikipedia data. Note that XLM-R requires much higher computational cost 5 and has different input 5 XML-R required 40GB of GPU memory in our experiments to finetune compared to maximally 16GB used by others. source (Wikipedia data in 100 languages and more) compared with DHGNet (source word embeddings, bilingual dictionaries). Figure 3 shows the results with varying training size. It can be observed that DHGNet can still perform relatively well under high data scarcity. The gap between DHGNet and no-DHGNet is larger as the training size decreases. Moreover, in Bosnian, DHGNet consistently outperformed DHGNet and other methods.
The performance of all methods also depends on the data provided for them. To the best of our knowledge, there does not exist any methods that use the same input source as our proposed DHGNets. As a result, it is difficult to provide a fairness comparison for all methods due to different data accessibility. For example, Wikipedia data of the target language were used by all baselines for pretraining or producing word embeddings. On the other hand, DHGNets never observe such data but use bilingual dictionaries to construct DHG. Nevertheless, our experiment shows that using bilingual dictionaries and source word embeddings can construct a classifier that is competitive to baselines accessing to extremely large corpora. Thus, we believe that DHGNet is the current best solution if parallel corpora cannot be obtained but only bi-lingual dictionary.

Analysis
We conducted further experiments to analyze the effect of (1) quality of dictionary, (2) source language, (3) source word embeddings, and (4) choice of graph neural networks to the classification performance of DHGNet.  Both DHGNets could maintain their performances and beat baselines even the dictionary were heavily tampered.

Quality of Dictionary
In practice, auto-generated dictionaries have certain noises for high co-occurrence words and wrong word-segmentation as we found from word2word. However, such noises are difficult to synthesize. To analyze the effect of the quality of dictionaries, we injected noise by randomly adding incorrect translation to worsen the quality of dictionary. For example, "猫" (cat) might have been added to the translation of "dog". Random noise should worsen the quality of dictionaries than practical noises because the added words are likely to be highly dissimilar to the correct translation. Figure 4 shows the effect of dictionary qualities. Both DHGNets could maintain performance even in high noise presence and surpassed mBERT. Given 50% noise rate, DHGNets did not suffer from negative transfer because they achieved superior performance than that of no-DHGNet.
DHGNets achieved high level of dictionary robustness owing to being able to identify the suitable translations of each target word. Translations that benefit the task were given more attention, while reducing neutral and harmful ones. Knyazev et al.   (2019) showed that the attention mechanism in GNNs can improve robustness to noisy graphs by attending to important and avoiding noisy parts. By having more sources, DHGNet outperformed DHGNet in most cases.

Choice of Source Languages
We investigate the effect of choosing different source languages. The results are listed in Table 3. We observe that the choice of source languages can improve up to 3% absolute accuracy. There is no best source language for all settings. English, which is the most resource-rich, could give inferior performance in some cases. The worst performance of each setting in Table 3 still outperformed no-DHGNet in Table 2, indicating no negative transfer.

Quality of Source Word Embeddings
To analyze the quality of source word embeddings, we changed the source word embeddings of DHGNet to GloVe (Pennington et al., 2014), RC-SLS, and Wiki2vec (Yamada et al., 2016). We also included a variant that does not use any pretrained word embeddings by replacing them with trainable randomly-initialized word vectors (RandInit). Table 4 shows the performance of DHGNet with different source word embeddings. RandInit can be observed to perform worst. This indicates the benefit of using source word embeddings.

Choice of Graph Neural Networks
One core component of DHGNet is Multi-source HGNN described in Section 4.2.2. To analyze its effect in the multi-source scenario, we replaced our proposed multi-source HGNN in DHGNet to other GNNs. We included homogeneous GNNs: GCN (Kipf and Welling, 2016) and GAT (Veličković et al., 2017) and heteregeneous GNN: RGCN (Schlichtkrull et al., 2018). RGCN employs distinct weights for each relation (language pair) with the mean aggregation. Note that to use a different GNN to solve the problem, we must still use our proposed DHG (Section 4.1), cross-lingual transformation technique (Section 4.2.1), and the end-toend optimization (Section 4.2.3).

Discussion and Conclusions
We proposed a method based on heterogeneous graph neural networks called DHGNet to transfer the source word embeddings through bilingual dictionaries. Without task-specific source data, DHGNet demonstrates that it can outperform models pretrained on extremely large corpora. Furthermore, our results revealed that DHGNet can perform well even though dictionaries contain many incorrect translations. Its robustness opens the possibility to use a wider range of dictionaries such as an automatically constructed dictionary and crowdsourced dictionary. Limitation -Because our method operates on the word level, some words may not be found in bilingual dictionaries although they are in the dictionaries in a different form. This may occur in languages that have declensions, e.g., Malayalam, Japanese, and Sanskrit. For computational limitation, DHGNet can be trained with a DHG upto 30K target words 6 , 110K source words, and 1M edges for long documents, e.g., news articles, on a single NVIDIA Tesla V100-16GB.