Cross-lingual Text Classification with Heterogeneous Graph Neural Network

Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages, which is very useful for low-resource languages. Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks, but rarely consider factors beyond semantic similarity, causing performance degradation between some language pairs. In this paper we propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification using graph convolutional networks (GCN). In particular, we construct a heterogeneous graph by treating documents and words as nodes, and linking nodes with different relations, which include part-of-speech roles, semantic similarity, and document translations. Extensive experiments show that our graph-based method significantly outperforms state-of-the-art models on all tasks, and also achieves consistent performance gain over baselines in low-resource settings where external tools like translators are unavailable.


Introduction
The success of recent deep learning based models on text classification relies on the availability of massive labeled data (Conneau et al., 2017;Tian et al., 2020;Guo et al., 2020). However, labeled data are usually unavailable for many languages, and hence researchers have developed the setting where a classifier is only trained using a resourcerich language and applied to target languages without annotated data (Xu et al., 2016;Chen and Qian, 2019;Fei and Li, 2020). The biggest challenge is to bridge the semantic and syntactic gap between languages. Most existing methods explore the semantic similarity among languages, and learn a language-agnostic representation for documents from different languages (Chen et al., 2018;Zhang et al., 2020a). This includes recent state-of-the-art multilingual pretrained language models (mPLM) (Devlin et al., 2019;Conneau and Lample, 2019), which pretrain transformer-based neural networks on large-scale multilingual corpora. The mPLM methods show superior cross-lingual transfer ability in many tasks (Wu and Dredze, 2019). However, they do not explicitly consider syntactic discrepancy between languages, which may lead to degraded generalization performance on target languages (Ahmad et al., 2019;Hu et al., 2020).
On the other hand, there usually exists sufficient unlabeled target-language documents that come naturally with rich information about the language and the task. However, only a handful of previous researches have taken advantage of the unlabeled data (Wan, 2009;Dong and de Melo, 2019).
To integrate both semantic and syntactic information within and across languages, we propose a graph-based framework named Cross-Lingual Heterogeneous GCN (CLHG). Following the work of TextGCN (Yao et al., 2019), we represent all the documents and words as graph nodes, and add different types of information into the graph. We utilize mPLM to calculate the representation of all the nodes, and connect documents nodes with semantically similar ones to extract the knowledge in mPLM. Words are connected with documents based on the co-occurrences as in previous works. However, we choose to separate different word-doc edges by part-of-speech (POS) tags of words to inject some shallow syntactic information into the graph, as POS taggers are one of the most widely accessible NLP tools, especially for low-resource languages. In-domain unlabeled documents are added to the graph if available. To further absorb  Figure 1: Illustration of our Cross-Lingual Heterogeneous GCN (CLHG) framework. For simplicity, only some POS tags are plotted in this graph. We recommend to view this figure in color as we use different colors to indicate different languages and edge types.
in-domain language alignment knowledge, we utilize machine translation to create translated text nodes. The text classification task is then formalized as node classification in the graph and solved with a heterogeneous version of Graph Convolutional Networks (Kipf and Welling, 2017). Our contributions are summarized as follows: (1) We propose a graph-based framework to easily comprise heterogeneous information for crosslingual text classification, and design multiple types of edges to integrate all these information. To the best of our knowledge, this is the first study to use heterogeneous graph neural networks for crosslingual classification tasks. (2) We conduct extensive experiments on 15 tasks from 3 different datasets involving 6 language pairs. Results show that our model consistently outperforms state-ofthe-art methods on all tasks without any external tool, and achieves further improvements with the help of part-of-speech tags and translations.
In the past few years, graph neural networks (GNN) have attracted wide attention, and become increasingly popular in text classification (Yao et al., 2019;Hu et al., 2019;Ding et al., 2020;Zhang et al., 2020b). These existing work mainly focus on monolingual text classification, except a recent work (Li et al., 2020) using meta-learning and graph neural network for cross-lingual sentiment classification, which nevertheless only uses GNN as a tool for meta-learning.

Method
In this section, we will introduce our CLHG framework, including how to construct the graph and how to solve cross-lingual text classification using heterogeneous GCN. In general, we first construct a cross-lingual heterogeneous graph based on the corpus and selected features, and next we encode all the texts with multilingual pre-trained language models, then we pass the encoded nodes to the heterogeneous GCN, each layer of which performs graph convolution on different subgraphs separated by different edge types, and aggregates the information together. Finally, the graph neural network outputs the predictions of doc nodes, which will be compared with groundtruth labels during training. Figure 1 shows the overall structure of the framework.

Graph Construction
Inspired by some previous works on GNN-based text classification (Yao et al., 2019;Hu et al., 2019), we construct the graph by representing both documents and words from the corpus in both languages as graph nodes, and augment the corpus by including unlabeled in-domain documents from the target language. To extract more information of language alignments, we further use a publicly available machine translation API 1 to translate the documents in both directions. Then two categories of edges are defined in the graph.
Doc-word Edges. Like TextGCN (Yao et al., 2019), documents and words are connected by their co-occurrences. To inject syntactic information more than just co-occurrences, we add part-ofspeech (POS) tags to the edges, since different POS roles have different importance in the classification tasks. Adjectives and adverbs are mostly decisive in sentiment classification, while nouns may play a more significant role in news classification. Therefore, we use POS taggers to tag each sentence and create different types of edges based on the POS roles of the words in the document, which could help GNN to learn different propagation patterns for each POS role.
Doc-doc Edges. To add more direct connections between documents, we include two types of document level edges. Firstly, we link each document with similar ones by finding K documents with the largest cosine similarity. The embeddings of the documents are calculated using mPLM. Secondly, we connect nodes created by machine translation with their original texts.

Heterogeneous Graph Convolution
After building the heterogeneous cross-lingual graph, we first encode all the nodes using mPLM by directly inputting the text to the mPLM and taking the hidden states of the first token. The encoded node features are fixed during training. Next we apply heterogeneous graph convolutional network (Hetero-GCN) (Hu et al., 2019) on the graph to calculate higher-order representations of each node with aggregated information.
Heterogeneous GCN applies traditional GCN on different sub-graphs separated by different types of edges and aggregates information to an implicit common space.
whereÃ τ is a submatrix of the symmetric normalized adjacency matrix that only contains edges with 1 https://cloud.tencent.com/document/ api/551/15619 type τ , H (l) τ is the feature matrix of the neighboring nodes with type τ of each node, and W (l) τ is a trainable parameter. σ(·) denotes a non-linear activation function, which we use leaky ReLU. Initially, H (0) τ is the node feature calculated by mPLM.
Empirically, we use two graph convolution layers to aggregate information within second-order neighbors. Then a linear transformation is applied to the document nodes to get the predictions.

Experiments
We evaluate our framework on three different classification tasks, including Amazon Review sentiment classification (Prettenhofer and Stein, 2010), news category classification from XGLUE (Liang et al., 2020), and intent classification on a multilingual spoken language understanding (SLU) dataset (Schuster et al., 2019). More details of each dataset is provided in the appendix. For all the tasks, we use only the English samples for training and evaluate on other 6 languages, which are German (DE), French (FR), Russian (RU), Spanish (ES), Japanese (JA), and Thai (TH).

Experiment Setting
In all our experiments, we use two-layer GCN with hidden size 512 and output size 768. Each document is connected with 3 most similar documents. The model is trained using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 2 × 10 −5 and batch size 256. We train the GCN for at most 15 epochs and evaluate the model with best performance on validation set. XLM-RoBERTa (Conneau et al., 2020) is used to encode all the documents and words, which is finetuned on the English training set of each task for 2 epochs with batch size 32 and learning rate 4 × 10 −5 . We set the max length as 128 for intent classification, and 512 for the other two tasks. Each experiment is repeated 3 times and the average accuracy is reported. All the experiments are conducted on an NVIDIA V100 GPU 2 .
For part-of-speech tagging, we adopt different taggers for each language 3 and map all the tags to   Universal Dependency (UD) tagset 4 .

Baselines
Our method is compared with different multilingual pretrained models finetuned for each task, which include multilingual BERT ( Multilingual SLU. CoSDA-ML (Qin et al., 2020) is a data augmentation framework that automatically generates code-switching documents using a bilingual dictionary, which is used when finetuning language models on downstream tasks.

Results and Analysis
The results are provided in table 1 2 and 3 for each dataset. Our method significantly outperforms state-of-the-art baselines and achieves consistent improvements over XLM-R model. The most performance gain is achieved on the multilingual SLU dataset. Different from the other two, this dataset consists of short texts, and thus the created graph is much cleaner and more suitable for GCN to model. To verify that the improvement does not come barely from the external data, we conduct another experiment that adds the translated data to the training set and finetunes the baseline XLM-R model on Amazon Review dataset. The results showed very slight improvement (0.09% on average), showing that XLM-R cannot directly benefit from external training data.
Ablation studies are performed on Amazon Review dataset to analyze the effectiveness of different graph structures. From the results provided in table 4, variant 1 containing a homogeneous graph with only word-doc edges (same as TextGCN) performs the worst, while adding more information leads to better performance in general. Comparing variants 4-7, similarity demonstrates to be the most important among all added information. Similarity edges help the model to converge faster and learn better as well, since they provide a "shortcut" between documents that is highly likely to be  in the same category. Variant 7 shows that unlabeled corpus play an important role in EN→JA setting, but less effective when transferring between similar languages, since unlabeled data inevitably contain some noise and do not provide much help for linguistically-closer languages. Variant 4 also shows that POS tags are more helpful for distant language pairs like EN→JA, and our added experiment on EN→TH shows greater impact of POS tags (89.71→88.06 when removing POS tags). Additionally, we test a variant without any external tool that requires training resources in the target language. Variant 3 does not rely on POS tagger or translation service, and still outperforms the XLM-R baseline with a large margin. This demonstrates that our method can be adopted for real low-resource languages without good tagger or translator.

Conclusion
In this study, we propose a novel graph-based method termed CLHG to capture various kinds of information within and across languages for crosslingual text classification. Extensive experiments illustrate that our framework effectively extracts and integrates heterogeneous information among multilingual corpus, and these heterogeneous relations can enhance existing models and are instrumental in cross-lingual tasks. There may exist some better semantic or syntactic features and combinations of features, which we leave as a future work to explore. We also wish to extend our GNN-based framework to different NLP tasks requiring knowledge transfer and adaptation in the future.

A Datasets
Here we introduce the three datasets we use in our experiments. A summary of the statistics for three datasets are provided in table 5.
Amazon Review 5 This is a multilingual sentiment classification dataset covering 4 languages (English, German, French, Japanese) on three domains (Books, DVD, and Music). The original dataset contains ratings of 5-point scale. Following previous works (Xu and Yang, 2017;Fei and Li, 2020), we convert the ratings to binary labels with threshold at 3 points. Since English is not used for testing, we follow the previous works and re-construct the training and validation set by combining the English training and test set. The new training set contains 3,200 randomly sampled documents. We use the training set of target languages for validation. This dataset also provides large amount of unlabeled data for each language, which is used in our framework.
XGLUE News Classification 6 This is a subtask of a recently released cross-lingual benchmark named XGLUE. This subtask aims at classifying the category of a news article, which covers 10 categories in 5 languages, including English, Spanish, French, German and Russian. This dataset does not contain unlabeled data for target languages, so we do not add unlabeled documents in our graph either.
Multilingual SLU 7 This dataset contains short task-oriented utterances in English, Spanish and Thai across weather, alarm and reminder domains. We evaluate our framework on the intent classification subtask, which has 12 intent types in total. For each target language, we use the original training set as unlabeled data added in the graph.