Hierarchical Heterogeneous Graph Representation Learning for Short Text Classification

Short text classification is a fundamental task in natural language processing. It is hard due to the lack of context information and labeled data in practice. In this paper, we propose a new method called SHINE, which is based on graph neural network (GNN), for short text classification. First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs which introduce more semantic and syntactic information. Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts. Thus, comparing with existing GNN-based methods, SHINE can better exploit interactions between nodes of the same types and capture similarities between short texts. Extensive experiments on various benchmark short text datasets show that SHINE consistently outperforms state-of-the-art methods, especially with fewer labels.


Introduction
Short texts such as tweets, news feeds and web search snippets appear daily in our life (Pang and Lee, 2005;Phan et al., 2008). To understand these short texts, short text classification (STC) is a fundamental task which can be found in many applications such as sentiment analysis (Chen et al., 2019), news classification (Yao et al., 2019) and query intent classification (Wang et al., 2017).
STC is particularly hard in comparison to long text classification due to two key issues. The first key issue is that short texts only contain one or a few sentences whose overall length is small, which lack enough context information and strict syntactic structure to understand the meaning of texts (Tang et al., 2015;Wang et al., 2017). For example, it is hard to get the meaning of "Birthday girl is an amusing ride" without 1 Codes are available at https://github.com/tata1661/ SHINE-EMNLP21.
knowing "Birthday girl" is a 2001 movie. A harder case is to understand a web search snippet such as "how much Tesla", which usually does not contain word order nor function words (Phan et al., 2008). In addition, real STC tasks usually only have a limited number of labeled data compared to the abundant unlabeled short texts emerging everyday . Therefore, auxiliary knowledge is required to understand short texts, examples include concepts that can be found in common sense knowledge graphs (Wang et al., 2017;Chen et al., 2019), latent topics extracted from the short text dataset , and entities residing in knowledge graphs . However, simply enriching auxiliary knowledge cannot solve the shortage of labeled data, which is another key issue commonly faced by real STC tasks (Pang and Lee, 2005;Phan et al., 2008). Yet the popularly used deep models require large-scale labeled data to train well (Kim, 2014;Liu et al., 2016).
Currently, graph neural networks (GNNs) designed for STC obtain the state-of-the-art performance Ye et al., 2020). They both take the STC as the node classification problem on a graph with mixed nodes of different types: HGAT  builds a corpus-level graph modeling latent topics, entities and documents and STGCN (Ye et al., 2020) operates on a corpuslevel graph of latent topics, documents and words. In both works, each document is connected to its nodes of a different type such as entities and latent topics but not to other documents. However, they do not fully exploit interactions between nodes of the same type. They also fail to capture the similarities between short documents, which is both useful to understand short texts (Zhu et al., 2003;Kenter and De Rijke, 2015;Wang et al., 2017) and and important to propagate few labels on graphs (Kipf and Welling, 2016). Besides, both works have large parameter sizes: HGAT ) is a GNN with dual-level attention and STGCN (Ye et al., 2020) merges the node representations with word embeddings obtained via a pretrained BERT (Devlin et al., 2019) via a bidirectional LSTM (Liu et al., 2016).
To address the aforementioned problems, we propose a novel HIerarchical heterogeNEous graph representation learning method for STC called SHINE, which is able to fully exploit interactions between nodes of the same types and capture similarity between short texts. SHINE operates on a hierarchically organized heterogeneous corpuslevel graph, which consists of the following graphs at different levels: (i) word-level component graphs model interactions between words, partof-speech (POS) tags and entities which can be easily extracted and carry additional semantic and syntactic information to compensate for the lack of context information; and (ii) short document graph is dynamically learned and optimized to encode similarities between short documents which allows more effective label propagation among connected similar short documents. We conduct extensive experiments on a number of benchmark STC datasets including news, tweets, document titles and short reviews. Results show that the proposed SHINE consistently outperforms the state-of-theart with a much smaller parameter size.

Text Classification
Text classification assigns predefined labels to documents of variable lengths which may consist of a single or multiple sentences . Traditional methods adopt a two-step strategy: first extract human-designed features such as bagof-words (Blei et al., 2003) and term frequencyinverse document frequency (TF-IDF) (Aggarwal and Zhai, 2012) from documents, then learn classifiers such as support vector machine (SVM) (Cortes and Vapnik, 1995). Deep neural networks such as convolutional neural networks (CNN) (Kim, 2014) and long short-term memory (LSTM) (Liu et al., 2016) can directly obtain expressive representations from raw texts and conduct classification in an end-to-end manner.
Recently, graph neural networks (GNNs) (Defferrard et al., 2016;Kipf and Welling, 2016) have obtained the state-of-the-art performance on text classification. They can be divided into two types. The first type of GNNs constructs document-level graphs where each document is modeled as a graph of word nodes, then formulates text classification as a whole graph classification problem (Defferrard et al., 2016). Examples are TLGNN , TextING , HyperGAT (Ding et al., 2020), which establish word-word edges differently. In particular, some methods Chen et al., 2020) propose to estimate the graph structure of the document-level graphs during learning. However, if only a few documents are labeled, these GNNs cannot work due to the lack of labeled graphs.
As is known, GNNs such as graph convolutional network (GCN) (Kipf and Welling, 2016) can conduct semi-supervised learning to solve node classification task on a graph where only a small number of nodes are labeled (Kipf and Welling, 2016). Therefore, another type of GNNs instead operates on a heterogeneous corpus-level graph which takes both text and word as nodes, and classifies unlabeled texts by node classification. Examples include TextGCN (Yao et al., 2019), Ten-sorGCN , HeteGCN (Ragesh et al., 2021) and TG-Transformer  with different strategies to construct and handle heterogeneous nodes and edges. However, these methods cannot work well for short texts of limited length.

Short Text Classification (STC)
Short text classification (STC) is particularly challenging (Aggarwal and Zhai, 2012;. Due to limited length, short texts lack context information and strict syntactic structure which are vital to text understanding (Wang et al., 2017). Therefore, methods tailored for STC strive to incorporate various auxiliary information to enrich short text representations. Popularly used examples are concepts existing in external knowledge bases such as Probase (Wang et al., 2017;Chen et al., 2019) and latent topics discovered in the corpus (Zeng et al., 2018). However, simply enriching semantic information cannot compensate for the shortage of labeled data, which is a common problem faced by real short texts such as queries and online reviews (Pang and Lee, 2005;Phan et al., 2008). Thus, GNN-based methods which perform node classification for semi-supervised STC are utilized. HGAT ) applies a GNN with dual-level attention to forward messages on a corpus-level graph modeling topics, entities and documents jointly, where the entities are words linked to knowledge graphs. STGCN (Ye et al., 2020) operates on a corpus-level graph of topics, documents and words, and merges the node representations with word embeddings obtained via a pretrained BERT (Devlin et al., 2019) via a bidirectional LSTM (Liu et al., 2016). Currently, the state-of-the-art method on STC is HGAT Yang et al., 2021).

Proposed Method
As mentioned in Section 2.2, GNN-based methods, i.e., HGAT and STGCN, can classify short texts while HGAT performs better. However, both works build a graph with mixed nodes of different types without fully exploiting interactions between nodes of the same type. Besides, they fail to capture the similarities between short documents, which can be important to propagate few labels on graphs. Here, we present the proposed SHINE which can address above limitations, thus is able to better compensate for the shortage of context information and labeled data for STC.
Given a short text dataset 2 S containing short documents, we model S as a hierarchically organized heterogeneous graph consisting of: (i) word-level component graphs: we construct wordlevel component graphs which model word-level semantic and syntactic information in order to compensate for the lack of context information; (ii) short document graph: we dynamically learn the short document graph via hierarchically pooling over word-level component graphs, such that the limited label information can be effectively propagated among similar short texts. A high-level illustration of SHINE is shown in Figure 1.
In the sequel, vectors are denoted by lowercase boldface, matrices by uppercase boldface. For a vector x, [x] i denotes the ith element of x. For a matrix X, x i denotes its ith row, [X] ij denotes the (i, j)th entry of X. For a set S, |S| denotes the number of elements in S.

Word-Level Component Graphs
To compensate for the lack of context information and syntactic structure in short documents, we leverage various word-level components which can bring in more syntactic and semantic information.
Particularly, we consider the following three types of word-level components τ ∈ {w, p, e} in this paper: (i) word (w) which makes up short documents and carries semantic meaning; (ii) POS tag (p) which marks the syntactic role such as noun and verb of each word in the short text and is helpful for discriminating ambiguous words; and (iii) entity (e) which corresponds to word that can be found in auxiliary knowledge bases such that additional knowledge can be incorporated. SHINE can easily be extended with other components such as adding a topic graph on the first level. We use these three word-level components as they are wellknown, easy to obtain at a low cost, and already surpass the state-of-the-art HGAT which use topics.
We first provide a general strategy to obtain node embeddings from different types of wordlevel component graphs, then describe in detail how to construct these graphs via common natural language processing techniques including tokenization, entity linking and POS tagging. In this way, we can fully exploit interactions between nodes of the same type.

Node Embedding Learning
Denote word-level component graph of type τ as G τ = {V τ , A τ } where V τ is a set of nodes and A τ ∈ R |Vτ |×|Vτ | is the adjacency matrix. Each node v i τ ∈ V τ is provided with node feature x i τ ∈ R dτ . For simplicity, the node features are collectively denoted as X τ ∈ R |Vτ |×dτ with the ith row corresponds to one node feature x i τ . These G τ s are used to capture the pairwise relationship between nodes of the same type, without being influenced by other types.
Provided with G τ and X τ , we use the classic 2-layer graph convolutional network (GCN) (Kipf and Welling, 2016) to obtain node embeddings H τ . Formally, H τ is updated as

Graph Construction
Next, we present the details of how to construct each G τ from S.
Word Graph G w . We construct a word graph G w = {V w , A w } where word nodes are connected based on local co-occurrence relationships, while Figure 1: A high-level illustration of (a) how we construct a heterogeneous corpus-level graph from a short text dataset using well-known natural language processing techniques; and (b) framework of the proposed SHINE which hierarchically pools over word-level component graphs to obtain short document graph where node classification is conducted to classify those unlabeled nodes. SHINE is trained end-to-end on the complete twolevel graph with respect to the classification loss. The plotted examples of short texts are taken from the movie review (MR) dataset (Pang and Lee, 2005).
other types of relationship such as syntactic dependency  can also be used. We Yao et al., 2019). We initialize the node feature x i w ∈ R |Vw| for v i w ∈ V w as a one-hot vector. Once learned by (1), H w is able to encode the topological structure of G w which is specific to S. We can also leverage generic semantic information by concatenating H w with pretrained word embeddingsĤ w extracted from large text corpus such as Wikidata (Vrandečić and Krötzsch, 2014).
POS Tag Graph G p . We use the default POS tag set of NLTK 3 to obtain the POS tag for each word of short text in S, which forms the POS tag node set V p . Similar to G w , we construct a co- , where the inputs are POS tags for all words. Then we again initialize the node feature x i p ∈ R |Vp| as a one-hot vector.
Entity Graph G e . We obtain the entity node set V e by recognizing entities presented in the NELL knowledge base (Carlson et al., 2010). In contrast to words and POS tags which are abundant in the documents, the number of entities is much smaller. Most short documents only contain one entity, which makes it infeasible to calculate cooccurrence statistics between entities. Instead, we first learn the entity feature x i e ∈ R de of each v i e ∈ G e from NELL, using the classic knowledge graph embedding method TransE (Bordes et al., 2013). Then, we measure the cosine similarity c(v i e , v j e ) between each entity pair v i e , v j e ∈ V e and set [A e ] ij = max(c(x j e , x j e ), 0).

Short Document Graph
As discussed in Section 2.1, the reason why GNN-based methods, which take short documents classification as node classification tasks, can deal with few labels is the usage of adjacent matrix which models the similarities between short documents. However, STGCN and HGAT do not consider such similarities.
Here, to effectively propagate the limited label information, we dynamically learn the short document graph G s = {V s , A s } based on embeddings pooled over word-level component graph to encode the similarity between short documents, where v i s ∈ V s corresponds to one short document in S, and A s is the learned adjacency matrix. As shown in Figure 1, we propose to obtain G s via hierarchically pooling over word-level component graphs, hence G s is dynamically learned and optimized during training. This learned G s then facilitates efficient label propagation among connected short texts.

Hierarchical Pooling over G τ s
In this section, we propose to learn A s via a textspecific hierarchically pooling over multiple word-level component graphs (G τ s).
With H τ obtained from (1), we represent each v i s ∈ G s by pooling over node embeddings in G τ with respect to G τ : where superscript (·) denotes the transpose operation, and u(x) = x/ x 2 normalizes x to unit norm. Particularly, each s i τ is computed as follows: • When τ = w or p: where means concatenating vectors along the last dimension. Please note that concatenation is just an instantiation which already obtains good performance. It can be replaced by more complex aggregation function such as weighted average or LSTM.

Dynamic Graph Learning
Now, we obtain A s on the fly using the learned short document features x i s 's: where δ s is a threshold used to sparsity A s such that short documents are connected only if they are similar enough viewed from the perspective of G τ s. Note that the resultant G s is dynamically changing along with the optimization process, where H τ , x i s , and A s are all optimized and improved.
Upon this G s , we propagate label information among similar short documents via a 2-layer GCN. Let X s collectively record short text embeddings with x i s on the ith row. The class predictions of all short documents in S with respect to C classes are obtained aŝ is applied for each row, W 1 s and W 2 s are trainable parameters.
We train the complete model by optimizing the cross-entropy loss function in an end-to-end manner: where I l records the indices of the labeled short documents, y i s ∈ R C is a one-hot vector with all 0s but a single one denoting the index of the ground truth class c ∈ {1, . . . , C}. By jointly optimized with respect to the single objective, different types of graphs can influence each other. During learning, node embeddings of G τ for all τ ∈ {w, p, e, s} and A s are all updated. The complete procedure of SHINE is shown in Algorithm 1. obtain short document embeddings from G s and make the class prediction by (5); 7: optimize model parameter with respect to (6) by back propagation; 8: end for 4 Experiments All results are averaged over five runs and are obtained on a PC with 32GB memory, Intel-i8 CPU, and a 32GB NVIDIA Tesla V100 GPU.

Datasets
We perform experiments on a variety of publicly accessible benchmark short text datasets (Table 1): (i) Ohsumed 4 : a subset of the bibliographic Ohsumed dataset (Hersh et al., 1994) used in  where the title is taken as the short text to classify.
(ii) positive or negative attitude towards some contents.
(iii) MR 6 : a movie review dataset for sentiment analysis (Pang and Lee, 2005).
(iv) Snippets 7 : a dataset of web search snippets returned by Google Search (Phan et al., 2008).
(v) TagMyNews: a dataset contains English news titles collected from Really Simple Syndication (RSS) feeds, as adopted by .
We tokenize each sentence and remove stopping words and low-frequency words which appear less than five times in the corpus as suggested in (Yao et al., 2019;.
Following , we randomly sample 40 labeled short documents from each class where half of them forms the training set and the other half forms the validation set for hyperparameter tuning. The rest short documents are taken as the test set, which are unlabeled during training.

Compared Methods
The proposed SHINE is compared with the following methods.
• Group (A). Two-step feature extraction and classification methods include (i) TF-IDF+SVM and (ii) LDA+SVM (Cortes and Vapnik, 1995) which use support vector machine to classify documents represented by TF-IDF feature and LDA feature respectively; and (iii) PTE 8 (Tang et al., 2015) which learns a linear classifier upon documents represented as the average word embeddings pretrained from bipartite word-word, word-document and word-label graphs.
• Group (B). BERT 9 (Devlin et al., 2019) which is pretrained on a large corpus and fine-tuned together with a linear classifier for the short text classification task. Each document is represented as the averaged word embeddings (denote asavg) or the embedding of the CLS token (denote as -CLS). • Group (C). Deep text classification methods include (i) CNN (Kim, 2014) and (ii) LSTM (Liu et al., 2016) where the input word embeddings are either randomly initialized (denote as -rand) or pretrained from large text corpus (denote as -pre); GNNs which perform graph classification on document-level graphs including (iii) TL-GNN 10 , (iv) TextING 11 , and (v) HyperGAT 12 (Ding et al., 2020); GNNs which perform node classification on corpus-level graphs including (vi) TextGCN 13 (Yao et al., 2019) and (vii) Ten-sorGCN 14 . HeteGCN (Ragesh et al., 2021) and TG-Transformer    Table 2: Test performance (%) measured on short text datasets. The best results (according to the pairwise t-test with 95% confidence) are highlighted in bold. The second best results are marked in Italic. The last row records the relative improvement (%) of SHINE over the second best result. embeddings learned by a GNN with word embeddings produced by a pretrained BERT.
For these baseline methods, we either show the results reported in previous research Yang et al., 2021) or run the public codes provided by the authors. For fairness, we use the public 300-dimensional GloVe word embeddings 18 in all methods which require pretrained word embeddings (Pennington et al., 2014).
Hyperparameter Setting. For all methods, we find hyperparameters using the validation set via grid search. For SHINE, we set entity embedding dimension d e as 100. For all the datasets, we set the sliding window size of PMI as 5 for both G w and G p , set the embedding size of all GCN layers used in SHINE as 200, and set the threshold δ s for G s as 2.5. We implement SHINE in PyTorch and train the model for a maximum number of 1000 epochs using Adam (Kingma and Ba, 2014) with learning rate 10 −3 . We early stop training if the validation loss does not decrease for 10 consecutive epochs. Dropout rate is set as 0.5. 18 http://nlp.stanford.edu/data/glove.6B.zip Evaluation Metrics. We evaluate the classification performance using test accuracy (denote as ACC in short) and macro-averaged F1 score (denote as F1 in short) following (Tang et al., 2015;Yang et al., 2021).

Benchmark Comparison
Performance Comparison. Table 2 shows the performance. As can be seen, GNN-based methods in group (D) obtain better classification results in general, where the proposed SHINE consistently obtains the state-of-the-art test accuracy and macro-F1 score. This can be attributed to the effective semantic and syntactic information fusion and the modeling of short document graphs.
In addition, if we order datasets by increasing average text length (i.e., Twitter, TagMyNews, Ohsumed, MR and Snippets), we can find that SHINE basically obtains larger relative improvement over the second best method on shorter documents as shown in the last row of Table 2. This validates the efficacy of label propagation in SHINE, which can be attributed to the dynamical learning of short document graph. As shown, GNNs which perform node classification on the corpus-level graph obtain better performance than GNNs which perform graph classification on short text datasets with a few labeled short documents. Another common observation is that incorporating pretrained word embeddings can consistently improve the accuracy, as can be observed by comparing CNN-pre to CNN-rand, LSTM-pre to LSTM-rand, BiLSTM-pre to BiLSTM-rand. CNN and LSTM can obtain worse performance than traditional methods in group (A), such as results on Ohsumed. The fine-tuned BERT encodes generic semantic information from a large corpus, but it cannot beat SHINE which is particularly designed to handle the short text dataset.
Model Size Comparison. Table 3 presents the parameter size of SHINE and the two most relevant GNN-based methods, i.e., HGAT and STGCN. As can be seen, SHINE takes much smaller parameter size. The reason is that instead of organizing different types of nodes in the same graph like HGAT and STGCN, SHINE separately constructs graphs for each type of nodes and pools from them to represent short documents. Thus, the graph used in SHINE can be much smaller than HGAT and STGCN, which leads to a reduction of the parameter number. We also observe that SHINE takes less training time per epoch.

Ablation Study
We compare with different variants of SHINE to evaluate the contribution of each part: (i) w/o G w , w/o G p and w/o G e : remove one single G τ from SHINE while keeping the other parts unchanged.
(ii) w/o pre: do not concatenate H w with pretrained word embeddingsĤ w .
(iii) w/ pre X w : initializes node embeddings X w of G w as pretrained word embeddingsĤ w directly.
(iv) w/o word GNN: fix H τ as the input node features X τ of each cGH τ , therefore the node embeddings of G s are simply weighted average of corresponding word-level features.
(v) w/o doc GNN: use label propagation (Zhou et al., 2004) to directly obtain class prediction using A s learned by (4) and x i s learned by (3).
(vi) w/ a single GNN: run a single GNN on a heterogeneous corpus-level graph containing the same set of words, entities, POS tags and documents as ours. We modify TextGCN (Yao et al., 2019) to handle this case. The concatenation of pretrained word embedding can slightly improve the performance. However, "w/ pre X w " is worse than SHINE. This shows the benefits of separating corpus-specific and general semantic information: using G w with onehot initialized features to capture corpus-specific topology among words, while using pretrained word embeddings to bring in general semantic information extracted from an external large corpus. The performance gain of SHINE with respect to "w/o word GNN" validates the necessity of (i) message passing among nodes of the same type and update node embeddings accordingly and (ii) update G τ s with respect to the STC task. While the improvement of SHINE upon "w/o doc GNN" shows that refining short document embeddings by GNN is useful. Finally, SHINE defeats "w/ a single GNN" which uses the same amount of information. This reveals that SHINE outperforms due to model design. Figure 2 further plots the influences of incrementally adding in more wordlevel component graphs and the short document graph. This again validates the effectiveness of SHINE framework.

Model Sensitivity
We further examine the impact of labeled training data proportion for GNN-based methods which perform node classification on the corpus-level graph, including TextGCN, TensorGCN, HGAT, STGCN and SHINE. Figure 3(a) plots the results. As shown, SHINE consistently outperforms other methods, where the performance gap is increasing with fewer labeled training data . Figure 3(b) plots the impact of threshold δ s in (4). At first, performance increases with a larger δ s which leads to a sparse G s where only certainly similar short documents are connected to propagate information. However, when δ s is too large, G s losses its functionality and reduces to w/o G τ in Table 4. Finally, recall that we set the embedding size of all GCN layers used in SHINE equally. Figure 3(c) plots the effect of varying this embedding size. As observed, small embedding size cannot capture enough information while a overly large embedding size may not improve the performance but is more computational costly.

Conclusion
In this paper, we propose SHINE, a novel hierarchical heterogeneous graph representation learning method for short text classification. It is particularly useful to compensate for the lack of context information and propagate the limited number of labels efficiently. Specially, SHINE can effectively learn from a hierarchical graph modeling different perspectives of the short text dataset: word-level component graphs are used to understand short texts from the semantic and syntactic perspectives, and the dynamically learned short document graph allows efficient and effective label propagation among similar short documents. Extensive experiments show that SHINE outperforms the others consistently.