HiTIN: Hierarchy-aware Tree Isomorphism Network for Hierarchical Text Classification

Hierarchical text classification (HTC) is a challenging subtask of multi-label classification as the labels form a complex hierarchical structure. Existing dual-encoder methods in HTC achieve weak performance gains with huge memory overheads and their structure encoders heavily rely on domain knowledge. Under such observation, we tend to investigate the feasibility of a memory-friendly model with strong generalization capability that could boost the performance of HTC without prior statistics or label semantics. In this paper, we propose Hierarchy-aware Tree Isomorphism Network (HiTIN) to enhance the text representations with only syntactic information of the label hierarchy. Specifically, we convert the label hierarchy into an unweighted tree structure, termed coding tree, with the guidance of structural entropy. Then we design a structure encoder to incorporate hierarchy-aware information in the coding tree into text representations. Besides the text encoder, HiTIN only contains a few multi-layer perceptions and linear transformations, which greatly saves memory. We conduct experiments on three commonly used datasets and the results demonstrate that HiTIN could achieve better test performance and less memory consumption than state-of-the-art (SOTA) methods.


Introduction
Hierarchical text classification is a sub-task of text multi-label classification, which is commonly applied in scenarios such as news document classification (Lewis et al., 2004;Sandhaus, Evan, 2008), academic paper classification (Kowsari et al., 2017), and so on. Unlike traditional classification tasks, the labels of HTC have parent-child relationships forming a hierarchical structure. Due to the complex structure of label hierarchy and the * Equal Contribution. † Correspondence to: Junran Wu. imbalanced frequency of labels, HTC becomes a challenging task in natural language processing. Recent studies in HTC typically utilize a dualencoder framework (Zhou et al., 2020), which consists of a text encoder for text representations and a structure encoder to inject the information of labels into text. The text encoder could be a traditional backbone for text classification, for instance, Tex-tRCNN (Lai et al., 2015) or BERT (Devlin et al., 2019). The structure encoder is a Graph Neural Network (GNN) that treats the label hierarchy as a Directed Acyclic Graph (DAG) and propagates the information among labels. To maximize the propagation ability of the structure encoder, Zhou et al. (2020) learn textual features of labels and count the prior probabilities between parent and child labels. Based on the dual-encoder framework, researchers further complicated the model by adding complementary networks and loss functions from different aspects, such as treating HTC as a matching problem (Chen et al., 2021), introducing mutual information maximization (Deng et al., 2021). However, more complementary components result in more memory consumption, as shown in Figure 1. On the other hand, their structure encoders still rely on the prior statistics (Zhou et al., 2020;Chen et al., 2021) or the representation of labels (Zhou et al., 2020;Deng et al., 2021). That is, their models require a mass of domain knowledge, which greatly reduces the generalization ability.
To this end, we intend to design a more effective structure encoder with fewer parameters for HTC. Instead of introducing domain knowledge, we try to take full advantage of the structural information embedded in label hierarchies. Inspired by Li and Pan (2016), we decode the essential structure of label hierarchies into coding trees with the guidance of structural entropy, which aims to measure the structural complexity of a graph. The coding tree is unweighted and could reflect the hierarchical organization of the original graph, which provides us with another view of the label hierarchy. To construct coding trees, we design an algorithm, termed CodIng tRee Construction Algorithm (CIRCA) by minimizing the structural entropy of label hierarchies. Based on the hierarchical structure of coding trees, we propose Hierarchical-aware Tree Isomorphism Network (HiTIN). The document representations fetched by the text encoder are fed into a structure encoder, in which we iteratively update the node embeddings of the coding tree with a few multi-layer perceptions. Finally, we produce a feature vector of the entire coding tree as the final representation of the document. Compared with SOTA methods of dual encoders on HTC tasks (Zhou et al., 2020;Chen et al., 2021;Deng et al., 2021;Wang et al., 2022a), HiTIN shows superior performance gains with less memory consumption. Overall, the contributions of our work can be summarized as follows: • To improve the generalization capability of dual-encoder models in HTC, we decode the essential structure of label hierarchies with the guidance of structural entropy.
• We propose HiTIN, which has fewer learnable parameters and requires less domain knowledge, to fuse the structural information of label hierarchies into text representations.
• Numerous experiments are conducted on three benchmark datasets to demonstrate the superiority of our model. For reproducibility, our code is available at https://github.com/Rooooyy/HiTIN.

Related Work
Hierarchical Text Classification. Existing works for HTC could be categorized into local and global approaches (Zhou et al., 2020). Local approaches build classifiers for a single label or labels at the same level in the hierarchy, while global approaches treat HTC as a flat classification task and build only one classifier for the entire taxonomy. Previous local studies mainly focus on transferring knowledge from models in the upper levels to models in the lower levels. Kowsari et al. (2017) first feed the whole corpus into the parent model and then input the documents with the same label marked by the parent model into a child model. In the next few years, researchers try different techniques to deliver knowledge from high-level models to low-level models (Shimura et al., 2018;Huang et al., 2019;Banerjee et al., 2019).
Global studies in HTC try to improve flat multilabel classification by introducing various information from the hierarchy. Gopal and Yang (2013) propose a recursive regularization function to make the parameters of adjacent categories have similar values. Peng et al. (2018) propose a regularized graph-CNN model to capture the nonconsecutive semantics from texts. Besides, various deep learning techniques, such as sequenceto-sequence model Rojas et al., 2020), attention mechanism (You et al., 2019), capsule network (Aly et al., 2019;, reinforcement learning (Mao et al., 2019), and meta-learning (Wu et al., 2019) are also applied in global HTC. Recently, Zhou et al. (2020) specially design an encoder for label hierarchies which could significantly improve performance. Chen et al. (2020) learn the word and label embeddings jointly in the hyperbolic space. Chen et al. (2021) formulate the text-label relationship as a semantic matching problem. Deng et al. (2021) introduce information maximization which can model the interaction between text and label while filtering out irrelevant information. With the development of Pretrained Language Model (PLM), BERT (Devlin et al., 2019) based contrastive learning (Wang et al., 2022a), prompt tuning (Wang et al., 2022b), and other methods (Jiang et al., 2022) Figure 2: An example of HiTIN with K = 2. As shown in Section 4.1, the input document is first fed into the text encoder to generate text representations. Next, the label hierarchy is transformed into a coding tree via Coding Tree Construction Algorithm proposed in Section 4.2. The text representations are mapped into the leaf nodes of the coding tree and we iteratively update the non-leaf node embeddings in Section 4.2. Finally, we produce a feature vector of the entire coding tree and calculate the classification probabilities in Section 4.3. Besides, HiTIN is supervised by binary cross-entropy loss and recursive regularization (Gopal and Yang, 2013). tropy (Shannon, 1948) on graphs as structure entropy could measure the structural complexity of a graph. The structural entropy of a graph is defined as the average length of the codewords obtained by a random walk under a specific coding scheme. The coding scheme, termed coding tree (Li and Pan, 2016), is a tree structure that encodes and decodes the essential structure of the graph. In other words, to minimize structural entropy is to remove the noisy information from the graph. In the past few years, structural entropy has been successfully applied in network security (Li et al., 2016a), medicine (Li et al., 2016b), bioinformatics , graph classification (Wu et al., 2022b,a), text classification (Zhang et al., 2022), and graph contrastive learning (Wu et al., 2023).

Problem Definition
Given a document D = {w 1 , w 2 , . . . , w n }, where w i is a word and n denotes the document length, hierarchical text classification aims to predict a subset Y of the holistic label set Y . Besides, every label in Y corresponds to a unique node on a directed acyclic graph, i.e. the label hierarchy. The label hierarchy is predefined and usually simplified as a tree structure. In the groud-truth label set, a non-root label y i always co-occurs with its parent nodes, that is, for any y i ∈ Y, the parent node of y i is also in Y.

Methodology
Following the dual-encoder scheme in HTC, the architecture of HiTIN that consists of a text encoder and a structure encoder is shown in Figure 2. The text encoder aims to capture textual information from the input document while the structure encoder could model the label correlations in the hierarchy and inject the information from labels into text representations.

Text Encoder
In HTC, text encoder generally has two choices, that is, TextRCNN encoder and BERT encoder. TextRCNN (Lai et al., 2015) is a traditional method in text classification, while BERT (Devlin et al., 2019) has shown its powerful ability in sequence feature extraction and has been widely applied in natural language processing in the past few years.
TextRCNN Encoder. The given document D = {w 1 , w 2 , . . . , w n }, which is a sequence of word embeddings, is firstly fed into a bidirectional GRU layer to extract sequential information. Then, multiple CNN blocks along with max pooling over time are adopted to capture n-gram features. Formally, where Φ CN N (·) and Φ GRU (·) respectively denote a CNN and a GRU layer, while M axP ool(·) denotes the max pooling over time operation. Besides, H RCN N ∈ R n C ×d C , where n C denotes the number of CNN kernels and d C denotes the output channels of each CNN kernel.
BERT Encoder. Recent works in HTC also utilize BERT for learning textual features (Chen et al., 2021;Wang et al., 2022a). Since there are few changes made to the vanilla BERT, we only introduce the workflow of our model and omit the details of BERT. Given a input document D = {w 1 , w 2 , . . . , w n }, we pad the document with two specical tokens: where [CLS] and [SEP ] respectively denote the beginning and the end of the document. After padding and truncating, documentD is fed into BERT. Then BERT generates embeddings for each token in the document: where H BERT ∈ R (n+2)×d B , and Φ BERT (·) denotes the BERT model. We adopt the CLS embedding as the representation of the entire text sequence. Thus, the final representation H of document D is: where d B is the hidden dimension.

Structure Encoder
The semantic information provided by text encoder is then input into the structure encoder. Unlike previous works, we do not utilize the prior statistics or learn representations of the label hierarchy. Instead, we design a suite of methods guided by structural entropy (Li and Pan, 2016) to effectively incorporate the information of text and labels.
Structural Entropy. Inspired by Li and Pan (2016), we try to simplify the original structure of the label hierarchy by minimalizing its structural entropy. The structural entropy of a graph is defined as the average length of the codewords obtained by a random walk under a specific coding pattern named coding tree (Li and Pan, 2016). Given a graph G = (V G , E G ), the structural entropy of G on coding tree T is defined as: where α is a non-root node of coding tree T which represents a subset of V G , α − is the parent node of α on the coding tree. g α represents the number of edges with only one endpoint in α and the other end outside α, that is, the out degree of α. vol(G) denotes the volume of graph G while vol(α) and vol(α − ) is the sum of the degree of nodes that respectively partitioned by α and α − . The structural entropy of G is defined by H(G) = min T H T (G). For a certain coding pattern, the height of the coding tree should be fixed. Therefore, the K-dimensional structural entropy of the graph G determined by the coding tree T with a certain height K can be computed as: Coding Tree Construction Algorithm. To minimize the structural entropy of graph G, we design a CodIng tRee Construction Algorithm (CIRCA) to heuristically construct a coding tree T with a certain height no greater than K. That is, To better illustrate CIRCA, we make some definitions as follows, and v j is equivalent to v i .parent.
Definition 2 Following Definition 1, given any two Definition 3 Following Definition 1, given a node v i . Define a member function delete(v i ) of T . T.delete(v i ) could delete v i from T and attach the child nodes of v i to its parent node. Formally, v i .parent.children ← v i .children; Definition 4 Following Definition 1, given any two nodes Based on the above definitions, the pseudocode of CIRCA can be found in Algorithm 1. More details about coding trees and CIRCA are shown in Appendix A.

Algorithm 1 Coding Tree Construction Algorithm
where V G L , E G L respectively denotes the node set and the edge set of G L , V G L = Y while E G L is predefined in the corpus. In our work, V G L and E G L are represented by the unweighted adjacency matrix of G L . X G L is the node embedding matrix of G L . Instead of learning the concept of labels, we directly broadcast the text representation to the label structure. Specifically, X G is transformed from the text representation H by duplication and projection. Formally, where W d ∈ R |Y |×1 and W p ∈ R d H * d V are learnable weights for the duplication and projection. |Y | is the volume of the label set. d H and d V respectively denote the dimension of text and node. B H indicates the learnable bias and B H ∈ R |Y |×dv . Next, we simplify the structure of the label hierarchy into a coding tree with the guidance of structural entropy. Given a certain height K, the coding tree T L = (V T L , E T L , X T L ) of the label hierarchy could be constructed by CIRCA, where The coding tree T L encodes and decodes the essential structure of G L , which provides multigranularity partitions for G L . The root node v r is the roughest partition which represents the whole node set of G L , so V K T L = {v r }. For every node v and its child nodes {v 1 , v 2 , . . . , v z }, v 1 , v 2 , . . . , and v z formulate a partition of v. Moreover, the leaf nodes in T L is an element-wise partition for G L , that is, K]} is given by CIRCA while their node embeddings {X i T L |i ∈ [1, K]} remain empty till now. Thus, we intend to update the un-fetched node representation of coding tree T L . Following the message passing mechanism in Graph Isomorphism Network (GIN) (Xu et al., 2019), we design Hierarchy-aware Tree Isomorphism Network (HiTIN) according to the structure of coding trees. For x i v ∈ X i T L in the i-th layer, where v ∈ V i T , x i v ∈ R d V is the feature vector of node v, and C(v) represents the child nodes of v in coding tree T L . Φ i M LP (·) denotes a two-layer multi-layer perception within BatchNorm (Ioffe and Szegedy, 2015) and ReLU function. The learning stage starts from the leaf node (layer 0) and learns the representation of each node layer by layer until reaching the root node (layer K). Finally, a read-out function is applied to compute a representation of the entire coding tree T L : where Concat(·) indicates the concatenation operation. P ool(·) in Eq. 11 can be replaced with a summation, averaging, or maximization function. H T ∈ R d T denotes the final representation of T L .

Classification and Loss Function
Similar to previous studies (Zhou et al., 2020; Wang et al., 2022a), we flatten the hierarchy by attaching a unique multi-label classifier. H T is fed into a linear layer along with a sigmoid function to generate classification probability: where W c ∈ R d T ×|Y | and b c ∈ R |Y | are weights and bias of linear layer while |Y | is the volume of the label set. For multi-label classification, we adopt the Binary Cross-Entropy Loss as the classification loss: where y j is the ground truth of the j-th label while p j is the j-th element of P . Considering hierarchical classification, we use recursive regularization Gopal and Yang (2013) to constrain the weights of adjacent classes to be in the same distributions as formulated in Eq. 14: where p is a non-leaf label in Y and q is a child of p. w p , w q ∈ W c . We use a hyper-parameter λ to control the strength of recursive regularization. Thus, the final loss function can be formulated as:   (Gopal and Yang, 2013). Micro-F1 is the harmonic mean of the overall precision and recall of all the test instances, while Macro-F1 is the average F1-score of each category. Thus, Micro-F1 reflects the performance on more frequent labels, while Macro-F1 treats labels equally.
Implementation Details. The text embeddings fed into the TextRCNN encoder are initialized with GloVe (Pennington et al., 2014). The TextRCNN encoder consists of a two-layer BiGRU with hidden dimension 128 and CNN layers with kernel size=[2, 3, 4] and d C =100. Thus, the hidden dimension of the final text representation is d H = r C * d C = 3 * 100 = 300. The height K of the coding tree is 2 for all three datasets. The hidden dimension d V of node embedding X G is set to 512 for RCV1-v2 while 300 for WOS and NYTimes. P ool(·) in Eq. 11 is summation for all the datasets. The balance factor λ for L R is set to 1e-6. The batch size is set to 16 for RCV1-v2 and 64 for WOS and NYTimes. The model is optimized by Adam (Kingma and Ba, 2014) with a learning rate of 1e-4.
For BERT text encoder, we use the BertModel of bert-base-uncased and there are some negligible changes to make it compatible with our method. d B = d H = d V = 768. The height K of the coding tree is 2 and the P ool(·) in Eq. 11 is averaging. The batch size is set to 12, and the BertModel is fine-tuned by Adam (Kingma and Ba, 2014) with a learning rate of 2e-5.  Table 2: Main Experimental Results with TextRCNN encoders. All baselines above and our method utilize GloVe embeddings (Pennington et al., 2014) to initialize documents and encode them with TextRCNN (Lai et al., 2015).  Baselines. We compare HiTIN with SOTAs including HiAGM(Zhou et al., 2020), HTCInfo-Max (Deng et al., 2021), HiMatch (Chen et al., 2021), and HGCLR (Wang et al., 2022a). Hi-AGM, HTCInfoMax, and HiMatch use different fusion strategies to model text-hierarchy correlations. Specifically, HiAGM proposes a multi-label attention and a text feature propagation technique to get hierarchy-aware representations. HTCInfo-Max enhances HiAGM-LA with information maximization to model the interaction between text and hierarchy. HiMatch treats HTC as a matching problem by mapping text and labels into a joint embedding space. HGCLR directly incorporates hierarchy into BERT with contrastive learning.

Experimental Results
The experimental results with different types of text encoders are shown in Table 2 and Table 3. Hi-AGM is the first method to apply the dual-encoder framework and outperforms TextRCNN on all the datasets. HTCInfoMax improves HiAGM-LA (Zhou et al., 2020) by introducing mutual information maximization but is still weaker than HiAGM-TP. HiMatch treats HTC as a matching problem and surpasses HiAGM-TP(Zhou et al., 2020) on WOS and RCV1-v2. Different from these methods, HiTIN could further extract the information in the text without counting the prior probabilities between parent and child labels or building feature vectors for labels. As shown in Table 2, when using TextRCNN as the text encoder, our model outper-forms all baselines on the three datasets. Based on TextRCNN, HiTIN brings 3.55% and 4.72% improvement of Micro-F1 and Macro-F1 on average.
As for pretrained models in Table 3, our model also beats existing methods in all three datasets. Compared with vanilla BERT, our model can significantly refine the text representations by respectively achieving 1.2% and 3.1% average improvement of Micro-F1 and Macro-F1 on the three datasets. In addition, our method can achieve 3.69% improvement of Macro-F1 on NYT, which has the deepest label hierarchy in the three datasets. It demonstrates the superiority of our model on the dataset with a complex hierarchy. Compared with BERT-based HTC methods, our model observes a 1.12% average improvement of Macro-F1 against HGCLR. On RCV1-v2, the performance boost of Macro-F1 even reaches 1.64%. The improvement of Macro-F1 shows that our model could effectively capture the correlation between parent and child labels even without their prior probabilities.
(c) NYTimes Figure 3: Test performance of HiTIN with different height K of the coding tree on three datasets.

The Necessity of CIRCA
In this subsection, we illustrate the effectiveness of CIRCA by comparing it to a random algorithm. The random algorithm generates a coding tree of the original graph G with a certain height K just like CIRCA. First, the random algorithm also takes all nodes of graph G as leaf nodes of the tree. But different from CIRCA, for each layer, every two nodes are randomly paired and then connect to their parent node. Finally, all nodes in the K − 1 th layer are connected to a root node. We generate coding trees with the random algorithm and then feed them into our model. As shown in Table 4, the results demonstrate that the random algorithm leads to a negative impact which destroys the original semantic information. Thus, it is difficult for the downstream model to extract useful features. On the contrary, the coding tree constructed by CIRCA can retain the essential structure of the label hierarchy and make the learning procedure more effective. Besides, our model could achieve good performance without Eq. 14, which proves that CIRCA could retain the information of low-frequency labels while minimizing the structural entropy of label hierarchies.

The Height of Coding Tree
The height of the coding tree directly affects the performance of our model. The higher the coding tree, the more information is compressed. To investigate the impact of K, we run HiTIN with different heights K of the coding tree while keeping other settings the same. Figure 3 shows the test performance of different height coding trees on WOS, RCV1-v2, and NYTimes. As K grows, the performance of HiTIN is severely degraded. Despite the different depths of label hierarchy, the optimal heights of the coding tree for the three datasets are always 2. A probable reason is that the 2-dimensional structural entropy roughly corresponds to objects in the 2-dimensional space as the text and label are both represented with 2-D tensors. On the other hand, as K grows, more noisy information is eliminated, but more useful information is also compressed.

The Mermory-saving Feature of HiTIN
In this subsection, we compare the number of learnable parameters of HiTIN with that of the baselines. We set K to 2 and run these models on WOS while keeping the other hyper-parameter the same. The numbers of trainable parameters are counted by the numel(·) function in PyTorch (Paszke et al., 2019). As shown in Figure 4, we can observe that the parameter of our model is slightly greater than TextRCNN (Zhou et al., 2020) but significantly smaller than HiAGM (Zhou et al., 2020), HiMatch (Chen et al., 2021), and HTCInfoMax (Deng et al., 2021). One important reason is the simple and efficient architecture of HiTIN, which contains only a few MLPs and linear transformations. On the contrary, HiAGM-LA (Zhou et al., 2020) needs extra memory for label representations, HiAGM-TP uses a space-consuming method for text-to-label transformation, and both of them utilized gated network as the structure encoder, which further aggravates memory usage. HiMatch (Chen et al., 2021) and HTCInforMax (Deng et al., 2021) respectively introduce auxiliary neural networks based on HiAGM-TP and HiAGM-LA. Thus, their memory usages are even larger.

Conclusion
In this paper, we propose a suite of methods to address the limitations of existing approaches regarding HTC. In particular, tending to minimize structural entropy, we design CIRCA to construct coding trees for the label hierarchy. To further extract textual information, we propose HiTIN to update node embeddings of the coding tree iteratively. Experimental results demonstrate that HiTIN could enhance text representations with only structural information of the label hierarchy. Our model outperforms existing methods while greatly reducing memory increments.

Limitations
For text classification tasks, the text encoder is more important than other components. Due to the lack of label semantic information and simplified learning procedure, the robustness of text encoders directly affects the performance of our model. From Table 2

A Analysis of CIRCA
In this section, we first present the definition of coding tree following (Li and Pan, 2016). Secondly, we present the detailed flow of CIRCA, in particular, how each stage in Algorithm 1 works, and the purpose of designing these steps. Finally, we give an analysis of the temporal complexity of CIRCA.
Coding Tree. A coding tree T of graph G = (V G , E G ) is defined as a tree with the following properties: i. For any node v ∈ T . v is associated with a nonempty subset V of V G . Denote that T v = V , in which v is called the codeword of V while V (or T v ) is termed as the marker of v. 1 ii. The coding tree has a unique root node v r that stands for the vertices set V G of G. That is, iii. For every node v ∈ T , if v 1 , v 2 , . . . , v z are all the children of v, {T v 1 , T v 2 , . . . , T vz } is a partition of T v . That is, iv. For each leaf node v γ ∈ T , T vγ is a singleton. i.e. v γ corresponds to a unique node in V G , and for any vertex v ∈ V G , there is only one leaf node v τ ∈ T that satisfies T vτ = v.
The workflow of CIRCA. In the initial state, the original graph G = (V G , E G ) is fed into CIRCA and each node in V G is treated as the leaf node of coding tree T L and directly linked with the root node v r . The height of the initial coding tree is 1, which reflects the one-dimensional structure entropy of graph G. In other words, there are only two kinds of partition for V G , one is the graphlevel partition (T vr = V G ), and the other is the node-level partition (T vτ = v). We tend to find multi-granularity partitions for G, which could be provided by the K-dimensional optimal coding tree as the coding tree with height K encodes and decodes K + 1 partitions in different levels for graph G.
In Stage 1, we merge the leaf nodes of the initial coding tree pair by pair until the root node v r has only two children. Merging leaf nodes is essentially compressing structural information, which is a process of reducing the structural entropy of graph G. When selecting the node pairs to be merged, we give priority to the nodes that reduce more structural entropy of graph G after merging.
After Stage 1, the coding tree T becomes a binary tree, whose height is much greater than K and closer to log|V G | in practical applications. In Stage 2, we tend to compress the coding tree T to height K by erasing its intermediate nodes. Note that removing nodes from the highly compressed coding tree is increasing the structural entropy of graph G. Thus, we preferentially erase the nodes that cause the minimal structural entropy increase.
The result of Stage 2 might be an unbalanced tree that does not conform to the definition of coding trees. In Stage 3, we do some post-processing on the coding tree to make the leaf nodes the same height.
Complexity analysis. The time complexity of CIRCA is O(h max (|E G |log|V G | + |V G |)), where h max is the maximum height of coding tree T L during Stage 1. Since CIRCA tends to construct balanced coding trees, h max is no greater than log(|V G |).