Exploiting Global and Local Hierarchies for Hierarchical Text Classification

Hierarchical text classification aims to leverage label hierarchy in multi-label text classification. Existing methods encode label hierarchy in a global view, where label hierarchy is treated as the static hierarchical structure containing all labels. Since global hierarchy is static and irrelevant to text samples, it makes these methods hard to exploit hierarchical information. Contrary to global hierarchy, local hierarchy as a structured labels hierarchy corresponding to each text sample. It is dynamic and relevant to text samples, which is ignored in previous methods. To exploit global and local hierarchies, we propose Hierarchy-guided BERT with Global and Local hierarchies (HBGL), which utilizes the large-scale parameters and prior language knowledge of BERT to model both global and local hierarchies. Moreover, HBGL avoids the intentional fusion of semantic and hierarchical modules by directly modeling semantic and hierarchical information with BERT. Compared with the state-of-the-art method HGCLR, our method achieves significant improvement on three benchmark datasets.


Introduction
Hierarchical text classification (HTC) focuses on assigning one or more labels from the label hierarchy to a text sample (Sun and Lim, 2001).As a special case of multi-label text classification, HTC has various applications such as news categorization (Kowsari et al., 2017) and scientific paper classification (Lewis et al., 2004b).The methods in HTC aim to improve prediction accuracy by modeling the large-scale, imbalanced, and structured label hierarchy (Mao et al., 2019a).
To model the label hierarchy, recent methods (Zhou et al., 2020;Chen et al., 2021;Wang et al., 2022) view hierarchy as a directed acyclic graph and model label hierarchy based on graph encoders.However, the input of graph encoders is static, considering that all HTC text samples share the same hierarchical structure, which leads graph encoders to model the same graph redundantly.To solve this problem, Wang et al. (2022) directly discards the graph encoder during prediction, but this method still suffers from the same problem during training.Moreover, since the target labels corresponding to each text sample could be either a single-path or a multi-path in HTC (Zhou et al., 2020), recent methods only consider the graph of the entire label hierarchy and ignore the subgraph corresponding to each text sample.This subgraph can contain structured label co-occurrence information.For instance, a news report about France travel is labeled "European" under the parent label "World" and "France" under a different parent label "Travel Destinations".There is a strong correlation between the labels "France" and "European".But these labels are far apart on the graph, making it difficult for graph encoders to model this relationship.
Under such observation, we divide the label hierarchy into global and local hierarchies to take full advantage of hierarchical information in HTC.We define global hierarchy as the whole hierarchical structure, referred to as hierarchical information in previous methods.Then we define local hierarchy as a structured label hierarchy corresponding to each text sample, which is the subgraph of global hierarchy.Moreover, global hierarchy is static and irrelevant to text samples, while local hierarchy is dynamic and relevant to text samples.Considering the characteristics of two hierarchies, our method models them separately to avoid redundantly modeling static global hierarchy and fully exploit hierarchical information with dynamic local hierarchy.
To model semantic information along with hierarchical information, Zhou et al. (2020) proposes hierarchy-aware multi-label attention.Chen et al. (2021) reformulates it as a matching problem by encouraging the text representation to be similar to its label representation.Although, these methods can improve the performance of text encoders by injecting label hierarchy with the graph encoder , the improvement on pretrained language model BERT (Devlin et al., 2018) is limited (Wang et al., 2022).Compared to previous text encoders such as CNN or RNN, BERT has large-scale parameters and prior language knowledge.It enables BERT to roughly grasp hierarchical information with multi-label text classification.Therefore, HG-CLR (Wang et al., 2022) is proposed to improve the BERT performance on HTC, and the hierarchy is embedded into BERT based on contrastive learning during training.For prediction, HGCLR directly uses BERT as a multi-label classifier.Specifically, the hierarchy in HGCLR is represented by positive samples in contrastive learning, which is implemented by scaling the BERT token embeddings based on a graph encoder.However, representing hierarchy by simply scaling token embeddings is inefficient, which may also lead to a gap between training and prediction.
To efficiently exploit BERT in HTC, we leverage the prior knowledge of BERT by transforming both global and local hierarchy modeling as mask prediction tasks.Moreover, we discard the auxiliary graph encoder and utilize BERT to model hierarchical information to avoid the intentional fusion of BERT and graph encoder.For global hierarchy, we propose a label mask prediction task to recover masked labels based on the label relationship in global hierarchy.Since global hierarchy is irrelevant to text samples, we only fine-tune label embeddings and keep BERT frozen.For local hierarchy, we combine text samples and labels as the input of BERT to directly fuse semantic and hierarchical information according to the attention mechanism in BERT.
In summary, the contributions of this paper are following: • We propose HBGL to take full advantage of BERT in HTC.HBGL does not require auxiliary modules like graph encoders to model hierarchical information, which avoids the intentional fusion of semantic and hierarchical modules.
• We propose corresponding methods to model the global and local hierarchies based on their characteristics in order to further exploit the information of the hierarchy.
• Experiments show that the proposed model achieves significant improvements on three datasets.Our code will be public to ensure reproducibility.

Related Work
Hierarchical text classification (HTC) is a special multi-label text classification problem that requires constructing one or more paths from the taxonomic hierarchy in a top-down manner (Sun and Lim, 2001).Compared to multi-label text classification, HTC focuses on leveraging hierarchical information to achieve better results.There are two groups of existing HTC methods based on treating the label hierarchy: local and global approaches.
The local approaches leverage hierarchical information by constructing one or more classifiers at each level or each node in hierarchy.Generally speaking, a text sample will be classified top-down according to its hierarchy.Shimura et al. (2018) applies a CNN with a fine-tuning technique to utilize the data in the upper levels.Banerjee et al. (2019) initials parameters of the child classifier by the finetuned parent classifiers.
The global approaches leverage hierarchical information by directly treating HTC as multi-label text classification with hierarchy information as input.Many methods like recursive regularization (Gopal and Yang, 2013), reinforcement learning (Mao et al., 2019b), capsule network (Peng et al., 2021), and meta-learning (Wu et al., 2019) has been proposed to capture hierarchical information.To better represent hierarchical information, Zhou et al. (2020) formulates the hierarchy as a directed graph and introduces hierarchy-aware structure encoders.Chen et al. (2021) formulates the text-label semantics relationship as a semantic matching problem.
With the development of Pretrained Language Model (PLM), PLM outperforms previous methods even without using hierarchical information.Compared to text encoders like RNN or CNN, PLM is strong enough to learn hierarchical information without hierarchy-aware structure encoders.Under this observation, Wang et al. (2022) proposes HGCLR to embed the hierarchical information into the PLM directly.However, HGCLR still requires the hierarchy-aware structure encoder like Graphormer (Ying et al., 2021) to incorporate hierarchical information during training.

Problem Definition
Given a training set {(x i , y i )} N i=1 where x i is raw text, and y i ∈ {0, 1} L is the label of x i represented by L dimensional multi-hot vector.The goal of Hierarchical Text Classification (HTC) is to predict a subset labels for x i with the help of hierarchical information, which can be organized as a Directed Acyclic Graph (DAG is the bottom-up hierarchy path.Although labels in y i follow the labels hierarchy, HTC could be either a single-path or a multi-path problem (Zhou et al., 2020).

Methodology
In this section, we provide the technical details of the proposed HBGL. Figure 1 shows the overall framework of the model.The left part corresponds to global hierarchy-aware label embeddings, while the right part corresponds to local hierarchy-aware text encoder.We first inject global hierarchy into label embeddings in global hierarchy-aware label embeddings.Then we leverage these label embeddings with our local hierarchy-aware text encoder.

Global Hierarchy-aware Label Embeddings
Global hierarchy-aware label embeddings aims to initialize the label embeddings based on the label semantics and hierarchy in HTC.It allows BERT to directly exploit the global hierarchy without auxiliary label encoders.Contrary to previous methods (Zhou et al., 2020;Chen et al., 2021;Wang et al., 2022), we implement BERT as a graph encoder to initialize the label embeddings via gradients descent and adapt the global hierarchy to label embeddings by formulating it as mask prediction.Global hierarchy-aware label embeddings leverages the large-scale pretraining knowledge of BERT to generate better hierarchy-aware and semantic-aware label embeddings.
Following Wang et al. (2022), we first initialize label embeddings Ŷ = [ŷ 1 , . . ., ŷL ] ∈ R L×d with the averaging BERT token embeddings of label name, where d is the hidden size of BERT and ŷi corresponds to label node v i in G. Since Ŷ is initialized with the BERT token embedding, it takes advantage of the prior knowledge of BERT to merge label semantic and hierarchical information.Specifically, the input embeddings e ∈ R L×d of BERT encoder is defined as: where p ∈ R 512×d and t ∈ R 2×d are the position embeddings and segment embeddings in BERT (Devlin et al., 2018).To exploit the position embeddings p, we use the hierarchy level HLevel(v i ) as the position id for label v i .Low position ids represent coarse-grained labels, while high position ids represent fine-grained labels.To exploit the segment embeddings t, we use segment id 1 to represent labels, which makes BERT easy to distinguish labels and text in HTC.
To feed BERT with the label graph, we add attention mask A ∈ {0, 1} L×L in each self-attention layers.Formally, A is defined as: where − → E is the top-down hierarchy path and ← − E is the bottom-up hierarchy path of DAG G.We allow one label can attend its parent and child labels.For example, we show the attention mask for the fourlevel hierarchy in Figure 1.
Based on input embeddings E and attention mask A, we can use mask LM task (Devlin et al., 2018) to inject hierarchy into Ŷ.However, if we directly follow mask LM task in BERT, it will cause BERT unable to distinguish between masked leaf labels under the same parent label.For example, two leaf labels "Baseball" and "Football" under the same parent label "Sport".The model will output the same result if both leaf labels are masked.Since both labels have the same position and segment embeddings, and only attend to "Sport" label in A. To solve this problem, we treat the masked label prediction task as the multi-label classification, which requires a masked leaf label to predict itself and other masked sibling leaf labels according to G.
Formally, we first random mask several labels by replacing ŷi with mask token embedding in Eq. 1 to get masked input embeddings e ′ .Second, we calculate the hidden state representation h ∈ R L×d and scores of each label s ∈ R L×L as following: Where BERTEncoder is the encoder part of BERT and A is applied to each layer of BERTEncoder.
Finally, The problem of injecting hierarchy into Ŷ can be reformulated as solving the following optimization problem: (4) Where V m is the masked labels set and ȳij is the target for s ij .We set ȳij = 1, when i = j or the label i and j are masked sibling leaf nodes in G.To avoid model overfitting on static graph G, we keep all parameters of BERT frozen and only fine-tune label embeddings Ŷ in Eq. 4.Moreover, we gradually increase the label mask ratio during training.
The whole procedure of global hierarchy-aware label embeddings is shown in Algorithm 1.
Algorithm 1 Global Hierarchy-aware Label Embeddings input: Label hierarchy G and label names output: Label embeddings Ŷ. initialize: Ŷ using averaging BERT token embeddings of each label name.
1: Set mask ratio r m ; 2: Set mask ratio upper bound r M ; 3: Set learning rate lr, batch size bsz and training steps T train ; 4: Get attention mask A according to Eq. 2; 5: for t = 1, ..., T train do 6: Get input embeddings e according to Eq. 1; Mask e with mask ratio r t m to get e ′ ; 9: Get h and s according to Eq. 3; 10: Get ȳ based on e ′ and G, ȳij = 1, when j = i or the label i and j are masked sibling leaf nodes; 11: Compute loss L in Eq. 4; 12: Backward and compute the gradients Local hierarchy is the structured label hierarchy corresponding to each text sample, which is ignored in previous methods (Zhou et al., 2020;Chen et al., 2021;Wang et al., 2022).In contrast to global hierarchy, local hierarchy is dynamic and related to semantic information.To leverage local hierarchy in HTC, we need to examine several issues before introducing our methods.First, although local hierarchy contains the hierarchical information related to target labels, this leads to label leakage during training.Second, we should pay attention to the gap between training and prediction, since local hierarchy is only available during training.To this end, we propose local hierarchy-aware text encoder to exploit local hierarchy while avoiding the above issues.

Local Hierarchy Representation
We first discuss the representation of local hierarchy in BERT before introducing local hierarchyaware text encoder.Following the method in global hierarchy-aware label embeddings, we can represent local hierarchy as the subgraph of global hierarchy according to attention mechanism of BERT.However, it is hard for BERT to combine the input of label graph and text sample, while avoiding label leakage and the gap between training and prediction.Therefore, we propose another method to efficiently represent local hierarchy, which allows BERT to combine local hierarchies with text samples easily.Since the local hierarchy is single-path or multi-path in the global hierarchy, in the singlepath case we can simply treat it as the sequence.If we can also transform the multi-path case into the sequence, we can represent local hierarchy as the sequence, making it easy to model with BERT.Under this observation, we use the following method to transform the multi-path local hierarchy into the sequence: Where u h ∈ R d is the hth level of hierarchy, y h is the target labels in hth level, ŷj is the global hierarchy-aware label embeddings, D is the maximum level of hierarchy and u is local hierarchy sequence.
For example, consider a multi-path local hierarchy with four labels: 1a, 1b, 2a and 2b, where 2a and 2b are the child labels of 1a and 1b, respectively.According to Eq. 5, we can get u 1 = ŷ1a + ŷ1b and u 2 = ŷ2a + ŷ2b .The attention score α 21 in BERT can be calculated as: Where W Q , W K ∈ R d×dz are parameter matrices in BERT.As shown in Eq. 6, α 21 contains ŷ2a W Q (ŷ 1a W K ) T and ŷ2b W Q (ŷ 1b W K ) T to represent the local hierarchy graph, while it also con- Since we have injected global hierarchy into the label embeddings: ŷ1a , ŷ1b , ŷ2a and ŷ2b according to Algorithm 1, which leverages the first part in α 21 to predict masked labels, it allows α 21 to be able to hold hierarchical information in local hierarchy.
In addition, the second part of α 21 is also relevant for modeling the local hierarchy, as the labels in it correspond to the same text sample.

Fusing Local Hierarchy into Text Encoder
To further exploit local hierarchy, while avoiding label leakage and the gap between training and prediction, it is hard to implement BERT directly as multi-label classifier.Inspired by s2s-ft (Bao et al., 2021), which adopts PLM like BERT for sequenceto-sequence learning, we propose a novel method by adopting BERT to generate the local hierarchy sequence.Note that since elements in the local hierarchy sequence may contain multiple labels, we cannot use sequence-to-sequence methods directly.
In order to fuse local hierarchy and text in sequence-to-sequence fashion, our model aims to generate the local hierarchy sequence u: where u <h = u 1 , . . ., u h−1 and x is the input text corresponding to u.There are several advantages to model HTC as Eq.7: First, the structure of local hierarchy can be included.Since u h represents the labels corresponding to the hth level of hierarchy, it only depends on the labels above hth level, which can been represented by p (u h | u <h , x).Second, we can leverage the teacher forcing to fuse local hierarchy and text while avoiding label leakage during training.Specifically, the input of BERT is composed of three parts: input text, local hierarchy and masked local hierarchy during training, as shown in Figure 1.The end-of-sequence token [SEP] is used to divide these three parts.Based on s2s-ft, we implement similar attention mask provided in Figure 1.The attention mask prevents the input text from attending to the local hierarchy and masked local hierarchy, which guarantees that labels do not influence the input text tokens.The attention mask also ensures the top-down manner in the local hierarchy, which allows the label to attend to the upper level labels in the hierarchy.For the attention mask between local hierarchy and masked local hierarchy, it allows the masked labels to be predicted based on upper level target labels, following the teacher forcing manner.We use 0 and 1 to distinguish text and label for segment ids in BERT.For position ids in BERT, we set the same position ids in the local hierarchy and mask local hierarchy by accumulating them based on text location ids.By feeding BERT with the above inputs, we use a binary crossentropy loss function to predict labels in each level separately.The optimization problem is as follows: where Θ are parameters of BERT, V h = {j | HLevel(v j ) = h} is the labels in hth level, y ij is the jth label corresponding to text x i and s t ihj is the score of jth label, which is calculated as following: where h t ih ∈ R d is the hidden state representation of hth masked local hierarchy corresponding to x i and s t ihj is the jth element of s t ih ∈ R L .For prediction, we utilize the local hierarchy during the training stage to make BERT separately predict labels at each level in an autoregressive manner.The scores s p i for ith level labels are computed as following: where h p h ∈ R d is the hidden state representation of hth level, e text and e mask are text embeddings and mask token embedding, A p h is the attention mask for hth level, which is a submatrix of the attention mask between text and local hierarchy in training, and where 1 s p hj represents the predicted label corresponding to s p hj .Specifically, we sum hth level predicted label embeddings to generate u p h and replace u p h with e sep [SEP] token embedding when these is no predicted label in hth level.
Finally, the predicted labels set y p is: 5 Experiments Datasets and Evaluation Metrics We select three widely-used HTC benchmark datasets in our experiments.They are: Web-of-Science (WOS) (Kowsari et al., 2017), NY-Times (NYT) (Shimura et al., 2018), and RCV1-V2 (Lewis et al., 2004a).The detailed information of each dataset is shown in Table 1.We follow the data processing of previous works (Zhou et al., 2020;Wang et al., 2022) and use the same evaluation metrics to measure the experimental results: Macro-F1 and Micro-F1.

implement Details
Following HGCLR (Wang et al., 2022), we use bert-base-uncased as both text and graph encoders.The graph structure input of BERT is implemented based on the attention mask of huggingface transformers (Wolf et al., 2020).We introduce the implementation details of the global hierarchy-aware label embeddings and the local hierarchy-aware text encoder, respectively.
For global hierarchy-aware label embeddings, we first initialize the label embeddings by averaging their label name token embeddings in BERT.To embed global hierarchy into label embeddings, we train initialized label embeddings with frozen bert-base-uncased according to Algorithm 1.The initial mask ratio and mask ratio upper bound are 0.15 and 0.45, respectively.Considering the different maximum levels of hierarchy and the number of labels in each dataset, we grid search learning rates of label embeddings among {1e-3, 1e-4} and the training steps among {300, 500, 1000}.
For local hierarchy-aware text encoder, we follow the settings of HGCLR.The batch size is set to 12.The optimizer is Adam with a learning rate of 3e-5.We use global hierarchy-aware label embeddings to initialize the label embeddings.And the input label embeddings share weights with the label embeddings in the classification head.The input label length of each dataset is set to the depth in Table 1.Compared to the 512 maximum input tokens in HGCLR, we use smaller input tokens to achieve similar prediction time performance.According to the maximum hierarchy level, the maximum input tokens in WOS, NYT, and RCV1-V2 are 509, 472, and 492, respectively.Moreover, we cache the previous attention query and key values to make the prediction more efficient.

Baselines
We compared the state-of-the-art and most enlightening methods including HiAGM (Zhou et al., 2020), HTCInfoMax (Deng et al., 2021), Hi-Match (Chen et al., 2021), and HGCLR (Wang et al., 2022).HiAGM, HTCInfoMax, and HiMatch use different fusion strategies to mix text-hierarchy representation.Specifically, HiAGM proposes hierarchy-aware multi-label attention to get the texhierarchy representation.HTCInfoMax introduces information maximization to model the interaction between text and hierarchy.HiMatch reformulates it as a matching problem by encouraging the text representation be similar to its hierarchical label representation.Contrary to the above methods, HG-CLR achieves state-of-the-art results by directly incorporating hierarchy into BERT based on contrastive learning.

Experimental Results
Table 2 shows Micro-F1 and Macro-F1 on three datasets.Our method significantly outperforms all methods by further exploiting the hierarchical information of HTC and the prior knowledge of BERT.
Compared to BERT, we show proposed HBGL can significantly leverage the hierarchical information by achieving 2.99%, 4.59% and 4.05% improvement of Macro-F1 on WOS, NYT, and RCV1-V2.By the way, our method shows better performance on the dataset with a complex hierarchy.Our method can achieve 4.59% improvement of Macro-F1 on NYT with the largest label depth and number of labels in the three datasets.
HBGL also shows impressive performance compared to BERT based HTC methods.Current stateof-the-art methods like HGCLR, which relies on contrastive learning to embed the hierarchy into BERT, has negligible improvement over previous methods such as HiMatch.Furthermore, although different methods of incorporating semantic and hierarchical information are used, these methods have similar performances on three datasets, which shows a common limitation in previous methods: merging BERT and the graph encoder regardless of local hierarchy and prior knowledge of BERT.By overcoming this limitation, our method observes 2.23% and 2.76% boost on Macro-F1 on NYT and RCV1-V2 compared to HGCLR.For WOS, it is the simplest dataset among the above datasets with two-level label hierarchy, and the labels for each document are single-path in the hierarchy, the impact of leveraging hierarchy is smaller than the other two datasets.However, our methods still achieve reasonable improvement compared to HG-CLR.

Effect of Global Hierarchy-aware Label Embeddings
To

Effect of Local Hierarchy-aware Text Encoder
We also analyze the importance of local hierarchyaware text encoder by comparing it with two methods: multi-label and seq2seq.For multi-label, we fine-tune BERT as the multi-label classifier.For seq2seq, we fine-tune BERT as the seq2seq model following s2s-ft, where target labels are sorted according to their levels in the global hierarchy.Local hierarchy-aware text encoder achieves the best performance in Table 4. Compared to existing methods, HBGL achieves significant improvements on all three datasets, while only parameters corresponding to label embeddings are required except BERT.

Limitation
While HBGL exploits global and local hierarchies and achieves improvements on three HTC datasets, one limitation is that HBGL requires additional iterations to predict labels.HBGL needs to predict upper level labels before predicting current level labels.To alleviate this limitation, we cached the BERT attention query and key values from previous iterations and used a smaller source length than HGCLR, which allowed HBGL to achieve similar inference speeds compared to HGCLR.Specifically, HGCLR achieves 1.02× to 1.10× inference speedups over HBGL on three datasets.

Figure 1 :
Figure 1: The overall framework of our model under the label hierarchy with four maximum levels.The left part is the global hierarchy-aware label embeddings module.The right part is the local hierarchy-aware text encoder module.We use different colors to identify labels, and the darker color indicates the lower level.Gray squares in attention masks indicate that tokens are prevented from attending, while white squares indicate that attention between tokens is allowed.The label embeddings share the same weight in both modules, which is initialized by the global hierarchy-aware label embeddings module.Note the special tokens like [CLS] or [SEP], position and segment tokens of BERT are ignored for simplicity, which we will discuss in methodology.(Best view in color.) = r t m + r M −rmT train 16: Ŷt+1 = UpdateParameter( Ŷt , ∂L ∂ Ŷt , lr); 17: end for 4.2 Local Hierarchy-aware Text Encoder

Table 1 :
Data Statistics.L is the number of classes.D is the maximum level of hierarchy.Avg(|L i |) is the average number of classes per sample.

Table 3 :
As shown in Table3, our method outperforms the other three label embeddings methods.Although all methods can leverage hierarchical information by the local hierarchy-aware text encoder, the remaining methods still achieve poorer performance than global hierarchy BERT, which shows the importance of global hierarchy.For global hierarchy GAT, GAT cannot leverage global hierarchy by predicting masked labels like BERT.Impact of different label embeddings on NYT and RCV1-V2.

Table 4 :
Impact of different fine tuning methods on NYT and RCV1-V2.In this paper, we proposed a BERT-based HTC framework HBGL.HBGL avoids the intentional fusion of semantic and hierarchical modules by utilizing BERT to model both semantic and hierarchical information.Moreover, HBGL takes full advantage of hierarchical information by modeling global and local hierarchies, respectively.Considering that global hierarchy is static and irrelevant to text samples, we propose global hierarchy-aware label embeddings to inject global hierarchy into label embeddings directly.Considering that local hierarchy is dynamic and relevant to text samples, we propose local hierarchy-aware text encoder to deeply combine semantic and hierarchical information according to the attention mechanism in BERT.