Instances and Labels: Hierarchy-aware Joint Supervised Contrastive Learning for Hierarchical Multi-Label Text Classification

Hierarchical multi-label text classification (HMTC) aims at utilizing a label hierarchy in multi-label classification. Recent approaches to HMTC deal with the problem of imposing an overconstrained premise on the output space by using contrastive learning on generated samples in a semi-supervised manner to bring text and label embeddings closer. However, the generation of samples tends to introduce noise as it ignores the correlation between similar samples in the same batch. One solu-tion to this issue is supervised contrastive learning, but it remains an underexplored topic in HMTC due to its complex structured labels. To overcome this challenge, we propose HJCL, a H ierarchy-aware J oint Supervised C ontrastive L earning method that bridges the gap between supervised contrastive learning and HMTC. Specifically, we employ both instance-wise and label-wise contrastive learning techniques and carefully construct batches to fulfill the contrastive learning objective. Extensive experiments on four multi-path HMTC datasets demonstrate that HJCL achieves promising re-sults and the effectiveness of Contrastive Learning on HMTC. Code and data are available at https://github.com/simonucl/HJCL.


Introduction
Text classification is a fundamental problem in natural language processing (NLP), which aims to assign one or multiple categories to a given document based on its content.The task is essential in many NLP applications, e.g. in discourse relation recognition (Chan et al., 2023), scientific document classification (Sadat and Caragea, 2022), or e-commerce product categorization (Shen et al., 2021).In practice, documents might be tagged with multiple categories that can be organized in a concept hierarchy, such as a taxonomy of a knowledge graph (Pan et al., 2017b,a), cf. Figure 1.The The British Government raised the nation's base lending rate a quarter of a point yesterday, to 6 percent.The Bank of England uses for loans to commercial banks, since February 1995 … his target for the underlying rate of inflation --excluding mortgage interest payments --is 2.5 percent by next spring, but the rate has been about 2.8 percent recently.

Label Taxonomy
Gold Label Non-Gold Label Figure 1: Example of an input sample and its annotated labels from the New York Times dataset (Sandhaus, 2008).The Label Taxonomy is a subgraph of the actual hierarchy.
task of assigning multiple hierarchically structured categories to documents is known as hierarchical multi-label text classification (HMTC).
A major challenge for HMTC is how to semantically relate the input sentence and the labels in the taxonomy to perform classification based on the hierarchy.Recent approaches to HMTC handle the hierarchy in a global way by using graph neural networks to incorporate the hierarchical information into the input text to pull together related input embeddings and label embeddings in the same latent space (Zhou et al., 2020;Deng et al., 2021;Chen et al., 2021;Wang et al., 2022b;Jiang et al., 2022).At the inference stage, most global methods reduce the learned representation into level-wise embeddings and perform prediction in a top-down fashion to retain hierarchical consistency.However, these methods ignore the correlation between labels at different paths (with varying lengths) and different levels of abstraction.
To overcome these challenges, we develop a method based on contrastive learning (CL) (Chen et al., 2020).So far, the application of contrastive learning in hierarchical multi-label classification has received very little attention.This is because it is difficult to create meaningful positive and negative pairs: given the dependency of labels on the hierarchical structure, each sample could be characterized with multiple labels, which makes it hard to find samples with the exact same labels (Zheng et al., 2021).Previous endeavors in text classification with hierarchically structured labels employ data augmentation methods to construct positive pairs (Wang et al., 2022a;Long and Webber, 2022).However, these approaches primarily focus on pushing apart inter-class labels within the same sample but do not fully utilize the intra-class labels across samples.A notable exception is the work by Zhang et al. (2022a) in which CL is performed across hierarchical samples, leading to considerable performance improvements.However, this method is restricted by the assumption of a fixed depth in the hierarchy i.e., it assumes all paths in the hierarchy have the same length.
To tackle the above challenges, we introduce a supervised contrastive learning method, HJCL, based on utilising in-batch sample information for establishing the label correlations between samples while retaining the hierarchical structure.Technically, HJCL aims at achieving two main goals: 1) For instance pairs, the representations of intra-class should obtain higher similarity scores than interclass pairs, meanwhile intra-class pairs at deeper levels obtain more weight than pairs at higher levels.2) For label pairs, their representations should be pulled close if their original samples are similar.This requires careful choices between positive and negative samples to adjust the contrastive learning based on the hierarchical structure and label similarity.To achieve these goals, we first adopt a text encoder and a label encoder to map the embeddings and hierarchy labels into a shared representation space.Then, we utilize a multi-head mechanism to capture different aspects of the semantics to label information and acquire label-specific embeddings.Finally, we introduce two contrastive learning objectives that operate at the instance level and the label level.These two losses allow HJCL to learn good semantic representations by fully exploiting information from in-batch instances and labels.We note that the proposed contrastive learning objectives are aligned with two key properties related to CL: uniformity and alignment (Wang and Isola, 2020).Uniformity favors feature distribution that preserves maximal mutual information between the representations and task output, i.e., the hierarchical relation between labels.Alignment refers to the encoder being able to assign similar features to closely related samples/labels.We also emphasize that unlike previous methods (Zhang et al., 2022a), our approach has no assumption on the depth of the hierarchy.
Our main contributions are as follows: • We propose HJCL, a representation learning approach that bridges the gap between supervised contrastive learning and Hierarchical Multi-label Text Classification.• We propose a novel supervised contrastive loss on hierarchical structure labels that weigh based on both hierarchy and sample similarity, which resolves the difficulty of applying vanilla contrastive in HMTC and fully utilizes the label information between samples.• We evaluate HJCL on four multi-path datasets.
Experimental results show its effectiveness.We also carry out extensive ablation studies.

Related Work Hierarchical Multi-label Text Classification
Existing HMTC methods can be divided into two groups based on how they utilize the label hierarchy: local or global approaches.The local approach (Kowsari et al., 2017;Banerjee et al., 2019) reuses the idea of flat multi-label classification tasks and trains unique models for each level of the hierarchy.In contrast, global methods treat the hierarchy as a whole and train a single model for classification.The main objective is to exploit the semantic relationship between the input and the hierarchical labels.Existing methods commonly use reinforcement learning (Mao et al., 2019), meta-learning (Wu et al., 2019), attention mechanisms (Zhou et al., 2020), information maximization (Deng et al., 2021), and matching networks (Chen et al., 2021).However, these methods learn the input text and label representations separately.Recent works have chosen to incorporate stronger graph encoders (Wang et al., 2022a), modify the hierarchy into different representations, e.g.text sequences (Yu et al., 2022), or directly incorporate the hierarchy into the text encoder (Jiang et al., 2022;Wang et al., 2022b).To the best of our knowledge, HJCL is the first work to utilize supervised contrastive learning for the HMTC task.
Contrastive Learning In HMTC, there are two major constraints that make challenging for su-pervised contrastive learning (SCL) (Gunel et al., 2020) to be effective: multi-label and hierarchical labels.Indeed, SCL was originally proposed for samples with single labels, and determining positive and negative sets becomes difficult.Previous methods resolved this issue mainly by reweighting the contrastive loss based on the similarity to positive and negative samples (Suresh and Ong, 2021;Zheng et al., 2021).Note that the presence of a hierarchy exacerbates this problem.Con-trastiveIDRR (Long and Webber, 2022) performed semi-supervised contrastive learning on hierarchystructured labels by contrasting the set of all other samples from pairs generated via data augmentation.Su et al. (2022b) addressed the sampling issue using a kNN strategy on the trained samples.
In contrast to previous methods, HJCL makes further progress by directly performing supervised contrastive learning on in-batch samples.In a recent study in computer vision, HiMulConE (Zhang et al., 2022a) proposed a method similar to ours that focused on hierarchical multi-label classification with a hierarchy of fix depth.However, HJCL does not impose constraints on the depth of the hierarchy and achieves this by utilizing a multi-headed attention mechanism.

Background
Task Formulation Let Y = {y 1 , . . ., y n } be a set of labels.A hierarchy H = (T, τ ) is a labelled tree with T = (V, E) a tree and τ : V → Y a labelling function.For simplicity, we will not distinguish between the node and its label, i.e. a label y i will also denote the corresponding node.Given an input text X = {x 1 , .(Luong et al., 2015) to allow the model to jointly attend to information from different representation subspaces at different positions.Instead of computing a single attention function, this method first projects the query Q, key K and value V onto h different heads and an attention function is applied individually to these projections.The output is a linear transformation of the concatenation of all attention outputs: The multi-headed attention is defined as follows (Lee et al., 2018): (1) where O j = Attention(QW q j , KW k j , V W v j ), and R hd h v ×d are learnable parameters in the multi-head attention.|| represents the concatenation operation,

Supervised Contrastive Learning
Given a mini-batch with m samples and n labels, we define the set of label embeddings as Each label embedding can be seen as an independent instance and can be associated to a label {(z ij , y ij )} ij .We further define I = {z ij ∈ Z | y ij = 1} as the gold label set.Given an anchor sample z ij from I, we define its positive set as P ij = {z kj ∈ I | y kj = y ij = 1} and its negative set as N ij = I \{{z ij } ∪ P ij }.The supervised contrastive learning loss (SupCon) (Khosla et al., 2020) is formulated as follows:

Methodology
The overall architecture of HJCL is shown in Fig. 2. In a nutshell, HJCL first extracts a label-aware embedding for each label and the tokens from the input text in the embedding space.HJCL combines two distinct types of supervised contrastive learning to jointly leverage the hierarchical information and the label information from in-batch samples.: (i) Instance-level Contrastive Learning and (ii) Hierarchy-aware Label-enhanced Contrastive Learning (HiLeCon).

Label-Aware Embedding
In the context of HMTC, a major challenge is that different parts of the text could contain information related to different paths in the hierarchy.To overcome this problem, we first design and extract label-aware embeddings from input texts, with the Instance wise Contrastive Loss (ℒ-./0.) This Comic book is related to Joker.

Representation Space dist(A B) > dist(A B) dist(A B) < dist(A B)
Lionel Messi's new book will be published.
BBC Report Argentina wins the World Cup objective of learning the unique label embeddings between labels and sentences in the input text.Following previous work (Wang et al., 2022a;Jiang et al., 2022), we use BERT (Devlin et al., 2019) as text encoder, which maps the input tokens into the embedding space: H = {h 1 , . . ., h m }, where h i is the hidden representation for each input token x i and H ∈ R m×d .For the label embeddings, we initialise them with the average of the BERT-embedding of their text description, To learn the hierarchical information, a graph attention network (GAT) (Velickovic et al., 2018) is used to propagate the hierarchical information between nodes in Y ′ .
After mapping them into the same representation space, we perform multi-head attention as defined at Eq. 1, by setting the i th label embedding y i as the query Q, and the input tokens representation H as both the key and value.The label-aware embedding g i is defined as follows: Each g i is computed by the attention weight between the label y i and each input token in H, then multiplied by the input tokens in H to get the label-aware embeddings.The label-aware embedding g i can be seen as the pooled representation of the input tokens in H weighted by its semantic relatedness to the label y i .

Integrating with Contrastive Learning
Following the general paradigm for contrastive learning (Khosla et al., 2020;Wang et al., 2022a), the learned embedding g i has to be projected into a new subspace, in which contrastive learning takes place.Taking inspiration from Wang et al. (2018) and Liu et al. (2022), we fuse the label representa-tions and the learned embeddings to strengthen the label information in the embeddings used by contrastive learning, Instance-level Contrastive Learning For instance-wise contrastive learning, the objective is simple: the anchor instances should be closer to the instances with similar label-structure than to the instances with unrelated labels, cf.Fig. 2.Moreover, the anchor nodes should be closer to positive instance pairs at deeper levels in the hierarchy than to positive instance pairs at higher levels.Following this objective, we define a distance inequality: dist pos ℓ 1 < dist pos ℓ 2 < dist neg , where 1 ≤ ℓ 2 < ℓ 1 ≤ L and dist pos ℓ is the distance between the anchor instance X i and X ℓ , which have the same labels at level ℓ.
Given a mini-batch for instances tains the label-aware embeddings for sample i, we define their subsets at level ℓ as where where L = {1, . . ., ℓ h } is the set of levels in the taxonomy, |L| is the maximum depth and the term exp(1 |L|−ℓ ) is a penalty applied to pairs constructed from deeper levels in the hierarchy, forcing them to be closer than pairs constructed from shallow levels.
Label-level Contrastive Learning We will also introduce label-wise contrastive learning.This is possible due to our extraction of label-aware embeddings in Section 4.1, which allows us to learn each label embedding independently.Although Equation 2 performs well in multi-class classification (Zhang et al., 2022b), it is not the case for multi-label classification with hierarchy.(1) It ignores the semantic relation from their original sample {X i , X k }. (2) N ij contains the label embeddings from the same samples but with different classes, z ik .Pushing apart labels that are connected in the hierarchy could damage the classification performance.To bridge this gap, we propose a Hierarchy-aware Label-Enhanced Contrastive Loss Function (HiLeCon), which carefully weighs the contrastive strength based on the relatedness of the positive and negative labels with the anchor labels.The basic idea is to weigh the degree of contrast between two label embeddings z i , z j by their samples' label similarity, Y i , Y j ∈ {0, 1} n .In particular, in supervised contrastive learning, the gold labels for the samples from where the label pairs come can be used for their similarity measurement.We will use a variant of the Hamming metric that treats differently labels occurring at different levels of the hierarchy, such that pairs of labels at higher level should have a larger semantic difference than pairs of labels at deeper levels.Our metric between Y i and Y j is defined as follows: where ℓ i is the level of the i-th label in the hierarchy.For example, the distance between News and Classifields in Figure 1 is 4, while the distance between United Kingdom and France is only 1. Intuitively, this is the case because United Kingdom and France are both under Countries, and samples with these two labels could still share similar contexts relating to the World News.
We can now use our metric to set the weight between positive pairs z ij ∈ P ij and negative pairs z ik ∈ N ij in Eq. 2: where C = ρ(0 n , 1 n ) 1 is used to normalize the σ ij values.HiLeCon is then defined as [log σijf (zij, zp) where n is the number of labels and f (•, •) is the exponential cosine similarity measure between two embeddings.Intuitively, in L HiLeCon the label embeddings with similar gold label sets should be close to each other in the latent space, and the magnitude of the similarity is determined based on how similar their gold labels are.Conversely, for dissimilar labels.

Classification and Objective Function
At the inference stage, we flatten the label-aware embeddings and pass them through a linear layer to get the logits s i for label i: where where s i , s j ∈ R are the logits output from the Equation ( 5).The final prediction is as follows: Finally, we define our overall training loss function: where λ 1 and λ 2 are the weighting factors for the Instance-wise Contrastive loss and HiLeCon.Table 1: Experimental results on the four HMTC datasets.The best results are in bold and the second-best is underlined.We report the mean results across 5 runs with random seeds.Models with △ are those using contrastive learning.For HiAGM, HTCInfoMax and HiMatch, their works used TextRCNN (Zhou et al., 2020) as encoder in their paper, we replicate the results by replacing it with BERT.The ↑ represents the improvement to the second best model; the ± represent the std.between experiments.

Experiments
Datasets and Evaluation Metrics We conduct experiments on four widely-used HMTC benchmark datasets, all of them consisting of multi-path labels: Blurb Genre Collection (BGC) 2 , Arxiv Academic Papers Dataset (AAPD) (Yang et al., 2018), NY-Times (NYT) (Shimura et al., 2018), and RCV1-V2 (Lewis et al.).Details for each dataset are shown in Table 5.We adopt the data processing method introduced in Chen et al. ( 2021) to remove stopwords and use the same evaluation metrics: Macro-F1 and Micro-F1.

Baselines
We compare HJCL with a variety of strong hierarchical text classification baselines, such as HiAGM (Zhou et al., 2020), HTCInfoMax (Deng et al., 2021), HiMatch (Chen et al., 2021), Seq2Tree (Raffel et al., 2019), HGCLR (Wang et al., 2022a).Specifically, HiMulConE (Zhang et al., 2022a) also uses contrastive learning on the hierarchical graph.More details about their implementation are listed in A.2.Given the recent advancement in Large Language Models (LLMs), we also consider ChatGPT gpt-turbo-3.5 (Brown et al., 2020) with zero-shot prompting as a baseline.The prompts and examples of answers from ChatGPT can be found in Appendix C.

Main Results
Table 1 presents the results on hierarchical multilabel text classification.More details can be found in Appendix A. From Table 1, one can observe that HJCL significantly outperforms the baselines.This shows the effectiveness of incorporating supervised contrastive learning into the semantic and hierarchical information.Note that although HG-CLR (Wang et al., 2022a) introduces a stronger graph encoder and perform contrastive learning on generated samples, it inevitably introduces noise into these samples and overlooks the label correlation between them.In contrast, HJCL uses a simpler graph network (GAT) and performs contrastive learning on in-batch samples only, yielding significant improvements of 2.73% and 2.06% on Macro-F1 in BGC and NYT.Despite Seq2Tree's use of a more powerful encoder, T5, HJCL still shows promising improvements of 2.01% and 0.48% on Macro-F1 in AAPD and RCV1-V2, respectively.This demonstrates the use of contrastive model better exploits the power of BERT encoder.HiMul-ConE shows a drop in the Macro-F1 scores even when compared to the BERT baseline, especially on NYT, which has the most complex hierarchical structure.This demonstrates that our approach to extracting label-aware embedding is an important step for contrastive learning for HMTC.For the instruction-tuned model, ChatGPT performs poorly, particularly suffering from minority class performance.This shows that it remains challenging for LLMs to handle complex hierarchical information, and that representation learning is still necessary.To better understand the impact of the different components of HJCL on performance, we conducted an ablation study on both the RCV1-V2 and NYT datasets.The RCV1-V2 dataset has a substantial testing set, which helps to minimise experimental noise.In contrast, the NYT dataset has the largest depth.One can observe in Table 2 that without label contrastive the Macro-F1 drops notably in both datasets, 1.23% and 0.87%.The removal of HiLeCon reduces the potential for label clustering and majorly affects the minority labels.Conversely, Micro-F1 is primarily affected by the omission of the sample contrast, which prevents the model from considering the global hierarchy and learning label features from training instances of other classes, based on their hierarchical interdependencies.When both loss functions are removed, the performance declines drastically.This demonstrates the effectiveness of our dual loss approaches.

Ablation Study
Additionally, replacing the ZMLR loss with BCE loss results in a slight performance drop, showcasing the importance of considering label correlation during the prediction stage.Further analysis between BCE loss and ZLPR is shown in Appendix B.3.Finally, as shown in the last row in Table 2, the removal of the graph label fusion has a significant impact on the performance.The projection is shown to affect the generalization power without the projection head in CL (Gupta et al., 2022).Ablation results on other datasets can be found in Appendix B.1.

Effects of the Coefficients λ 1 and λ 2
As shown in Equation ( 7), the coefficients λ 1 and λ 2 control the importance of the instance-wise and 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Instance-wise Loss (λ1)  label-wise contrastive loss, respectively.Figure 3 illustrates the changes on Macro-F1 when varying the values of λ 1 and λ 2 .The left part of Figure 3 shows that the performance peaks with small λ 1 values and drops rapidly as these values continue to increase.Intuitively, assigning too much weight to the instance-level CL pushes apart similar samples that have slightly different label sets, preventing the models from fully utilizing samples that share similar topics.For λ 2 , the F1 score peaks at 0.6 and 1.6 for NYT and RCV1-V2, respectively.We attribute this difference to the complexity of the NYT hierarchy, which is deeper.Even with the assistance of the hierarchy-aware weighted function (Eq.3), increasing λ 2 excessively may result in overwhelmingly high semantic similarities among label embeddings (Gao et al., 2019).The remaining results are provided in Appendix B.1.

Effect of the Hierarchy-Aware Label Contrastive loss
To further evaluate the effectiveness of HiLeCon in Eq. 4, we conduct experiments by replacing it with the traditional SupCon (Khosla et al., 2020) or dropping the hierarchy difference by replacing the ρ(•, •) at Eq. 3 with Hamming distance, LeCon.Figure 4 presents the obtained results and detailed results are shown in Table 3. HiLeCon outperforms the other two methods in all four datasets with a substantial threshold.Specifically, HiLeCon significantly outperforms the traditional SupCon in the four datasets by an absolute gain of 2.06% and 3.27% in Micro-F1 and Macro-F1, respectively.As shown in Table 3, the difference in both metrics is statistically significant with p-value 8.3e-3 and 2.8e-3 by a two-tailed t-test.Moreover, the improvement from LeCon in F1 scores by considering the hierarchy is 0.56% and 1.19% which are statistically significant (p-value = 0.033, 0.040).This shows the importance of considering label granularity with depth information.

Results on Multi-Path Consistency
One of the key challenges in hierarchical multilabel classification is that the input texts could be categorized into more than one path in the hierarchy.In this section, we analyze how HJCL leverages contrastive learning to improve the coverage of all meanings from the input sentence.For HMTC, the multi-path consistency can be viewed from two perspectives.First, some paths from the gold labels were missing from the prediction, meaning that the model failed to attribute the semantic information about that path from the sentences; and even if all the paths are predicted correctly, it is only able to predict the coarse-grained labels at upper levels but missed more fine-grained labels at lower levels.To compare the performance on these problems, we measure path accuracy (Acc P ) and depth accuracy (Acc D ), which are the ratio of testing samples that have their path number and all their depth correctly predicted.(Their definitions are given in the Appendix B.4).As shown in Table 4, HJCL (and its variants) outperformed the baselines, with an offset of 2.4% on average compared with the second-best model HGCLR.Specifically, the Acc P for HJCL outperforms HGCLR with an absolute gain of 5.5% in NYT, in which the majority of samples are multi-path (cf.Table 9 in the Appendix).HJCL shows performance boosts for multi-path samples, demonstrating the effectiveness of contrastive learning.Texts: Last time we had Pat Robertson and Pat Buchanan.We had the Republican Chairman, warning that if Bill Clinton was elected as the U.S. President, Jane Fonda would be sleeping in the White House "as guest of honor at a state dinner... HJCL better exploits the correlation be-tween labels in different paths in the hierarchy with contrastive learning.For an intuition see the visualization in Figure 9, Appendix B.5.For example, the F 1 score of Top/Features/Travel/Guides/Destinations/North America/United States is only 0.3350 for the HG-CLR method (Wang et al., 2022a).In contrast, our methods that fully utilised the label correlation information improved the F 1 score to 0.8176.Figure 5 shows a case study for the prediction results from different models.Although HGCLR is able to classify U.S. under News (the middle path), it fails to take into account label similarity information to identify the United States label under the Features path (the left path).In contrast, our models correctly identify U.S. and Washington while addressing the false positive for Sports under the News category.

Conclusion
We introduce HJCL , a combination of two novel contrastive methods that better learn the representation for embedding in Hierarchical Multi-Label Text Classification (HMTC).Our method has the following features: (1) It demonstrates that contrastive learning can help retain the hierarchy information between samples.(2) By weighting both label similarity and depth information, applying supervised contrastive learning directly at the label level shows promising improvement.(3) Evaluation on four multi-path HMTC datasets demonstrates that HJCL significantly outperforms previous baselines and shows that in-batch contrastive learning notably enhances performance.Overall, HJCL bridges the gap between supervised contrastive learning in hierarchical structured label classification tasks in general and demonstrates that better representation learning is feasible for improving HMTC performance.
In the future, we plan to look into applying our approach in some special kinds of texts, such as arguments (Saadat-Yazdi et al., 2022, 2023;Chausson et al., 2023), news (Pan et al., 2018;Liu et al., 2021;Long et al., 2020a,b) and events (Guan et al., 2023).Furthermore, we will also further develop our approach in the setting of multi-modal (Kiela et al., 2018;Chen et al., 2022b;Huang et al., 2023;Chen et al., 2022a) classification, involving both texts and images.

Limitations
Our method is based on the extraction of a labelaware embedding for each label in the given taxonomy through multi-head attention and performs contrastive learning on the learned embeddings.Although our method shows significant improvements, the use of label-aware embeddings scales according to the number of labels in the taxonomy.Thus, our methods may not be applicable for other HMTC datasets which consist of a large number of labels.Recent studies (Ni et al., 2023) show the possible improvement of Multi-Headed Attention (MHA), which is to reduce the over-parametrization posed by the MHA.Further work should focus on reducing the number of labelaware embeddings but still retaining the comparable performance.(Kowsari et al., 2017) was not used as its labels are single-path only.

A Appendix for Experiment Settings
A.1 Implementation Details We implement our model using PyTorch-Lightning 3 since it is suitable for our large batches used for contrastive learning.For fair comparison, we employ the bert-base-uncased model which was used by other HMTC models to implement HJCL.The batch size is set to 80 for all datasets.Unless noted otherwise, the λ 1 and λ 2 at Eq. 7 are fixed to 0.1 and 0.5 for all datasets without any hyperparameter searching.The temperature τ is fixed at 0.1.The number of heads for multi-head attention is set to 4. We use 2 layers of GAT for hierarchy injection to BGC, AAPD and RCV1-V2; and 4 layers for NYT due to its depth.The optimizer is AdamW (Loshchilov and Hutter, 2017) with a learning rate of 3e −5 .The early stopping is set to suspend training after Macro-F1 in the validation dataset and does not increase for 10 epochs.Since contrastive learning imposed stochasticity, we performed experiments with 5 random seeds.between experiments.For the baseline models, we use the hyperparameters from the original paper to replicate their results.
For HiMulConE (Zhang et al., 2022a), as the model was used on the image domain, we replaced its ResNet-50 feature encoder with BERT and replicated its experiment by first training the encoder with the proposed loss and the classifier with BCE loss, with 5e −5 learning rate.

A.2 Baseline Models
To show the effectiveness of our proposed method, HJCL, we compared it with previous HMTC works.
In this section, we mainly describe baselines in recent work with strong performance.
• HTCInfoMax (Deng et al., 2021) utilises information maximization to model the interactions between text and hierarchy.
• HiMatch (Chen et al., 2021) turns the problem into a matching problem by grouping the text representation with its hierarchical label representation.
• Seq2Tree (Yu et al., 2022) introduces a sequence-to-tree framework and turns the problem into a sequence generation task using the T5 Model (Raffel et al., 2019).
• HiMulConE (Zhang et al., 2022a) is the closest to our work, also performs contrastive learning on hierarchical labels, where their hierarchy has fixed height and labels are singlepath only.
• HGCLR (Wang et al., 2022a) incorporates the hierarchy directly into BERT and performs contrastive learning on the generated positive samples.

B Appendix for Evaluation Result and Analysis
B.1 Ablation study for BGC and AAPD The ablation results for BGC and AAPD are presented in Table 6.It is worth noting that in the case of AAPD, the removal of label contrastive loss significantly affects the Micro-F1 and Macro-F1 scores in both datasets.Conversely, when the instance contrastive loss is removed, only minor changes are observed in comparison to the other three datasets.This can be primarily attributed to the shallow hierarchy of AAPD, which consists of only two levels, resulting in smaller differences 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Instance-wise Loss (λ1) between instances.Furthermore, the results in Table 1 demonstrate that the substantial improvement in Macro-F1 for AAPD can be attributed to HiLe-Con, further highlighting the effectiveness of our Hierarchy-Aware label contrastive method.On the other hand, the results for BGC follow a similar trend as RCV1-V2 where both have similar hierarchy structure (c.f.Table 5), where the removal of either loss leads to a comparable drop in performance.The findings presented in the last two rows of Table 6 are consistent with the performance observed in the ablation study for NYT and RCV1-V2, underscoring the importance of both ZMLR loss and graph label fusion.

B.2 Appendix for Hyperparameter Analysis
The hyperparameter analysis for Micro-F1 scores for NYT and RCV1-V2 is shown in Figure 6.The results are aligned with the observations for Macro-F1.Moreover, the hyperparameter analysis for BGC and AAPD regarding λ 1 and λ 2 is presented in Figure 7. Consistent with the observations from the previous ablation study section, the instance loss has a minor influence on AAPD, with performance peaking at λ 1 = 0.2 and subsequently dropping.Conversely, for any value of λ 2 , the performance outperforms the baseline at λ 2 = 0, highlighting its effectiveness in shallow hierarchy labels.Additionally, the changes in BGC is consistent with those observed in RCV1-V2, as depicted in Figure 3.

B.3 BCE v.s ZLPR
In this paper, we replaced the commonly-used BCE by new loss function, ZLPR (Su et al., 2022a), as it presents a more balanced loss function for the multi-label classification task, achieving this by leveraging the softmax function and considering the correlations between labels, in contrast to the Sigmoid + BCE approach proposed in the original paper (Zhang et al., 2022a).We consider this characteristic to be fundamental as it aligns with our approach of emphasizing label correlations across different paths on the hierarchy.
To provide a fairer comparison, we conducted additional experiments on strong baselines, in line with our ablation study settings.We replaced their BCE loss with the ZLPR loss on the NYT and RCV1 datasets.As shown in Tables 7 and 8, ZLPR consistently demonstrated improvements across different methods, further highlighting its effectiveness in enhancing multi-label classification.On the other side, even with the integration of the ZLPR loss function, our method continues to outperform other baseline models.This shows that it is not only the adoption of the ZLPR loss function, but the overall design that allows our model to outperform the state-of-the-art.

B.4 Performance on Multi-Path Samples
Statistics for the number of path distributions on the four multi-path HMTC datasets are shown in Table 9. Figure 8 presents the results of the performance on samples with different paths in NYT dataset.
Before we formalize Acc P and Acc D , we give the definition of some auxiliary functions.Given the testing datasets D = {(X i , ŷi )} N and the prediction results y i , ∀i ≤ N , where ŷi , y i ⊆ Y, the true positive labels for each sample is defined as y Pos i = y i ∩ ŷi .Then we decompose both label sets ŷi and y Pos i into disjoint sets where each set contains labels from a single path: Path(ŷ i ) = {Y i |Y i ∩ Y j = ∅}.We say that the gold label ŷi and prediction y i are path consistent when: The Acc P is the measure for the ratio of predictions that has all the path corrected predicted; the Acc D is the measure for the ratio of paths that the prediction got it all correct.The results on multipath consistency for BGC and AAPD are shown in Table 10.

B.5 T-SNE visualisation
To qualitatively analyse the HiLeCon, we plot the T-SNE visualisation with learned label embedding across the path, as shown in Figure 9.

B.6 Case study details
The complete news report in the NYT dataset used for the case study is shown in Figure 10.The complete set of labels for the four hierarchy plots (Figure 5) is shown in Table 11.Note that to save space, the ascendants of leaf labels are omitted since they are already self-contained within the names of the leaf labels themselves.

C Discussion and Case Example for ChatGPT
For each prompt, the LLM is presented with input texts, label words structured in a hierarchical

Figure 2 :
Figure 2: The model architecture for HJCL.The model is split into three parts: (a) shows the multi-headed attention and the extraction of label-aware embeddings; parts (b) and (c) show the instance-wise and label-wise contrastive learning.The legend on the lower left of (a) shows the labels corresponding to each color.We use different colors to identify the strength of contrast: the lighter the color, the less pushing/pulling between two instances/labels.

Figure 3 :
Figure 3: Effects of λ 1 (left) and λ 2 (right) on NYT and RCV1.The step size for λ 1 is 0.1 and λ 2 is 0.2.λ 1 has a smaller step size since it is more sensitive to changes.

Figure 4 :
Figure 4: F1 scores on 4 different datasets with different contrastive methods.The texts above the bar show the offset between the models to the SupCon model.

Figure 5 :
Figure 5: Case study on a sample from the NYT dataset.Orange represents true positive labels; Green represents false negative labels; Red represents false positive labels; Blue represents true negative labels.The indicates that two nodes are skipped.Part of the input texts is shown at the bottom, the full text and prediction results are in Appendix B.6.
Figure 8: (a) Micro-F1 and (b) Macro-F1 scores on testing data of NYT, grouped by paths in the hierarchy.HiLeCon is our proposed method and HiLeCon (w/o) dropped the contrastive learning function.
New York Times report

Table 2 :
Ablation study when removing components on the RCV1-V2 and NYT datasets.r.m. stands for removing the component; r.p. stands for replace with.

Table 3 :
Comparison results on different Contrastive Learning approaches on the Label embedding, performed on the 4 datasets.HiLeCon denotes our proposed method.The p-value is calculated by two-tailed t-tests.

Table 4 :
Measurement for Acc P and Acc D on NYT and RCV1.The best scores are in bold and the second best is underlined.The formula and results on BGC and AAPD are shown in Appendix B.4.

Table 5 :
Dataset statistics.L is the number of classes.D is the maximum level of hierarchy.Avg(L i ) is the average number of classes per sample.Note that the commonly used WOS dataset

Table 6 :
Ablation study when removing components of on the AAPD and BGC.r.m. stands for removing the component; r.p. stands for replace with.

Table 7 :
Experimental results on NYT dataset with traditional BCE and ZLPR loss.Best results are in bold.

Table 8 :
Experimental results on RCV-1 dataset with traditional BCE and ZLPR loss.Best results are in bold.

Table 10 :
Measurement for Path Accuracy and Depth Accuracy on BGC and AAPD.