Hierarchy-aware Label Semantics Matching Network for Hierarchical Text Classification

Hierarchical text classification is an important yet challenging task due to the complex structure of the label hierarchy. Existing methods ignore the semantic relationship between text and labels, so they cannot make full use of the hierarchical information. To this end, we formulate the text-label semantics relationship as a semantic matching problem and thus propose a hierarchy-aware label semantics matching network (HiMatch). First, we project text semantics and label semantics into a joint embedding space. We then introduce a joint embedding loss and a matching learning loss to model the matching relationship between the text semantics and the label semantics. Our model captures the text-label semantics matching relationship among coarse-grained labels and fine-grained labels in a hierarchy-aware manner. The experimental results on various benchmark datasets verify that our model achieves state-of-the-art results.


Introduction
Hierarchical text classification (HTC) is widely used in Natural Language Processing (NLP), such as news categorization (Lewis et al., 2004) and scientific paper classification (Kowsari et al., 2017). HTC is a particular multi-label text classification problem, which introduces hierarchies to organize label structure. As depicted in Figure 1, HTC models predict multiple labels in a given label hierarchy, which generally construct one or multiple paths from coarse-grained labels to fine-grained labels in a top-down manner (Aixin Sun and Ee-Peng Lim, 2001). Generally speaking, fine-grained labels are the most appropriate labels for describing the input text. Coarse-grained labels are generally the parent nodes of coarse-or fine-grained labels, expressing a more general concept. The key challenges of * *Corresponding author HTC are to model the large-scale, imbalanced, and structured label hierarchy (Mao et al., 2019). Existing work in HTC has introduced various methods to use hierarchical information in a holistic way. To capture the holistic label correlation features, some researchers proposed a hierarchyaware global model to exploit the prior probability of label dependencies through Graph Convolution Networks (GCN) and TreeLSTM (Zhou et al., 2020). Some researchers also introduced more label correlation features such as label semantic similarity and label co-occurrence (Lu et al., 2020). They followed the traditional way to transform HTC into multiple binary classifiers for every label (Fürnkranz et al., 2008). However, they ignored the interaction between text semantics and label semantics (Fürnkranz et al., 2008;, which is highly useful for classification (Chen et al., 2020). Hence, their models may not be sufficient to model complex label dependencies and provide comparable text-label classification scores .
A natural strategy for modeling the interaction between text semantics and label semantics is to introduce a text-label joint embedding by label attention (Xiao et al., 2019) or autoencoders (Yeh et al., 2017). Label attention-based methods adopted a self-attention mechanism to identify label-specific information (Xiao et al., 2019). Autoencoder-based methods extended the vanilla Canonical Correlated Autoencoder (Yeh et al., 2017) to a ranking-based autoencoder architecture to produce comparable text-label scores . However, these methods assume all the labels are independent without fully considering the correlation between coarse-grained labels and fine-grained labels, which cannot be simply transferred to HTC models (Zhou et al., 2020).
In this paper, we formulate the interaction between text and label as a semantic matching problem and propose a Hierarchy-aware Label Semantics Matching Network (HiMatch). The principal idea is that the text representations should be semantically similar to the target label representations (especially fine-grained labels), while they should be semantically far away from the incorrect label representations. First, we adopt a text encoder and a label encoder (shown in Figure 2) to extract textual semantics and label semantics, respectively. Second, inspired by the methods of learning common embeddings , we project both textual semantics and label semantics into a text-label joint embedding space where correlations between text and labels are exploited. In this joint embedding space, we introduce a joint embedding loss between text semantics and target label semantics to learn a text-label joint embedding. After that, we apply a matching learning loss to capture text-label matching relationships in a hierarchy-aware manner. In this way, the finegrained labels are semantically closest to the text semantics, followed by the coarse-grained labels, while the incorrect labels should be semantically far away from the text semantics. Hence, we propose a hierarchy-aware matching learning method to capture different matching relationships through different penalty margins on semantic distances. Finally, we employ the textual representations guided by the joint embedding loss and matching learning loss to perform the hierarchical text classification.
The major contributions of this paper are: 1. By considering the text-label semantics matching relationship, we are the first to formulate HTC as a semantic matching problem rather than merely multiple binary classification tasks.
2. We propose a hierarchy-aware label semantics matching network (HiMatch), in which we introduce a joint embedding loss and a matching learn-ing loss to learn the text-label semantics matching relationship in a hierarchy-aware manner.
3. Extensive experiments (with/without BERT) on various datasets show that our model achieves state-of-the-art results.
2 Related Work

Hierarchical Text Classification
Hierarchical text classification is a particular multilabel text classification problem, where the classification results are assigned to one or more nodes of a taxonomic hierarchy. Existing state-of-the-art methods focus on encoding hierarchy constraint in a global view such as directed graph and tree structure. Zhou et al. (2020) proposed a hierarchyaware global model to exploit the prior probability of label dependencies. Lu et al. (2020) introduced three kinds of label knowledge graphs, i.e., taxonomy graph, semantic similarity graph, and cooccurrence graph to benefit hierarchical text classification. They regarded hierarchical text classification as multiple binary classification tasks (Fürnkranz et al., 2008). The limitation is that these models did not consider the interaction of label semantics and text semantics. Therefore, they failed to capture complex label dependencies and can not provide comparable text-label classification scores , which leads to restricted performance (Chen et al., 2020). Hence, it is crucial to exploit the relationship between text and label semantics, and help the model distinguish target labels from incorrect labels in a comparable and hierarchy-aware manner. We perform matching learning in a joint embedding of text and label to solve these problems in this work.

Exploit Joint Embedding of Text and Label
To determine the correlation between text and label, researchers proposed various methods to exploit a text-label joint embedding such as (Xiao et al., 2019) or Autoencoder (Yeh et al., 2017). In the field of multi-label text classification, Xiao et al.  (Yeh et al., 2017) to a ranking-based autoencoder architecture to produce comparable label scores. However, they did not fully consider label semantics and holistic label correlation

Coarse-grained Labels
Fine-grained Labels Figure 2: The overall architecture of the proposed model. Firstly, the text encoder and label encoder extract the text semantics and label semantics, respectively. Then text semantics and label semantics are projected into a joint embedding space. Joint embedding loss encourages the text semantics to be similar to the target label semantics. By introducing matching learning loss, fine-grained labels semantics (Debt) is semantically closest to the text semantics, followed by coarse-grained labels (Economics), while other incorrect labels semantics is semantically far away from text semantics (Revenue, Society). The relative order is d 1 < d 2 < d 3 < d 4 , where d represents the metric distances in joint embedding.
among fine-grained labels, coarse-grained labels, and incorrect labels. In addition, we can not simply transfer these multi-label classification methods to HTC due to the constraint of hierarchy (Zhou et al., 2020).

Proposed Method
In this section, we will describe the details about our Hierarchy-aware Label Semantics Matching Network. Figure 2 shows the overall architecture of our proposed model.

Text Encoder
In the HTC task, given the input sequence x seq = {x 1 , ..., x n }, the model will predict the label y = {y 1 , ..., y k } where n is the number of words and k is the number of label sets. The label with a probability higher than a fixed threshold (0.5) will be regarded as the prediction result. The sequence of token embeddings is firstly fed into a bidirectional GRU layer to extract contextual feature H = {h 1 , ..., h n }. Then, CNN layers with top-k max-pooling are adopted for generating key n-gram features T ∈ R k×dcnn where d cnn indicates the output dimension of the CNN layer. Following the previous work (Zhou et al., 2020), we further introduce a hierarchy-aware text feature propagation module to encode label hierarchy information. We define a hierarchy label structure indicates the set of hierarchy structure nodes. ← − E are built from the top-down hierarchy paths representing the prior statistical probability from parent nodes to children nodes.
− → E are built from the bottom-up hierarchy paths representing the connection relationship from children nodes to parent nodes. The feature size of graph adjacency matrix ← E and → E is ∈ R k×k , where k is the number of label sets. Text feature propagation module firstly projects text features T to node inputs V t by a linear transformation W proj ∈ R k×dcnn×dt , where d t represents the hierarchy structure node dimension from text feature. Then a Graph Convolution Network (GCN) is adopted to explicitly combine text semantics with prior hierarchical information ← − E and − → E : where σ is the activation function ReLU. W g1 , W g2 ∈ R dt×dt are the weight matrix of GCN. S t is the text representation aware of prior hierarchy paths.

Label Encoder
In the HTC task, the hierarchical label structure can be regarded as a directed graph where V l indicates the set of hierarchy structure nodes with label representation. The graph G in label encoder shares the same structure ← − E and − → E with the graph in text encoder. Given the total label set y = {y 1 , ..., y k } as input, we create label embeddings V l ∈ R d l by averaging of pre-trained label embeddings first. Then GCN could be utilized as label encoder: where σ is the activation function ReLU. W g3 , W g4 ∈ R d l ×d l are the weight matrix of GCN. S l is the label representation aware of prior hierarchy paths. It must be noted that the weight matrix and input representation of the label encoder are different with those in the text encoder.

Joint Embedding Learning
In this section, we will introduce the methods of learning a text-label joint embedding and hierarchyaware matching relationship. For joint embedding learning, firstly, we project text semantics S t and label semantics S l into a common latent space as follows: where FFN t and FFN l are independent two-layer feedforward neural networks. Φ t , Φ l ∈ R dϕ represent text semantics and label semantics in joint embedding space, respectively. d ϕ indicates the dimension of joint embedding. In order to align the two independent semantic representations in the latent space, we employ the mean squared loss between text semantics and target labels semantics: where P (y) is target label sets. L joint aims to minimize the common embedding loss between input text and target labels.

Hierarchy-aware Matching Learning
Based on the text-label joint embedding loss, the model only captures the correlations between text semantics and target labels semantics, while correlations among different granular labels are ignored.  Figure 3: Illustration of hierarchy-aware margin. Target labels are colored yellow. Each colored line represent the matching operation between text and different labels. The two vertical axes for semantic matching distance and penalty margin are on the right. The semantic matching distance can be sorted by the order of d 1 (fine-grained target labels) < d 2 (coarse-grained target labels) < d 3 (incorrect sibling labels) < d 4 (other incorrect labels). We introduce penalty margins γ to model the relative matching relationships.
In the HTC task, it is expected that the matching relationship between text semantics and fine-grained labels should be the closest, followed by coarsegrained labels. Text semantics and incorrect labels semantics should not be related.
Insight of these, we propose a hierarchy-aware matching loss L match to incorporate the correlations among text semantics and different labels semantics. L match aims to penalize the small semantic distance between text semantics and incorrect labels semantics with a margin γ: where Φ p l represents target labels semantics and Φ n l represents incorrect labels semantics. We use L2-normalized euclidean distance for metric D and γ is a margin constant for margin-based triplet loss. We take the average of all the losses between every label pairs as the margin loss.
Hierarchy-aware Margin Due to the large label sets in the HTC task, it is time-consuming to calculate every label's matching loss. Therefore, we propose hierarchy-aware sampling to alleviate the problem. Specifically, we sample all parent labels (coarse-grained labels), one sibling label, and one random incorrect label for every fine-grained label to obtain its negative label sets n ∈ N (y). It is also unreasonable to assign the same margin for different label pairs since the label semantics similarity is quite different in a large structured label hierarchy. Our basic idea is that the semantics relationship should be closer if two labels are closer in the hierarchical structure. Firstly, the text semantics should match fine-grained labels the most, which is exploited in joint embedding learning. Then we regard the pair with the smallest semantic distance (d 1 ) as a positive pair and regard other textlabel matching pairs as negative pairs. As depicted in the schema figure 3, compared with the positive pair, the semantics matching distance between text and coarse-grained target labels (d 2 ) should be larger. The incorrect sibling labels have a certain semantic relationship with the target labels. Hence, the semantics matching distance between text and the incorrect sibling labels of fine-grained labels (d 3 ) should be further larger, while the semantics matching distance between text and other incorrect labels (d 4 ) should be the largest. We introduce hierarchy-aware penalty margins γ 1 , γ 2 , γ 3 , γ 4 to model the comparable relationship. The penalty margin is smaller if we expect the semantic matching distance to be smaller. We neglect γ 1 because the matching relationships between text semantics and fine-grained labels are exploited in joint embedding learning. γ 2 , γ 3 , γ 4 are penalty margins compared with the matching relationships between text semantics and fine-grained labels semantics. We introduce two hyperparameters α, β to measure different matching relationships of γ: γ 2 = αγ; γ 3 = βγ; γ 4 = γ where 0 < α < β < 1. The proposed loss captures the relative semantics similarity rankings among target labels and incorrect labels in a hierarchyaware manner.

Classification Learning and Objective Function
We find that it is easier to overfit for classification learning if we perform classification learning in the text-label joint embedding directly. Hence, we use the text semantics representation S t guided by joint embedding loss and matching learning loss to perform classification learning. S t is fed into a fully connected layer to get the label probabilityŷ for prediction. The overall objective function includes a crossentropy category loss, joint embedding loss and hierarchy-aware matching loss: where y andŷ are the ground-truth label and output probability, respectively. λ 1 , λ 2 are the hyperparameters for balancing the joint embedding loss and  . The label sets are split into zero-shot labels, few-shot labels, and frequent labels. Few-shot labels are labels whose frequencies in the training set are less than or equal to 50. Frequent labels are labels whose frequencies in the training set are more than 50. The label setting is the same as previous work (Lu et al., 2020). In EURLEX57K, the corpora are only tagged with fine-grained labels, and the parent labels of fine-grained labels are not tagged as the target labels. Evaluation Metric On RCV1-V2 and WOS datasets, we measure the experimental results by Micro-F1 and Macro-F1. Micro-F1 takes the overall precision and recall of all the instances into account, while Macro-F1 equals the average F1score of labels. We report the results of two ranking metrics on large hierarchical multi-label text classification dataset EURLEX-57K, including Re-call@5 and nDCG@5. The ranking metrics are preferable for EURLEX-57K since it does not introduce a significant bias towards frequent labels (Lu et al., 2020).
Implementation Details We initialize the word embeddings with 300D pre-trained GloVe vectors (Pennington et al., 2014). Then we use a one-layer BiGRU with hidden dimension 100 and used 100 filters with kernel size [2,3,4] to setup the CNNs. The dimension of the text propagation feature and graph convolution weight matrix are both 300. The hidden size of joint embedding is 200. The matching margin γ is set to 0.2 on RCV1-V2 and WOS datasets, and set to 0.5 on EURLEX-57K dataset. We set the value of hierarchy-aware penalty hyperparameters α, β to 0.01 and 0.5, respectively. The loss balancing factor λ 1 , λ 2 are set to 1. For fair comparisons with previous work (Lu et al., 2020;Chalkidis et al., 2019) on EURLEX-57K dataset, firstly, we do not use CNN layer and text feature propagation module. Secondly, to adapt to the zeroshot settings, the prediction is generated by the dot product similarity between text semantics and label semantics. Our model is optimized by Adam with a learning rate of 1e-4. For pretrained language model BERT (Devlin et al., 2018), we use the top-level representation h CLS of BERT's special CLS token to perform classification. To combine our model with BERT, we replace the text encoder of HiMatch with BERT, and the label representations are initiated by pretrained BERT embedding. The batch size is set to 16, and the learning rate is 2e-5.
On EURLEX-57K dataset, we compare our model with strong baselines with/without zeroshot settings such as BIGRU-ATT, BIGRU-LWAN (Chalkidis et al., 2019) which introduced labelwise attention. The models starting with "ZERO" make predictions by calculating similarity scores between text and label semantics for zero-shot set-tings. AGRU-KAMG (Lu et al., 2020) is a stateof-the-art model which introduced various label knowledge.

Models
Micro Macro Baselines TextRCNN (Zhou et al., 2020) 81.57 59.25 TextRCNN-LA (Zhou et al., 2020) Table 2, 3 and 4 report the performance of our approaches against other methods. HiAGM is an effective baseline on RCV1-V2 and WOS due to the introduction of holistic label information. However, they ignored the semantic relationship between text and labels. Our model achieves the best results by capturing the matching relationships among text and labels in a hierarchy-aware manner, which achieves stronger performances especially on Macro-F1. The improvements show that our model can make better use of structural information to help imbalanced HTC classification. The pretrained language model BERT is an effective method when fine-tuned on downstream tasks. Compared with the results regarding HTC  as multiple binary classifiers, our results show that the full use of structured label hierarchy can bring great improvements to BERT model on RCV1-V2 and WOS datasets. On EURLEX57K dataset, our model achieves the best results on different matrics except for zeroshot labels. The largest improvements come from few-shot labels. AGRU-KAMG achieves the best results on zero-shot labels by fusing various knowledge such as label semantics similarities and label co-occurrence. However, our model performs semantics matching among seen labels based on training corpora, which is not designed for a specific zero-shot learning task.

Ablation Study
In this section, we investigate to study the independent effect of each component in our proposed model. Firstly, we validate the influence of two proposed losses, and the hierarchy-aware sampling. The results are reported in Table 5. The results show that F1 will decrease with removing joint embedding loss or matching learning loss. Joint embedding loss has a great influence since label semantics matching relies on the joint embedding. Besides, in the hierarchy-aware margin subsection, we perform hierarchy-aware sampling by sampling coarse-grained labels, incorrect sibling labels, and other incorrect labels as negative label sets. When we remove hierarchy-aware sampling and replace it with random sampling, the results will decrease, which shows the effectiveness of hierarchy-aware sampling.

Hyperparameters Study
To study the influence of the hyperparameters γ, α, and β, we conduct seven experiments on RCV1-V2 dataset. The results are reported in Table 6. The first experiment is the best hyperparameters of our model. Then we fine-tune the matching learning margin γ in experiments two and three. We   find that a proper margin γ = 0.2 is beneficial for matching learning compared with a large or small margin. Furthermore, we validate the effectiveness of the hierarchy-aware margin. In experiment four, the performance will decrease if we violate the hierarchical structure by setting a large penalty margin for coarse-grained labels, and setting a small penalty margin for incorrect sibling labels. In experiment five, the performance has a relatively larger decrease if we set α = 1 and β = 1, which ignores hierarchical structure completely. We speculate that the penalty margin that violates the hierarchical structure will affect the results, since the semantics relationship should be closer if the labels are closer in the hierarchical structure. Moreover, we validate the effectiveness of different penalty margins among different granular labels. In experiments six and seven, the results will degrade if we ignore the relationships between coarse-grained target labels and incorrect sibling labels, by setting the same margin for α and  β. Therefore, it is necessary to set a small penalty margin for coarse-grained target labels, and a larger penalty margin for incorrect sibling labels.

T-SNE Visualization of Joint Embedding
We plot the T-SNE projection of the text representations and label representations in the joint embedding in Figure 4. Figure a) is a part of the hierarchical label structure in RCV1-V2. Label C171 and C172 are fine-grained labels, and label C17 is coarse-grained label of C171 and C172. GWELF and E61 are other labels with different semantics with C17, C171 and C172. In Figure b), by introducing joint embedding loss, we can see that the text representations are close to their corresponding label representations. Furthermore, the text representations of labels C171 and C172 are close to the label representation of their coarse-grained label C17. However, the text representations of different labels may overlap, since the matching relationships among different labels are ignored. In The T-SNE visualization shows that our model can capture the semantics relationship among texts, coarsegrained labels, fine-grained labels and unrelated labels.

Performance Study on Label Granularity
We analyze the performance with different label granularity based on their hierarchical levels.
We compute level-based Micro-F1 and Macro-F1 scores of the RCV1-V2 dataset on TextRCNN, Hi-AGM, and our model in Figure 5. On RCV1-V2 dataset, both the second and third hierarchical levels contain fine-grained labels (leaf nodes). The second level has the largest number of labels and contains confusing labels with similar concepts, so its Micro-F1 is relatively low. Both the second and third levels contain some long-tailed labels, so their Macro-F1 are relatively low. Figure 5 shows that our model achieves a better performance than other models on all levels, especially among deep levels.
The results demonstrate that our model has a better ability to capture the hierarchical label semantic, especially on fine-grained labels with a complex hierarchical structure.

Computational Complexity
In this part, we compare the computational complexity between HiAGM and our model. For time complexity, the training time of HiMatch is 1.11 times that of HiAGM with batch size 64. For space complexity during training, HiMatch has 37.4M parameters, while HiAGM has 27.8M. The increase mainly comes from the label encoder with large label sets. However, during testing, the time and space complexity of HiMatch is the same as Hi-AGM. The reason is that only the classification results are needed, and we can remove the joint embedding. HiMatch achieves new state-of-the-art results, and we believe that the increase of computational complexity is acceptable.

Conclusion
Here we present a novel hierarchical text classification model called HiMatch that can capture semantic relationships among texts and labels at different abstraction levels. Instead of treating HTC as multiple binary classification tasks, we consider the text-label semantics matching relationship and formulate it as a semantic matching problem. We learn a joint semantic embedding between text and labels. Finally, we propose a hierarchy-aware matching strategy to model different matching relationships among coarse-grained labels, fine-grained labels and incorrect labels. In future work, we plan to extend our model to the zero-shot learning scenario.