HTCInfoMax: A Global Model for Hierarchical Text Classification via Information Maximization

The current state-of-the-art model HiAGM for hierarchical text classification has two limitations. First, it correlates each text sample with all labels in the dataset which contains irrelevant information. Second, it does not consider any statistical constraint on the label representations learned by the structure encoder, while constraints for representation learning are proved to be helpful in previous work. In this paper, we propose HTCInfoMax to address these issues by introducing information maximization which includes two modules: text-label mutual information maximization and label prior matching. The first module can model the interaction between each text sample and its ground truth labels explicitly which filters out irrelevant information. The second one encourages the structure encoder to learn better representations with desired characteristics for all labels which can better handle label imbalance in hierarchical text classification. Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed HTCInfoMax.


Introduction
Hierarchical text classification (HTC) is a particular subtask of multi-label text classification (Li et al., 2020). Many datasets have been proposed to study HTC for decades, such as RCV1 (Lewis et al., 2004) and NYTimes (Sandhaus, 2008), which categorize a news into several categories/labels. And all the labels in each dataset are usually organized as a tree or a directed acyclic graph. Thus, there is a label taxonomic hierarchy existing in each dataset. The goal of HTC is to predict multiple labels in a given label hierarchy for a given text.
There are two groups of existing methods for HTC: local approaches and global approaches. Local approaches usually build a classifier for each label/node (Banerjee et al., 2019), or for each parent node, or for each level of the label hierarchy (Wehrmann et al., 2018;Huang et al., 2019;Chang et al., 2020). Global approaches just build one classifier to simultaneously predict multiple labels of a given text. The earlier global approaches ignore the hierarchical structure of labels and assume there is no dependency among labels which leads to flat models such as (Johnson and Zhang, 2015). Later on, more and more works try to make use of the label taxonomic hierarchy to improve the performance by employing different strategies such as recursively regularized Graph-CNN (Peng et al., 2018), reinforcement learning (Mao et al., 2019), attentional capsule network (Peng et al., 2019), meta-learning (Wu et al., 2019) and structure encoder (Zhou et al., 2020). Many attention-based models are also proposed to learn more refined text features for text classification tasks such as (You et al., 2019;Deng et al., 2020). Among these methods, HiAGM proposed by Zhou et al. (2020) is the state-of-the-art model for HTC which designs a structure encoder that integrates the label prior hierarchy knowledge to learn label representations, and then proposes a model HiAGM with two variants (one is HiAGM-LA, the other is HiAGM-TP) based on the structure encoder to capture the interactions between text features and label representations. However, there are some limitations of HiAGM. Firstly, it utilizes the same label hierarchy information for every text sample which cannot distinguish the relevant and irrelevant labels to a specific text sample. Although HiAGM-LA can implicitly relate each text to its corresponding labels by soft attention weights, there are still irrelevant and noisy information. Secondly, for HiAGM-LA, there is no statistical constraint on the label embeddings generated by the structure encoder, while statistical constrains for representation learning are proved to be helpful by Hjelm et al. (2019).
To address the two limitations of HiAGM-LA, we propose HTCInfoMax which introduces information maximization consisting of two new modules which are text-label mutual information maximization and label prior matching on top of HiAGM-LA. Specifically, the first new module makes a connection between each text sample and its corresponding labels explicitly by maximizing the mutual information between them, and thus can filter out irrelevant label information for a specific text sample. The label prior matching module can impose some constraints on the learned representation of each label to force the structure encoder to learn better representations with desirable properties for all labels and thus also improve the quality of representations for low-frequency labels, which helps handle label imbalance issue better.
In summary, our main contributions are: 1) We propose a novel global model HTCInfoMax for HTC by introducing information maximization which includes two modules: text-label mutual information maximization and label prior matching. 2) To our best knowledge, this is the first work to utilize text-label mutual information maximization for HTC which enables each text to capture its corresponding labels' information in an effective way. 3) Also, to our best knowledge, this is the first work to introduce label prior matching for HTC which encourages the structure encoder to learn desired label representations for all labels which can better handle inherent label imbalance issue in HTC. 4) Experimental results demonstrate the effectiveness of our proposed model for HTC. 5) We release our code to enable replication, available at https://github. com/RingBDStack/HTCInfoMax.

Our approach
The overall architecture of our model is shown in Figure 1. The major part of HTCInfoMax is the "Information Maximization" part shown in the dashed box which has two new modules: text-label mutual information maximization and label prior matching, which will be introduced in the following sections. We keep the remaining part such as text encoder, structure encoder and the predictor be the same as in HiAGM-LA (Zhou et al., 2020).

Text-label mutual information estimation and maximization
Good text representation is critical for predicting its corresponding labels, thus fusing label information into text feature can help improve the prediction performance. The HiAGM-LA utilizes multi-label attention to bridge the text feature of each sample with all labels' information implicitly, which can somehow help each text obtain some label information. However, irrelevant label information is also injected into the text feature by using soft attention weights. Therefore, we design a text-label mutual information maximization module to help remove irrelevant label information for each text as well as help each text capture its corresponding labels' information. In this way, the learned representation for each text incorporates useful label information which is helpful for predicting its labels.
To implement the text-label mutual information maximization, we first select the ground truth labels for each text sample in the training process, and then apply a discriminator to estimate the mutual information between text and its labels, which is also known as negative sampling estimation. Let P T and P Y denote the distribution of text feature outputted by the text encoder and the distribution of label representation produced by the structure encoder respectively. And the joint distribution of text and label is denoted as P T Y = P Y |T P T . Then the positive samples are the pairs of text t and its corresponding labels y which is denoted as (t, y), in other words, these positive samples are drawn from the joint distribution of text and label. For the negative samples, we pair y with another text sample t in the same batch which is denoted as (t , y), the negative samples can be deemed as drawn from the product of marginal distribution of text P T and label P Y . Both positive and negative samples are fed to the discriminator D M I to do classification and to estimate the mutual information I(T ; Y ) between text and label shown in Eq. (1). D M I (t, y) and D M I (t , y) represents the probability score assigned to the positive and negative sample by the discriminator respectively. The goal of the text-label mutual information maximization module is to maximize I(T ; Y ), thus the loss from this module is shown in Eq. (2).
This module is inspired by Deep InfoMax (DIM) (Hjelm et al., 2019) which utilizes local and global mutual information maximization to help the encoder learn high-level representation for an image. The structure of the discriminator D M I in this module can be found in the Appendix A.1.

Label prior matching
There is an inherent label imbalance issue in HTC, thus the learned label embeddings by the model for low-frequency labels are not good because of underfitting caused by less training examples. The label prior matching imposes some statistical constrains on the learned representation of each label which can help the structure encoder learn better label representations with desirable characteristics for all labels. This also improves the quality of representations for low-frequency labels, which helps handle the label imbalance situation better in terms of improvement of Macro-F1 score.
To implement the label prior matching mechanism, we use a method similar to adversarial training in adversarial autoencoders (Makhzani et al., 2015) but without a generator to force the learned label representation to match a prior distribution. We denote the prior as Q and the distribution of label representation learned by the structure encoder as P. Specifically, a discriminator network D pr is employed to distinguish the representation/sample drawn from the prior (i.e., real sample which is denoted asỹ) from the label embedding produced by the structure encoder (i.e., fake sample which is denoted as y). For each label, we utilize D pr to calculate its corresponding prior matching loss l pr , which is shown in Eq. (3).
This loss aims at pushing the distribution P of learned representation for a label towards its prior distribution Q. The final label prior matching loss is the average of losses from all the labels which is shown in Eq. (4), N is the number of labels.
This idea is inspired by DIM which matches the representation of an image to a prior, but different from DIM, it trains the structure encoder to learn desired representations for all labels by imposing the constraints on each label's representation.
An uniform distribution on the interval [0, 1) is adopted as the label prior distribution Q in the label prior matching module. The reason for choosing the uniform distribution is that it works well as a prior in DIM for generating image representations. And the improvement of Macro-F1 score in the experimental results of hierarchical text classification further verifies the suitability of using the uniform distribution as the label prior. The detailed structure of the discriminator D pr can be found in the Appendix A.2.

Final loss of HTCInfoMax
A loss weight estimator is adopted to learn the weights for text-label mutual information loss and label prior matching loss by using learned text features t and all labels' representation y, shown in Eq. (5), and both W 1 and W 2 are trainable parameters.
And the loss from the predictor is the traditional binary cross-entropy loss L c (Zhou et al., 2020). Then the final objective function of HTCInfoMax is the combination of all the three losses as follows: 3 Experiment

Datasets and evaluation metrics
Following HiAGM (Zhou et al., 2020), we use RCV1-V2 (Lewis et al., 2004) and Web of Science (WOS) (Kowsari et al., 2017)   Standard evaluation metrics including Micro-F1 (Mi-F1) and Macro-F1 (Ma-F1) score are employed to evaluate our model. In label imbalance situation, Ma-F1 can better evaluate model's performance in the perspective of not focusing on frequent labels in a certain degree.

Experimental setup
In order to make a fair comparison between our model and HiAGM, we use the same parameter settings as HiAGM and follow its implementation details which can be seen in (Zhou et al., 2020).

Experimental results
The experimental results of our model are shown in Table 2, each score is the average result of 8 runs. The results of HiAGM are referred from (Zhou et al., 2020). There are two variants of HiAGM which are HiAGM-LA and HiAGM-TP. As stated before, our model is built on top of HiAGM-LA to address its limitations. From Table 2, one can see that our model outperforms the HiAGM-LA model with either GCN or TreeLSTM as structure encoder on two datasets, which demonstrates that the introduced information maximization in our model can address the limitations of HiAGM-LA and improve the performance. This is because the label prior matching can drive the structure encoder to learn good and desired label representations that encode more useful and informative information of labels, and the text-label mutual information maximization module helps learn better representation of each text for prediction by fusing the above learned good representations of its ground truth labels while ignoring irrelevant labels' information. It is also worth nothing that the improvement of Ma-F1 on the RCV1-V2 dataset is bigger compared with that on WOS, which indicates that our model can work better on dataset with a more complicated label hierarchy as RCV1-V2 has a deeper label hierarchical structure than WOS.  Although our model does not outperform all the results of HiAGM-TP, it reaches the similar performance. This indicates that information maximization is an alternative effective way to fuse the text feature and label information together to boost the performance. In addition, apart from generating text representations, our model can also generate refined label representations via information maximization which can be utilized for inference, while HiAGM-TP cannot produce such label embeddings for usage in the inference phase because it directly feeds the text feature into the structure encoder to obtain final text representation for prediction. In other words, HiAGM-TP encodes text and label information into only one feature space. However, obtaining separate text features and label features such as the ones generated by our model can help encode more semantic information of labels, which may be helpful for HTC especially when there is a large label hierarchy in the dataset.
We do not report the results of other baselines such as HFT(M) (Shimura et al., 2018), SGM , HiLAP-RL (Mao et al., 2019), etc. as they can be found in (Zhou et al., 2020), and our model performs better than these baselines.

Ablation study
To demonstrate the effectiveness of the two modules of information maximization, we conduct an ablation study and the results are shown in Table  3. Every score in Table 3   Max w/o MI) by 0.09, 0.92 points on RCV1-V2 and 0.12, 0.11 points on WOS in terms of Mi-F1 and Ma-F1 respectively, which indicates that the textlabel mutual information maximization module can make each text capture its corresponding labels' information and thus improves the Mi-F1 and Ma-F1 score at the same time. When compared with the other variant (i.e., HTCInfoMax w/o LabelPrior), the improvements of the two metrics can also be observed but Ma-F1 has larger improvements by 2.14 and 1.04 points on RCV1-V2 and WOS respectively compared with Mi-F1. This demonstrates that label prior matching helps regularize the label feature space and forces the structure encoder to learn better representations with desired properties for all labels. Thus the representations of imbalanced labels are also well learned, which helps mitigate the issue of underfitting of low-frequency labels, and thus improves the Ma-F1 score more and better to handle the label imbalance issue.

Conclusion
We propose HTCInfoMax to address the limitations of HiAGM by introducing information maximization which includes two modules: text-label mutual information maximization and label prior matching. The label prior matching can drive the model to learn better representations for all labels, while the other module further fuses such learned label representations into text to learn better text representations containing effective label information for prediction. The experimental results demonstrate the effectiveness of HTCInfoMax.

A Architecture Details of Information Maximization
A.1 The structure of discriminator in text-label mutual information maximization module The discriminator D M I consists of two 1Dconvolutional layers with kernels of size 3 and three linear layers. The architecture of D M I is shown in Figure 2 and the details of all the layers are shown in Table 4 ("-" indicates that there is no activation for the corresponding layer). D M I takes pairs of text representation and label representation as input. The text representations are fed to the convolutional layers first, then the label representations are concatenated with the output from the second convolutional layer and fed to the following linear layers. The final linear layer produces a score for each pair of text sample and corresponding labels.   A.2 The structure of discriminator in label prior matching The discriminator D pr in the label prior matching module is composed of three linear layers. The details of these layers are shown in Table 5. This discriminator takes label representation as input and is applied for each label to compute its prior matching loss as stated in Section 2.