Concept-Based Label Embedding via Dynamic Routing for Hierarchical Text Classification

Hierarchical Text Classification (HTC) is a challenging task that categorizes a textual description within a taxonomic hierarchy. Most of the existing methods focus on modeling the text. Recently, researchers attempt to model the class representations with some resources (e.g., external dictionaries). However, the concept shared among classes which is a kind of domain-specific and fine-grained information has been ignored in previous work. In this paper, we propose a novel concept-based label embedding method that can explicitly represent the concept and model the sharing mechanism among classes for the hierarchical text classification. Experimental results on two widely used datasets prove that the proposed model outperforms several state-of-the-art methods. We release our complementary resources (concepts and definitions of classes) for these two datasets to benefit the research on HTC.


Introduction
Text classification is a classical Natural Language Processing (NLP) task. In the real world, the text classification is usually cast as a hierarchical text classification (HTC) problem, such as patent collection (Tikk et al., 2005), web content collection (Dumais and Chen, 2000) and medical record coding (Cao et al., 2020). In these scenarios, the HTC task aims to categorize a textual description within a set of labels that are organized in a structured class hierarchy (Silla and Freitas, 2011). Lots of researchers devote their effort to investigate this challenging problem. They have proposed various HTC solutions, which are usually categorized into flat (Aly et al., 2019), local (Xu and Geng, 2019), global (Qiu et al., 2011) and combined approaches (Wehrmann et al., 2018).
In most of the previous HTC work, researchers mainly focus on modeling the text, the labels are simply represented as one-hot vectors (Zhu and Bain, 2017;Wehrmann et al., 2018). Actually, the one-hot vectors act as IDs without any semantic information. How to describe a class is also worthy of discussion. There is some work that embeds labels into a vector space which contains more semantic information. Compared with one-hot representations, label embeddings have advantages in capturing domain-specific information and importing external knowledge. In the field of text classification (includes the HTC task), researchers propose several forms of label embeddings to encode different kinds of information, such as 1) anchor points (Du et al., 2019), 2) compatibility between labels and words Huang et al., 2019;Tang et al., 2015), 3) taxonomic hierarchy (Cao et al., 2020;Zhou et al., 2020) and 4) external knowledge (Rivas Rojas et al., 2020).
Although the external knowledge has been proven effective for HTC, it comes from a dictionary or knowledge base that humans constructed for entity definition, and it doesn't focus on the class explanations of a certain HTC task. In this sense, external knowledge is a type of domainindependent information. The taxonomic hierarchy encoding can capture the structural information of classes, which is a sort of domain-specific information for HTC. However, actually it only models the hypernym-hyponym relations in the class hierarchy. The process is implicit and difficult to be interpreted. Besides the structural connections between classes, we find that the information of concept shared between adjacent levels of classes is ignored by previous work. For instance, there is a parent node named "Sports" in a concrete class hierarchy (Qiu et al., 2011). Its subclasses "Surfing" and "Swimming" are "water" related sports. The subclasses "Basketball" and "Football" are "ball" related sports. The "water" and "ball" are a type of abstract concept included in the parent class "Sports" and can be shared by the subclasses. As shown in Figure 1, we have a similar observation in WOS (Kowsari et al., 2017), which is a widely used public dataset (details in our experiments). The concept "design" of the parent class "Computer Science" is shared by the child classes "Soft engineering" and "Algorithm design". The concept "distributed" is shared by "Network security" and "Distributed computing". The concept information can help to group the classes and measure the correlation intensity between parent and child classes. Compared with the information of node connections in the class hierarchy, the concept is more semantic and fine-grained, but rarely investigated. Although Qiu et al. (2011) have noticed the concept in HTC, they define the concept in a latent way and the process of represent learning is also implicit. Additionally, few of previous work investigates how to extract the concepts or model the sharing interactions among class nodes.
To further exploit the information of concept for HTC, we propose a novel concept-based label embedding method which can explicitly represent the concepts and model the sharing mechanism among classes. More specifically, we first construct a hierarchical attention-based framework which is proved to be effective by Wehrmann et al. (2018) and Huang et al. (2019). There is one conceptbased classifier for each level. The prior level classification result (i.e. predicted soft label embedding) is fed into the next level. A label embedding attention mechanism is utilized to measure the compatibility between texts and classes. Then we design a concept sharing module in our model. It firstly extracts the concepts explicitly in the corpus and represents them in the form of embeddings. Inspired by the CapsNet (Sabour et al., 2017), we employ the dynamic routing mechanism. The iterative routing helps to share the information from the lower level to the higher level with the agreement in CapsNet. Taking into account the characters of HTC, we modify the dynamic routing mechanism for modeling the concepts sharing interactions among classes. In detail, we calculate the agreement between concepts and classes. An external knowledge source is taken as an initial reference of the child classes. Different from the full connections in CapsNet, we build routing only between the class and its own child classes to utilize the structured class hierarchy of HTC. Then the routing coefficients are iteratively refined by measuring the agreement between the parent class concepts embeddings and the child class embeddings. In this way, the module models the concept sharing process and outputs a novel label representation which is constructed by the concepts of parent classes. Finally, our hierarchical network adopts such label embeddings to represent the input document with an attention mechanism and makes a classification.
In summary, our major contributions include: • This paper investigates the concept in HTC problem, which is a type of domain-specific information ignored by previous work. We summarize several kinds of existing label embeddings and propose a novel label representation: concept-based label embedding.
• We propose a hierarchical network to extract the concepts and model the sharing process via a modified dynamic routing algorithm. To our best knowledge, this is the first work that explores the concepts of the HTC problem in an explicit and interpretable way.
• The experimental results on two widely used datasets empirically demonstrate the effective performance of the proposed model.
• We complement the public datasets WOS (Kowsari et al., 2017) and DBpedia (Sinha et al., 2018) by exacting the hierarchy concept and annotating the classes with the definitions from Wikipedia. We release these complementary resources and the code of the proposed model for further use by the community 1 .

Model
In this section, we detailedly introduce our model CLED ( Figure 2). It is designed for hierarchical text classification with Concept-based Label Embeddings via a modified Dynamic routing mechanism. Firstly, we construct a hierarchical attentionbased framework. Then a concept sharing module is designed for extracting concepts and modeling the sharing mechanism among classes. The module learns a novel label representation with concepts. Finally, the model takes the concept-based label embeddings to categorize a textual description.

Hierarchical Attention-based Framework
In recent years, the hierarchical neural network has been proven effective for the HTC task by much work (Sinha et al., 2018;Wehrmann et al., 2018;Huang et al., 2019). We adopt it as the framework of our model.

Text Encoder
We first map each document d = (w 1 , w 2 , ..., w |d| ) into a low dimensional word embedding space and denote it as X = (x 1 , x 2 , ..., x |d| ). A CNN layer is used for extracting n-gram features. Then a bidirectional GRU layer extracts contextual features and represents the document as S = (s 1 , s 2 , ..., s |d| ).
Label Embedding Attention To measure the compatibility between labels and texts, we adopt the label embedding attention mechanism. Given a structured class hierarchy, we denote the label embeddings of the i-th level as C = (c 1 , c 2 , ..., c |l i | ), where |l i | is the number of classes in the i-th level. Then we calculate the cosine similarity matrix G ∈ R |d|×|l i | between words and labels via g kj = (s k c j )/( s k c j ) for the i-th level. Inspired by  and , we adopt convolutional filters F to measure the correlations r p between the p-th phrase of length 2k + 1 and the classes at i-th level, We denote the largest correlation value of the pth phrase with regard to the labels of i-th level as t p = max-pooling(r p ). Then we get the labelto-text attention score α ∈ R |d| by normalizing t ∈ R |d| with the SoftMax function. Finally, the document representation d att can be obtained by averaging the word embeddings, weighted by labelto-text attention score: d att = |d| k α k s k .

Concept Sharing Module (CSM)
Most of researchers focus on measuring the correlations of classes by modeling the structured class hierarchy. In fact, they only get the information of graphic connections. By contrast, the concepts are more semantic, fine-grained and interpretable, but have been ignored. To further exploit the concepts, we design a concept module to explicitly model the mechanism of sharing concepts among classes and measure the intensity of interactions.
Concepts Encoder Given the corpus of class c, we extract the keywords from the documents and take top-n ranked keywords as the concepts of class

Algorithm 1 Pseudo Code of Concepts Sharing via Dynamic Routing
Input: all the classes c and their concepts e in level l; all the classes in level (l + 1) Output: c CL j : the concept-based label embedding of the class in level (l + 1); 1: for each concept i of a class c in level l and each of its child class j in level (l + 1): b ij ← 0; 2: for r iterations do 3: for each concept i of class c in level l: β i ← softmax(b i ); softmax computes Eq. 1 4: for each child class j of class c in level (l + 1): v j ← i β ij e i ; 5: for each child class j of class c in level (l + 1): c CL j ← squash(v j ) squash computes Eq. 4 6: for each concept i of class c in level l and each of its child class j in level (l + 1) In the WOS dataset, every document is already annotated with several keywords. So we rank the keywords by term frequency within each class. For the DBpedia dataset, there is no annotated keyword available. We carry out the Chi-square (χ 2 ) statistical test, which has been widely accepted as a statistical hypothesis test to evaluate the dependency between words and classes (Barnard, 1992;Palomino et al., 2009;Kuang and Davison, 2017). The words are ranked by the χ 2 values. Having extracted concepts for each class, we represent them with word embeddings.
To further encode the concepts, we exploit two different ways and make a comparison in experiments. A simple and efficient way is to feed the concept embeddings into the sharing networks directly. Alternatively, we try the k-means clustering algorithm (Hartigan and Wong, 1979) in consideration of the similarity between concepts, then get the embeddings of cluster centers. The outputs (word embeddings or cluster centers) of concepts encoder are denoted as E c = (e 1 , e 2 , ..., e n ) for class c.
Concepts Sharing via Dynamic Routing For the HTC task, we find that there are concepts of parent classes shared by their child classes. The semantically related classes share some concepts in common. The concepts describe a class in different views. We adopt the dynamic routing mechanism in the CapsNet (Sabour et al., 2017), which is effective for sharing the information from lower levels to higher levels. Considering the characters of HTC, we modify it to explicitly model the interactions among classes and quantitatively measure the intensity.
To utilize the taxonomic hierarchy, we build routing only between the class and its own child classes, which is different from the full connections in Cap-sNet. We take the coupling coefficients between concepts of a parent class and all its child classes as the intensities of the sharing interactions. The intensity (coupling coefficient) β ij sums to 1 and is determined by a "routing softmax". The logit b ij is the log prior probability that concept i of a parent class should be shared to its child class j in level l n .
The logit b ij is iteratively refined by adding with the agreement.
The agreement is the scalar product between the concept embedding e i and the concept-based label embedding (CL) of the child class c CL j . The v j is the intermediate label embedding of the child class, which is generated by weighting over all the concepts of its parent class.
As Sabour et al. (2017) do in the CapsNet, we also use a non-linear "squashing" function which is effective in our experiments.
Finally, we get the concept-based label embedding for class c j by modeling the sharing mechanism.
The new generated label embedding c CL j is constructed with several concepts e i in different views and affected in different intensities β ij . Compared with randomly initializing c CL j , an external knowledge source is taken as an initial reference which is more effective in experiments. The procedures are illustrated in Algorithm 1.

Classification
We build a classifier for each class level. Letŷ l i denote the predictions of the classes in i-th level.
j is the label embedding represented by averaging word embeddings of class definition in external knowledge (EK encoder in Figure 2). We calculate the loss of classifier in i-th level as follows: where y l i n is the one-hot vector of ground truth label in the i-th level for document n and CE(·, ·) is the cross entropy between two probability vectors. We optimize the model parameters by minimize the overall loss function: where H is the total number of levels in the structured class hierarchy.

Datasets
We evaluate our model on two widely used hierarchical text classification datasets: Web of Science (WOS; Kowsari et al. (2017)) and DBpedia (Sinha et al., 2018). The former includes published papers available from the Web of Science (Reuters, 2012

Metrics and Parameter Settings
As the state-of-the-art methods do, we take the accuracy of each level and the overall accuracy as metrics. Hyper-parameters are tuned on a validation set by grid search. We take Stanford's publicly available GloVe 300-dimensional embeddings trained on 42 billion tokens from Common Crawl (Pennington et al., 2014) as initialization for word embeddings. The number of filters in CNN is 128 and the region size is {2, 3}. The number of hidden units in bi-GRU is 150. We set the maximum length of token inputs as 512. The rate of dropout is 0.5. The number of routing iterations is 3. We compare two different inputs of the sharing networks: 1) top 30 ranked concepts of each parent class as inputs; 2) 40 cluster centers generated by the k-means clustering algorithm on 1k concepts for each parent class. We train the parameters by the Adam Optimizer (Kingma and Ba, 2014) with an initial learning rate of 1e-3 and a batch size of 128.

Baselines
HDLTex Kowsari et al. (2017) prove that the hierarchical deep learning networks outperform the conventional approaches (Naïve Bayes or SVM).
HNATC Sinha et al. (2018) propose a Hierarchical Neural Attention-based Text Classifier. They build one classifier for each level and concatenate the predicted category embedding at (i-1)-th level with each of the encoder's outputs to calculate attention scores for i-th level.

Compared with State-of-the-art Methods
To illustrate the practical significance of our proposed model, we make comparisons with several competitive state-of-the-art methods. The results of experiments conducted on the public datasets are shown in Table 2. Most of the state-of-the-art methods referred to in Section 3.3 adopt a hierarchical attention-based network as their models' framework. Within their models, the hierarchical framework is effective in utilizing the classification results of the previous levels for the next levels.
The label embedding attention mechanism helps to import external knowledge sources and the taxonomic hierarchy. On both of the two datasets, the state-of-the-art methods obtain competitive performance. With a similar framework, our model focuses on the concept-based label embedding and outperforms the other methods on both level and overall accuracy. The results indicate the effectiveness of the concepts among classes which have been ignored by previous work. The concept-based label embedding models related classes by the sharing mechanism with common concepts (visualizations in Section 3.6). The ablation comparisons are shown in Section 3.5. The experimental results of the two variants of our model are also shown in Table 2. Compared with directly feeding the concepts into the sharing networks (CLED), the variant CLEDcluster performs slightly better. It indicates that cluster centers generated by the k-means algorithm are more informative and effective.

Ablation Experiments
To investigate the effectiveness of different parts in our model, we carry out ablation studies. The experiment results are shown in Table 3.
Effectiveness of Concept-based Label Embedding By comparing the results of CLED and the model without the learnt concept-based label embedding (w/o CL), we further confirm that the concepts shared among classes help to improve the performance.

Effectiveness of Dynamic Routing
We remove the dynamic routing networks from the model CLED. Because there is no dynamic routing to share the concepts from the parent classes to their  Table 3: Ablation studies for different parts in our model. child classes, it is an intuitive way to represent the label embeddings by averaging the word embeddings of the child classes' concepts. Specifically, there are top-30 ranked concepts for each parent class to share with their child classes. So for the model without dynamic routing (w/o DR), we represent the child class label embedding with the top-30 ranked concepts of each child class. Although the concepts of child classes are more finegrained and informative than the concepts of parent classes, the model CLED with the dynamic routing networks to share the concepts among classes performs better. It indicates that modeling the sharing mechanism and learning to represent the child classes with common concepts are more effective.

Effectiveness of External Knowledge
We take an external knowledge source as the initial reference of child classes in the concepts sharing module. When we remove the reference (w/o reference in CSM), the results are slightly worse on accuracy. It demonstrates that the external knowledge makes an efficient reference for the concept sharing. Similar to the state-of-the-art methods, the external knowledge is also used individually as the representation of each class in our model. It helps to measure the compatibility between labels and texts via the attention mechanism. When we fully remove the external knowledge and initialize the label embeddings randomly (w/o EK), the performances are slightly worse than that with external knowledge (CLED). It indicates the effectiveness of external knowledge. Besides, the experiment which removes the predicted soft label embedding (w/o PRE) proves that, it is effective to utilize the predictions of previous level.

Visualizations of Concepts Sharing
In this paper, we explicitly investigate the concept sharing process. A concept sharing module is designed to model the mechanism of sharing concepts among classes and measure the intensity of interactions. The heat map of the learnt dynamic routing scores between the concepts of class "Computer Science" and its child classes is illustrated in Figure 3. The color changes from white to blue while the score increases. The score indicates the intensity between the concept and class in the sharing process. In Figure 3, we find that the concept "design" is shared by the classes "Soft engineering" and "Algorithm design". The concept "distributed" is shared by the classes "Network security" and "Distributed computing". The concept is shared by related classes.
We use t-SNE (Van der Maaten and Hinton, 2008) to visualize the concept embeddings of class "Computer Science" and the concept-based label embeddings of its child classes on a 2D map in Figure 4. The label embedding (red triangle) is constructed with the embeddings of concepts (blue dot). As shown, the class "Software engineering" is surrounded by the concepts "optimization" and "design". "Network security" is surrounded by "cloud", "machine" and "security". The class is described by several concepts in different views.
The visualizations in Figure 3 and 4 indicate that we successfully model the concept sharing mechanism in a semantic and explicit way.

Related Work
Hierarchical text classification with label embeddings Recently, researchers try to adopt the label embeddings in the hierarchical text classification task. Huang et al. (2019) propose hierarchical attention-based recurrent neural network (HARNN) by adopting label embeddings. Mao et al. (2019) propose to learn a label assignment policy via deep reinforcement learning with label embeddings. Peng et al. (2019) propose hierarchical taxonomy-aware and attentional graph RCNNs with label embeddings. Rivas Rojas et al. (2020) Figure 3: Dynamic routing scores between the concepts of class "Computer Science" (Y-axis) and its child classes (X-axis). define the HTC task as a sequence-to-sequence problem. Their label embedding is defined by external knowledge. For modeling label dependencies, Zhou et al. (2020) formulate the hierarchy as a directed graph and introduce hierarchy-aware structure encoders. Cao et al. (2020) and Chen et al. (2020a) exploit the hyperbolic representation for labels by encoding the taxonomic hierarchy.
Hierarchical text classification besides label embeddings According to the motivation of this work, we separate previous work with label embeddings from the HTC task and present it in the above paragraph. Besides, existing work is usually categorized into flat, local and global approaches (Silla and Freitas, 2011). The flat classification approach completely ignores the class hierarchy and only predicts classes at the leaf nodes (Aly et al., 2019). The local classification approaches could be grouped as a local classifier per node (LCN), a local classifier per parent node (LCPN) and a local classifier per level (LCL). The LCN approach train one binary classifier for each node of the hierarchy (Fagni and Sebastiani, 2007). Banerjee et al. (2019) apply transfer learning in LCN by fine-tuning the parent classifier for the child class. For the LCPN, a multiclass classifier for each parent node is trained to distinguish between its child nodes (Wu et al., 2005;Dumais and Chen, 2000). Xu and Geng (2019) investigate the correlation among labels by the label Figure 4: t-SNE plot of the concept embeddings of the class "Computer Science" and the concept-based label embeddings of its child classes. distribution as an LCPN approach. The LCL approach consists of training one multi-class classifier for each class level (Kowsari et al., 2017;Shimura et al., 2018). Zhu and Bain (2017) introduce a B-CNN model which outputs predictions corresponding to the hierarchical structure. Chen et al. (2020b) propose a multi-level learning to rank model with multi-level hinge loss margins. The global approach learns a global classification model about the whole class hierarchy (Cai and Hofmann, 2004;Gopal and Yang, 2013;Wing and Baldridge, 2014;Karn et al., 2017). Qiu et al. (2011) exploit the latent nodes in the taxonomic hierarchy with a global approach. For the need for a large amount of training data, a weakly-supervised global HTC method is proposed by Meng et al. (2019). Meta-learning is adopted by Wu et al. (2019) for HTC in a global way. In addition, there is some work combined with both local and global approach (Wehrmann et al., 2018). A local flat tree classifier is introduced by Peng et al. (2018) which utilizes the graph-CNN.

Conclusion
In this paper, we investigate the concept which is a kind of domain-specific and fine-grained information for the hierarchical text classification. We propose a novel concept-based label embedding model. Compared with several competitive stateof-the-art methods, the experimental results on two widely used datasets prove the effectiveness of our proposed model. The visualization of the concepts and the learnt concept-based label embeddings re-5018 veal the high interpretability of our model.