Label-Specific Dual Graph Neural Network for Multi-Label Text Classification

Multi-label text classification is one of the fundamental tasks in natural language processing. Previous studies have difficulties to distinguish similar labels well because they learn the same document representations for different labels, that is they do not explicitly extract label-specific semantic components from documents. Moreover, they do not fully explore the high-order interactions among these semantic components, which is very helpful to predict tail labels. In this paper, we propose a novel label-specific dual graph neural network (LDGN), which incorporates category information to learn label-specific components from documents, and employs dual Graph Convolution Network (GCN) to model complete and adaptive interactions among these components based on the statistical label co-occurrence and dynamic reconstruction graph in a joint way. Experimental results on three benchmark datasets demonstrate that LDGN significantly outperforms the state-of-the-art models, and also achieves better performance with respect to tail labels.


Introduction
Automatically labeling multiple labels of documents is a fundamental and practical task in natural language processing. Recently, with the growth of data scale, multi-label text classification(MLTC) has attracted more attention, since it is often applied to many fields such as sentiment analysis (Liu and Chen, 2015;Li et al., 2016), emotion recognition Jabreel and Moreno, 2019), web page tagging (Jain et al., 2016) and so on. However, the number of labels and documents and the complex relations of labels render it an unsolved and challenging task.
Existing studies for multi-label text classification mainly focus on learning enhanced document *Corresponding Author representation (Liu et al., 2017) and modeling label dependency (Zhang et al., 2018;Yang et al., 2018;Tsai and Lee, 2019) to improve classification performance. Although they have explored the informative words in text content, or considered the label structure and label semantics to capture label correlations, these models cannot distinguish similar labels well (e.g., the categories Prices vs Consumer Prices in Reuters News).
The main reason is that most of them neglect the semantic connections between labels and input documents and they learn the same document representations for different labels, which cannot issue the label similarity problem. More specifically, they do not explicitly consider the corresponding semantic parts of each label in the document.
Recently, some studies (You et al., 2019;Xiao et al., 2019;Du et al., 2019) have used attention mechanism to explore the above semantic connections, and learn a label-specific document representation for classification. These methods have obtained promising results in MLTC, which shows the importance of exploring semantic connections. However, they did not further study the interactions between label-specific semantic components which can be guided by label correlations, and thus these models cannot work well on predicting tail labels which is also a challenging issue in MLTC. To handle these issues, a common way to explore the semantic interactions between labelspecific parts in document is to utilize the statistical correlations between categories to build a label co-occurrence graph for guiding interactions.
Nevertheless, statistical correlations have three drawbacks. First, the co-occurrence patterns between label pairs obtained from training data are incomplete and noisy. Specifically, the label cooccurrences that appear in the test set but do not appear in the training set may be ignored, while some rare label co-occurrences in the statistical correlations may be noise. Second, the label cooccurrence graph is built in global, which may be biased for rare label correlations. And thus they are not flexible to every sample document. Third, statistical label correlations may form a long-tail distribution, i.e., some categories are very common while most categories have few of documents. This phenomenon may lead to models failing to predict low-frequency labels. Thus, our goal is to find a way to explore the complete and adaptive interactions among label-specific semantic components more accurately.
In this paper, we investigate: (1) how to explicitly extract the semantic components related to the corresponding labels from each document; and (2) how to accurately capture the more complete and more adaptive interactions between label-specific semantic components according to label dependencies. To solve the first challenge, we exploit the attention mechanism to extract the labelspecific semantic components from the text content, which can alleviate the label similar problem. To capture the more accurate high-order interactions between these semantic components, we first employ one Graph Convolution Network (GCN) to learn component representations using the statistical label co-occurrence to guide the information propagation among nodes (components) in GCN. Then, we use the component representations to reconstruct the adjacency graph dynamically and re-learn the component representations with another GCN, and thus we can capture the latent interactions between these semantic components. Finally, we exploit final component representations to predict labels. We evaluate our model on three real-world datasets, and the results show that the proposed model LDGN outperforms all the comparison methods. Further studies demonstrate our ability to effectively alleviate the tail labels problem, and accurately capture the meaningful interactions between label-specific semantic components.
The contributions of this paper are as follows: • We propose a novel label-specific dual graph neural network (LDGN), which incorporates category information to extract label-specific components from documents, and explores the interactions among these components.
• To model the accurate and adaptive interactions, we jointly exploit global co-occurrence patterns and local dynamic relations. To make up the deficiency of co-occurrences, we employ the local reconstruction graph which is built by every document dynamically.
• We conduct a series of experiments on three public datasets, and experimental results demonstrate that our model LDGN significantly outperforms the state-of-the-art models, and also achieves better performance with respect to tail labels.

Model
As depicted in Figure 1, our model LDGN is composed of two major modules: 1) labelspecific document representation 2) dual graph neural network for semantic interaction learning. Specifically, label-specific document representation learning describes how to extract labelspecific semantic components from the mixture of label information in each document; and the dual graph neural network for semantic interaction learning illustrates how to accurately explore the complete interactions among these semantic components under the guidance of the prior knowledge of statistical label co-occurrence and the posterior information of dynamic reconstruction graph. Problem Formulation: Let D = {x i , y i } N be the set of documents, which consists of N document x i and its corresponding label y i ∈ {0, 1} |C| , where |C| denotes the total number of labels. Each document x i contains J words x i = w i1 , w i2 , . . . , w iJ . The target of multi-label text classification is to learn the mapping from input text sequence to the most relevant labels.

Label-specific Document Representation
Given a document x with J words, we first embed each word w j in the text into a word vector e wj ∈ R d , where d is the dimensionality of word embedding vector. To capture contextual information from two directions of the word sequence, we first use a bidirectional LSTM to encode wordlevel semantic information in document representation. And we concatenate the forward and backward hidden states to obtain the final word sequence vector h ∈ R |J|×D . After that, to explicitly extract the corresponding semantic component related to each label from documents, we use a label guided attention mechanism to learn label-specific text representation. Firstly, we randomly initialize the label representation C ∈ R |C|×dc , and compute the label-aware attention values. Then, we can induce the labelspecific semantic components based on the label guided attention. The formula is as follows: where α ij indicates how informative the j-th text feature vector is for the i-th label. u i ∈ R D denotes the semantic component related to the label c i in this document.

Dual Graph Neural Network
Interaction Learning with Statistical Label Cooccurrence To capture the mutual interactions between the label-specific semantic components, we build a label graph based on the prior knowledge of label co-occurrence, each node in which correlates to a label-specific semantic component u i . And then we apply a graph neural network to propagate message between nodes.
Formally, we define the label graph G = (V, E), where nodes refer to the categories and edges refer to the statistical co-occurrence between nodes (categories). Specifically, we compute the probability between all label pairs in the training set and get the matrix A s ∈ R |C|×|C| , where A s ij denotes the conditional probability of a sample belonging to category C i when it belongs to category C j .
Then, we utilize GCN (Kipf and Welling, 2017) to learn the deep relationships between labelspecific semantic components guided by the statistical label correlations. GCNs are neural networks operating on graphs, which are capable of enhancing node representations by propagating messages between neighboring nodes.
In multi-layer GCN, each GCN layer takes the component representations from previous layer H l as inputs and outputs enhanced component representations, i.e., H l+1 . The layer-wise propagation rule is as follows: where σ (·) denotes LeakyReLU (Maas et al., 2013) activation function. W l ∈ R D×D is a transformation matrix to be learned. A denotes the normalized adjacency matrix, and the normalization method (Kipf and Welling, 2017) is: where D is a diagonal degree matrix with entries D ij = Σ j A ij Depending on how many convolutional layers are used, GCN can aggregate information only about immediate neighbors (with one convolutional layer) or any nodes at most K-hops neighbors (if K layers are stacked). See (Kipf and Welling, 2017) for more details about GCN.
We use a two-layer GCN to learn the interactions between label-specific components. The first layer takes the initialized component representations U ∈ R |C|×D in Equation 2 as inputs H 0 ; and the last layer outputs H 2 ∈ R |C|×D with D denoting the dimensionality of final node representations.
However, the statistical label correlations obtained by training data are incomplete and noisy.
And the co-occurrence patterns between label pairs may form a long-tail distribution.

Re-learning with Dynamic Reconstruction Graph
To capture the more complete and adaptive interactions between these components, we exploit the above component representations H 2 to reconstruct the adjacency graph dynamically, which can make up the deficiency of co-occurrence matrix. And then we re-learn the interactions among the label-specific components guided by the posterior information of dynamic reconstruction graph.
Specifically, we apply two 1×1 convolution layers and dot product to get the dynamic reconstruction graph A D as follows: are the weights of two convolution layers, f is the sigmoid activation function. And then we normalize the reconstruction adjacency matrix as Equation 4, and obtain the normalized adjacency matrix A D of reconstruction graph.
In a similar way as Equation 3, we apply another 2-layer GCN to learn the deep correlations between components with the dynamic reconstruction graph. The first layer of this GCN takes the component representations H 2 as inputs, and the last layer outputs the final component representations H 4 ∈ R |C|×D .

Multi-label Text Classification
After the above procedures, we concatenate the two types of component representations H O = [H 2 , H 4 ] and feed it into a fully connected layer for prediction: y = σ(W 1 H O ) , where W 1 ∈ R 2D ×1 and σ is the sigmoid function.
We use y ∈ R |C| to represent the ground-truth label of a document, where y i = 0, 1 denotes whether label i appears in the document or not. The proposed model LDGN is trained with the multi-label cross entropy loss: 3 Experiment

Experimental Setup Datasets
We evaluate the proposed model on three benchmark multi-label text classifica-tion datasets, which are AAPD (Yang et al., 2018), EUR-Lex (Mencia and Fürnkranz, 2008) and RCV1 (Lewis et al., 2004). The statistics of these three datasets are listed in Table 1.

Evaluation Metric
Following the settings of previous work (You et al., 2019;Xiao et al., 2019), we use precision at top K (P@k) and Normalized Discounted Cumulated Gains at top K (nDCG@k) for performance evaluation. The definition of two metrics can be referred to You et al. (2019).

Implementation Details
For a fair comparison, we apply the same dataset split as previous work (Xiao et al., 2019), which is also the original split provided by dataset publisher (Yang et al., 2018;Mencia and Fürnkranz, 2008).
The word embeddings in the proposed network are initialized with the 300-dimensional word vectors, which are trained on the datasets by Skipgram (Mikolov et al., 2013) algorithm. The hidden sizes of Bi-LSTM and GCNs are set to 300 and 512, respectively. We use the Adam optimization method (Kingma and Ba, 2014) to minimize the cross-entropy loss, the learning rate is initialized to 1e-3 and gradually decreased during the process of training. We select the best parameter configuration based on performance on the validation set and evaluate the configuration on the test set. Our code is available on GitHub 1 .

Baselines
We compare the proposed model with recent deep learning based methods for MLTC, including seq2seq models, deep embedding models, and label attention based models. And it should be noted that, because of different application scenarios, we did not choose the label tree-based methods and extreme text focused methods as baseline models.   • SGM (Yang et al., 2018): a sequence generation model which models label correlations as an ordered sequence.
• DXML (Zhang et al., 2018): a deep embedding method which models the feature space and label graph structure simultaneously. The SotA model (i.e., LSAN) used BiLSTM model for text representations. For a fair comparison, we also take BiLSTM as text encoder in our model. Table 2 and Table 3 demonstrate the performance of all the compared methods based on the three datasets. For fair comparison, the experimental results of baseline models are directly cited from previous studies (Xiao et al., 2019). We also bold the best result of each column in all tables.

Experimental Results and Analysis
From the Table 2 and  baselines on three datasets. The outstanding results confirm the effectiveness of label-specific semantic interaction learning with dual graph neural network, which include global statistical patterns and local dynamic relations. It is observed that the performance of XML-CNN is worse than other comparison methods. The reason is that it only exploits the text content of documents for classification but ignores the label correlations which have been proven very important for multi-label classification problem.
The label tree-based model AttentionXML performs better than the seq2seq method (SGM) and the deep embedding method (DXML). Although both DXML and SGM employ a label graph or an ordered sequence to model the relationship between labels, they ignore the interactions between labels and document content. And AttentionXML uses multi-label attention which can focus on the most relevant parts in content and extract different semantic information for each label.
Compared with other label attention based methods (AttentionXML, EXAM), LSAN performs the best because it takes the semantic correlations between document content and label text into account simultaneously, which exploits an adaptive fusion to integrate self-attention and label-attention mechanisms to learn the labelspecific document representation.
In conclusion, the proposed network LDGN outperforms sequence-to-sequence models, deep embedding models, and label attention based models, and the metrics P @k and nDCG@k of multi-label text classification obtain significant improvement. Specifically, on the AAPD dataset, LDGN increases the P @1 of the LSAN method (the best baseline) from 85.28% to 86.24%, and increases nDCG@3 and nDCG@5 from 80.84% to 83.33%, 84.78% to 86.85% , respectively. On the EUR-Lex dataset, the metric P @1 is boosted from 79.17% to 81.03%, and P @5 and nDCG@5 are increased from 53.67% to 56.36%, 62.47% to 66.09%, respectively. On the RCV1 dataset, the P @k of our model increased by 0.3% at average, and LDGN achieves 1% and 1.6% absolute improvement on nDCG@3, 5 compared with LSAN. The improvements of the proposed LDGN model demonstrate that the semantic interaction learning with joint global statistical relations and local dynamic relations are generally helpful and effective, and LDGN can capture the deeper correlations between categories than LSAN.

Ablation Test
We perform a series of ablation experiments to examine the relative contributions of dual graphbased semantic interactions module. To this end, LDGN is compared with its three variants:(1)S: Graph-based semantic interactions only with statistical label co-occurrence; (2)D: Graph-based semantic interactions only with dynamic reconstruction graph; (3)no-G:Removing the dual graph neural network. For a fair comparison, both S and D use 4-layer GCN which is the same as LDGN.
As presented in Figure 3, S and D perform better than no-G, which demonstrates that exploring either statistical relations or dynamic relations can correctly capture the effective semantic interactions between label-specific components. D performs better than S, indicating the model with local dynamic relations is adaptive to data and has better stability and robustness, which also shows that the model with local dynamic relations can capture semantic dependencies more effectively and accurately. The performance of S+D (i.e., LDGN) combining two aspect relations obtains significant improvement, which shows dynamic relations can make up the deficiency of statistical co-occurrence and correct the bias of global correlations. Thus, it is necessary to explore their joint effects to further boost the performance.

Performance on tail labels
In order to prove the effectiveness of the proposed LDGN in alleviating the tail labels problem, we evaluate the performance of LDGN by propensity scored precision at k (PSP@k), which is calcu-smart grid digitalization power grid visionary acceptation model energy management users engaged producing energy consuming systems aware energy demand response network dynamically varying prices natural question smart grid reality distribution grid updated assume positive answer question lower layers medium low voltage change previous analyzed samples dutch distribution grid previous considered evolutions synthetic topologies modeled studies complex systems technological domains previous paper extra step defining methodology evolving existing physical power grid smart grid model laying foundations decision support system utilities governmental organizations evolution strategies apply dutch distribution grid Figure 4: The Visualization of label attention weights. (The attention weights of 'physics.soc' for words are shaded in blue, and the attention scores of class CS.CY and CS.CE are shaded in green and yellow color respectively. Darker color represents higher weight score.) lated as follow: where P rank(l) is the propensity score (Jain et al., 2016) of label rank(l). Figure 2 shows the results of LDGN and LSAN on three datasets.
As shown in Figure 2(a), Figure 2(b) and Figure 2(c), the proposed LDGN performs better in predicting tail labels than the LSAN model (the best baseline) on three datasets. Specifically, on the RCV1 dataset, LDGN achieves 0.97% and 1.35% absolute improvement in term of P SP @3 and P SP @5 compared with LSAN. On the AAPD dataset, the P SP @k increased by at least 0.63% up to 0.90%. And on the EUR-Lex dataset, LDGN achieves 1.94%, 3.64% and 4.93% absolute improvement on P SP @1, 3, 5 compared with LSAN. The reason for the improvement in the EUR-Lex dataset is more obvious is that the semantic interactions learning is more useful to capture related information in the case of a large number of labels.
The results prove that LDGN can effectively alleviate the problem of predicting tail labels.

Case Study
To further verify the effectiveness of our label attention module and dual graph neural network in LDGN, we present a typical case and visualize the attention weights on the document words and the similarity scores between label-specific components. We show a test sample from original AAPD dataset, and the document belongs to three categories, 'Physics and Society' (physics.soc), 'Computers and Society' (cs.cy) and 'Computational Engineering, Finance, and Science' (cs.ce).

Visualization of Attention
We can observe from the Figure 4 that different labels focus on different parts in the document text, and each label has its own concerned words. For example, the more important parts in the 'physics.soc' category are 'digitalization power grid', 'energy management'. And the words that the 'cs.ce' category focuses on are 'consuming systems', 'varying prices', 'laying foundations', 'lower ' and etc. For class 'cs.cy', the concerned words are 'samples dutch distribution', 'evolutions' and 'topologies'. The corresponding related words of the three categories can reflect the semantics of the categories. Visualization of Interactions To gain a clearer view of the importance of our dual graph-based interactions learning module, we display two heatmaps in Figure 5 to visualize the partial graph structure of dual GCN. The edge weights shown in the heatmaps are obtained by global label cooccurrence and local dynamic relations (i.e., computed by Equation 5), respectively.
As presented in heatmaps, different relations between categories are captured by dual GCN. In global statistical relations, 'cs.cy' is highly linked with 'physics.soc' and wrong label 'nlin.ao', while the true label 'cs.ce' is isolated. And in local dynamic relations, 'cs.cy' is more related to 'cs.ce', and the correlations between wrong label 'nlin.ao' and true labels are reduced. This demonstrates that local dynamic relations can capture the latent relations that do not appear in global relations, and correct the bias of global correlations.

Multi-label Text Classification
The existing methods for MLTC mainly focus on learning enhanced document representation (Liu et al., 2017) and modeling label dependency (Nam et al., 2017;Yang et al., 2018;Tsai and Lee, 2019) to improve the classification performance.
With the wide application of neural network methods for text representation, some innovative models have been developed for this task, which include traditional deep learning methods and Seq2Seq based methods. Liu et al. (2017) employed CNNs and dynamic pooling to learn the text representation for MLTC. However, they treated all words equally and cannot explored the informative words in documents. The Seq2Seq methods, such as MLC2Seq (Nam et al., 2017) and SGM (Yang et al., 2018), employed a RNN to encode the input text and an attention based RNN decoder to generate predicted labels sequentially. Although they used attention mechanism to capture the informative words in text content, these models cannot distinguish similar labels well. There is a big reason that most of them neglect the semantic connections between labels and document, and learn the same document representations for different labels.
Recently, some studies (You et al., 2019;Xiao et al., 2019;Du et al., 2019) have used attention mechanism to explore the interactions between words and labels, and learned a labelspecific document representation for classification. These methods have obtained promising results in MLTC, which shows the importance of ex-ploring semantic connections. However, they did not further study the interactions between labelspecific semantic components which can help to predict low-frequency labels.
To handle these issues, a common way to explore the semantic interactions between labelspecific parts in document, is to utilize the label graph based on statistical co-occurrences.

MLC with Label Graph
In order to capture the deep correlations of labels in a graph structure, many researches in image classification apply node embedding and graph neural network models to the task of multi-label image classification. Lee et al. (2018) incorporated knowledge graphs for describing the relationships between labels, and the information propagation can model the dependencies between seen and unseen labels for multi-label zero-shot learning. Chen et al. (2019) learned label representations with prior label correlation matrix in GCN, and mapped the label representations to inter-dependent classifiers, which achieved superior performance.
However, there were few related approaches for multi-label classification of text. Zhang et al. (2018) established an explicit label cooccurrence graph to explore label embedding in low-dimension latent space.
Furthermore, the statistical label correlations obtained by training data are incomplete and noisy. And the co-occurrence patterns between label pairs may form a long-tail distribution.
Thus, our goal is to find a way to explore the complete and adaptive interactions among labelspecific semantic components more accurately.

Conclusion
In this paper, we propose a graph-based network LDGN to capture the semantic interactions related to corresponding labels, which jointly exploits global statistical patterns and local dynamic relations to derive complete and adaptive dependencies between different label-specific semantic parts. We first exploit multi-label attention to extract the label-specific semantic components from documents. Then, we employ GCN to learn component representations using label co-occurrences to guide the information propagation among components. After that, we use the learned component representations to compute the adjacency graph dynamically and re-learn with GCN based on the reconstruction graph. Extensive experiments con-ducted on three public datasets show that the proposed LDGN model outperforms other state-ofthe-art models on multi-label text classification task and also demonstrates much higher effectiveness to alleviate the tail label problem. In the future, we will improve the proposed model in efficiency, for example we could construct a dynamic graph for a few samples rather than each sample. And besides, we will explore more information about labels for MLC classification.