Beyond Text: Incorporating Metadata and Label Structure for Multi-Label Document Classification using Heterogeneous Graphs

Multi-label document classification, associating one document instance with a set of relevant labels, is attracting more and more research attention. Existing methods explore the incorporation of information beyond text, such as document metadata or label structure. These approaches however either simply utilize the semantic information of metadata or employ the predefined parent-child label hierarchy, ignoring the heterogeneous graphical structures of metadata and labels, which we believe are crucial for accurate multi-label document classification. Therefore, in this paper, we propose a novel neural network based approach for multi-label document classification, in which two heterogeneous graphs are constructed and learned using heterogeneous graph transformers. One is metadata heterogeneous graph, which models various types of metadata and their topological relations. The other is label heterogeneous graph, which is constructed based on both the labels’ hierarchy and their statistical dependencies. Experimental results on two benchmark datasets show the proposed approach outperforms several state-of-the-art baselines.


Introduction
With the rapid growth of scientific documents, it is difficult to track related literature manually. For example, there are more than 200,000 publications related to COVID-19 by April 2021. Therefore, it is crucial to automatically assign publications with their corresponding categories. Multi-label document classification (MLDC), associating one document instance with a set of most relevant labels, is attracting more and more research attention.
To tackle the problem, early work focused on learning semantic representations of the input text using some text encoders. For example, Liu et al.  (2017) proposed the XML-CNN model using a convolutional neural network and You et al. (2018) proposed the AttentionXML model using a recurrent neural network as text semantic encoder. Recently, Chang et al. (2020) proposed the X-Transformer model, a deep transformer model fine-tuned for MLDC. Different from the aforementioned approaches, there have been attempts exploring information beyond text for MLDC. On the one hand, the information associated with labels such as label semantics and the relationships between labels are employed. For example, Xiao et al. (2019) generated labelspecific document representation using the label semantic information. You et al. (2018) improved the classification performance by constructing a hierarchical label tree. To model label dependency, MLDC is cast as a seq2seq task . On the other hand, the document metadata is incorporated. For example, to employ the metadata information, the representation of the document and its metadata are learned in the same embedding space (Zhang et al., 2021). The label hierarchy is also applied to regularize the output probability of each child label by its parents.
However, existing methods have two limitations: (1) the heterogeneous structure of a document's metadata is ignored. As shown in Figure 1a, for the BERT paper (Devlin et al., 2018), a heterogeneous metadata graph can be constructed which consists of multiple authors, multiple references and a publication venue. It can be observed that different types of nodes convey information in different granularity. It is worth noting that labels can also be categorized in different granularity. For example, in Figure 1b, yellow nodes are more coarse-grained compared to blue nodes. The node "NAACL" of the venue type in Figure 1a carries the coarse-grained information relating to the label node "Natural language processing" in Figure 1b, while the node "Glove: Global Vectors for Word Representation" of the reference type in Figure 1a contains the fine-grained information relating to the label node "Word Embedding" in Figure 1b. We believe that modeling such metadata with a heterogeneous graph structure helps to improve the accuracy of final classification. (2) the implicit statistical dependencies between labels are ignored. As shown in Figure 1b, there might exist statistical dependencies between labels. For example, the label "topic model" might have a strong association with "machine learning". But such information is not captured in the label hierarchy.
To tackle the above limitations, in this paper, we propose a novel neural network based approach for multi-label document classification using two heterogeneous graphs. Specifically, a metadata heterogeneous graph with four node types and five edge types is constructed to capture the metadata information and a label heterogeneous graph with two edge types is constructed to capture both the label hierarchy and labels' statistical dependencies. Both graphs are learned using the heterogeneous graph transformer. Moreover, to fully utilize the label information, the label-specific document representation is learned.
The main contributions of this paper are listed as follows: • A novel neural network based approach is proposed to utilize the information learned from two heterogeneous graphs. As far as we know, we are first to incorporate both the document metadata and the label structure information using the heterogeneous graphs.
• Experimental results on two benchmark datasets show that the proposed approach outperforms existing state-of-the-art approaches for multi-label document classification.

Related Work
Based on the information utilized for model learning, approaches for multi-label document classification can be categorized into two types: using textual information only or additionally incorporating external information.
For approaches solely based on textual information, early attempts (Babbar and Schölkopf, 2017;Jain et al., 2016) employed bag-of-words representations. Recently, deep learning based approaches were proposed to learn better text representations. For example, Liu et al. (2017) proposed to use a convolutional neural network for text encoding. You et al. (2018) proposed a neural network approach with the attention mechanism to capture the most relevant part of the input text for each label. Chang et al. (2020) employed a pre-trained Transformer (Vaswani et al., 2017) to capture textual information for text classification. However, such methods ignore the information beyond text, which we believe is crucial for accurate multi-label document classification.
Other approaches attempted to incorporate external information, which can be further classified into two categories, using the document metadata and using label information. For approaches utilizing metadata, Tang et al. (2015) proposed a neural network approach for sentiment analysis which incorporates user and product meta information by a vector space model. Kim et al. (2019) employed categorical metadata signals as additional features to train a deep neural network classifier. (Zhang et al., 2021) developed an approach called MATCH, which pre-trained the embeddings of text and metadata in the same space, and used the Transformer to capture the relationship between them. Nevertheless, these methods ignored the heterogeneous structure of a document's metadata.
Approaches using label information have considered label semantic information, label hierarchy and label dependency. To incorporate label semantic information, Du et al. (2019) proposed an interactive mechanism that incorporates word-level matching signals into the text classification task. LSAN (Xiao et al., 2019) leveraged label semantic information to determine the semantic connec-  Table 1. tion between each label and the document. Pappas and Henderson (2019) proposed the GILE model, which is a joint input label embedding model for neural text classification. To incorporate label hierarchy, Gopal and Yang (2015) employed a recursive regularization to encourage the similarity between the child classifiers and their parent classifier. Peng et al. (2018) further extended this regularization to a text semantic encoder based on graph convolutional neural networks.  constructed a label dependency graph to model the label embedding in the label space based on the graph prior.  converted multilabel document classification into a Seq2Seq task, and predicted label sequences with label dependence. Zhang et al. (2021) proposed the MATCH model and employed different ways to regularize the parameters and output probability of each child label by its parents. However, these methods cannot jointly consider label semantic information, label hierarchy, and label dependency.
Our approach is partly inspired by MATCH (Zhang et al., 2021), but with the following differences: (1) the heterogeneous structure of the document's metadata is modeled in the proposed approach, which is ignored in (Zhang et al., 2021); and (2) the proposed approach incorporates the implicit statistical dependencies between labels, which are not considered in (Zhang et al., 2021).

Problem Definition
In traditional approaches, the multi-label document classification task is to learn a mapping function f : W d !ŷ d using only textual information of document d, whereŷ d is the relevant labels of document d and W d is the word sequence w 1 , w 2 , ..., w |W d | . However, as shown in Figure 1a and 1b, the information beyond text (such as the metadata of documents, the label semantic information, and the label hierarchy) is crucial for accurate document classification. Therefore, in this paper, not only the document text but also the information of the metadata and labels is considered. Given a dataset, a label set is denoted as L = {l 1 , l 2 , ..., l |L| }, and a label hierarchy is represented as a directed acyclic graph G hierarchy = (L, E hierarchy ), where E hierarchy is the set of the edges representing the parent-child relationships between labels. The semantic information of each label is represented as an -dimension embedding vector c i 2 R . Given a document d 2 D and its word sequence W d , its associated metadata set M d is denoted as: Here, M is the set of all metadata in the dataset. Take the BERT paper in Figure 1a as an example, W d is its title and abstract, M d is a set of its metadata such as the author of the paper, 'Jacob Devlin', and the paper it cited, 'Attention is All You Need'. The task aims to learn a mapping function f :

Metadata Heterogeneous Graph
Construction In this subsection, we present how to construct a metadata heterogeneous graph (MHG) based on the relationship between documents and their metadata.
The MHG schema is shown in Figure 3a, which contains four types of nodes such as document, author, venue, and ref erence. In addition, there are five types of edges in MHG as shown in Figure  3b. It is worth noting that the document ! venue relationship is ignored as the node of the venue type is connected with a large number of nodes of the document type, which increases the computational complexity. Taking the MAG-CS dataset as an example, the number of document ! venue edges is 705, 407, the number of venue nodes is 105, on average, each venue is connected to 6, 718 document.
After defining the MHG schema, mathematically, the MHG can be denoted as G meta = (V meta , E meta , A meta , R meta ), where V meta is the combined set of nodes representing all documents D and nodes representing metadata M, E meta is the set of edges, A meta is a set of four node types, and R meta is a set of five edge types. In the MHG, each node v meta and each edge e meta are associated with their type mapping functions ⌧ (v meta ) : V meta ! A meta and (e meta ) : E meta ! R meta . Taking Figure 1a as an example, the "NAACL" node belongs to V meta , but its type venue is a node type belonging to A meta .
In MHG, an edge e meta = (v i , v j ) indicates the relevance between node v i and v j , whose weight is determined by where (v i , v j ) is the edge type of (v i , v j ) and A is the adjacency matrix of G meta . For the re- needs to be calculated except for v j with venue type. For each edge (e meta ) type, there is a corresponding adjacency matrix A (emeta) 2 R |Vmeta|⇥|Vmeta| in the MHG.

Label Heterogeneous Graph
Construction In this subsection, we introduce how to construct a label heterogeneous graph based on the label hierarchy and statistical dependencies between labels.
Unlike MHG, LHG contains only one type of nodes label and two types of edges representing label hierarchy and statistical dependency. Take the "Topic Model" label in Figure 1b as an example, ("NLP", is_the_parent_label_of , "Topic Model") is an edge representing the label hierarchy, ("Topic Model", depends_on, "Machine Learning") is an edge representing the label statistical dependency.
Therefore, we define label heterogeneous graph as G label = (V label , E label , A label , R label ), where V label is the set of the label nodes, E label is the set of all edges, A label is the set containing only the label node type, and R label is the set consisting of two edge types.
A label hierarchy edge (v i , v j ) indicates the parent-child relevance of two label node v i and v j , whose weight is determined by where A hierarchy is the adjacency matrix of the label hierarchy in G label . Following Chen et al. (2019), the probability between two labels is employed to represent the label statistical dependency. For example, as shown in Figure 1b, the label statistical dependency is in the form of conditional probability, i.e., P (B|A) which denotes the probability of the presence of B when A appears. In addition, we use the threshold to filter noisy edges in the label statistical dependency edges. Thus, we can obtain the adjacency matrix A dependency of the label statistical dependency in G label as The adjacency matrix of LHG can be defined as

Model Architecture
In this subsection, we introduce the architecture of the proposed model as shown in Figure 2. It consists of three components: (1) an encoding layer, (2) a label-specific document representation layer and (3) a classification layer. The first layer aims to obtain the text representation and the metadata representation of an input document, and also the label representation based on the label heterogeneous graph. The second layer is designed to generate a label-specific document representation by fusing text representation, label semantic representation and metadata representation. Finally, the last layer employs the label-specific document representation and the label representation based on LHG to predict the most relevant labels.

Encoding Layer
The encoding layer consists of three parts. The first part is a text semantic encoder built on the multi-layers Transformer (Vaswani et al., 2017) to capture the text semantics information. The second part is a metadata heterogeneous graph where the heterogeneous structure of the document's metadata is learned through the heterogeneous graph transformer (HGT) (Hu et al., 2020). The last part is a label heterogeneous graph through which the label representation is learned via HGT. The input to the Text Semantic Encoder is a word sequence of a document, prepended by a [CLS] token as typically done in BERT (Devlin et al., 2018). That is, for a document d, the input to the Text Semantic Encoder, denoted as h Here, h where is the dimension of the word representation space and e w 1 2 R k is the word embedding of the token w 1 . One Transformer layer can be described by where Norm(.) is the normalization layer, MHA(.) is multi-head attention operator, and F F N(.) is the position-wise feed-forward network (Vaswani et al., 2017). We can stack L Transformer layers to model text semantics, where the l-th layer h (l) d is also the input of the (l + 1)-th layer. Therefore, we can obtain the word representation h In order to model the heterogeneous graph structure of metadata and label, the heterogeneous graph transformer (HGT) (Hu et al., 2020) is employed to build two heterogeneous graph neural networks. HGT consists of three components: Heterogeneous Mutual Attention AT T (.), Heterogeneous Message Passing MSG(.) and Target-Specific Aggregation AGG(.). When t is the target node, the layer-wise propagation rule of the HGT at layer l 1 2 [0, L] is defined as: AT T (s, e, t) MSG(s, e, t) Here e is the edge between s and t, and N (t) is all neighbor nodes of t. h (l) [t] 2 R is the representation of the target node t at the l-th layer. In the metadata heterogeneous graph neural network, we set the document node as the target node, and use the text representation h d, [CLS] 2 R as the embedding of the document d node. We can obtain the metadata representation of document d by: Here G meta,d is a sub-graph composed of the document node d and its neighbor nodes.
In the label heterogeneous graph neural network, we set the label nodes as target nodes, and use label embedding C to initialize the embedding of label nodes. We can obtain the label representation h label 2 R |L|⇥ : Different from the metadata heterogeneous graph neural network, to obtain the representation of the labels, G label is set as full-graph.

Label-specific Document Representation Layer
The label-specific document representation layer aims to obtain the label-related representation based on the document text and metadata. To obtain the label-specific text representation h text d 2 R |L|⇥ using the attention mechanism, we construct the query vector Q label 2 R |L|⇥ using label semantics embedding C and construct the key vector K d and the value vector V d using the representation of the document d, h (L) d :  Classification Layer The classification layer aims to assign the most relevant labelsŷ d to the document d. We dot product the label representation h label with the label-specific document representation h doc d , and then use the sigmoid activation function to obtain the multi-label prediction: Finally, the cross-entropy loss is used:

Experiments
In this section, we evaluate the proposed model on two large-scale datasets for extreme multi-label document classification (with the number of labels more than 15,000).

Experiments Setting
Datasets Two large-scale datasets 1 constructed by Zhang et al. (2021)   terms 2 , which are viewed as labels in the MLDC task. For both datasets, metadata information (includes disambiguated authors, venues, and references) is collected from MAG. The text information of each document is its title and abstract. Table 2 shows the statistics of the two datasets.
Baselines The following extreme multi-label document classification methods and transformerbased models are chosen as the baselines.
• XML-CNN (Liu et al., 2017) is an extreme multi-label document classification model built on convolutional neural networks. • MeSHProbeNet (Xun et al., 2019) solves the problem of multi-label document classification with recurrent neural networks and multiple MeSH "probes". • AttentionXML (You et al., 2018) is an extreme multi-label document classification model that is built on RNN with a label-aware attention layer. • Transformer (Vaswani et al., 2017) is composed of multiple self-attention layers. • Star-Transformer  sparsifies the fully connected attention in the Transformer to a star-shaped structure. • BERTXML (Xun et al., 2020) is a model inspired by BERT (Devlin et al., 2018)  Evaluation Metrics Two widely used metrics, precision at top k (P @k) and Normalized Discounted Cumulative Gains at top k (nDCG@k), are used to evaluate the model performance 4 .
Here, y d 2 {0, 1} |L| is the ground truth label vector of the document d, rank(i) is the index of the i-th highest predicted label .
Parameter Setting For all methods, the embedding dimension k is set to 100, and GloVe.6B.100d (Pennington et al., 2014) is used to initialize word embeddings. For our method, we set the hidden vector dimension = 100, the number of the text  Table 3 shows the performance comparison of the proposed approach with other baselines. Experiments are conducted three times with the mean and standard deviations reported. According to Eq.14, it is easy to show that P @1 = nDCG@1 if each document has at least one true label. It can observed from Table 3 that: (1) the proposed model is significantly better than all compared models on both datasets. (2) the proposed model is also superior to the two ablation models Ours without LHG and Ours without MHG on both datasets. It shows that the label heterogeneous graph and the metadata heterogeneous graph are effective for the MLDC task. (3) The labelaware models (such as AttentionXML, MATCH and Ours) significantly outperform other models for MLDC on both datasets. It shows that labelawareness is essential, and modeling label information can further improve the classification performance.

Results
It is worth noting that the performance improvement by incorporating the label heterogeneous graph is more significant on PubMed. In specific, on MAG-CS, the proposed model has an average absolute improvement of 1.5% on five metrics in comparison with ours w/o LHG, while on PubMed, the improvement is 2.9%. It might attribute to the different label dependencies in the two datasets. When the label dependency threshold is = 0.3, the label dependency edge number on MAG-CS is 67, 620, while in the PubMed dataset, the number is 88, 390.

Effect of Comprehensive Label Info
In both datasets, the label hierarchy is available and we construct the label statistical dependencies by calculating the conditional probability between the labels. To explore the effectiveness of each type of relationships, we conduct an ablation analysis. Three ablation versions of the proposed model are constructed: No-Hierarchy, No-Dependency, Neither. For No-Hierarchy, we remove the edges of the label hierarchical relationship from the label heterogeneous graph. We construct the model variants, No-Dependency and Neither, in a similar way by removing edges of their corresponding types.
The performance comparisons of the proposed model with its three ablations versions is shown in Figure 4. It can be observed that: (1) The proposed model outperforms No-Hierarchy, No-Dependency, Neither in all metrics, indicating that both types of label relationships play a vital role in MLDC task.
(2) Among the three ablation versions, Neither has the worst performance which shows that the label hierarchical relationship and the label dependency relationship are complementary. It can also be observed that the method with label hierarchy achieves larger improvement. It is because that the label hierarchy information comes from the rigorous label taxonomy, while the label dependency information comes from the label probability statistics.

Conclusions
We propose a novel neural network approach for multi-label document classification, in which two heterogeneous graphs are incorporated and learned using heterogeneous graph transformers. One is the metadata heterogeneous graph, which models various type of metadata and their topological relations. The other is the label heterogeneous graph, which is constructed based on the labels' hierarchy and statistical dependency. Experiments on two datasets show the superior performance of the proposed approach over existing approaches. In addition, results of the ablation experiments show the effectiveness of incorporating both the metadata heterogeneous graph and the label heterogeneous graph for multi-label document classification.