Effective Convolutional Attention Network for Multi-label Clinical Document Classification

Multi-label document classification (MLDC) problems can be challenging, especially for long documents with a large label set and a long-tail distribution over labels. In this paper, we present an effective convolutional attention network for the MLDC problem with a focus on medical code prediction from clinical documents. Our innovations are three-fold: (1) we utilize a deep convolution-based encoder with the squeeze-and-excitation networks and residual networks to aggregate the information across the document and learn meaningful document representations that cover different ranges of texts; (2) we explore multi-layer and sum-pooling attention to extract the most informative features from these multi-scale representations; (3) we combine binary cross entropy loss and focal loss to improve performance for rare labels. We focus our evaluation study on MIMIC-III, a widely used dataset in the medical domain. Our models outperform prior work on medical coding and achieve new state-of-the-art results on multiple metrics. We also demonstrate the language independent nature of our approach by applying it to two non-English datasets. Our model outperforms prior best model and a multilingual Transformer model by a substantial margin.


Introduction
In multi-label document classification (MLDC), we have a set of labeled data {X, Y}, X ∈ R N ×D and Y ∈ R N ×L , where N is the number of documents, D is the feature dimension size of each document and L is the total number of labels. The i th row of Y is a multi-hot vector representing the set of labels associated with the i th document. The task is to learn a mapping between X and Y so that the labels of each document are predicted correctly.
MLDC has a great number of practical applications, one of which is automatic medical coding, where a patient encounter containing multiple records are assigned with appropriate medical codes. A large number of medical encounters need to be coded for billing purposes everyday. Professional clinical coders often use rule-based or simple ML-based systems to assign billing codes, but the large code space (viz. the ICD-10 code system with over 90,000 codes) and long documents are challenging for ML models. In addition, coding requires extracting useful information from specific locations across the entire encounter to support the assigned codes. Consequently, effective models with the capability of handling these challenges will have an immense impact in the medical domain by helping to reduce coding cost, improve coding accuracy and increase customer satisfaction.
Deep learning methods have been demonstrated to produce the state-of-the-art outcomes on benchmark MLDC and medical coding tasks Mullenbach et al., 2018;, but demands remain for more effective and accurate solutions. In this paper, we propose EffectiveCAN, an effective convolution attention network for MLDC. Our models try to strike a careful balance of simplicity and effective over-parameterization such that we can effectively model long documents and capture nuanced aspects of the whole document texts. Such a model is particularly useful for addressing the challenges of automatic medical coding. We evaluate our models on the widely used MIMIC-III dataset (Johnson et al., 2016), and attain state-of-the-art results across multiple commonly used metrics. We also demonstrate the languageindependent nature of our approach by coding on two non-English datasets. Our model outperforms prior best model and a multilingual transformer model by a substantial margin.

Related Works
Previous deep learning methods for MLDC involve various neural network architectures to learn the semantic embeddings of the document texts. For example, XML-CNN proposed by Liu et al. (2017) employs a 1-dimension convolutional network along with dynamic pooling to learn the text representation. RNN-based sequence-to-sequence models, such as SGM  and SU4MLC (Lin et al., 2018) use an encoder to encode the information of the input text and a decoder to generate the predicted labels. AttentionXML proposed by  leverages the BiL-STM and label-aware attention to capture the most relevant texts for each label. As a follow-up, MAG-NET (Pal et al., 2020) incorporates graph neural network to capture the attentive dependency structure among the labels. More recently, transformers such as the X-transformer  have also been introduced. X-transformer tackles MLDC in three steps: label clustering, transformer classification and label ranking.
There is a surge in neural network models employed for automatic medical coding in the past several years. In particular, recent works have utilized attention mechanism to improve automatic coding performance. Shi et al. (2017) applied LSTMs to produce represtations of the discharge summary and used attention to predict the top 50 codes. Mullenbach et al. (2018) proposed CAML that applies separate attention for each label, which generates better label-specific representations for label prediction. They also used the label descriptions to regularize the model (called DL-CAML) in an attempt to improve the prediction of rare labels. To improve the classification performance, Xie et al. (2019) used the multi-scale convolutional attention while Li and Yu (2020) employed multi-filter convolution to learn text patterns of different lengths. Furthermore, to incorporate the inner relationship of the labels, HyperCore (Cao et al., 2020) integrated a hyperbolic representation learning method and a graph convolutional network, and Lu et al. (2020) utilized multi-graph knowledge aggregation. Vu et al. (2020) proposed to combine Bi-LSTM and an extension of structured self-attention mechanism for ICD code prediction.

EffectiveCAN Model
In this section, we introduce our EffectiveCAN model (Figure 1), which is composed of four major components: an input layer that transforms the raw document texts into pretrained word embeddings, a deep convolution-based encoder that combines the information of adjacent words and learns meaningful representations of the document texts, an attention component that selects the most important text features and generates label-specific representations for each label, and an output layer that produces the final predictions.
The model structure is primarily designed for generating better predictions on multi-label classification tasks from three aspects: (1) generating meaningful representations for input texts; (2) selecting informative features from text representations for label prediction; (3) preventing overconfidence on frequent labels. Firstly, in order to obtain high-quality representations of the document texts, we incorporate the squeeze-and-excitation (SE) network and the residual network into the convolution-based encoder. The encoder consists of multiple encoding blocks to enlarge the receptive field and capture text patterns with different lengths. Secondly, instead of only using the last encoder layer output for attention, we extract all encoding layer outputs and apply the attention to select the most informative features for each label. Finally, to cope with the long-tail distribution of the labels, we use a combination of the binary cross entropy loss and focal loss to make the model perform well on both frequent and rare labels.

Input Layer
Our model takes a word sequence as the input and each word is transferred to a word embedding of size d e . Assuming the document has N w number of words, the input will be a word embedding matrix

Convolutional Encoder
To transform the document into informative representations, the input word embeddings X e first go through a convolution-based encoder that consists of multiple residual squeeze-and-excitation convolutional blocks (Res-SE blocks). Each Res-SE block, as shown in Figure 2, is composed of two parallel modules that are referred to as the SE module and the residual module.
In recent years, transformer-based models with self-attention modules have shown to be effective in text classification tasks (Devlin et al., 2018;. However, for our applications we use a convolutional encoder instead of a self-attention one for two reasons: (1) ICD code predictions are often associated with a span of texts in the input. Convolutional operations can effectively aggregate the information of text spans and output meaningful representations for downstream predictions; (2) Clinical documents are usually long (i.e. MIMIC-III document have an average of 1500 words). A convolutional encoder is more time and space efficient than a self-attention encoder for modeling long documents.

SE Module
The SE module contains a squeeze-and-excitation network (Hu et al., 2018) followed by layer normalization (Ba et al., 2016). The SE network can adaptively adjust the weighting of each feature map and refine the convolutional features. Here we use the SE network to enhance the learning of document representations for the down-stream prediction task. The structure of the SE network in our model is shown in Figure 3. In this network, we first apply a standard 1-dimensional convolutional layer on the input to aggregate the information of adjacent word embeddings. Suppose the convolutional filter applied on the input matrix X is W c ∈ R k×de×dconv , where k is the filter size, d e is the in-channel size (the size of input embedding) and d conv is the outchannel size (the size of output embedding). The 1-dimensional convolution is computed as where * is the convolution operator and b c the bias. The output convolutional features can be represented as C = [c 1 , . . . , c Nw ] with C ∈ R Nw×dconv . The SE network then uses a two-stage process, 'squeeze' and 'excitation', to compute the channeldependent coefficients to enhance the convolutional features. In the 'squeeze' stage, each channel is compressed into a single numeric value via global average pooling: z c = GAP (C). Here z c ∈ R dconv can be treated as a channel descriptor that aggregates the global spatial information of C. In the 'excitation' stage, the channel descriptor goes through a dimensionality-reduction-layer with reduction ratio r followed by a dimensionalityincreasing-layer back to the channel dimension of C. The reduction ratio r is a tunable parameter and we use r = 20 in our model. The excitation step can be written as r and b f c2 ∈ R dconv are the weights and biases of the fully-connected linear layers. Next, we rescale the convolutional feature C with s c by: X = scale(C, s c ), where scale denotes the channel-wise multiplication between C and s c .
Eventually, X is normalized and used as the output of the SE module. In particular, we employ the layer normalization (Ba et al., 2016) that's widely used for stabilizing the hidden layer distribution and smoothing the gradients in NLP tasks (Devlin et al., 2018;Hou et al., 2019).

Residual Module
In addition to the SE module, we also simultaneously transform input X and add it to the SE module output as in the residual network (He et al., 2016), which reduces the gradient vanishing issue in the deep encoder structure. In order to avoid dimension mismatch, we transform the input X into X by using a filter-size-1 convolutional layer. Then we add X with the SE module output X.
Finally, we apply the gelu activation function to generate the output of the Res-SE block:

Attention
We use the label-wise attention (Mullenbach et al., 2018) to generate label specific representations from H. Since our convolutional encoder contains multiple Res-SE blocks that generate multi-scale representations of the document texts, we perform multi-layer attention, which attends to outputs of all Res-SE blocks. In this way, each label is allowed to select the most relevant features from a rich feature space extracted by the encoder. Assuming U ∈ R N l ×d l represents the label embedding matrix, where N l is the number of labels and d l the embedding size of each label. To attend to the i th Res-SE layer output H i ∈ R Nw×d i conv , U is first mapped to U ∈ R N l ×d i conv via a filter-size-1 convolutional layer to avoid dimension mismatch. The attention weights are then computed by Here, each of the j th column of A i ∈ R N l ×Nw is a weight vector measuring how informative the text representations in H are for the j th label. Next, we generate the label specific representations: is the label specific representation for the j th label, generated from the i th Res-SE layer output. We repeat the attention process for all Res-SE layer outputs, then concatenate the label specific representations: where N Res−SE is the number of Res-SE blocks. The resulted V ∈ R N l × i d i conv will be used for the final prediction.
When the application domain has a large label space but insufficient data points, a multi-layer attention model can be difficult to train, especially for deep networks. Therefore we also experiment with sum-pooling attention where we first transform each convolutional layer to have the same dimention as the last layer, then sum all the layers and apply attention to the summed output. The resulting V ∈ R N l ×d last−layer conv is used for the final prediction.

Output Layer
After obtaining the label specific representations, we compute the probability for each label by using a fully connected layer followed by a sum-pooling operation and a sigmoid transformation: Here, the j th value in p is the predicted probability for the j th label to be present given the document texts.

Loss Function
Binary cross entropy loss is widely used as the loss function for training MLDC models. Suppose y is the ground truth label and p is the predicted probability, then the binary cross entropy loss is To tackle the long-tail distribution of the labels, we also apply the focal loss (Lin et al., 2017), which adds a weight term to the ordinary binary cross entropy loss to dynamically down-weights the loss assigned to well-classified labels. The focal loss is Here γ is a tunable parameter to adjust the strength of down-weighting. The weight term (1 − p t ) γ suppresses the loss from well-classified labels (where p t is high) and bias the model towards labels that get wrong predictions. In practice, using the focal loss from the beginning of training isn't ideal, because it tends to correct the misclassified rare labels while sacrificing the performance on the frequent labels. Instead, we first train our model with the ordinary binary cross entropy loss to allow the model to learn general features and perform well on frequency labels. Once the model performance saturates, we switch  to use the focal loss and further fine-tune the model to improve the predictions on rare labels.

Datasets
We evaluated our model on the widely used medical benchmark dataset MIMIC-III, as well as two medical datasets in Dutch and French respectively. The statistics of the datasets are listed in Table 1.

MIMIC-III
The Medical Information Mart for Intensive Care III (MIMIC-III) is an open-access dataset comprised of hospital records associated with over 4000 patients. We focus on using the discharge summaries to predict their tagged International Classification of Diseases 9 (ICD-9) codes. We formulate this task as a MLDC problem following prior work (Shi et al., 2017;Mullenbach et al., 2018). In total, there are 52,722 discharge summaries and 8,922 unique ICD-9 codes. We follow the experiment settings of Mullenbach et al. (2018). We focus on the experiment that predicts the full 8,922 ICD-9 codes (denoted as MIMIC-III-full) but also present the results on the top-50 ICD-9 codes (denoted as MIMIC-III-50). The data statistics of the two experiments are listed in Table 1.

Dutch and French Datasets
Many European hospitals are aware of the advantages of automatic coding solutions that improve the accuracy and efficiency of medical coding. To evaluate how well our model adapts to coding on non-English medical documents, we use two real-world datasets, one in Dutch and the other in French. These datasets contain human assigned ICD-10 codes for each encounter. In these datasets, the Discharge Summary is not differentiated from other documents so we concatenate all documents in the encounter. Nonetheless, the French data has similar encounter length to MIMIC-III, but the Dutch data are much longer with an average of 30 documents or close to 5,000 tokens per encounter. The ICD-10 code system is widely used in European countries, but no benchmark dataset is available for comparing coding methods -likely due to existing patient data protection regulations in the EU. For U.S. English data, the restrictions are somewhat less, which is how MIMIC-III was able to be produced -though still at immense deidentification cost. Although the de-identification and release of the French/Dutch data was not possible, we believe our experiments and findings still benefit the research community because they (1) demonstrate that our model can generalize to other languages and (2) are the first medical coding results reported for French or Dutch.

Preprocessing and Hyperparameters
We follow the preprocessing schema of (Mullenbach et al., 2018) except that we keep numerical values from one to ten as they are relevant for coding. We utilize the word2vec CBOW method to pretrain the word embeddings of size d e = 100 and 200 on the preprocessed texts for MIMIC-III and the non-English sets respectively. All MIMIC documents are truncated to a maximum sequence length w max =3,500, whereas both 2,500 and 3,500 sequence length were used for the non-English sets.
We found optimal hyperparameter settings using the Ray Tune library (Liaw et al., 2018). We optimized values for out-channel size d conv and filter size k of the convolutional layer in each SE module, dropout probability q after the input embedding layer, as well as the power term γ in the focal loss function. To reduce the search space, we set d 1 conv = d 2 conv , d 3 conv = d 4 conv and k 1 = k 2 , k 3 = k 4 . Table 2 summarizes their optimal values for different experiments. We use four Res-SE blocks across all experiments, and adopt the Adam optimizer with an initial learning rate of 0.00015.

Evaluation Metrics
The goal of computer assisted coding is to have as little human intervention as possible. This means that a model trained for coding should aim to predict the correct codes from the full set rather than the top N codes, or give a ranked list of possible codes. The performance of a model on the top 50 codes is often reported in research papers. However, in real-world settings, top-50 metrics are insufficient for making an accurate assessment of automatic coding because expensive human resources  -III-full  200,200,  240,240 13,13, 9,9 0.3 1 MIMIC- 180,200,200 11,11,9,9 0.3 0.5 Dutch and French 180,180,200,200 11,11,9,9 0.3 0.5 Table 2: The parameter values used in different tasks. d i conv , k i : the out-channel size and the kernel size of the SE convolutional layer in the i th Res-SE block, q: the dropout probability after the input embedding layer, γ: the power term in the focal loss. are still needed for the large number of remaining codes. In MIMIC-III, top 50 codes cover only a third of the codes per encounter, and in reality a small number of top codes can usually be handled by rule-based systems with great accuracy.
Ranking based metrics like P@K, R@K, RP@K (Chalkidis et al., 2020), where K is often the average number of labels per document, are rarely used in coding because there is high variability in the number of codes per encounter. In MIMIC-III, the number of codes per encounter varies from one to 79, and 43% of the encounters have more than the average 15 codes. Asking a human coder to always review K codes for every encounter would cause a huge productivity drop because she will still have to review K codes when there is only one code. On the other hand, reducing the number of gold codes to K (Chalkidis et al., 2019) will result in inaccurate measures (especially for Recall) for a large percentage of encounters with more than K codes and artificially inflate system performance.
Although macro metrics are useful for assessing performance on rare codes, they are less important in determining overall coding performance. For these reasons, micro precision, recall and F1 over all codes best reflect improvements in coding productivity because they directly measure the accuracy and coverage of the code assignment by models. However, prior work did not report precision and recall on the MIMIC data. For comparison purposes, we report F1 and other previously used metrics on both MIMIC-III and the non-English datasets, but the emphasis should be on Micro F1.

Results
To evaluate the effectiveness of our methods, we compare our model with the existing state-of-theart. The results shown below are generated from the average of five runs with different random seeds for parameter initialization. We also investigate the interpretability of the model. Table 3 shows the results on the MIMIC-III dataset using the full ICD-9 codes. Our model achieved the strongest results across multiple metrics compared to the other systems. In particular, our model improves the state-of-the-art Micro F1 score as well as ranking based precision scores. Table 3 also shows that the systems achieved very similar results on Micro AUC for all codes even when they differ significantly in other metrics. This suggests that Micro AUC is not sensitive enough to distinguish different systems and is therefore not a good metric for comparing coding models. Table 4 shows the results for the top-50-code prediction. Our model produced competitive results with other top models.

Results on MIMIC-III
An interesting observation is that multi-layer attention yields better results on MIMIC-III-50 but sum-pooling attention performs better on MIMIC-III-full. One possible explanation is that when there are sufficient training data for the labels, multilayer attention with more parameters is able to learn better representations for each label. Whereas when the data is insufficient given the label size, aggregating information over labels yields better results.

Results on Dutch and French
On the Dutch and French datasets, we establish two baselines. The first is MultiResCNN (Li and Yu, 2020), which is the best performing model on MIMIC-III that is publicly available. The second is XLM-RoBERTa (Conneau et al., 2019), a multi-lingual transformer model. 1 XLM-RoBERTa and related models achieve excellent performance on well-known benchmarks such as GLUE , however they are not well established on the task of long-document, multi-label classification.  (Mullenbach et al., 2018) 0.895 0.986 0.088 0.539 0.709 0.561 DR-CAML (Mullenbach et al., 2018) 0.897 0.985 0.086 0.529 0.690 0.548 MSATT-KG (Xie et al., 2019) 0.910 0.992 0.090 0.553 0.728 0.581 MultiResCNN (Li and Yu, 2020) 0.910 0.986 0.085 0.552 0.734 0.584 HyperCore (Cao et al., 2020) 0.930 0.989 0.090 0.551 0.722 0.579 LAAT (Vu et al., 2020) 0.919 0.988 0.099 0.575 0.738 0.591 JointLAAT (Vu et al., 2020) 0   French. Recall is particularly low, likely caused by the model only seeing the first 512 subwords of a long encounter with thousands of tokens. Our model with multi-layer attention substantially outperforms the other two systems. It strikes a good balance between precision and recall, and is able to handle the full code sets without difficulties. Unlike the observation of Li and Yu (2020) where the maximum length didn't make an obvious difference to the performance on MIMIC-III, we found that training on longer sequences on Dutch and French gives an extra boost to all metrics. This is especially true for the Dutch which contains longer encounter texts. The results show that Effective-CAN can be easily retrained for non-English documents to very good effect.

Analysis of Focal Loss
In this section, we describe our experiments on the MIMIC-III-full dataset for a better understanding of the focal loss.
To investigate how the moment of loss function switch impacts model performance, we trained models with focal loss activated at different training epoch and the results are given in Table 6. It shows that switching the loss function at a later stage yields more pronounced improvement in Macro F1. We obtained the best results by training with BCE loss first and saving the best model as measured by the micro-F1 on the dev set. Then we continued training using the focal loss until it converged.
To better understand which tail labels the focal loss helps improve, we analyzed model performance based on label frequency in the test set. Table 7 shows that the focal loss improves the prediction of both frequent and rare labels, but the improvement is more pronounced for the less frequent labels.

Discussion
In this section we analyze the differences between the models. Compared to CAML, MultiResCNN yields better results by enhancing the encoder using the multi-filter residual convolutional network, and HyperCore improves the macro-metrics by incorporating the correlations within the labels. Although both MSATT-KG and EffectiveCAN use multi-layer attention, we differ in the ways of aggregating the attention results. Our model uses all the attended values for the final label prediction whereas MSATT-KG performs extra max-pooling operations before the prediction. The max-pooling operations, in our opinion, are unnecessary and risk losing information. Our model produces notably better results than MSATT-KG on the full code set.
JointLAAT differs from EffectiveCAN in the encoder layer where it uses the BiLSTM to capture contextual information, whereas we choose to    use the convolution-based model for computational and memory efficiency. To deal with rare labels, prior works often add a separate component such as a graph neural network or a hierarchical joint learning module, which inevitably increases the complexity and size of the model. Instead, we employ the focal loss, which can be easily modified from the binary cross entropy loss, to improve the rare-label prediction without sacrificing the overall performance. By refining the entire model structure including the convolutional encoder, attention coverage and training objective, we build a model that is simple and easy to scale, yet very effective for the medical coding problem. The model achieved the best micro F1 results on the MIMIC-III dataset, even when compared with more complex models. It is capable of not only generating accurate top codes but also covering a large number of codes including rare codes, which is important for real world applications in the medical domain. Recent results (You et al., 2019b;Chalkidis et al., 2020) show that RNN-based and BERT-based models performed well on the topic categorization tasks of EUR-LEX, AMAZON, WIKIPEDIA and RCV1. However, it's also clear that the best models on these tasks are typically not the same as the best performing models on MIMIC-III, which is fundamentally not a topic categorization task. Rather medical coding requires fine-grained analysis of very narrow aspects of the document in order to identify appropriate codes. For an additional point of comparison, we evaluated EffectiveCAN on two topic categorization tasks (EUR-Lex and Wiki10-31K) and found it outperforms several strong baselines and is only lower than X-Transformer , a large pre-trained transformer model, by a small margin on most metrics. Detailed results are reported in Appendix A.

Model Interpretability
It is a requirement of medical coding that an automatic coding system is able to extract text evidence to support the generated billing codes. With the attention mechanism, we can extract the text snippets that support the predicted codes. More specifically, by conducting the multi-layer attention on the four Res-SE layer outputs, we obtain four attention weight matrices A i∈{1,2,3,4} with each A i ∈ R N l ×Nw . For the j th label, the associated attention weights are the j th column of each matrix, that is A i ·j ∈ R Nw . Next, to get the most influential text span for the j th label, we first get the text position k * which is the argmax of all attention weights: We then select the most informative n-gram features surrounding the text position k * . Table 8 gives some examples of the extracted text snippets for the predicted ICD-9 codes in the MIMIC-III-full experiments. Our model is able to extract the n-gram features that are similar to the code descriptions, e.g., the extracted snippet "Systolic congestive heart failure" for 428.20. More importantly, our model is capable of selecting phrases with different syntactic forms but similar semantics as the code descriptions, e.g., the extracted snippet "percutaneous tracheostomy tube placement" for 934.1. It indicates that the model can learn inter-

Ablation Study
We conducted ablation studies to verify the effectiveness of each module in our model. We compare the results on MIMIC-III-full between the ordinary model and the one with a component removed.
The results for the macro-and micro-F1 scores are listed in Table 9. For the multi-layer attention model, removing the residual module causes a notable reduction in both the macro-and micro-F1 scores, indicating the importance of the residual module in the deep convolutional encoder of our model. Meanwhile, the model without the SE module also reports a lower macro-F1 and micro-F1, which implies that the SE module enables the model to produce better representations for the predictions.
Only attending to the first or last Res-SE layer output leads to worse results. It confirms our argument that the multi-layer attention can capture information from the input at different levels, which further facilitates better predictions. It is also possible to completely remove the attention module, but since (Mullenbach et al., 2018) has shown that label-wise attention improves F1, this experiment wasn't deemed informative.
Compared to the original model, the one without using the focal loss produces a slightly lower result in the micro-F1 but a large reduction in the macro-F1. This verifies the effectiveness of the focal loss in tackling the long-tail distribution of the labels.
For the sum-pooling attention model, removing the SE module results in the largest performance drop. We have yet to find an explanation for this difference in the two attention models.

Conclusions
In this paper, we proposed an effective convolutional attention network for MLDC, and showed its effectiveness for medical coding on long documents. Our model features a deep and more refined convolutional encoder, consisting of multiple Res-SE blocks, to capture the multi-scale patterns of the document texts. Furthermore, we use the multilayer attention to adaptively select the most relevant features for each label. We employ the focal loss to improve the rare-label prediction without sacrificing the overall performance. Our model obtains the state-of-the-art results across several metrics on MIMIC-III, and compares favorably with other systems on two non-English datasets.

A.1 Additional Experiments
We also evaluated our model on two large-scale benchmark datasets: EUR-Lex and Wiki10-31K, to show the effectiveness of our model across domains. We use w max = 2000, 3000 for EUR-Lex and Wiki10-31K respectively, and the hyperparameters of the models are given in Table 11.
A.1.1 Datasets EUR-Lex consists of a collection of documents of European Union laws. It contains 19,314 documents in total with 3,956 categories regarding different aspects of European law. We follow the setting of  to split the train and test sets, obtaining 15,449 and 3,865 training and testing documents. From the training set, we then take 1,545 documents out for validation, resulting in 13,904 training documents.
Wiki10-31K is a collection of social tags for Wikipedia pages. It's composed of 20,762 documents and 30,938 associated tags. We also use the setting of  to get 14,146 and 6,616 training and testing documents. We then use 1415 documents for validation, resulting in 12,731 training documents.

A.1.2 Results
The results on the EUR-Lex dataset are listed in Table 13. The results from our model are higher than some strong baselines including AnnexML (Tagami, 2017), DiSMEC (Babbar and Schölkopf, 2017), Parabel , and Atten-tionXML , and is only lower than X-transformer  by a tiny gap, e.g. 0.08% lower on precision@1. We also observe that traditional ML models, such as An-nexML, DiSMEC and Parabel, generally produce worse results than deep learning model such as At-tentionXML. By employing large-scale pretrained transformer-based models, X-tranformer reports the start-of-the-art results. Table 13 also shows that our model produces very competitive results on the Wiki10-30K dataset. Our model outperforms most baselines except for X-transformer. The losing margins are quite small, 0.66% on precision@1, 0.08% on precision@3, and 0.43% on precision@5.
Compared to the large-scale transformer-based models, our model is more effective in terms of balancing the model performance and model   size. Table 12 lists the comparison of the model size between our model, AttenionXML, and the transformer-based models used in X-transformer.
We can see that our model is much smaller than BERT-large, XLNet-large and Roberta-large used in X-transformer. Note that there are other components in X-transformer that we don't take into account. With a significantly smaller model size, our model achieved less than 1% drop on EUR-Lex and Wiki10-31K datasets compared to X-transformer.
In addition, our model can handle much longer sequences than transformer models (maximum 512 tokens). This is especially important when the information for predicting labels is spread over the long document.