Meta-LMTC: Meta-Learning for Large-Scale Multi-Label Text Classification

Large-scale multi-label text classification (LMTC) tasks often face long-tailed label distributions, where many labels have few or even no training instances. Although current methods can exploit prior knowledge to handle these few/zero-shot labels, they neglect the meta-knowledge contained in the dataset that can guide models to learn with few samples. In this paper, for the first time, this problem is addressed from a meta-learning perspective. However, the simple extension of meta-learning approaches to multi-label classification is sub-optimal for LMTC tasks due to long-tailed label distribution and coexisting of few- and zero-shot scenarios. We propose a meta-learning approach named META-LMTC. Specifically, it constructs more faithful and more diverse tasks according to well-designed sampling strategies and directly incorporates the objective of adapting to new low-resource tasks into the meta-learning phase. Extensive experiments show that META-LMTC achieves state-of-the-art performance against strong baselines and can still enhance powerful BERTlike models.


Introduction
Large-scale multi-label text classification (LMTC) is a fundamental and practical task in natural language processing (Tsoumakas et al., 2010). LMTC can be found in several domains, such as organizing documents in Wikipedia articles (Partalas et al., 2015), annotating medical records with diagnostic and procedure labels (Yan et al., 2010;Rios and Kavuluru, 2018), assigning legislation with relevant legal concepts (Chalkidis et al., 2019). Different from multi-class classification, the LMTC task aims to assign multiple labels from a large predefined set (typically thousands) to each instance.
Due to the large predefined label set and limited annotated resources, LMTC tasks usually face the challenges of long-tailed label distribution, i.e., * * Corresponding author. many labels have few or even no annotated samples. For example, in EURLEX15K (Chalkidis et al., 2019), about 70% of seen labels have been assigned to less than 20 documents (i.e., few-shot labels); and more than 40% of the predefined labels are not associated with any document (i.e., zeroshot labels). To make matter worse, new labels continually emerge as the field evolves. Though few/zero-shot labels may not contribute heavily to the overall performance, correct prediction of such labels is crucial in some cases (Rios and Kavuluru, 2018). For instance, when assigning the diagnosis labels to electronic health records, incorrect predictions of these labels either bring unnecessary financial burdens or make patients ignore potential health risks. These factors require models to utilize few or no samples to accurately assign labels.
To cope with these few/zero-shot labels, current models typically match texts to feature vectors for each label obtained by exploiting prior label information. Specifically, Rios and Kavuluru (2018) utilizes label textual descriptors to generate a feature vector for each label. Also, it employs a 2layer graph convolutional neural network (Kipf and Welling, 2017) to take advantage of the structured knowledge of label spaces to enhance label representations. Apart from that, Lu et al. (2020) finds that label similarity graphs based on pre-trained word embeddings and co-occurrence frequency are also beneficial.
Nonetheless, these approaches neglect the potential meta-knowledge contained in the dataset that can guide the models to learn with only a small amount of samples. Meta-learning has been suggested as an efficient strategy to acquire this knowledge. To acquire meta-knowledge, meta-learning constructs the tasks of few-shot learning scenarios and aims to learn how to achieve maximum performance by utilizing a limited amount of samples (Vinyals et al., 2016;Snell et al., 2017). Following the idea of meta-learning (Qian and Yu, Figure 1: Illustration of the idea of META-LMTC. The left part shows models how to handle few/zero-shot labels in the LMTC task; the right part shows the simulated low-resource multi-label classification tasks, where the numbers of instances in the support set and in the query set are N = 3 and K = 2, respectively. 2019), for the first time, we propose to investigate the problem of few/zero-shot labels in LMTC tasks from a meta-learning perspective. Illustrated in Fig. 1, we simulate some few/zero-shot scenarios, which are faithful to the LMTC task and thereby provide chances for models to learn how to adapt fast and efficiently with a limited amount of data.
However, most meta-learning algorithms are designed for multi-class classification under the fewshot setting (Vinyals et al., 2016), and it is critical for meta-learned models' generalization to construct faithful and diverse tasks (Snell et al., 2017;Bansal et al., 2020). We argue that the simple extension of these approaches to multi-label classification is sub-optimal for the LMTC tasks in that (1) LMTC tasks need to cope with few-and zero-shot scenarios, while existing methods only consider few-shot ones, which is not faithful to the LMTC tasks. (2) LMTC tasks often face the challenge of long-tailed data distribution. However, these algorithms are not designed for specific data distribution and thereby makes the rare labels in the training set less involved in the meta-learning process, which reduces the diversity of the tasks.
To address the above two issues, we propose an optimization-based meta-learning algorithm, namely META-LMTC, which contains the metalearning phase and fine-tuning phase. We design a task sampling strategy when considering the characteristics of LMTC tasks (i.e., the coexistence of few-and zero-shot scenarios, long-tailed data distribution). During the meta-learning phase, this strategy not only constructs more faithful metalearning tasks (i.e., the zero-and few-shot scenarios coexist) but also provides more diverse labels and more various instances. Then the model acquires meta-knowledge on these tasks through the alternating meta-training process and meta-evaluation process. During the fine-tuning phase, the metalearned model is fine-tuned on the original LMTC dataset to further improve performance. In summary, our contributions are as following: • We propose a meta-learning algorithm META-LMTC for LMTC tasks. To our best knowledge, we are the first study to address these challenges in LMTC tasks from the metalearning point of view.
• Our method outperforms the current state-ofthe-art models on two LMTC benchmarks. Further analysis reveals that our method can still enhance powerful BERTlike models.

Related Work
Our work is a synthesis of two research directions: large-scale multi-label text classification and metalearning. We review them in this section.

Large-Scale Multi-Label Text Classification
The skewed label frequency distribution of LMTC datasets poses few/zero-shot challenges for current models. Leveraging prior knowledge about labels has become a promising approach of tackling these problems. Rios and Kavuluru (2018) utilizes label descriptors and hierarchy to generate a representation for each label, with promising results. To further enhance these rare label representations, Lu et al. (2020) fuses pre-defined word embeddings and label co-occurrence graphs. Additionally, some studies find that a more powerful text encoder can improve the performance of frequent labels (Chalkidis et al., 2019(Chalkidis et al., , 2020Li and Yu, 2020). Different from these existing solutions, we directly tackle the few/zero-shot label learning challenges from a meta-learning perspective.

Meta-Learning
Meta-learning (a.k.a. learning-to-learn) aims to learn a general model that can quickly adapt to a new task given a limited amount of annotated instances without suffering from overfitting (Geng et al., 2019). Most recent approaches to metalearning focus on few-shot learning, which can be broadly categorized into (i) metric- (Vinyals et al., 2016;Snell et al., 2017) (ii) model- (Santoro et al., 2016;Ravi and Larochelle, 2017) and (iii) optimization-based techniques (Finn et al., 2017;Yoon et al., 2018). Meta-learning has been applied in various circumstances, such as image classification , machine translation (Gu et al., 2018), dialogue systems (Mi et al., 2019;Qian and Yu, 2019), etc. Different from the above studies on multi-class classification under the few-shot setting, our work focus on LMTC tasks, where one document may be assigned multiple labels from a large predefined label set. In this work, we propose META-LMTC that is more suitable for encouraging general and robust representation in LMTC tasks. Unlike the few-shot learning that only focuses on the performance of novel classes, the LMTC tasks are concerned with the performances of all labels (including few/zero ones). To the best of our knowledge, we are the first to frame LMTC as a meta-learning problem.

Large-Scale Multi-Label Text Classification
As mention before, LMTC tasks face a serious long-tailed problem, often involve few/zero-shot labels. Formally, we have two disjoint sets of seen labels C S and unseen (i.e., zero-shot) labels C U . According to the label frequency, C S can be further divided into frequent labels C R S and few-shot labels , where x i indicates the i-th document and y i ⊂ C S is the corresponding labels of x i , our goal is to predict correct labelsŷ ⊂ C S ∪ C U for each testing document. Apart from training and testing set, some prior knowledge of labels, such as label descriptions, predefined label hierarchy is also available.

Model-Agnostic Meta-Learning
Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) is an optimization-based meta-learning framework. Its core idea is to leverage a set of auxiliary tasks to search for a good parameter initialization from which learning a target task would require only a handful of training samples. Formally, MAML first meta-learns the initialization of model parameters θ 0 with auxiliary tasks {T 1 , · · · , T i } and continue to learn the optimized parameters θ * for target task T t (Gu et al., 2018): Notably, the original MAML is designed for fewshot multi-class classification problems and does not consider specific data distributions. However, due to the coexistence of few-and zero-shot scenarios and long-tailed data distribution, a simple extension of MAML for multi-label classification problems can be sub-optimal to LMTC tasks.

Meta-Learning for LMTC
In this section, we first define LMTC tasks from the meta-learning perspective. Then we present a detailed description of the proposed META-LMTC.

Problem Statement
Previous studies formulate the LMTC tasks as a traditional supervised learning process Learn(T LMTC ; θ 0 ), where initial parameters θ 0 are obtained either randomly or pre-trained. Instead, from meta-learning perspective, we aim to find a better initialization θ * 0 with auxiliary low-resource multi-label text classification tasks are sampled from the LMTC training data D tr with a specific strategy τ , where D tr T i ∩ D val T i = ∅ and N , K are the number of instances in the support set and query set. In addition, we let C tr T i = N n=1 y n and C val T i = K k=1 y k be the corresponding labels of support set and query set in T i . As far as we know, it is the first attempt to cope with few/zero-shot labels in LMTC from the perspective of meta-learning.

Overview of META-LMTC
Algo. 1 shows an overall procedure of META-LMTC, which consist of a meta-learning phase and a finetuning phase. We describe the meta-learning stage in detail here. Suppose we are given a model f θ with parameters θ and a task sampling strategy τ which generates tasks T i . For each task we first update the model parameters using one-step gradient descent as where α is the local learning rate and L is the loss function. After that, the loss of local parameters on the corresponding query set is computed, i.e., Finally, the global parameters are obtained using the loss across multiple tasks, i.e., Algorithm 1 META-LMTC Input: Dataset D, learning rates α, β and task sampling strategy τ Output: Model θ * 1: Initialize parameters θ = θ0 2: // Meta-Learning Phase 3: while not done do 4: Simulate a batch of low-resource multi-label text classification tasks Ti using strategy τ 5: Compute local parameters θ ′ i with Eq. 1 7: end for 8: Update global model parameters θ with Eq. 2 9: end while 10: // Fine-tuning Phase 11: Fine-tune the model initialized with meta-learned parameter θ * 0 on the dataset D 12: Return the final model θ * where β is the global learning rate. In short, META-LMTC explicitly simulates the low-resource LMTC tasks, and directly incorporates the objective of adapting to these tasks into the meta-learning optimization phases. This encourages models to learn meta knowledge, i.e., how to obtain maximal performance on these rare/unseen labels with little training data.
However, how to construct tasks is one of the main challenges for meta-learning (Vinyals et al., 2016). Snell et al. (2017) pointed out that a more faithful training problem to the test environment can lead to better performance and Bansal et al. (2020) claimed that the diversity in tasks for metalearning is beneficial for models' generalization ability. Thus, the design of task sampling strategy τ to make tasks faithful and diverse is a critical problem to be solved.

LMTC Task Sampling Strategy
As mentioned before, a simple extension of metalearning algorithms for the multi-label classification problem is sub-optimal for LMTC tasks in two following issues: (1) LMTC tasks need to cope with few-and zero-shot scenarios, while existing methods are only considering few-shot scenarios and thereby provide less faithful training condition; (2) LMTC datasets often exhibit long-tailed distribution. But meta-learning algorithms do not take into account the specific data distribution. The constructed tasks by their naive task sampling strategy just consider a limited amount of frequent labels more and reduce the diversity of the tasks.
To address the issue (1), we design a simple yet effective task sampling strategy, namely instancebased one: a handful of samples are uniformly sampled from the original LMTC dataset D and partitioned into two disjoint set, i.e., the support set D tr We have found empirically that this strategy can construct more faithful tasks in which few-and zero-shot scenarios coexist, i.e., C val with a high probability. 1 However, this strategy is still affected by the long-tailed label distribution of LMTC, as shown in the upper part of Fig. 2: the few-shot labels in the training set have fewer chance to appear in the tasks and models are more susceptible to meta-overfitting (Bansal et al., 2020) to a handful of frequent labels.
To alleviate this issue (issue (2)), we provide another strategy, namely label-based one: a label is first sampled from the label space C S , and then an instance annotated with this label is selected. We repeat this process N + K times to construct The upper part of Fig. 2 shows that the label-based one is fairer than the instance-based one from the label dimension. On the other hand, the lower part of Fig. 2 reveals that the instancebased one shows no biases to instances, while the label-based one pays too much attention to those instances mostly annotated with few-shot labels.
Though both the instance-and the label-based strategies provide more faithful tasks, they reduce diversity in the tasks from either the label dimension or the instance dimension. To increase diversity in tasks for meta-learning, we use a sampling ratio p ∈ [0, 1] and each task T i is constructed by the instance-based strategy with probability p or by the label-based one with probability 1 − p. By appropriately setting the value of p, META-LMTC can provide more faithful and more diverse tasks and thereby boost models' performance.

Experiments
In this section, we conduct several experiments to evaluate the efficacy of our method in LMTC tasks. The experimental result shows that our method can bring performance improvements to all of the few/zero-shot LMTC base models.

Datasets
To evaluate our method, we use two benchmarks, a medical dataset MIMIC-III 2 (Johnson et al., 2016) and a EU legislation dataset, EU-RLEX57K (Chalkidis et al., 2019). 3 MIMIC-III contains approximately 58k English discharge summaries from US hospitals. Each summary is annotated with codes (labels) from 6,966 leaves of the ICD-9 diagnosis hierarchy, with an average of 11 labels. Another benchmark EURLEX57K are the LMTC dataset in the legal domain, which contains 57k English legislative documents. Each document is annotated with an average of five concepts (labels) from the 4,271 concepts of EUROVOC 4 .
Following Rios and Kavuluru (2018); Lu et al. (2020), the labels are divided into frequent, fewshot and zero-shot labels. Specifically, few-shot labels are defined as those whose frequencies in the training set are less than or equal to 5 for MIMIC-III and 50 for EURLEX57K 5 . In addition, MIMIC-

Evaluation Metrics
Because there are thousands of labels in LMTC datasets, annotators or users would see a label unless it appears at the top of the ranking. Thus, ranking metrics are usually adopted to measure the usefulness of various systems (Rios and Kavuluru, 2018;Lu et al., 2020;Chalkidis et al., 2020). Following them, we report both recall at k (R@K) and normalized discounted cumulative gain at k (nDCG@K), where K is set to 10 for MIMIC-III and 5 for EURLEX57K. Because our aim is high performance on both frequent and few/zero-shot labels, similar to the setup in Xian et al. (2019) and Rios and Kavuluru (2018), we also report the harmonic average across all R@K and all nDCG@K scores for methods that can predict zero-shot labels.

Baselines
Following Lu et al. (2020), we compare the following baselines. CNN (Kim, 2014) uses convolutional neural networks with max-pooling to extract text features, which are then used to make the predictions for the labels. RCNN (Lai et al., 2015) uses recurrent neural networks with a convolution layer to consider both long-distance and local dependencies. It achieves best the performances across competitive text encoders in Liu et al. (2019).
CAML (Mullenbach et al., 2018   uses the label-wise attention mechanism, allowing each label to focus on different parts of the text. 6 ZAGGRU (Chalkidis et al., 2019) originally proposed by Rios and Kavuluru (2018), applies graph convolutions (GCNs) to the label hierarchy. 7 Its GCNs can obtain better representations for few/zero-shot labels benefit from the (better) representations of frequent labels that are nearby in the label hierarchy.
ZAGRU is an ablation method of ZAGGRU proposed in Chalkidis et al. (2020). ZAGRU replaces the stack of GCN layers in ZAGGRU into a plain two-layer Multi-Layer Perceptron (MLP). 6 The original model uses a CNN text encoder whereas we use a Bi-GRU for better performance and fairness of comparison. 7 According to Chalkidis et al. (2020), a Bi-GRU encoder can obtain better performance than the CNN token encoder of the original model. Thus, we use Bi-GRUs rather than CNNs as the token encoder.
The model though is unaware of the label hierarchies yet produces a surprisingly competitive performance of rare labels.
AGRU-KAMG (Lu et al., 2020) is the state-ofthe-art model of LMTC task, which can handle fewand zero-shot labels. It utilizes the label graphs based on the similarity among labels' embeddings and the label co-occurrence graphs besides the predefined label hierarchy, which captures label relations from different views and thereby enhances the quality of labels' representations.
Among the above models, the first three use randomly initialized label embedding for each label, which results in their incapability of coping with unseen labels and poor generalization over rare labels. Instead, the last three models use a shared label encoder to obtain label representations, which empowers them to handle few/zero-shot labels. Because we focus on the models' generaliza-tion over both frequent labels and few/zero-shot labels, META-LMTC is only applied to the last three models to verify its effectiveness and versatility. To explore the necessity of a balanced task sampling strategy, we also apply the simple extension of MAML for the multi-label classification problem (called SIMPLE-EXT 8 ) to the same base models.

Implementation Details
We implement all the methods relying on the Py-Torch library. We also use Higher (Grefenstette et al., 2019) for our meta-learners. Additionally, the binary cross-entropy loss is used as the loss function during the meta-training and fine-tuning phases. More details can be found in Appendix A.

Results
The experimental results of our methods and the baselines on the MIMIC-III and EURLEX57K datasets are shown in Table 2. We apply our framework to the ZAGRU, ZAGGRU, and AGRU-KAMG models. The performance of the models meta-trained by pure instance-and label-based strategies is not reported here due to space limitations but can be found in Appendix C.
As shown in the upper part of Table 2, the AGRU-KAMG model meta-trained with our method performs the best in every single evaluation metric among all of the models on the MIMIC-III dataset. Equipped with our method META-LMTC, the state-of-the-art model, ARGU-KAMG, has achieved relative improvements of 13.1% R@10 and 16.8% nDCG@10 on few-shot labels along with 26.2% R@10 and 23.5% nDCG@10 on zeroshot labels. In addition to the few/zero-shot labels, performance on frequent ones can also benefit from our method. We argue this is because our method can obtain better initialization for these frequent labels. It is worth noting that the performance of all of the three base models is significantly improved when equipped with META-LMTC, which verifies its versatility.
The lower part of Table 2 presents the results of our proposed methods and the baselines on the EURLEX57K testing set. Similar to the experimental results on the MIMIC-III dataset, our proposed methods still bring great improvement to all of the base models in each label group and outperform the baselines by a large margin. Specifically, by employing our method, the harmonic average nDCG@5 of ZAGRU, ZAGGRU, and AGRU-KAMG is absolutely improved by 5.8% and 5.8% and 3.5% respectively. This further confirms that our method is capable of helping each model to predict the labels more accurately. Table 2 also shows the performance of the three base models equipped with SIMPLE-EXT on the dataset MIMIC-III and EURLEX57K. Although this method can boost models' performance, it is not as effective as our method in that SIMPLE-EXT neglects the zero-shot scenarios and long-tailed label distributions of LMTC datasets.

Analysis
We explore the following questions further in the section: Is META-LMTC also effective to the powerful BERTlike models? How does hyperparameters choice affect this method? Which labels benefit more from our method?

Apply to BERTlike Model
Most recently, Chalkidis et al. (2020) shows BERTlike models (Devlin et al., 2019) equipped with label-wise attention networks (BERT-LWAN) has the best results among all methods on EU-RLEX57K. But the BERT-LWAN relies solely on trainable vectors to represent labels and thereby cannot handle unseen labels. However, it is not trivial to extend this model to zero-shot scenarios. To cope with unseen labels, BERT-LWAN needs to employ a shared label encoder to encode each label's text description as its representation. Due to a large number of predefined label sets in LMTC datasets, it is impractical to use BERT as the shared label encoder. 9 Fortunately, Sanh et al. (2019)   model as the shared label encoder and equip it with LWAN and the gradient accumulation trick, which is dubbed as Z-DistilBERT capable of zero-shot learning. 10 Table 3 shows the difference in various metrics on EURLEX57K using Z-DistilBERT with or without META-LMTC. It clearly demonstrates the META-LMTC can still bring significant improvement even to the powerful BERTlike model, Z-DistilBERT.

Hyperparameter Studies
The META-LMTC improves the generalization ability of models by increasing the diversity of metalearning tasks and the task distribution depends on the hyperparameter p. In this subsection, we investigate the influence of this hyperparameter on the models' performances.  Fig. 3 presents the difference between the performance of ZAGGRU equipped with META-LMTC and that of the base model. The hyperparameter p is chosen from {0.00, 0.25, 0.50, 0.75, 1.00}. Note that p = 0.00 is the pure label-based task sampling strategy while p = 1.00 is the pure instance-based one. It demonstrates that META-LMTC can consistently boost the performance of the base model with all different values of p. But the value of p can significantly affect the performance of META-LMTC. In general, the performance improves at first and then decreases as the value of p increases. As discussed before, pure task sampling strategies are inferior because they ignore the long-tailed distribution of label frequency in LMTC datasets and reduce the diversity of sampled tasks. Using other base models, the experimental results also show a similar trend, which can be found in Appendix D. To completely understand the source of the performance boost, we resort to a detailed performance improvement breakdown, presented in Fig. 4. The green line in this figure indicates the performance difference between applying or not applying META-LMTC to the base model when considering labels with a frequency less than or equal to a certain value. As can be seen, our method has the greatest benefit for zero-shot labels and fewshot labels whose frequencies are between 1 and 20. This reveals that META-LMTC does improve the models' ability to handle few/zero-shot labels.

Discussion
In this section, we report some experimental results of evaluating few-and/or zero-shot labels in the LMTC tasks in stricter settings.

Construction of the Zero-shot Label Candidates
When evaluating models' performance over unseen labels, existing works just consider only the labels appearing in the datasets (i.e., the validation or testing test) but not all available labels. However, in the realistic setting, we only know that the unseen label appears in the predefined label set. For that, we consider all the available labels but not appearing in the training set as zero-shot label candidates. Because the number of zero-shot labels are dramatically increases, the performances of all models drop dramatically. For example, the R@5 and nDCG@5 of the ZAGRU model drops from 54.5% and 43.7% to 20.7% and 14.6% respectively in the EURLEX57K dataset. But our method can still bring performance enhancement to these base models. When equipped with META-LMTC, they rise to 23.8% (+3.1%) and 16.4% (+1.8%).

Evaluation Metrics of the Few/Zero-shot Labels
In LMTC tasks, ranking-based metrics are often adopted to evaluate the top K labels with the highest scores predicted by the model, e.g. R@K and nDCG@K. In previous works, the value of K is selected based on the average number of labels per document. However, the average numbers of few-and/or zero-shot labels in each dataset are much lower than the selected K, which may lead to inappropriate evaluation on these labels. For example, the average numbers of few-and zero-shot labels in the EURLEX57K dataset are about 1.7 and 1.1 respectively (instead of 5), so we set K=2 and K=1 for few-and zero-shot evaluation. Under these settings, the performance of the AGRU-KAMG model on the few-shot labels becomes 52.5% R@2 and 57.3% nDCG@2. As for the zero-shot labels, AGRU-KAMG gets 24.1% R@1 and 25.8% nDCG@1. Even though the model's performance shows an obvious difference, our method can still bring steady improvement on both few-and zero-shot labels, specifically 55.2% R@2 (+2.7%) and 60.7% nDCG@2 (+3.4%) for few-shot labels along with 26.9% R@1 (+2.8%) and 28.3% nDCG@1 (+2.5%) for zero-shot ones.

Conclusion and Future Work
In this paper, we proposed an optimization-based meta-learning framework, namely META-LMTC, along with several task sampling strategies. We are the first study to address the LMTC tasks from a meta-learning perspective. Extensive experimental results showed that our method is able to significantly improve the performance of all the base models. The further analysis presented that our method is also applicable to the strong BERTlike model, and revealed the source of the performance boost our method brings. As future work, we will further explore the meta-learning approaches to handle the generalized zero-shot learning problem (GZSL) in the LMTC tasks.

A Additional Implementation Details of Main Experiments
We extract the vocabularies from both the documents in the training texts and the label descriptors. Each document is truncated at the length of 512 at the training and inference stage. Hyperparameters are selected with the best nDCG@K of the zero-shot labels on the validation set. The search space of each hyperparameter is shown in Table 4.
For all the models implemented in experiments of the two datasets, the one-layer Bi-GRU with hidden dimension 100 is used to set up the RNN encoders and 200 filters with kernel size 10 is for the CNN encoders. The size of the GCNs' hidden states is set to 200. Additionally, We used 200-dimensional word embeddings pretrained on PubMed  and GloVe (Pennington et al., 2014) for the MIMIC-III and EU-RLEX57K respectively. The dropout rate is set to 0.1 for the embedding layer and 0.5 for the last hidden layer for all the implemented models.
In the meta-training phase, the SGD optimizer with learning rate α = 3 × 10 −3 is used for each task's local update in the MIMIC-III and α = 1 × 10 −3 in the EURLEX57K. The Adam optimizer with learning rate β = 3 × 10 −4 is used to update the global parameters in the MIMIC-III and β = 1 × 10 −4 in the EURLEX57K. The size of the support set and the query set is 128 and 32 respectively. Besides, the model's global parameters are updated once using the average loss of 4 sampled tasks. At last, the meta-model is saved for the fine-tuning phase after being updated by 300 iterations, i.e., learning from 1200 sampled tasks.
In the training phase, the batch size of 64 is used for both of the datasets. When training a model from scratch, the learning rate is set to 1 × 10 −3 for MIMIC-III and 3 × 10 −4 for EURLEX57K. If fine-tuning a model that has been meta-trained, the learning rate is 3 × 10 −4 and 1 × 10 −4 for MIMIC-III and EURLEX57K respectively.
All experiments are run with one NVIDIA GPU V100. In Table 5, we report the size of the models and the elapsed training time.

B Implementation Details of Z-DistilBERT
We implement the Z-DistilBERT model similar to ZAGRU but replace both text encoder and label encoder with DistilBERT. Due to the thousands of labels in LMTC tasks, the memory overhead will become unacceptable if all the labels are encoded at the same time. To overcome this issue, we divide the labels into many blocks with small sizes, e.g. 256 labels per block. For each block, the loss of its labels and the gradients of the model parameters are firstly computed. Then the gradient of each parameter are accumulated and the computation graph except for the text encoding part will be freed manually. When all the blocks are processed serially, model parameters are updated with the accumulated gradients.

C Full Experiments Results
The experimental results of our methods and the baselines on dataset MIMIC-III and EURLEX57K are shown in Table 6. We apply our algorithm to the ZAGRU, ZAGGRU, and AGRU-KAMG models based on the instance-based (META-LMTC-IS), label-based (META-LMTC-LS), and final sampling strategies (META-LMTC). The results show that all the existing models can obtain significant improvements in performance when being meta-trained with our method, which illustrates the effectiveness and versatility of our methods. Additionally, the final strategy outperforms the pure instance-or label-based ones in most of the metrics.