Improving Pretrained Models for Zero-shot Multi-label Text Classification through Reinforced Label Hierarchy Reasoning

Exploiting label hierarchies has become a promising approach to tackling the zero-shot multi-label text classification (ZS-MTC) problem. Conventional methods aim to learn a matching model between text and labels, using a graph encoder to incorporate label hierarchies to obtain effective label representations (Rios and Kavuluru, 2018). More recently, pretrained models like BERT (Devlin et al., 2018) have been used to convert classification tasks into a textual entailment task (Yin et al., 2019). This approach is naturally suitable for the ZS-MTC task. However, pretrained models are underexplored in the existing work because they do not generate individual vector representations for text or labels, making it unintuitive to combine them with conventional graph encoding methods. In this paper, we explore to improve pretrained models with label hierarchies on the ZS-MTC task. We propose a Reinforced Label Hierarchy Reasoning (RLHR) approach to encourage interdependence among labels in the hierarchies during training. Meanwhile, to overcome the weakness of flat predictions, we design a rollback algorithm that can remove logical errors from predictions during inference. Experimental results on three real-life datasets show that our approach achieves better performance and outperforms previous non-pretrained methods on the ZS-MTC task.


Introduction
Multi-label text classification (MTC) is a basic NLP problem that underlies many real-life applications like product categorization (Partalas et al., 2015) and medical records coding (Du et al., 2019). The labels in the output space are often interdependent and in many applications organized in a hierarchy, as shown in the example in Figure 1. A significant challenge for real-life development of MTC applications is severe deficiencies of annotated data for each label in the hierarchy, which demands better solutions for zero-shot learning. The existing zero-shot learning for multi-label text classification (ZS-MTC) mostly learns a matching model between the feature space of text and the label space (Ye et al., 2020). In order to learn effective representations for labels, a majority of existing work incorporates label hierarchies via a label encoder designed as Graph Neural Networks (GNNs) that can aggregate the neighboring information for labels (Chalkidis et al., 2020;Lu et al., 2020).
Recently, pretrained models like BERT (Devlin et al., 2018) have been widely used as strong matching models due to their superior representation ability (Qiao et al., 2019). They have been applied to convert a classification task to a textual entailment task, by treating the text to be classified as the premise, and its label as the hypothesis, which is naturally suitable for the ZS-MTC study (Yin et al., 2019). However, the problem of this approach is that pretrained models cannot generate individual vector representations for labels-a label is coupled with the corresponding text in learning joint representation-thus conventional methods, like GNNs which utilize the label hierarchy to obtain better label representations, cannot be directly applied to pretrained models, making them underexplored in the existing research.
Although pretrained models have shown potential on ZS-MTC, as discussed above, it is not intuitive to introduce structural information of label hierarchies to the learning procedure. Flattening all the labels without considering their hierarchical structures, however, will result in predictions that contain logical errors, which are known as the class-membership inconsistency (Silla and Freitas, 2011). The problem will be even more salient for pretrained models because they only take the literal tokens of the labels as input. An example with logical errors is shown in Figure 1. Without label hierarchy information, the model correctly predicts Bikes as a true label, but fails to predict its parent label, Sporting Goods. Meanwhile, the model does not choose the label Local Services while predicting its child label Bike Repair due to the fact that Bike Repair has tokens similar to those in the input text.
To overcome the forementioned weakness, we propose a Reinforced Label Hierarchy Reasoning (RLHR) approach to introduce label structure information to pretrained models. Instead of regarding labels to be independent, we cast ZS-MTC as a deterministic Markov Decision Process (MDP) over the label hierarchy. An agent starts from the root label and learns to navigate to the potential labels by hierarchical deduction in the label hierarchy. The reward is based on the correctness of the deduction paths, not simply on the correctness of each label. Thus the reward received by one predicted label will be determined by both the label itself and other labels on the same path, which can help to strengthen the interconnections among labels. Meanwhile, we find that the hierarchical inference method (Huang et al., 2019) will broadcast the errors arising at the higher levels of label hierarchies. Thus we further design a rollback algorithm based on the predicted matching scores of labels to reduce the logical errors in the flat prediction mode during inference. We apply our approach to different pretrained models and conduct experiments on three real-life datasets. Results demonstrate that pretrained models outperform conventional nonpretrained methods by a substantial margin. After being combined with our approach, pretrained models can attain further improvement on both the classification metrics and logical error metrics 1 . We summarize our contributions as follows: • We demonstrate that pretrained models outperform conventional methods on ZS-MTC.
• We design a novel Reinforced Label Hierarchy Reasoning (RLHR) approach and a matching-score-based rollback algorithm to introduce the structural information of label hierarchies to pretrained models in both the training and inference stage.
• Experiments with different pretrained models are performed on three real-life datasets. We show the effectiveness of our proposed approach and provide detailed analyses.

Related Work
Exploiting the prior distribution of the label space has proven to be an effective method to tackle the multi-label text classification problem because it can provide the model with information about the label structure. Mao et al. (2019); Huang et al. (2019) took the explicitly represented label hierarchy as the structural information, while Wu et al. (2019) assumed the prior distribution to be implicit and trained their model to learn the distribution during learning. Leveraging the label hierarchy to tackle ZS-MTC has shown to be promising in previous work, which mostly aimed to learn a matching model between texts and labels. Chalkidis et al. (2020Chalkidis et al. ( , 2019; Xie et al. (2019) adopted Label-Wise Attention Networks to encourage interactions between text and labels. Rios and Kavuluru (2018); Lu et al. (2020) used Graph Neural Networks to capture the structural information in the label hierarchy. However, few existing works investigate the effectiveness of pretrained models on the ZS-MTC task, despite pretrained models being effective as matching models for many natural language processing tasks (Ma et al., 2019;Qiao et al., 2019;Nogueira et al., 2019).
The logical error problem in flat predictions has been widely discussed in previous MTC work (Silla and Freitas, 2011;Wehrmann et al., 2018;Mao et al., 2019), which is mostly solved through a hierarchical procedure during inference. In our work, we will investigate such a method and see that the hierarchical inference method is not optimal for pretrained models on the ZS-MTC task because it broadcasts errors top-down in the label hierarchy.
Path reasoning is effective for exploiting explicit relationships in structured data, which can be combined with reinforcement learning, e.g., knowledge graph reasoning (Wan et al., 2020;Xian et al., 2019;Xiong et al., 2017). We propose to introduce the label hierarchy to pretrained models through path reasoning, with the aim to strengthen the intercon-nections between labels. To the best of our knowledge, our work is the first to improve pretrained models through label hierarchies for ZS-MTC.

Label Hierarchy Reasoning
In general, a label hierarchy is defined as G = (L, E), where L and E are a set of labels and relations, respectively. The latter represent parentchild relations between labels. The root of G is a special label R. A data instance x is defined as a tuple (T, P ) with T as the input text and P = {p 1 , p 2 , · · · , p N } as deduction paths, and a path p i = {R, l 1 i , · · · , l K−1 i , l K i } where l k i ∈ L is at the k th layer of G and l k−1 i is the parent of l k i . A deduction path must be contiguous, starting with R, and is not required to terminate at a leaf label.

Zero-shot Multi-label Text Classification
Let L s and L u denote the seen and unseen labels, respectively, where L s ∪ L u = L. Given a train- where the labels of x s i are all seen labels, we aim to learn a matching model f (D s ; θ) and make prediction on . Some deduction paths of x u i consist of seen labels while some contain both seen and unseen labels. Notice that the children of an unseen label are also unseen labels. Evaluations on D u will be conducted in two settings: (1) evaluate the performance on L u , which is known as the zero-shot (ZS) setting, and (2) evaluate the performance on L s ∪ L u , which is the generalized zero-shot (GZS) setting (Huynh and Elhamifar, 2020).

Methodology
The goal of our RLHR approach is to learn a policy P that can make more consistent predictions by traversing the label hierarchy G to generate deduction paths. Given a training instance x, an agent will start from the root R and follow P at each time step to extend the deduction paths by navigating to the children labels at the next level. By measuring the correctness of the generated deduction paths with reinforcement learning (RL), the label hierarchy is introduced to the model during the training time and the interconnections of labels will hence be strengthened, which can help to reduce logical errors in prediction. As we will show in our experiments, hierarchical inference, which is used in previous work (Mao et al., 2019), will propagate the errors occurring at the high levels of hierarchies during inference, resulting in inferior performance. Thus we still adopt the flat prediction during inference, but further design a rollback algorithm based on the structure of G and the predicted matching scores. We will introduce the details of our proposed RLHR and the rollback algorithm in the following subsections.

Base Model
Our base model adopts pretrained models M, e.g., BERT (Devlin et al., 2018), which have proven to be effective in matching modelling. Given the input text T and the label l, we follow Yin et al. (2019) by transforming the text-label pair into textual entailment representation as "[CLS] T [SEP] hypothesis of l". The hidden vector v cls of [CLS] is regarded as the aggregate representation and will be used in the classification layer to calculate the matching score ms. The overall calculation process of ms is abbreviated as: If ms ≥ γ where γ is a threshold, we then say T belongs to label l. In experiments γ is set to be 0.5.

Reinforced Label Hierarchy Reasoning (RLHR)
Different from vanilla pretrained models that rely on flat prediction during training, we propose to formulate the ZS-MTC task as a deterministic Markov Decision Process (MDP) over label hierarchies. For the input text, the agent trained by RLHR will predict M deduction paths from the root label R. When all deduction paths are generated, the rewards will be received, which are determined by the correctness of the paths. An overall illustration of the RLHR approach is shown in Figure 2. We introduce the details of the RL modules in this subsection.

States
Maintaining just one deduction path for one data instance will result in an inefficient learning process. However, the number of potential deduction paths will increase exponentially as the model goes deeper into the lower levels of the hierarchies. To maintain a good trade-off between computational resources and time efficiency, we keep the beam of deduction paths to be M . Thus for a data instance x, the global state S k at step k is composed of the sub-states of M deduction paths:

Start
Step 1 Step 2 Step 3 The sub-state s k i for deduction path p i at step k is defined as a tuple (T, l k i ), where T is input text and l k i is the label.

Actions
The complete action space A k i of sub-state s k i is defined as all possible child labels of label l k i : where C(l k i ) denotes the child labels of l k i . For the deduction path p i at the time step k, an action a k i is to select one label l k+1 i from A k i . Notice that the agent may not select any labels from A k i , which means path p i ends before it arrives at a leaf label and a "stop" action is taken. By adding this "early stop" mechanism, we can make the agent automatically learn when to stop assigning new labels to the deduction paths.

Policy
We parameterize the action a k i by a policy network π(·|s, A; θ) where θ is parameters. For deduction path p i at time step k, the policy network takes as input the state s k i and the corresponding action space A k i , emitting the matching score of each action in A k i , which is calculated by the base pretrained model M. Finally an action a k is sampled based on the matching score distribution of the actions in A k i . The calculation is formulated as follows:

Reward
In our approach, the reward is based on the correctness of a complete deduction path. Instead of treating all labels to be flat, our approach encourages the interdependence among the labels. The reward received by a label l k i is not only decided by the correctness of itself but also the correctness of other labels on the same deduction path p i . Given the golden deduction pathsP = {p 1 ,p 2 , · · · ,p N }, p i will obtain a positive reward if p i is inP or p i is a sub-path of a path inP . Formally the reward of path p i is defined as: where λ is a hyper-parameter for scaling. Under most circumstances, the number of wrong deduction paths will be greater than the correct ones. The problem will be even more severe for the MTC tasks because the distribution of positive labels and negative labels is usually imbalanced given a data instance x. A larger λ can encourage the model to focus more on the correct paths. Notice that our approach differs from existing methods which adopt hierarchical classification (Sun and Lim, 2001;Peng et al., 2018). A hierarchical classification method based on the label hierarchy can only cast the influence from parent label to child label, while in our approach the influence is mutual between parent label and child label, which can hence strengthen the reasoning ability of the models.

Optimization
Our goal is to learn a stochastic policy π that maximize the expected total reward J(θ) of the M sampled deduction paths, which can be formulated as: where θ is the parameter of policy network. We adopt policy gradient (Sutton et al., 2000) as the optimization algorithm which updates θ as: where η is the discount learning rate. Since there are multiple deduction paths for one data instance, the gradient can be approximated bỹ r b is a constant for the stabilization of the training procedure, for which we use the average reward of the last training epoch in our experiments.

Inference Rollback
Existing methods mostly adopt the hierarchical inference method (Mao et al., 2019), which will avoid logical errors, i.e., class-membership inconsistency (Silla and Freitas, 2011), but bring a serious problem: the prediction errors made at the high levels of a hierarchy are often severely propagated to the lower levels. For instance, if a correct label at the first layer is missing, then all the descendant labels will not be considered during inference. This will no doubt harm the performance. On the contrary, if the model still makes flat prediction, all labels will be visited during inference, while more logical errors will probably arise.
To overcome the forementioned weaknesses, we propose a rollback algorithm during the inference stage based on the predicted matching scores of all labels. For a data instance x, we obtain the predicted labels in flat prediction mode as P , which consists of two parts: (1) labels that can form complete deduction paths, and (2) labels with logical errors, which we denote as P e = {l k 1 1 , l k 2 2 , · · · , l k N N }. For a label l k i i ∈ P e , we extract its deduction path from G as p i = {R, l 1 i , · · · , l k i −1 i , l k i i } and their corresponding predicted matching scores {1, ms 1 i , · · · , ms k i −1 i , ms k i i } 2 . Meanwhile we set a rollback threshold µ k for the labels in the k th layer of G, where {µ k } are hyper-parameters tuned on the development set. As long as the matching scores meet the requirements we add the labels in p i back to P . Otherwise label l k i i will be removed from P . The motivation behind this matching-scorebased rollback algorithm is that for a label hierarchy G, the labels at higher-level hierarchy contain more training instances but their meaning are more 2 Root label R always has a matching score 1. abstract, while the labels at lower levels are more specific such as the labels "Active Life" and "Bike Rentals" in Figure 1. Pretrained models just take as input the literal tokens of a label and thus are possible to obtain a better performance on certain labels at the lower levels than those at higher levels.

Datasets
We conduct experiments on three real-life datasets from different domains; the details are provided in Table 1. Yelp 3 is a customer review dataset, in which we need to classify customer reviews into correct business categories. WOS (Kowsari et al., 2017) is a scientific paper dataset which provides the abstracts of published papers and the corresponding topics. QCD is a query classification dataset we create for the ZS-MTC task. It is composed of search queries and target product types, which is collected from e-commerce websites. The layer numbers of the label hierarchies in Yelp, WOS and QCD are 4, 2, and 3, respectively. For examples of the three datasets, please refer to Appendix A.1.

Implementation Details
We test our proposed approach with two pretrained models, BERT (Devlin et al., 2018) and Distil-BERT (Sanh et al., 2019). For BERT, we use the uncased base version, which is of 12-layer transformer blocks, 768-dimension hidden state, 12 attention heads and 110M parameters in total. For DistilBERT, it contains 6-layers transformer blocks, 768-dimension hidden state and 12 attention heads, totally 66M parameters. For training, we use Adam (Kingma and Ba, 2014) for optimization and learning rate is set to 1e-6. Meanwhile we adopt early stopping to avoid overfitting on the training data. λ is set to 30 on Yelp, 20 on QCD, and 5 on WOS,  which we will discuss more in Section 5.3.4. We set M to 5 with DistilBERT and 3 with BERT by trading off between training time and GPU memory usage.
The RL training procedure is unstable and slow if the agent is trained from scratch (Silver et al., 2016). So with both BERT and DistilBERT, we pretrain the policy network in flat prediction mode on the training data with the learning rate of 1e-5.

Evaluation Metrics
In our experiments, we use standard metrics Micro-F1 and Macro-F1 to evaluate the classification performance for both the zero-shot and generalized zero-shot setting. Meanwhile, we also adopt Example-based F1 (Peng et al., 2016) to measure the performance from the instance level, which is different from Micro/Macro-F1 measuring from the label level. Though some previous works adopted ranking based metrics (Rios and Kavuluru, 2018) for large-scale MTC, they are not appropriate in our settings because the datasets used in this work contain smaller label space.
For logical errors, we report the logical error rate, which is defined as the average number of logical errors in one data instance. We take the number of logical errors in one data instance as the number of labels that cannot form a complete deduction path.
Evaluation is conducted in two settings: (1) evaluate the performance on unseen labels only, which is the zero-shot (ZS) setting, and (2) evaluate the performance on both seen labels and unseen labels, i.e., the generalized zero-shot (GZS) setting (Huynh and Elhamifar, 2020).

Baselines
We use two different types of baselines. (1) The type of models where label hierarchy is not utilized, and we use CNN and CNN with Label-Wise Attention Networks (CNN+LWAN) (Chalkidis et al., 2019) in our experiments. (2) The type of models where GNNs are utilized to encode the label hierarchy to capture the label structure information. Specifically we use ZAGCNN proposed by Rios and Kavuluru (2018). Table 2 shows the experimental results of the baseline models and our proposed RLHR approach on three real-life datasets in both the zero-shot and generalized zero-shot setting.

Classification Performance
As we can see in Table 2 Table 3: Performance of our matching-score-based rollback algorithm and the comparison to the hierarchical inference method. Ma-F, Mi-F, EBF, and Err denote Macro-F1, Micro-F1, Example-based F1, and logical error rate, respectively. ZS and GZS denote the zero-shot and generalized zero-shot setting. Bold numbers indicate the best results for each metric. "BERT+Hie-Infe" in the last row means BERT with the hierarchical inference method, which is used in previous work (Huang et al., 2019).
setting while the performance under GZS setting is better, which suggests CNN and CNN+LWAN cannot provide accurate predictions for unseen labels due to the lack of label structure information.
In contrast, ZAGCNN, which utilizes the label hierarchy, performs better, particularly on unseen labels, which demonstrates the importance of label hierarchy for ZS-MTC. On the other hand, pretrained models, including DistilBERT and BERT, both outperform conventional non-pretrained methods with substantial improvements on three datasets, though ZAGCNN shows slight advantages on Micro-F1 and Examplebased F1 on the QCD dataset under the GZS setting. When incorporated with RLHR, the performance of pretrained models can be further improved by a relatively large margin. We notice that the improvement under GZS setting is more significant than in the ZS setting, suggesting that seen labels benefit more from our RLHR than unseen labels.

Logical Errors
As shown in Table 2, utilizing label hierarchies does not necessarily reduce the logical error rate for conventional methods, though it can improve the classification performance. For example, the logical error rate of ZAGCNN is higher than CNN and CNN+LWAN on Yelp and WOS. The logical error rate of pretrained models is generally lower than the conventional methods. However, pretrained models still face the logical error problem though they perform well on the classification metrics. We can also see that our RLHR can help reduce the logical error rate for DistilBERT and BERT under most circumstances.
Note that better classification performance does not necessarily lead to a lower logical error rate. From Table 2, we can see although CNN and CNN+LWAN perform poorly on classification metrics, they achieve a better logical error rate than ZAGCNN and DistilBERT on the WOS dataset. Similarly, the logical error rate of BERT is higher than DistilBERT on QCD even though BERT has a better classification performance. Our proposed RLHR approach can improve both the classification performance and logical error performance, which demonstrates the effectiveness of RLHR.

Analyses on Rollback Algorithm
Due to the limit of space, we only report the results of our proposed rollback algorithm based on BERT and put the results on DistilBERT in Appendix A.2. As shown in Table 3, we can see that when being combined with our proposed rollback algorithm, the performance of BERT+RLHR can be further improved, raising Example-based F1 on Yelp, WOS, and QCD from 49.52%, 64.43%, 39.99% to 50.01%, 69.32% and 40.13%, respec- tively. Our proposed rollback algorithm can also be combined with BERT only, while the gain is relatively marginal. We further investigate this and observe that at the same level of the label hierarchy, the matching scores obtained in RLHR is more polarized, compared to those obtained with BERT, suggesting RLHR is more confident about the predictions when the label hierarchy is provided. This yields a better prediction performance of RLHR when the rollback algorithm is adopted. Meanwhile, we compare the hierarchical inference method (Huang et al., 2019) with our rollback algorithm. Both methods can completely remove logical errors from the predicted results. However, as we can see in the table, the performance of the hierarchical inference method is not consistent on the three datasets, with either BERT or BERT+RLHR. When conducting hierarchical inference, BERT+RLHR achieves the best Micro-F1 and Example-based F1 on QCD dataset, while the performance is harmed with a significant gap on the WOS dataset. Similarly, the performance of hierarchical inference with BERT achieves minor improvement on the QCD dataset, while on WOS and Yelp, the performance is sometimes improved marginally or sometimes worse. The effectiveness of hierarchical inference method depends mainly on the classification difficulty of labels at the higher levels of label hierarchies. As we know, such labels are usually more abstract and general, thus making the performance of hierarchical inference susceptible.

Influence of λ
We discuss the influence of the parameter λ on logical error rates and useen label classification in this section. Due to the limit of space, we only represent the results with BERT and put the results based on DistilBERT in Appendix A.3. As shown in Figure 3, for datasets with large hierarchy, like Yelp and QCD, a larger λ helps achieve better classification performance on unseen labels, while it will bring more logical errors. On the contrary, a relatively small λ yields better classification performance and lower logical error rates on datasets with small hierarchies like WOS, as shown in Figure 3b. The reason is that for a large hierarchy, the number of sampled correct deduction paths will be much less than that of the wrong paths which is common in the ZS-MTC task because the positive labels are usually much less than negative labels, while for a small label hierarchy, the number of sampled correct paths are close to the false ones. A large λ will encourage a model to focus more on sampled correct paths, which will hence improve the classification performance. Meanwhile, if λ is too large, it will bring a bias to the dominating labels which appear more in the datasets. Thus it will reduce the generalization ability of the model, which will harm the performance.

Conclusion
We propose a Reinforced Label Hierarchy Reasoning approach to incorporate label hierarchies into pretrained models in order to better solve the zeroshot multi-label text classification tasks. We train an agent that starts from the root label, navigates to potential labels in the label hierarchies and generates multiple deduction paths. By rewarding based on the sampled deduction paths, our approach can strengthen the interconnections among the labels during the training stage. To overcome the weakness of hierarchical inference methods, we further design a rollback algorithm that can remove the logical errors in flat predictions. Experiments on the three datasets demonstrate that our proposed approach improves the performance of pretrained models and enable the models to make more consistent predictions.
We split the labels in the label space as seen labels and unseen labels. Unseen labels do not necessarily need to be leaf labels, and if an intermediate label is chosen as unseen, then all its descendant labels will be set as unseen. Meanwhile, each data instance in dev/test sets will contain at least one unseen label. Table 5 shows the example instances of Yelp, WOS and QCD datasets used in this work.

A.2 Rollback Results with DistilBERT
As shown in Table 6, DistilBERT+RLHR with Rollback algorithm can achieve the best performance on most evaluation metrics. Although the hierarchical inference method can improve DistilBERT on QCD dataset, its performance is not consistent. It lowers the performance by large margins on WOS with both DistilBERT and DistilBERT+RLHR. In contrast, the rollback algorithm has consistent performance on all the three datasets, especially when combined with our proposed RLHR approach.

A.3 Influence of λ with DistilBERT
As shown in Figure 4, the influence of parameter λ on three datasets with DistilBERT is similar to that with BERT. For Yelp and QCD datasets, a larger λ helps achieve better classification performance on unseen labels, while it will bring more logical errors. On the contrary, a relatively small λ yields both better classification performance and lower logical error rates on WOS dataset, as shown in Figure 4b. The results support our analyses in Section 5.3.4.

A.4 Deduction Path Analysis
We represent the results of deduction paths in this section, which is an important evaluation of if the model captures the interdependencies of labels. A path is considered as correct when it equals to or belongs to a golden deduction path, and we report Example-based Precision, Recall and F1 based on BERT. As shown in Table 4, BERT can achieve high recall but low precision on the deduction paths, which means that it tends to predict more labels as correct. This is because pretrained models only take the literal tokens of labels as input without any label structure information. On the contrary, RLHR, which incorporates the label hierarchy, can provide more accurate predictions of deduction

WOS
This paper presents the design and experimental evaluation of discrete time sliding mode controller using multirate output feedback to minimize structural vibration of a cantilever beam using shape memory alloy wires as control actuators and piezoceramics as sensor and disturbance actuator. Linear dynamic models of the smart cantilever beam are obtained using online recursive least square parameter estimation. A digital control system that consists of Simulink (TM) modeling software and dSPACE DS1104 controller board is used for identification and control. The effectiveness of the controller is shown through simulation and experimentation by exciting the structure at resonance.