BanditMTL: Bandit-based Multi-task Learning for Text Classification

Task variance regularization, which can be used to improve the generalization of Multi-task Learning (MTL) models, remains unexplored in multi-task text classification. Accordingly, to fill this gap, this paper investigates how the task might be effectively regularized, and consequently proposes a multi-task learning method based on adversarial multi-armed bandit. The proposed method, named BanditMTL, regularizes the task variance by means of a mirror gradient ascent-descent algorithm. Adopting BanditMTL in the multi-task text classification context is found to achieve state-of-the-art performance. The results of extensive experiments back up our theoretical analysis and validate the superiority of our proposals.


Introduction
Multi-task Learning (MTL), which involves the simultaneous learning of multiple tasks, can achieve better performance than learning each task independently (Caruana, 1993;Ando and Zhang, 2005). It has achieved great success in various applications, ranging from summary quality estimation (Kriz et al., 2020) to text classification (Liu et al., 2017).
In the multi-task text classification context, MTL simultaneously learns the tasks by minimizing their empirical losses together; for example, by minimizing the mean of the empirical losses for the included tasks. However, it is common for these tasks to be competing. Minimizing the losses of some tasks increases the losses of others, which accordingly increases the task variance (variance between the task-specific loss). Large task variance can lead to over-fitting in some tasks and under-fitting in others, which degenerates the generalization performance of an MTL model. To illustrate this issue, *Corresponding author. it is instructive to consider a case of two-task learning, where task 1 and task 2 are conflicting binary classification tasks. When the task variance is uncontrolled, it is possible that the empirical loss of task 1 will converge to 0, while the empirical loss of task 2 will converge to 0.5. In such a case, although the mean of the empirical losses is decreasing, task 1 overfits and task 2 underfits, which leads to poor generalization performance.
To address the problem caused by uncontrolled task variance, it is necessary to implement task variance regularization, which regularizes the variance between the task-specific losses during training. However, existing deep MTL methods, including both adaptive weighting sum methods (Kendall et al., 2018;Liu et al., 2017) and multi-objective optimization-based methods (Sener and Koltun, 2018;Mao et al., 2020b), ignore the task variance. Overlooking task variance degenerates an MTL model's generalization ability.
To fill this gap and further improve the generalization ability of MTL models, this paper proposes a novel MTL method, dubbed BanditMTL, which jointly minimizes the empirical losses and regularizes the task variance. BanditMTL is proposed based on linear adversarial multi-armed bandit and implemented with a mirror gradient ascent-descent algorithm. Our proposed approach can improve the performance of multi-task text classification.
Moreover, to verify our theoretical analysis and validate the superiority of BanditMTL in the text classification context, we conduct experiments on two classical text classification problems: sentiment analysis (on reviews) and topic classification (on news). The results demonstrate that applying variance regularization can improve the performance of a MTL model; moreover, BanditMTL is found to outperform several state-of-the-art multitask text classification methods.

Related Works
Multi-task Learning methods jointly minimize taskspecific empirical loss based on multi-objective optimization (Sener and Koltun, 2018;Mao et al., 2020a) or optimizing the weighted sum of the task-specific loss (Liu et al., 2017;Kendall et al., 2018;. The multi-objective optimization based MTL can converge to an arbitrary Pareto stationary point, the task variance of which is also arbitrary. While the weighted sum methods focus on minimizing the weighted average of the task-specific empirical loss, they do not consider the task variance. To fill the gap in existing methods, this paper proposes to regularize the task variance, which will significantly impact the generalization performance of MTL models. Variance-based regularization has been used previously in Single-task Learning to balance the tradeoff between approximation and estimation error (Bartlett et al., 2006;Koltchinskii et al., 2006;Namkoong and Duchi, 2017). In the Single-task Learning setting, the goal of variance-based regularization is to regularize the variance between the loss of training samples (Namkoong and Duchi, 2016;Duchi and Namkoong, 2019). While these variance-based regularization methods can improve the generalization ability of Single-task Learning models, they do not fit the Multi-task Learning setting. This paper thus first proposes a novel variancebased regularization method for Multi-task Learning to improve MTL models' generalization ability by regularizing the between-task loss variance.

Preliminaries
Consider a multi-task learning problem with T tasks over an input space X and a collection of task spaces {Y t } T t=1 . For each task, we have a set of i.i.d. training , where n t is the number of training samples of task t. In this paper, we focus on the neural network-based multi-task learning setting, in which the tasks are jointly learned by sharing some parameters (hidden layers).
Let h(·, θ) : {X } T t=1 → {Y t } T t=1 be the multitask learning model, where θ ∈ Θ is the vector of the model parameters. θ = (θ sh , θ 1 , ..., θ T ) consists of θ sh (the parameters shared between tasks) and θ t (the task-specific parameters). We denote h t (·, θ sh , θ t ) : X → Y t as the taskspecific map. The task-specific loss function is denoted as l t (·, ·) : Y t × Y t → [0, 1] T . The empirical loss of the task t is defined aŝ . The transpose of the vector/matrix is represented by the superscript , and the logarithms to base e are denoted by log.

The Learning Objective of MTL
Under the Empirical Risk Minimization paradigm, multi-task learning aims to optimize the vector of task-specific empirical losses. The learning objective of multi-task learning is formulated as a vector optimization objective, as in equation (1).
In order to optimize the learning objective, existing multi-task learning methods tend to adopt either global criterion optimization strategies (Liu et al., 2017;Kendall et al., 2018;Mao et al., 2020b) or multiple gradient descent strategies (Sener and Koltun, 2018;Debabrata Mahapatra, 2020). In this paper, we choose to adopt the typical linear-combination strategy, which can achieve proper Pareto Optimality (Miettinen, 2012) and is widely used in the multi-task text classification context (Liu et al., 2017;Yadav et al., 2018;Xiao et al., 2018). The linear-combination strategy is defined in (2):

Adversarial Multi-armed Bandit
Adversarial multi-armed bandit, a case in which a player and an adversary simultaneously address the trade-off between exploration and exploitation, is one of the fundamental multi-armed bandit problems (Bubeck and Cesa-Bianchi, 2012). In this paper, we consider the linear multi-armed bandit, which is a generalized adversarial multi-armed bandit. In our linear multi-armed bandit setting, the set of arms is a compact set A ∈ R T . At each time step k = 1, 2, ..., K the player chooses an arm from A while; simultaneously, the adversary chooses a loss vector from [0, 1] T . For linear multi-armed bandit, the Online Mirror Descent (OMD) algorithm is a powerful technology that can be used to achieve proper regret (Srebro et al., 2011).

Online Mirror Descent
The Online Mirror Descent (OMD) algorithm is a generalization of gradient descent for sequential de- cision problems. Rather than taking gradient steps in the primal space, the mirror descent approach involves taking gradient steps in the dual space. The bijection ∇Φ and its inverse ∇Φ * are used to map back and forth between primal and dual points. To obtain a good regret bound, Φ must be a Legendre function (Definition 1). Assume that we update u k with gradient g k using OMD. The OMD algorithm consists of three steps: (1) select a Legendre function Φ; (2) perform a gradient descent step in the dual space where Φ * and ∇Φ * are as defined in Definition 2 and η is the step length; (3) project back to the primal space according to the Bregman divergence (Definition 3):

Hard Parameter-sharing MTL Model
This paper adopts the most prevalent and efficient hard parameter-sharing MTL model (Kendall et al., 2018;Sener and Koltun, 2018;Mao et al., 2020b) to perform multi-task text classification. As shown in Figure 1, the hard parametersharing MTL model learns multiple related tasks simultaneously by sharing the hidden layers (feature extractor) across all tasks while retaining taskspecific output layers for each task. In multitask text classification, the feature extractor can be LSTM (Hochreiter and Schmidhuber, 1997), TextCNN (Kim, 2014), and so on. The task-specific layers are typically formulated by fully connected layers, ending with a softmax function.

Bandit-based Multi-task Learning
To avoid uncontrolled task variance, we need to develop a learning method that regularizes the task variance during training. Regularized Loss Minimization (RLM) is a learning method that jointly minimizes the empirical risk and a regularization function, and is thus a natural choice. While RLM is widely used in Single-task Learning, it cannot be directly used in Multi-task Learning to regularize the task variance. In this section, we propose a surrogate for RLM in MTL and accordingly develop a novel MTL method, namely BanditMTL.

Regularizing the Task Variance
RLM is a natural choice for regularizing the task variance. RLM for task-variance-regularized MTL can be formulated as in equation (3): t (θ sh , θ t )) 2 is the empirical variance between the task-specific losses.
is convex and can be used as a convex surrogate for (3). This paper proposes to perform task-variance-regularized multi-task-learning with the following learning objective: Optimizing (5) is equivalent to optimizing (3).
In the proposed learning objective (5), ρ is the regularization parameter that controls the trade-off between the mean empirical loss and the task variance. Experimental analysis on the influence of ρ is presented in Section 5.6. To learn an MTL model via learning objective (5), we formulate the learning problem as an adversarial multi-armed bandit problem in Section 4.2 and further propose the BanditMTL algorithm in Section 4.3.

Task-Variance-Regularized MTL as Adversarial Multi-armed Bandit
In deep multi-task learning, an MTL model is typically learnt by iteratively optimizing the learning objective. To iteratively optimize the proposed learning objective (5), we formulate it as an adversarial multi-armed bandit problem in which the player chooses an arm from P ρ,T and the adversary assigns a loss vector L(θ) = (L 1 (θ sh , θ 1 ), ...,L T (θ sh , θ T )) to each arm. In each learning iteration, the player chooses an arm from P ρ,T to increase the weighted sum loss, while the adversary aims to decrease the loss by updating the learning model. Moreover, both the player and the adversary aim to find a trade-off between exploration and exploitation to achieve proper regret. When l t (·, ·) is convex and Θ is compact, the adversarial multi-armed bandit problem can achieve a saddle point (θ * , p * ) (Boyd and Vandenberghe, 2014). The saddle point sat- To achieve a proper regret and saddle point, we adopts mirror gradient ascent for the player and mirror gradient descent for the adversary. The mirror gradient ascent-descent algorithm for MTL, namely BanditMTL, is proposed in the next section.

BanditMTL
In this paper, the task-variance-regularized multitask learning is formulated as a linear adversarial multi-armed bandit problem. For a problem of this kind, mirror gradient descent (ascent) is a powerful technique for the adversary and the player to achieve proper regret (Bubeck and Cesa-Bianchi, 2012;Namkoong and Duchi, 2016). Moreover, based on the mirror gradient ascent-descent, we can reach the saddle point of the minimax optimization problem when the task-specific loss functions are convex and the parameter space Θ is compact (Boyd and Vandenberghe, 2014).

Algorithm 1: BanditMTL
Input: data {D t } T t=1 , the learning rate η p and η a , the approximation parameter . Initialization: p 1 = ( 1 T , 1 T , ..., 1 T ) , randomly initialize θ 1 . for k = 1 to K do Compute λ with Algorithm 2. Update p: : Update θ: Algorithm 2: Compute λ In this paper, we propose a task-varianceregularized multi-task learning algorithm based on mirror gradient ascent-descent, dubbed Ban-ditMTL. The proposed method is presented in algorithmic form in Algorithm 1. We assume that the training procedure has K learning iterations. In each learning iteration 1 ≤ k < K, the player and the adversary update via mirror gradient ascent and descent.

Mirror Gradient Ascent for the Player
For the player, considering the constraint in P ρ,T , we choose the Legendre function Φ p (p) = T t=1 p t log p t . Based on the Legendre function, we propose the update rule of p in (6) (see the Appendix for derivations of the update rule).
where η p is the step size for the player. Moreover,λ is the solution of equation, where f (λ) is defined in (7). f (λ) is non-increasing and λ ≥ 0.
where q t = e (log p k t +ηpLt(θ k sh ,θ k t )) . To solve f (λ) = 0, we propose a bisection search-based algorithm, as outlined in Algorithm 2.

Mirror Gradient Descent for the
Adversary For the adversary, to simplify calculation, we choose the Legendre function Φ θ (θ) = 1 2 θ 2 2 . By using Φ θ (θ), the update rule of mirror gradient descent (presented in (8)) is the same as that of same with the common gradient descent. (see the Appendix for derivations of the update rule).
where η a is the learning rate for the adversary.

Experiments
In this section, we perform experimental studies on sentiment analysis and topic classification respectively to evaluate the performance of our proposed BanditMTL and verify our theoretical analysis. The implementation is based on PyTorch (Paszke et al., 2019). The code is attached in the supplementary materials.

Datasets
Sentiment Analysis . We evaluate our algorithm on product reviews from Amazon. The dataset (Blitzer et al., 2007) contains product reviews from 14 domains, including books, DVDs, electronics, kitchen appliances and so on. We consider each domain as a binary classification task. Reviews with rating > 3 were labeled positive, those with rating < 3 were labeled negative, reviews with https://www.cs.jhu.edu/˜mdredze/ datasets/sentiment/ rating = 3 are discarded as the sentiments were ambiguous and hard to predict.
Topic Classification . We select 16 newsgroups from the 20 Newsgroup dataset, which is a collection of approximately 20,000 newsgroup documents that is partitioned (nearly) evenly across 20 different newsgroups, then formulate them into four 4-class classification tasks (as shown in Table  1) to evaluate the performance of our algorithm on topic classification.

Baselines
We compare BanditMTL with following baselines. Single-Task Learning: learning each task independently.
Uniform Scaling: learning the MTL model with learning objective (2), the uniformly weighted sum of task-specific empirical loss.
Uncertainty: using the uncertainty weighting method proposed by (Kendall et al., 2018).
GradNorm: using the gradient normalization method proposed by .

Experimental Settings
We adopt the hard parameter-sharing MTL model shown in Fig. 1. The shared feature extractor is formulated via a TextCNN which is structured with three parallel convolutional layers with kernels size of 3, 5, 7 respectively. The task-specific module is formulated by means of one fully connected layer ending with a softmax function. To ensure consistency with the state-of-the-art multi-task classification methods (Liu et al., 2017;Mao et al., 2020b) and ensure fair comparison, we adopt Pre-trained http://qwone.com/˜jason/20Newsgroups/ GloVe (Pennington et al., 2014) word embeddings in our experimental analysis.
We train the deep MTL network model in line with Algorithm 1. The learning rate for the adversary is 1e − 3 for both sentiment analysis and topic classification. We use the Adam optimizer (Kingma and Ba, 2015) and train over 3000 epochs for both sentiment analysis and topic classification.
The batch size is 256. We use dropout with a probability of 0.5 for all task-specific modules.

Classification Accuracy
We compare the proposed BanditMTL with the baselines and report the results over 10 runs by plotting the classification accuracy of each task for both sentiment analysis and topic classification. The results are shown in Fig. 2 and 3.  All experimental results show that our proposed BanditMTL significantly outperforms Uniform Scaling, which demonstrates that adopting task variance regularization can boost the performance of MTL models. Moreover, BanditMTL can be seen to outperform all baselines and achieve state-of-the-art performance.

Task Variance
In this section, we experimentally investigate how BanditMTL regularizes the task variance during training and compare the task variance of Ban-ditMTL with the baselines. The results are plotted in Fig. 4. As the figure shows, all MTL methods have lower task variance than single task learning during training. Moreover, BanditMTL has lower task variance and smoother evolution during train-ing than other MTL methods. After considering the results obtained in Section 5.4, we conclude that task variance has a significant impact on multi-task text classification performance.

Impact of ρ
In BanditMTL, ρ is the regularization parameter. In this section, we experimentally investigate the impact of ρ on task variance and average classification accuracy over the tasks of interest.

Impact on Variance
Fig. 5 plots how the task variance evolves during training w.r.t different values of ρ. The task variance decreases as ρ increases. It reveals that we can control the task variance by adjusting ρ.

Impact on Average Accuracy
The changes in BanditMTL's average classification accuracy w.r.t different values of ρ is illustrated in Fig. 6. In this figure, as ρ increases, the average accuracy of BanditMTL first increases and then decreases. This reveals that ρ significantly impacts the performance of multi-task text classification. As ρ controls the trade-off between the empirical loss and the task variance, we can conclude that this trade-off significantly impacts the multi-task text classification performance. Thus, in the multitask text classification, it is necessary for us to find a proper trade-off between the empirical loss and the task variance rather than focusing only on empirical loss. These results verify the necessary of task variance regularization.

Sensitivity Study on η p
In BanditMTL, η p is a hyper-parameter. To determine whether the performance of BanditMTL is sensitive to η p , we conduct experiments on the classification performance of BanditMTL w.r.t different values of η p . The results of these experiments are presented in Fig. 7. As the figure shows, the performance of our proposed method is not very sensitive to η p when η p is within the range of 0.3 Figure 8: Comparison of task weight adaption processes between BanditMTL, Uncertainty, Gradnorm, and MGDA for topic classification. ρ = 1.2, η p = 0.5. to 0.9 for both sentiment analysis and topic classification. Setting η p to between 0.3 and 0.9 can generally provide satisfactory results.

Evolution of p t
In this section, we observe the changes in p t during training and compare these changes with the task weight adaption process of three weight adaptive MTL methods (i.e., Uncertainty, Gradnorm, and MGDA). The results for topic classification are reported in Fig. 9. Due to space limitations, the sentiment analysis results are presented in the appendix. From the results, we can see that the weight adaption process of BanditMTL is more stable than that of Uncertainty, Gradnorm, and MGDA.

Conclusion
This paper proposes a novel Multi-task Learning algorithm, dubbed BanditMTL. It fills the task variance regularization gap in the field of MTL and achieves state-of-the-art performance in real-world text classification applications. Moreover, our proposed BanditMTL is model-agnostic; thus, it could potentially be used in other natural language processing applications, such as Multi-task Named Entity Recognition.