Devil’s Advocate: Novel Boosting Ensemble Method from Psychological Findings for Text Classiﬁcation

We present a new form of ensemble method– Devil’s Advocate, which uses a deliberately dissenting model to force other submodels within the ensemble to better collaborate. Our method consists of two di ﬀ erent training settings: one follows the conventional training process (Norm), and the other is trained by artiﬁcially generated labels (DevAdv). After training the models, Norm models are ﬁne-tuned through an additional loss function, which uses the DevAdv model as a constraint. In making a ﬁnal decision, the proposed ensemble model sums the scores of Norm models and then subtracts the score of the DevAdv model. The DevAdv model improves the overall performance of the other models within the ensemble. In addition to our ensemble framework being based on psychological background, it also shows comparable or improved performance on 5 text classiﬁcation tasks when compared to conventional ensemble methods.


Introduction
Ensemble modeling is a technique that combines several submodels into a composite model. By diminishing model bias, and variance, ensemble techniques can improve overall model performance (Zhou, 2012). In addition, ensemble techniques are also used to get confidence scores of model predictions for explainable models (Haeusler et al., 2013;Li et al., 2014;Vasudevan et al., 2019). For these advantages, ensemble has been used as the de facto standard for many classification tasks.
Ensemble methods such as soft-voting, hardvoting (Hansen and Salamon, 1990), bagging (Breiman, 1996), and boosting (Schapire, 1990) attempt to build submodels which have different views on the same data, which produces more robust predictions. * Work carried out at Seoul National University † Equal Contribution Research in psychology has shown that a high level of cohesion and group thinking can lead to poor decisions and premature solutions (Janis, 1972;McGrath, 1984;Moorhead et al., 1991). People tend to follow majority in decision making even if the decisions are not reasonable. They are also more likely to rush to judgment and alternatives preferred by the majority (Nemeth, 2018). As Asch (1956) put, 35% of the responses agreed with the majority and nearly everyone followed the incorrect majority at least once. When it comes to group decision making, groups often fall into ideas that are sub-optimal rather than take advantages of using all of the ideas. Parallels can be drawn between this psychological phenomenon and some ensemble methods, especially in cases where the submodels all have similar architectures.
Devil's Advocate is one of the most prominent methods used for fostering healthy dissent in human group decision making (MacDougall and Baum, 1997;Nemeth et al., 2001). It involves taking a position counter to the majority position. That is, Devil's Advocate takes an alternative position from the norms taken for granted in order to deepen the discussion through reasonable opposition. By doing so, the dissenter can increase independence of individuals' thoughts (Nemeth and Nemeth-Brown, 2003). By leveraging this principle from human decision making, we attempt to model the settings of Devil's Advocate and to improve the quality of decision making (in the computational model) and performance.
The contributions of the present study can be summarized as follows 1 : • We propose an ensemble method, which is theoretically based on psychological background, Devil's Advocate: a reasonable dissent can improve overall group decision making.
• On 5 different text classification datasets, our method shows comparable or improved performance when compared to conventional ensemble methods.

Devil's Advocate
Psychologists have made various attempts to improve the quality of decision making. Some tried to raise the quality through increasing the diversity in groups (Chatman et al., 1998). Other researchers have utilized the concept of 'an outsider in group', especially, Devil's advocate (Schweiger et al., 1986;Nemeth et al., 2001). Devil's Advocate is a person who takes a position that does not necessarily agree with the consensus, for the sake of rich discussion. By taking a counter position, the Devil's Advocate engages others in an argumentative discussion to challenge the uniform thought of the majority further, making the participants disagree with the consensus and challenge their point of view. The purpose of this idea is to assess the quality of the original thought and identify errors in argument.

Ensembles
Voting Algorithms (Hansen and Salamon, 1990); Soft-Voting simply involves averaging the prediction scores of submodels. When we train models, the model weights are initialized differently. Due to the effect of random initialization, the models have different views on the same data. Hard-Voting is a variation of soft-voting. In hardvoting, the prediction made by the majority of submodels is the resultant ensemble prediction. Although alternative ensemble methods have been developed, these simple voting models remain widely used due to their simplicity and high performance.
Bagging (Bootstrap AGGregatING) (Breiman, 1996) first generates a bootstrap sample from the training dataset. A classifier is then trained from the bootstrap sample. Through repeating this process, the method builds a number of classifiers and averages their prediction scores.
Boosting (Schapire, 1990) links weak classifiers in various ways to build a strong classifier. The main idea is to train a classifier by complementing the weaknesses of the previously trained classifier. Its variations, Adaboost (Freund and Schapire, 1997) and Gradient Boosting (Friedman, 2002), are famous but not widely used in deep learning since boosting requires weak classifiers.

Training Norm and DevAdv models
Our method requires at least 3 models. Normal models (Norm n where n ≥ 2) follow the conventional training process, while one model is used as a Devil's Advocate model (DevAdv). We first train Norm n models, using a conventional Cross Entropy loss function (CE).
where Scores Norm n are prediction scores of Norm n models, and l true refers to true labels, respectively. Conversely, in order to create the DevAdv model, we randomly generate fake labels which do not intersect with the true labels. The generated labels are denoted as false labels (l false ). The loss function of DevAdv is as follows: where C is the number of labels. Since the DevAdv model is trained using false labels, the model serves the Devil's Advocate, disagreeing with the prediction scores of the other models. Furthermore, the fake labels are randomly generated in each epoch, allowing the DevAdv model to offer a different view on the data with each training iteration.
In early-stopping, the validation performance of the DevAdv model is checked by assessing whether argmin (Scores DevAdv ) is the true label.

Group Discussion: Fine-tuning
For fine-tuning, we adopt an approach inspired by experiments of the human group decision making (i.e., group discussion) used in the original Devil's Advocate work. With the trained models (Norm 1 , Norm 2 , DevAdv), we design additional loss function as follows:   The model weights of the DevAdv model are fixed to prevent DevAdv from being trained like Norm. Also, softmax normalization is not applied to Norms' scores, not to limit the scores from 0 to 1; but to make the scores much higher than normalized DevAdv's score. Through CE loss, the DevAdv model prevents Norm models from being correctly fitted to the true labels. However, during the training process, Norm models eventually learn to correctly predict the true labels, even despite the disturbance by the DevAdv model. In the second MSE term of the above equation, each Norm model enhances the others with information (experience) learned from the first term. This term also prevents the models from catastrophic forgetting. With this loss function, we train the models again using the same train set. As a result, we expect to result in a more diverse range of views on the data. When reporting the performance on the test set, we follow the soft-voting ensemble but utilize the DevAdv model by using its prediction scores reversely: argmax( N n Scores Norm n −Scores DevAdv ).

Experiment
Data. We use GloVe (Pennington et al., 2014) as pretrained embeddings. To increase model performance, we apply a word vector post-processing method called extrofitting (Jo and Choi, 2018). We prepare 3 topic classification datasets; DBpedia ontology (DBpedia) (Lehmann et al., 2015), YahooAnswers (Yahoo) (Chang et al., 2008), AG-News. We also prepared 2 sentiment classification datasets; Yelp reviews (Yelp) (Zhang et al., 2015), IMDB (Maas et al., 2011). The data information is presented in Table 1. Additionally, we seperate 15% from the training set of each dataset to create validation sets for all datasets. The validation set is used for early-stopping. We use all words as inputs, including all special symbols in a 300 dimensional embedding space.
Classifier. We choose TextCNN (Kim, 2014) as the submodel architecture of our proposed ensemble method. The model has two convolutional layers with 32 channels and 16 channels, respectively. We adopt multiple sizes of kernels-2, 3, 4, and 5, followed by ReLU activation (Hahnloser et al., 2000) and max-pooling. We concatenate the output after every max-pooling layer. We optimize the model parameters using Adam (Kingma and Ba, 2014) with a 1e-3 learning rate. We use 1 DevAdv model and 2 Norm models as a default.
Baseline Implementation. Soft-Voting is implemented by averaging the model prediction scores. Hard-Voting is implemented by selecting the majority predictions. The sampling rate of Bagging is 70% of training data with replacement, ensuring all the data being used at least once. Boosting cannot be compared as a baseline because our method consist of a single model architecture. In order to show the difference with Miyato et al. (2016), we report the performance with the embedding perturbation.

Result
The performance of our proposed ensemble methods is presented in Table 2. We confirm that our ensemble method is most effective when the dataset is relatively small. However, our method performs Figure 1: The performance of Single, DevAdv, Norm 1 , and Norm 2 models. We confirm that DevAdv model provides further improvements to other models (Single → Norm 1 , Single → Norm 2 ). on par with soft-voting even on relatively large datasets. In addition, the performance gap between them on large datasets is within the error bounds.
We present the performance of the ensemble models on each dataset in Figure 1. With the help of DevAdv, Norm 1 and Norm 2 models perform much better than single model on most of the datasets. That is, the scores of the DevAdv model force the other classifiers to be improved in order to counteract this noise. This principle is also the core idea behind boosting.
On the IMDB dataset, DevAdv model does not augment the performance of the Norm models. Since IMDB has only 2 classes, the training process of the DevAdv model is not different from a conventional training process.

Related Work
Although our boosting method is inspired by the psychological background, Devil's Advocate, its implementation is related to Data Augmentation (in particular, Negative Sampling (Mikolov et al., 2013)), and Adversarial Training in terms of training DevAdv and fine-tuning, respectively.
Data augmentation is used in many machine learning tasks to artificially enlarge the size of the training set. In the text domain, using synonyms (Zhang et al., 2015), back translation (Sennrich et al., 2016), and paraphrasing (Kumar et al., 2019) have been proposed. However, these methods are only moderately effective since the meaning of words are sensitive to modification. Instead, we use a model trained through negative sampling.
Our method can then be compared to adversarial training, which uses a negative model to make other models more robust towards adversarial examples. However, as far as we know, Miyato et al. (2016) is the only work using an adversarial training framework for text classification. They used an adversarial training process at embedding-level, from the beginning of model training. In contrast, our proposed method utilizes a pretrained negative model to fine-tune other models. Furthermore, our negative model contributes to the final prediction, resulting in further improvements (see Table 3).

Ablation Studies
Training and Inference The false labels generated artificially serve to augment the data used for training the DevAdv model, which is trained using exclusively false labels. By limiting the number of false labels to 1, we confirm the effect of data augmentation. Table 3 shows that the effect of data augmentation is important when the number of classes in the dataset is large. On the other hand, datasets which have small numbers of classes (e.g., IMDB) are less affected.
Next, we remove the group discussion stage, which fine-tunes the Norm models interactively. By this ablation, we can see the effect of adversarial training, which trains a model in an unconventional way by using a negative model. The group discussion process (adversarial training) shows positive effects on performance except for Yelp. However, the performance gap is within the error range.
We also see that our method can be used with  Table 3: Ablation studies on the number of fake labels (data augmentation) and presence of group discussion (adversarial training). We also present the performance when the DevAdv model does not involve.

Model
Ensemble  more than 3 models (see Table 3). When we use KL divergence instead of MSE in discussion loss it slightly degrades the performance.
Model Architecture The small sized TextCNN (SmallCNN) model consists of multi-kernels which size is [2,3] (instead of [2,3,4,5]). Also, we reduce channel size from [32,16] to [32], which has 1-depth convolutional layer only. The result is presented in Table 4. We also provide the performance of Transformers (Vaswani et al., 2017)-based model performance (see Table 4). The transformer classifier has the maximum 512 sequence length with 300 embedding dimensions and positional-embeddings. It also has 10 multi-head attentions but uses 1 encoder. Stacking more encoder layers harms the performance. The hyperparameters of these models are the same as those of main experiment with TextCNN.
Similar to the previous experiment, the performances on other models are on par with soft-voting. Nevertheless, the results indicates that our proposed ensemble (Devil's Advocate) can be applied to any kinds of model architecture. It is also interesting that Transformers shows overfitting on Yahoo dataset, but DevAdv makes the model being generalized.

Conclusion
In this paper, we propose a novel boosting ensemble approach, inspired by the Devil's Advocate. In addition to the implementation of the psychological background, the framework is designed to make submodels better collaborate with each other.
We first train a model with incorrect labels in order to make the model serves as Devil's Advocate (DevAdv), and the DevAdv interacts with the other conventionally trained models. In the experiments, we show DevAdv model improves performance of the other conventionally trained models.
Although the proposed models' performance does not significantly outperform other ensemble methods, we believe that our new ensemble approach makes valuable contributions to the future research: the use of negative model by taking advantages of data augmentation and adversarial training to provide different views of the same dataset, and the implementation of psychologicalmotivated idea can be properly applied to the NLP field/machine learning domain.