Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers

Large pre-trained language models have shown remarkable performance over the past few years. These models, however, sometimes learn superficial features from the dataset and cannot generalize to the distributions that are dissimilar to the training scenario. There have been several approaches proposed to reduce model's reliance on these bias features which can improve model robustness in the out-of-distribution setting. However, existing methods usually use a fixed low-capacity model to deal with various bias features, which ignore the learnability of those features. In this paper, we analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases. We further show that by choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.


Introduction
Advances in pre-trained language models have shown great performance in natural language processing (NLP) benchmarks.However, these models often learn dataset-specific patterns and cannot generalize well to out-of-distribution data (McCoy et al., 2019;Niven and Kao, 2019;Si et al., 2019;Ko et al., 2020).These patterns are referred to as bias features, which have strong indications of instance labels but do not necessarily generalize to out-of-distribution data (Geirhos et al., 2020).For example, in MNLI (Williams et al., 2018), the appearance of a negation word in an example has a strong correlation with label "contradiction" (Gururangan et al., 2018).A model leveraging such bias features can exhibit good performance on indomain data but will break when evaluated on an out-of-distribution test set where the correlation between the patterns and labels no longer holds.
Given prior knowledge of possible bias features in the dataset, several approaches have been proposed to reduce models' overreliance on the bias features (Clark et al., 2019;He et al., 2019;Utama et al., 2020).The underlying idea is to discourage the model to learn from "easy" examples that can be predicted correctly solely based on bias features.These works first train a bias model to capture bias patterns.They then train a main model and ensemble it with the bias model in a way that the predictions of the main model are adjusted by not leveraging the strategy captured by the bias model.Product-of-experts (Hinton, 2002) and self-distillation approaches have been widely adopted for the ensemble (Clark et al., 2019;He et al., 2019;Mahabadi and Henderson, 2019;Utama et al., 2020;Du et al., 2021).
Although these methods improve model robustness on some benchmark datasets, none of them study how to choose the bias model and only assume that a weak classifier (e.g., logistic regression) can explicitly capture the bias patterns.We argue that it is very important to choose the appropriate bias model.On one hand, different bias features may not be captured by one model with a fixed model size (capacity).For example, in MNLI, a model capturing the "negation word occurrence" does not necessarily capture the token overlap pattern at the same time.On the other hand, we do not expect an over-capacity bias model as it may capture non-bias features which will also be factored out during the ensemble, thus worsening the overall model performance.To do this, we train bias models with different capacities leveraging the product-of-experts method and compare with current state-of-the-art methods.We empirically show the different learnability of the bias patterns requires models with diverse capacities to better capture them.For example, a BERT-mini model can learn a token-overlap style bias pattern better in the MNLI dataset compared with the logistic regression used by existing literature.
In this work, we conduct a deep analysis of the existing literature on ensemble-based methods for model robustness improvement where most of them follow the hypothesis of adopting one simple model to learn bias patterns.Instead, we study different bias models and demonstrate their ability to capture the bias patterns varies.We propose an approach to selecting the "best" bias model by splitting the development set into "easy" and "challenge" subsets.By utilizing the best bias models for different bias features, we show that we can make the models more robust compared to existing baselines, on both natural language inference and fact-verification tasks.

Bias Model Selection
To understand to what extent our models can conquer the biases, for each bias, we create a bias training dataset only consisting of examples with this bias feature to reduce the impact of other biases.To evaluate model's ability to overcome the bias, we create a corresponding challenge test set which is composed of examples with that specific bias feature and not following the distribution in the training set.Examples of such sets are in Sec.3.3.
To obtain the best bias model for each bias feature, we first train various bias models with different capacities on the bias training set and ensemble them with the main model.We then evaluate the main model on the corresponding challenge set and choose the one exhibiting the best robustness result.Our model pipeline is shown in Figure 1.
In this work, following Clark et al. (2019), we leverage product-of-experts (PoE), a commonly used method for model robustness improvement, as an example to illustrate our argument. 1Given a dataset . ., p iC ⟩ where b ij and p ij are the probabilities predicted for label category j, the goal is to learn θ that can make a correct prediction for an input example without using the patterns learned by the bias model.To achieve such a goal, PoE (Hinton et al., 2015) fuses the two models as pi = softmax(log(p i ) + log(b i )).
1 Various methods can be chosen for the analysis such as learned-mixin which is the adaptive version of PoE.However, as stated in Clark et al. (2019), the learned-mixin method could be unstable in some cases.During training, the model is optimized for the cross-entropy loss based on p.After training, only the main model f will be used for evaluation.

Experiment
In this section, we use two tasks (in English) to study the effects of different bias models.We analyze the best bias model for various bias features.

Dataset
Fact Verification.FEVER (Thorne et al., 2018) is a dataset for fact verification task.Each instance contains a claim sentence and an evidence sentence.The goal is to verify if the claim is Supported, Refuted or NotEnoughInfo with the evidence.We evaluate model robustness on Fever-Symmetric dataset (Schuster et al., 2019).
Natural Language Inference.The goal of the natural language inference (NLI) task is to identify the relationship (Entailment, Contradiction or Neutral) between the hypothesis and premise sentences.We use the MNLI dataset (Williams et al., 2018) for training and evaluate the model robustness on the HANS dataset (McCoy et al., 2019).

Bias Features
Schuster et al. (2019) show that for the FEVER dataset, only using the claim sentence can obtain comparable results to using both claim and evidence.Hence we use this CLAIM-ONLY feature as one of our bias features.Similarly, we add another type of bias feature, which only uses the evidence sentence to make the prediction, and we refer to this as EVIDENCE-ONLY bias.Clark et al. (2019) list several bias features in MNLI dataset, such as whether all the tokens of the hypothesis appear in the premise (ALL-IN-P), whether the hypothesis is a subsequence in premise (H-IS-SUBSEQ), the percentage of words in hypothesis that are also in premise (PERCENT-IN-P), and In FEVER, there is no straightforward way to determine if one example can be purely predicted by one sentence.To create the bias training set for the CLAIM-ONLY, we first fine-tune a BERTbase model on the FEVER dataset.We then make the prediction based only on the claim sentence and collect all the examples that can be predicted correctly as the biased training set.To build the challenge set, we collect examples that cannot be correctly predicted by only looking at the claim sentence from the dev set.We do the same for the EVIDENCE-ONLY bias in FEVER.For these two tasks, the bias training set is obtained from the corresponding task training set and the challenge set is obtained from the task dev (or test) set.Our bias only models are trained on the bias training dataset and evaluated on the bias challenge set.

Capacity for Bias Models
In this section, we verify the best bias model for each bias feature.We use a BERT-base model as the main model for both the FEVER and MNLI datasets.In terms of the capacity for the bias model, we consider different BERT models (from tiny to base) as well as the one used in the existing literature.For MNLI, a widely used bias model is a logistic regression model trained on some predefined biased features (Clark et al., 2019).For FEVER dataset, existing work (Utama et al., 2020) leverages a shallow non-linear classifier, in this work, we use a multilayer perceptron (MLP) model.
Table 1 shows the results on FEVER when we use the bias models with different capacities, from MLP to BERT-base.The first row "None" stands for a naive BERT-base model without any bias model.CLAIM-ONLY stands for the model's performance on the challenge set we create.It suggests that to deal with the CLAIM-ONLY bias, using BERT-base as the bias model can better improve the robustness on the challenge set, at the same time, we see the model keeps its performance on the in-distribution dev set.Similarly, when dealing with the EVIDENCE-ONLY bias, in Table 2 we see that using BERT-tiny as the bias model can improve the model robustness better than other choices without performance loss in the in-distribution data.We compare our methods with two other baselines, one is where a model is trained on the weighted dataset.The weight of example x i is 1 − b iy i where b iy i is the probability from the bias model on the correct label y i (Clark et al., 2019).Another baseline is self-distillation (Utama et al., 2020), where the ensemble is based on the knowledge distillation (Hinton et al., 2015).It is a more complicated ensemble method than PoE and requires an additional teacher model.More details are provided in the Appendix.
We show that by fusing different bias logits together, we can provide a way to improve the model robustness.We compare our methods with existing literature which leverages the biases from a fixed model (Utama et al., 2020).In Table 3, PoE MLP refers to a baseline which uses the MLP as the bias model.2PoE Ours fuses a BERT-base model with weighted bias logits from our claim-only and evidence-only bias models and it outperforms the baselines by over 8% on the test benchmarks.
For MNLI, we consider the same set of BERT models.All of the bias models are trained on the examples that have this bias feature.The results are shown in  PoE method can outperform self-distillation which has a more sophisticated ensemble schema.
Insights We notice that the logits fused into the main models can play a significant role in the model performance.For example, we observe that an overconfident bias model can hurt the model performance in the in-distribution evaluation set.We also find a possible negative effect on the robustness result when dealing with the bias features not related to the test set.For example, in HANS, there are no instances with the NEG-IN-H bias, and fusing the logits from the bias model for NEG-IN-H sometimes hurts the performance on HANS (more in appendix).However, in most cases, we do not have access to the bias features in the test set, how to deal with the potential conflicts between different bias features remains an unexplored yet very important direction and we leave it for future study.

Related Work
Recent NLP models show great performance when evaluating on the in-distribution test set.However, such results might not hold when the test set is outof-distribution.For example, Schuster et al. (2019) discover that a fact verification model may make a prediction by looking at the occurrence of certain phrases in the input example.Similar scenarios have been observed in other applications such as in visual question answering (Agrawal et al., 2018), and paraphrase identification (Zhang et al., 2019).Several ensemble-based methods have shown improvement in model robustness when dealing with dataset biases (Clark et al., 2019;He et al., 2019;Mahabadi and Henderson, 2019;Utama et al., 2020).Such methods usually contain two components for the ensemble and focus on different ensemble strategies.In contrast, we analyze the components for the ensemble.Our work fills in the gap of a deeper understanding of the bias patterns in the dataset and provide a pipeline to choose the best component for the ensemble so that we can improve model robustness.

Discussion
How to improve the model robustness has been an important research topic, and several approaches have been proposed for such a goal, ranging from dataset augmentation (Kaushik et al., 2019) to model architecture design (Lewis and Fan, 2018).Although several ensemble-based methods have shown great improvement, they treat all the dataset artifacts exactly the same way.In this work, we revisit such methods, and demonstrate that, not all the dataset artifacts are the same and they require different capacity models to deal with.Contrary to the common beliefs in the existing literature that a smaller-capacity model captures the bias features, our paper is the first that investigates the effect of the bias model size and shows that better robustness needs to be achieved by bias models with different capacities.We also show that by better leveraging the information learned from the dataset artifacts, a simple ensemble method can achieve a better or the same level of model robustness.

Limitations
Our study in this paper only considers some known bias features.While this setting is common in the literature, we argue that in real applications, it might be very hard to get such information.In addition, this work is based on the ensemble method for robustness improvement and the bias models obtained for PoE might not be the same for another method.There are other ways to achieve robustness such as adversarial training (Grand and Belinkov, 2019) which we leave for future study to show how they can be used to deal with various bias features.

Figure 1 :
Figure 1: An overview of model pipeline.Bias model is trained on biased subset of the original training dataset (more details are in Sec.3.3).Blue arrows stand for the gradient flow.We only use the main model when doing the evaluation.

Table 1 :
Model evaluation results with different bias models for CLAIM-ONLY bias in FEVER.The values are accuracy scores (average ± standard deviation , in %) over 3 runs on FEVER dev and CLAIM-ONLY challenge set.The best bias model for the CLAIM-ONLY bias is a BERT-base model.

Table 2 :
Model evaluation results with different bias models for EVIDENCE-ONLY in FEVER.The values are the averaged accuracy scores (in %) on FEVER Dev and EVIDENCE-ONLY challenge set over 3 runs.To deal with EVIDENCE-ONLY bias feature, the best bias model is a BERT-tiny model.
CHALLENGE SET, we filter test examples with a corresponding bias (e.g., satisfying the ALL-IN-P) but the label does not follow the bias pattern in the training dataset (e.g., labels are not "entailment").

Table 5 :
Robustness results for combining all bias features in MNLI.BERT-base means the baseline model without fusing a bias model.