Elastic weight consolidation for better bias inoculation

The biases present in training datasets have been shown to affect models for sentence pair classification tasks such as natural language inference (NLI) and fact verification. While fine-tuning models on additional data has been used to mitigate them, a common issue is that of catastrophic forgetting of the original training dataset. In this paper, we show that elastic weight consolidation (EWC) allows fine-tuning of models to mitigate biases while being less susceptible to catastrophic forgetting. In our evaluation on fact verification and NLI stress tests, we show that fine-tuning with EWC dominates standard fine-tuning, yielding models with lower levels of forgetting on the original (biased) dataset for equivalent gains in accuracy on the fine-tuning (unbiased) dataset.


Introduction
A number of recent works have illustrated shortcomings in sentence-pair classification models that are used for Natural Language Inference (NLI). These arise from limited or biased training data and the lack of suitable inductive bias in models. Naik et al. (2018) demonstrated that phenomena such as the presence of negation or a high degree of lexical overlap induce misclassifications on models trained on the MultiNLI dataset (Williams et al., 2018). Poliak et al. (2018) and Gururangan et al. (2018) identified biases introduced during dataset construction that were exploited by models to learn associations between the label and only one of the two input sentences without considering the other -known as hypothesis-only bias.
Such biases also affect fact verification (Schuster et al., 2019), typically modeled as a text-pair classification between a claim and evidence retrieved from a trusted source (Pomerleau and Rao, 2017;Thorne et al., 2018). Symmetric (Counterfactual) Evidence (Schuster et al., 2019) His debut solo studio album [...] was released on 28 April 2011

1) Train model using FEVER dataset only 2) Illustrate test set accuracies on FEVER + Symmetric
Before mitigating biases High accuracy on original task Low accuracy on counterfactual task

Mitigating biases by fine-tuning on Symmetric
Higher accuracy on counterfactual task Original task exhibits catatstophic forgetting

Fine-tuning with EWC on Symmetric
Higher accuracy on counterfactual task Catastrophic forgetting mitigated FEVER Dataset (Thorne et al., 2018) Damon Albarn 's debut album was released in 2014 His debut solo studio album [...] was released on 28 April 2014

Evidence Model Predicts
SUPPORTS ✔ Figure 1: Hypothesis only bias in FEVER contributes to low accuracy when testing against counterfactual evidence. This is mitigated by fine-tuning on counterfactual evidence. Catastrophic forgetting from finetuning is reduced when using elastic weight consolidation (EWC), preserving the original task accuracy.
To mitigate these undesirable behaviors Liu et al. (2019a) fine-tune models with a small targeted number of labeled instances to "inoculate" the models. This can be contrasted to methods such as Debiased Focal Loss (DFL) and Product of Experts (POE) (Mahabadi et al., 2020) which require architectural changes to separately encode and penalize biases. In related work, Suntwal et al. (2019) delexicalize instances -replacing tokens with placeholders, preventing classifiers from exploiting mutual information between domainspecific noun-phrases and class labels. Finally, Schuster et al. (2019) re-weight the loss of in-stances during training guided by the mutual information between tokens and the instance labels. Each of these methods corresponds to a multiobjective optimization problem where the original dataset accuracy is often sacrificed in favor of accuracy on a different evaluation set.
For fine-tuning specifically, the reduction in accuracy on the original task called catastrophic forgetting (French, 1999) as parameters for the original task are overridden during fine-tuning. In this paper, we show that regularizing fine-tuning with Elastic Weight Consolidation (Kirkpatrick et al., 2017, EWC) minimizes catastrophic forgetting by penalizing weight updates to parameters crucial to modeling the original dataset. EWC has previously been applied to machine translation, where Thompson et al. (2019) and Saunders et al. (2019) minimized catastrophic forgetting for models undergoing domain adaption. Extending this line of research further, we demonstrate that EWC can be used to mitigate biases present in two sentence-pair classification tasks without architectural changes to the models. We evaluate a number of popular model architectures, trained on MultiNLI and FEVER, and demonstrate that fine-tuning with EWC mitigates model bias while yielding lower reductions in accuracy on the original dataset compared to standard fine-tuning.
On all experiments on the FEVER dataset, finetuning with symmetric counterfactual data (Schuster et al., 2019) mitigated hypothesis-only bias, increasing the absolute accuracy of a BERT model by approximately 10%. Without EWC, accuracy on the original dataset was reduced from 87% to 79%, whereas with EWC catastrophic forgetting was mitigated and accuracy was 82%. In all experiments with EWC, the original task accuracy was significantly higher than fine-tuning without regularization and fine-tuning with L2 regularization. These gains were attained while maintaining similar performance on the fine-tuning data. Plotting the Pareto frontier, we show that equivalent gains in accuracy can be made with less forgetting of the original dataset. Furthermore, we demonstrate that fine-tuning methods can be combined with POE and DFL, yielding improvements on both the original task as well as fine-tuning data used for debiasing. Similar patterns were observed when finetuning MultiNLI models with lexical bias evaluation datasets (Naik et al., 2018).

Mitigating biases with fine-tuning
Fine-tuning broadly refers to approaches where a model is initially trained on one dataset and then further improved by training on another. We refer to these datasets as fine-tuning training and test data as FT-train and FT-test respectively. This technique is commonly used to mitigate model biases (Liu et al., 2019a), where the original data, while useful in model training, often contain biases, which are addressed by further training the model on a small set of instances targeting these biases. Fine-tuning to mitigate bias, however, can result in model parameters over-adjusting to the instances targeting it, reducing the accuracy on the original task, referred to as catastrophic forgetting (French, 1999). To ameliorate this issue, one can regularize the parameter updates so that they do not deviate too much from the original training, similar to the intuition behind multi-task training approaches (Ruder, 2017).
Elastic Weight Consolidation (Kirkpatrick et al., 2017, EWC) penalizes parameter updates according to the model's sensitivity to changes in these parameters. The model sensitivity is estimated by the Fisher information matrix, which describes the model's expected sensitivity to a change in parameters, and near the (local) minimum of the loss function used for training is equivalent to the second-order derivative F = E (x,y)∼D original [∇ 2 log p(y|x; θ)]. When fine-tuning with EWC (which we refer to as FT+EWC), the Fisher information is used to elastically scale the cost of updating parameters θ i from the original value θ * i , controlled by the λ hyper-parameter, as follows: For efficiency, we use the empirical Fisher (Martens, 2014): diagonal elements are approximated through squaring first-order gradients from a sample of instances, recomputed before each epoch. If the Fisher information is not used (i.e. F i,i = 1), Eq. 1 is equivalent to L2 regularization (which we refer to as FT+L2).

Experimental setup
We assess the application of EWC to minimize catastrophic forgetting when mitigating model biases in the context of two sentence-pair classi-fication tasks: fact verification and natural language inference. We compare the untreated model (original), fine-tuning (FT), FT+EWC, FT+L2, and merging instances from the FT-train dataset when training (Merged). Each model is first trained using the original dataset and splits from the respective task, using the AllenNLP implementations (Gardner et al., 2017) with default hyper-parameters and tokenize with SpaCy or pretrained transformer tokenizers. We train five random initializations of each model, reporting the mean accuracy, standard deviation, and p-value with an unpaired t-test. For fine-tuning, the learning rate, regularization strength λ, and number of epochs are selected through 5-fold crossvalidation on the FT-train data, selecting the model with the highest FT-train accuracy. 30 hyper parameter choices were evaluated with grid search over 10 choices for regularization strength between 10 6 and 10 8 and 3 choices of learning rates in {2 · 10 −6 , 4 · 10 −6 , 6 · 10 −6 } for transformer models and {2 · 10 −4 , 4 · 10 −4 , 6 · 10 −4 } for ESIM models. We trained the models for a max of 8 epochs. For the transformer-based models, the highest cross validation accuracy on the FT-train dataset was achieved with LR = 4 · 10 −6 , λ = 10 7 and 6 epochs. For the ESIM-based models, the highest FT-train accuracy was achieved with LR = 2 · 10 −4 , λ = 10 7 and 5 epochs. Full hyperparameter choices are in Appendix D.
Mitigating hypothesis only bias in fact verification: The FEVER 1 task (Thorne et al., 2018) is to predict whether short factoid sentences called claims are Supported or Refuted against evidence (in the form of sentences from Wikipedia) or whether there is not enough info (NEI). When training the models, the NEI instances by definition don't have evidence: we sample negative instances for these with random sentences from the Wikipedia page closest to the claim using TF·IDF. This preprocessing is the same as Thorne et al. (2018). Schuster et al. (2019) identified a bias where the label for some claims can be predicted without the need for evidence. To evaluate this bias, they released 2 a set of 1420 symmetric counterfactual instances where each claim is supported by one Wikipedia passage and refuted by another (approximately 1% of the FEVER dataset). This is mitigating the claim-only bias by reducing the mutual information between claims and labels. The availability of counterfactual data meant that it is possible to experiment with fine-tuning as a mitigation strategy, using the published dev and test data as FT-train and FT-test respectively. Following Schuster et al. (2019)'s evaluation, we train two ESIM (Enhanced LSTM) variants (Chen et al., 2016), and a BERT (Devlin et al., 2019) transformer. We also evaluate RoBERTa (Liu et al., 2019b), as it has been shown to be more robust to adversarial testing (Bartolo et al., 2020).
Mitigating model limitations in NLI stress tests: The MultiNLI 3 task (Williams et al., 2018) requires systems to predict whether a hypothesis is entailed by a premise. Naik et al. (2018) identify limitations of models trained on this dataset where 6 phenomena such as lexical overlap, numerical reasoning and presence of antonyms were evaluated with 'stress-tests'. In this paper, we report on antonyms and numerical reasoning as these stress-tests exhibited catastrophic forgetting when used to fine-tune models (Liu et al., 2019a). To this end, we do not evaluate on HANS (McCoy et al., 2020), as high accuracies can be attained without forgetting. Like Liu et al. (2019a), we mitigate these biases through fine-tuning both an ESIM (Chen et al., 2016) model on stress-test data 4 . Each contains a small number of procedurally generated instances (between 1500-9800) that specifically target one of these phenomena. We evaluate FT and FT+EWC using the same methodology, controlling the number of instances, sampled at random, in FT-train and report the change in accuracy on the FT-test and original test sets.

Fact verification
Fine-tuning the models, rather than merging datasets, yielded the greatest improvements in accuracy on FT-test. All improvements from the untreated model were significant (p < 0.05, denoted #). Without L2 or EWC, catastrophic forgetting occurs due to the shift in label distribution between the FEVER and FT-train dataset, which only contains 2 of the original 3 label classes.  Table 1: Bias mitigation for FEVER classifiers comparing no treatment (original), against merging from instances from the FT-train with the original task training dataset (Merged) and FineTuning (with EWC and L2). Improvements p < 0.05 are marked with the following symbols: * against FT, † against FT+L2, # against original. Deteriorations p > 0.05 on the symmetric dataset are marked with ♦ against FT and ♥ against FT+L2 Both L2 and EWC reduced catastrophic forgetting. Improvements on the original task are significant (denoted * ) compared to FT. However, EWC regularization retained more of the original task accuracy than L2 for all models; this was also significant (denoted †). In all cases, there is a tradeoff between original and fine-tuning task accuracies. With regularization, the FT accuracy was higher than FT+L2 and FT+EWC (with the exception of ESIM+ELMo). However, the deterioration from FT when regularizing was not significant (p > 0.05, denoted ♦). Furthermore, for the highest performing model (RoBERTa base), the deterioration of using FT+EWC against FT+L2 was also not significant (denoted ♥).
Training a model where the FEVER training and FT-train were merged yielded modest improvements on the FT-test without harming the original FEVER task accuracy. We attribute this to the impact of these 700 instances being diluted by the large number of training instances in FEVER (FT-train is <1% the size of FEVER).

Combining FT and bias modeling
Fine-tuning can be applied to any task using a small amount of bias-mitigating labeled data, whereas explicit modeling of hypothesis-only biases (Mahabadi et al., 2020) requires architectural changes that are specific to the task and model. We consider two techniques from Mahabadi et al. Focal Loss (DFL) which explicitly modulates the loss of instances according to the accuracy of a hypothesis-only classifier. For both ESIM and BERT, accuracy, when trained with PoE, was stable across different choices of hyper-parameters, whereas some hyper-parameter choices for DFL resulted in lower accuracy on both tasks. We report results for BERT+PoE and ESIM+DFL as these were best.
The trade-off between accuracy on the original and FT-test datasets is visualized in Figure 2 indicating the change in accuracy for both bias modeling techniques in isolation, as well as in combination with fine-tuning. 5 This further shows that FT+EWC Pareto dominates FT for both the ESIM and BERT model. With EWC, equivalent gains on the symmetric FT-test can be attained with a lower reduction in accuracy on the original FEVER task.

Natural Language Inference (NLI)
In a separate experiment, we apply EWC to a different domain. We inoculate biases on the ESIM model for natural language inference reported in Liu et al. (2019a). For both Antonym and Numerical Reasoning challenges, MultiNLI accuracy was higher with FT+EWC compared to FT (dashed lines in Figure 3).
Antonym challenge: The ESIM model was sensitive to fine-tuning, attaining near perfect accuracy (top row of Figure 3) on the FT-test data.
The antonym stress-test only contains instances labeled contradiction: a change in label distribution that causes catastrophic forgetting. Without EWC, accuracy on MultiNLI fell to just above chance levels as the model learned only to predict contradiction (yellow dashed line). However, using an appropriate EWC penalty attained near-perfect accuracy with a smaller reduction in MultiNLI accuracy (purple dashed line).
Numerical reasoning challenge: The ESIM model was sensitive to fine-tuning to introduce numerical reasoning behaviours to the model. As the difference in label distribution in the inoculation dataset was less severe than the Antonym dataset, the catastrophic forgetting was less severe. Nevertheless, FT+EWC minimized catastrophic forgetting at the expense of reducing sample efficiency: accuracy on MultiNLI fell from 77.9% to 75.4% without EWC and 76.8% with EWC.

Conclusions
Fine-tuning can be used to mitigate model bias but has the risk that the model catastrophically forgets the data it was originally trained on. Incorporating elastic weight consolidation (EWC) when finetuning minimizes catastrophic forgetting, yielding higher accuracy on the original task. We show this holds for both the NLI stress-tests, as well debiasing fact-verification systems (Schuster et al., 2019) through fine-tuning.

A Hypothesis-only bias in other fact-verification datasets
We use the FEVER dataset in this paper due to the availability of the symmetric counterfactual data released by Schuster et al. (2019). Other datasets for fake news detection and fact verification also have the same task signature (sentence-pair classification with a claim and another sentence), but second sentence is not evidential. All tasks exhibit a hypothesis-only bias where information from the claim can be used to predict the label without considering the second sentence in the sentence-pair.
If there was no mutual information between claims (without evidence) and labels, this should be 33%. For FEVER, this bias is introduced through synthetic generation of the claims and is more problematic than the biases that occur in the datasets consisting of naturally occurring claims. In Liar and MultiFC, the claims arise from real-world events and the biases in the data reflect political viewpoints, rather than cognitive shortcuts taken by the FEVER annotators.
The RoBERTa model, trained on only the claims outperforms the sentence-pair setup for both the MultiFC (Augenstein et al., 2019) and Liar-Plus tasks (Alhindi et al., 2018) whereas for the the FEVER data the sentence-pair accuracy is higher. As FEVER is the only task that requires the use of evidence for classification, this is expected.

B Compute infrastructure
All experiments were performed on a single workstation with a single Xeon E5-2630 CPU, 64GB RAM and an Nvidia 1080Ti GPU.

C Average run-time for fine-tuning
Estimating the Fisher matrix diagonal takes approximately 2 seconds for the ESIM model using 2000 instances sampled from the original dataset. Average training duration (excluding estimating Fisher information) was 11 seconds in total with 700 instances. Estimating the Fisher matrix diagonal takes approximately 25 seconds for the BERT and RoBERTa models using 2000 instances sampled from the original dataset. Average training duration (excluding estimating Fisher information) was 2 minutes 20 seconds in total with 700 instances.

D Hyper-parameter configurations D.1 Base models
For the base-models, the default hyperparemters in AllenNLP are used for the ESIM, BERT and RoBERTa models.