Enhancing Transformers with Gradient Boosted Decision Trees for NLI Fine-Tuning

Transfer learning has become the dominant paradigm for many natural language processing tasks. In addition to models being pretrained on large datasets, they can be further trained on intermediate (supervised) tasks that are similar to the target task. For small Natural Language Inference (NLI) datasets, language modelling is typically followed by pretraining on a large (labelled) NLI dataset before fine-tuning with each NLI subtask. In this work, we explore Gradient Boosted Decision Trees (GBDTs) as an alternative to the commonly used Multi-Layer Perceptron (MLP) classification head. GBDTs have desirable properties such as good performance on dense, numerical features and are effective where the ratio of the number of samples w.r.t the number of features is low. We then introduce FreeGBDT, a method of fitting a GBDT head on the features computed during fine-tuning to increase performance without additional computation by the neural network. We demonstrate the effectiveness of our method on several NLI datasets using a strong baseline model (RoBERTa-large with MNLI pretraining). The FreeGBDT shows a consistent improvement over the MLP classification head.


Introduction
Recent breakthroughs in transfer learning ranging from semi-supervised sequence learning (Dai and Le, 2015) to ULMFiT (Howard and Ruder, 2018), ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have brought significant improvements to many natural language processing (NLP) tasks. Transfer learning involves pretraining neural networks, often based on the Transformer (Vaswani et al., 2017), on large amounts of text in a self-supervised manner in order to learn * Work conducted mainly as Research Intern at Huawei Noah's Ark Lab.
transferable language features useful for many NLP tasks. Pretraining is followed by fine-tuning the model on the target task, which is particularly beneficial when labelled data is scarce. Pretrained models can be further trained on intermediate labelled datasets, which are similar to the target task (Pruksachatkun et al., 2020). In this manner, the network learns more meaningful internal representations of the input text that are better aligned with the target task. In order to fine-tune the pretrained network, the uppermost layers are replaced by a classification head, usually a randomly initialised Multi-Layer Perceptron (MLP) (Wolf et al., 2019). The input to the classification head is the pooled output of the Transformer, i.e. the hidden state corresponding to the [CLS] token of BERT-like models, which we refer to as features throughout this paper. It is a high-dimensional vector that serves as a rich, distributed representation of the input text.
We investigate whether replacing the commonly used MLP classification head with a GBDT (Friedman, 2001) can provide a consistent improvement, using NLI tasks as our use case. GBDTs are known for strong performance on dense, numerical features (Ke et al., 2019), which includes the hidden states in a neural network. The number of input features p, i.e. the dimension of the hidden state corresponding to the [CLS] token is not necessarily much larger than the number of samples n (it may even be smaller). GBDTs have proven effective for tasks where n < p (Kong and Yu, 2018) and can be more effective compared to logistic regression if n p (Couronné et al., 2018). Therefore, for a language model that was trained on an intermediate supervised task before fine-tuning, we hypothesise that a GBDT may be able to outperform an MLP classification head as the hidden states already encode information relevant to the target task at the start of fine-tuning. The head must learn to exploit this information exclusively during the fine-tuning stage in which the training data may consist of only a small amount of samples. Our contributions are as follows: • We integrate the GBDT into a near state-ofthe-art (SOTA) language model as an alternative to an MLP classification head and train on the features extracted from the model after fine-tuning. We refer to this method as standard GBDT.
• We introduce a method to train a GBDT on the features computed during fine-tuning, at no extra computational cost by the neural network, showing a consistent improvement over the baseline. We refer to it as FreeGBDT.
In the following, we recap different approaches to integrating tree-based methods with neural networks (Section 2). We introduce our FreeGBDT method in Section 3. We present experiments on standard NLI benchmarks (Section 4). To conclude, we discuss the nature of the improvements and limitations of our method (Section 6).

Related Work
Recent work on transfer learning in NLP has often been based on pretrained transformers, e.g. BERT (Devlin et al., 2018), XLNet (Yang et al., 2019), T5 (Raffel et al., 2019) and RoBERTa . These models are pretrained on large datasets using self-supervised learning, typically a variation of language modelling such as Masked Language Modelling (MLM). MLM consists of masking some tokens as in the Cloze task (Taylor, 1953). The objective of the model is to predict the masked tokens. Recently, approaches using alternatives to MLM such as Electra (Clark et al., 2020) and Marge (Lewis et al., 2020) have also been proposed. Pretraining transformers on large datasets aims to acquire the semantic and syntactic properties of language, which can then be used in downstream tasks. The models can additionally be trained in a supervised manner on larger datasets before being fine-tuned on the target task.
Natural Language Inference is one of the most canonical tasks in Natural Language Understanding (NLU) (Nie et al., 2019;Bowman et al., 2015). NLI focuses on measuring commonsense reasoning ability (Davis and Marcus, 2015) and can be seen as a proxy task that estimates the amount of transferred knowledge from the self-supervised phase of training. The task involves providing a premise (also called context) and a hypothesis that a model has to classify as: • Entailment. Given the context, the hypothesis is correct.
• Contradiction. Given the context, the hypothesis is incorrect.
• Neutral. The context neither confirms nor disconfirms the hypothesis.
The task can also be formulated as binary classification between entailment and not entailment (contradiction or neutral). We focus on NLI as a challenging and broadly applicable NLP task, with multiple smaller evaluation datasets being available as well as the large Multi-Genre Natural Language Inference corpus (Williams et al., 2018, MNLI), which is often used for effective intermediate pretraining . As such, it provides a testing ground for the GBDT classification head with intermediate supervised pretraining.
Tree-based methods Models based on decision trees have a long history of applications to various machine learning problems (Breiman et al., 1984). Ensembling multiple decision trees via bagging (Breiman, 1996) or boosting (Freund et al., 1999) further improves their effectiveness and remains a popular method for modelling dense numerical data (Feng et al., 2018). Ensemble methods such as Random Forests (Breiman, 2001) and GBDTs combine predictions from many weak learners, which can result in a more expressive model compared to an MLP. There have been several approaches to combining neural networks with tree-based models, approximately divided into two groups: 1. Heterogeneous ensembling: The tree-based model and the neural network are trained independently, then combined via ensembling techniques. Ensembling refers to any method to combine the predictions of multiple models such as stacking (Wolpert, 1992) or an arithmetic mean of the base models' predictions.

Direct integration:
The tree-based model is jointly optimised with the neural network.
Heterogeneous ensembling  has proven effective for many applications such as Online Prediction (Ke et al., 2019), Learning-to-Rank for Personal Search (Lazri and Ameur, 2018) and Credit Scoring (Xia et al., 2018). It is also suitable for multimodal inputs, e.g. text, images and/or sparse categorical features as some input types are better exploited by a neural network while others are amenable to tree-based models (Ke et al., 2019).
Direct integration makes the tree-based model compatible with back-propagation thus trainable with the neural network in an end-to-end manner. Examples include the Tree Ensemble Layer (Hazimeh et al., 2020), Deep Neural Decision Forests (Kontschieder et al., 2015) and Deep Neural Decision Trees . Deep Forests (Zhou and Feng, 2017) are also related although they aim to create deep non-differentiable models instead. Other examples include driving neural network fine-tuning through input perturbation (Bruch et al., 2020), which focuses specifically on using a tree ensemble to fine-tune the neural network representations.
Algorithm 1 Standard GBDT training procedure. Features are extracted after fine-tuning. Require: training data X, pretrained network f p parametrised by θ p , classification head f h parametrised by θ h .
update θ h and θ p via backpropagation of loss end for end for f eatures ← empty list labels ← empty list As pretrained transformer-based models have recently achieved strong performance on various NLP tasks (Devlin et al., 2018), we see an opportunity to take advantage of their distributed representations by the means of using a tree-based model as the classification head. Our methods differ from direct integration in that they are not endto-end differentiable. The training procedure is a sequence, i.e. the transformer-based model is fine-tuned first, then a GBDT is trained with features extracted from the model. We do not interfere with the model updates during training. Finally, the GDBT replaces the MLP classification head. Our approach is invariant to the method with which the neural network is fine-tuned as long as there exists a forward pass in which the features are computed. Recent methods for neural network fine-tuning include FreeLB (Zhu et al., 2019) and SMART . FreeLB is an adversarial method, which perturbs the input during training via gradient ascent steps to improve robustness. SMART constrains the model updates during fine-tuning with smoothness-inducing regularisation in order to reduce overfitting. These approaches could theoretically be combined with both the standard GBDT and the FreeGBDT.
Algorithm 2 FreeGBDT training procedure. Features are accumulated throughout fine-tuning. Require: training data X, pretrained network f p parametrised by θ p , classification head f h parametrised by θ h . f eatures ← empty list labels ← empty list update θ h and θ p via backpropagation of loss end for end for gbdt ← train gbdt (f eatures, labels) Figure 1: The baseline model architecture. Feature storage is populated during fine-tuning for the FreeGBDT but after fine-tuning for the standard GBDT method.

Methodology
We introduce the standard GBDT (Algorithm 1) and the FreeGBDT (Algorithm 2), our new methods of computing with features generated during fine-tuning. Features refers to the hidden state corresponding to the [CLS] token of BERT-like pretrained models. We use these features as training data for the GBDT and FreeGBDT.

The standard GBDT classification head
In order to train the standard GBDT, we apply the feature extraction procedure shown in Algorithm 1. Using the fine-tuned neural network, we perform one additional forward pass over each sample in the training data. We store the features as training data for the GBDT, denoted 'feature storage' in Figures 1 and 2. The GBDT can be then used as a substitute for the MLP classification head.

The FreeGBDT classification head
Instead of extracting features once after fine-tuning, the training data for the proposed FreeGBDT is obtained during fine-tuning. The features computed in every forward pass of the neural network are stored as training data, shown in Algorithm 2. As no additional computation by the neural network is required, this new classification head is called FreeGBDT. Accumulating features in this manner allows the FreeGBDT to be trained on N × E samples while the standard GBDT is trained on N samples where N is the size of the dataset and E is the number of fine-tuning epochs.

Experimental Setup
We now describe the featured models, details of training procedures and evaluation datasets.

Datasets
We evaluate our methods on the following NLI datasets, summarised in Table 1. • CommitmentBank (CB) (de Marneffe et al., 2019). We use the subset of the data as used in SuperGLUE .
• Counterfactual NLI (CNLI) (Kaushik et al., 2019). The CNLI corpus consists of counterfactually-revised samples of SNLI (Bowman et al., 2015). We use the full dataset i.e. samples with the revised premise and with the revised hypothesis.
• The Adversarial NLI (ANLI) (Nie et al., 2019) corpus consists of three rounds of data collection. In each round, annotators try to break a   (Rajpurkar et al., 2016), aiming to determine whether a given context contains the answer to a question.

Model and Training
We start all experiments from the RoBERTa-large model  with intermediate pretraining on the Multi-Genre Natural Language Inference (MNLI) corpus (Williams et al., 2018). The MNLI checkpoint is provided by the fairseq 1 library . Note that no task-specific tuning of hyperparameters was performed. Instead, we use one learning rate cycle (Smith, 2017) with a maximum learning rate of 1 × 10 −5 for each task to fine-tune RoBERTa for 10 epochs with a batch size of 32. We use the Adam optimiser (Kingma and Ba, 2014) to optimise the network. In order to compare the FreeGBDTs with standard GBDTs, we apply Algorithm 1 and Algorithm 2 during the same fine-tuning session to eliminate randomness from different model initialisations.
We use LightGBM 2 to train the GBDT. We do not manually shuffle the data before training. The individual trees of a GBDT are learned in a sequence where each tree is fit on the residuals of the previous trees. One important parameter of the GBDT is thus the number of trees to fit. This is commonly referred to as boosting rounds.
We observe that the optimal number of boosting rounds varies significantly across tasks, with a tendency towards more boosting rounds for larger datasets. Thus, we select the number of boosting rounds from the set {1, 10, 20, 30, 40} for each task. This is the only task-specific hyperparameter in our experiments. The hyperparameters are iden-  Table 3, the model shown in Figure 2.

Evaluation
Development set We evaluate our methods using accuracy. Each experiment is repeated 20 times with different random seeds. We report the mean and standard deviation.
We select the GBDT with the best score on the development set. Test scores are obtained with a submission to the SuperGLUE benchmark 3 for CB and the GLUE benchmark 4 for RTE and QNLI. We calculate the test scores on ANLI and CNLI ourselves as the test labels are publicly available. We report accuracy on the test set for each task except for CB, where we report the mean of F1 Score and Accuracy, same as the SuperGLUE leaderboard.

Results and Analysis
We summarise the results on the development sets in Table 2. The FreeGBDT is compared with a standard GBDT and the MLP classification head. The standard GBDT achieves a higher score than the MLP head on 1 out of 5 tasks. The FreeGBDT outperforms the standard GBDT on 5 out of 5 tasks and the MLP head on 4 out of 5 tasks. As recommended for statistical comparison of classifiers across multiple datasets (Demšar, 2006) we conduct a Wilcoxon signed-rank test Figure 3: Accuracy on RTE and CNLI development sets. Training is paused after each epoch to compare the GBDT, FreeGBDT and MLP heads. We plot the mean from 20 runs (same hyperparameters but different seeds). (Wilcoxon, 1992) with the accuracy differences between the FreeGBDT and the MLP across the 20 seeds and 5 datasets. The test confirms that the improvement from our FreeGBDT method is significant with p ≈ 0.01. Results on the test sets are shown in Table 4. The FreeGBDT achieves a small but consistent improvement over the MLP head on each task except QNLI. This task is not a conventional NLI task but a question-answering task converted to an NLI format (Demszky et al., 2018). It has been shown that QNLI does not benefit from MNLI pretraining  hence this result is not unexpected. Out of the four datasets which do benefit from MNLI pretraining, the FreeGBDT improves over the MLP head on each one with an average score difference of +0.23%. As our experiments start from a competitive baseline, RoBERTa-large with MNLI pretraining, we consider the results important because (a) to the best of our knowledge, this is the first tree-based method that achieves near state-of-the-art performance on benchmark NLI tasks and (b) our method is 'free' as it requires no additional computations by the model. We were able to demonstrate that a FreeGBDT head can be successfully integrated with modern transformers and is a good alternative to the commonly used MLP classification head.
For the Adversarial NLI (ANLI) dataset, we report results which are competitive with the state-ofthe-art shown in Table 5, surpassing both SMART  and ALUM . The RoBERTa-large model pretrained with SNLI, MNLI, FEVER (Thorne et al., 2018) and ANLI reported 53.7% accuracy on the ANLI test set (Nie et al., 2019). The state-of-the-art result of 58.3% accuracy on the ANLI dataset was achieved by InfoBERT . The FreeGBDT achieves a new state-of-the-art on the A2 subset of ANLI with 52.7%. Interestingly, it does not yield an improvement on the easier A1 subset but compares favourably to other recent approaches on the more difficult A2 and A3 subsets of ANLI.
To better understand how performance of the GBDTs evolves during fine-tuning, we carry out an additional experiment. We pause training after each epoch to extract features and train a standard GBDT. We compare it with the MLP classification head and a FreeGBDT trained on the features accumulated up to the current epoch. The result on the RTE and CNLI datasets is shown in Figure  3. Notably, the FreeGBDT does not improve on the standard GBDT after the first epoch where the number of instances the GBDT is trained on is equal to the size of the training dataset for both. As Table 5: Accuracy across different rounds of the ANLI test set. All denotes a sample-weighted average. Our FreeGBDT achieves SOTA on the A2 subset. InfoBERT  is the SOTA on the full test set.   72.4 49.8 50.3 57.1 ALUM  72.3 52.1 48.4 57.0 InfoBERT  75.0 50.5 49.8 58.3 RoBERTa-large-mnli + FreeGBDT (ours) 71.9 52.7 49.7 57.6 the FreeGBDT starts accumulating more training data, however, it consistently outperforms the standard GBDT and eventually, the MLP head.

Method
The state-of-the-art in NLI provides some context for our method of combining tree-based models with modern neural networks. RoBERTa (large with MNLI pretraining) reports 89.5% accuracy on the development set of RTE  and 94.7% accuracy on the development set of QNLI. The same model obtains an F1 Score / Accuracy of 90.5/95.2 on the CB test set and an accuracy of 88.2% on the RTE test set. Note that ensembles of 5 to 7 models  were used while our test figures achieve similar scores of 91.3/95.2 for CB and 87.8% for RTE with a single model. These are not direct comparisons, however, the figures demonstrate that FreeGBDT can operate at SOTA levels while matching and exceeding the 'default' MLP head classifier accuracy. Across all datasets, the FreeGBDT improves by an average of 0.2% and 0.5% over the MLP head and the standard GBDT head, respectively.

Discussion
The FreeGBDT improves over the MLP head on each task where intermediate supervised pretraining on MNLI is effective. The improvement is significant but not large. This is expected since the input features of the classification head are already a highly abstract representation of the input. Thus, there is limited potential for improvement. However, our results show that a tree-based method is a viable alternative to the commonly used MLP head and can improve over a baseline chosen to be as competitive as possible. Notably, the FreeGBDT improves the MLP baseline on the CB dataset by > 0.6% solely by switching to our tree-based classification head. Furthermore, the FreeGBDT outperforms a standard GBDT by a large margin in some cases. For instance, we observe a +1.5% improvement on the CNLI dataset. Figure 3 shows the gap forming towards the end of training. We think this may be due to overfitting to the training data. Recall that the standard GBDT is trained only on features extracted after fine-tuning. At this point, the features may exhibit a higher degree of memorisation of the training data. The FreeGBDT is able to mitigate this problem as it was trained with features collected throughout training. Let f (x, θ t ) denote a mapping from the input text x to the output space parameterised by θ t where t ∈ {0..T } and T is the total amount of steps the model is fine-tuned for. Then, the standard GBDT is trained on features from f (x, θ T ), while the FreeGBDT is trained on features from every t in {0..T }. As such, it may help to think of the FreeGBDT as a type of regularisation through data augmentation (from the FreeGBDT's point of view), having trained on several perturbed views of each training instance.
Figure 4 helps illustrate the regularisation effect by showing the differences between the FreeGBDT and standard GBDT training data beyond just size. The figure shows the temporal changes in a typical feature collected during fine-tuning and the same feature extracted after fine-tuning. We can see that the distribution gradually changes from earlier epochs but remains similar to the distribution of the feature at the end of fine-tuning. FreeGBDT is able to exploit the information at the start of fine-tuning as the features at t = 0 already encode information highly relevant to the target task hence all training data is useful. The FreeGBDT head compares favourably to an MLP head, which is randomly initialized at the start of the fine-tuning stage and must thus learn to exploit the latent in- Figure 4: Values of a typical dimension from the 1,024 dimensional vector stored during fine-tuning to train a FreeGBDT (left). The same dimension extracted after fine-tuning to train a standard GBDT (right).
formation from a potentially small amount of training examples. Therefore, we believe intermediate supervised pretraining is essential for the effectiveness of the FreeGBDT, supported by the results from preliminary experiments on the BoolQ dataset (Clark et al., 2019) and QNLI, which does not benefit from pretraining on MNLI  where FreeGBDT matches the accuracy of the MLP head but does not exceed it. Our experiments also suggest that the potential for improvement from FreeGBDT depends on the size of the training dataset. The gap between FreeGBDT and the MLP head in Table 4 is larger for the smaller datasets CB and RTE and smaller for the larger datasets (ANLI, CNLI, QNLI). This is consistent with prior work showing that GBDTs are especially effective compared to other methods if n p (Couronné et al., 2018) and hints that the FreeGBDT method might be especially useful for smaller datasets.

Future Work
One possible avenue for future work is exploring different features to train the GBDT, e.g. the hidden states from different layers of the pretrained model. This includes new combinations of top layer representations of the Transformer to generate richer input features for the classification head. This could lead to potential improvement by leveraging a less abstract representation of the input. Given that our method operates on distributed representations from a pretrained encoder, applications in other domains such as Computer Vision may be possible, e.g. using features extracted from a ResNet (He et al., 2016) encoder. Furthermore, a GBDT might not be the best choice for each task hence the use of Random Forests (Breiman, 2001) or Support Vector Machines (Cortes and Vapnik, 1995) may also be evaluated to investigate the effectiveness of combining Transformer neural networks with traditional supervised learning methods.

Conclusion
State-of-the-art transfer learning methods in NLP are typically based on pretrained transformers and commonly use an MLP classification head to finetune the model on the target task. We have explored GBDTs as an alternative classification head due to their strong performance on dense, numerical data and their effectiveness when the ratio of the number of samples w.r.t the number of features is low. We have shown that tree-based models can be successfully integrated with transformer-based neural networks and that the free training data generated during fine-tuning can be leveraged to improve model performance with our proposed FreeGBDT classification head. Obtaining consistent improvements over the MLP head on several NLI tasks confirms that tree-based learners are relevant to state-of-the-art NLP.