Striking a Balance: Alleviating Inconsistency in Pre-trained Models for Symmetric Classification Tasks

While fine-tuning pre-trained models for downstream classification is the conventional paradigm in NLP, often task-specific nuances may not get captured in the resultant models. Specifically, for tasks that take two inputs and require the output to be invariant of the order of the inputs, inconsistency is often observed in the predicted labels or confidence scores.We highlight this model shortcoming and apply a consistency loss function to alleviate inconsistency in symmetric classification. Our results show an improved consistency in predictions for three paraphrase detection datasets without a significant drop in the accuracy scores. We examine the classification performance of six datasets (both symmetric and non-symmetric) to showcase the strengths and limitations of our approach.


Introduction
Symmetric classification tasks involve two inputs and require that the model output should be independent of the order in which the two input texts are given. In other words, the output of the classifier should be the same and the confidence score must not be significantly different, if the inputs X and Y are instead supplied as Y and X. Paraphrase detection, multi-lingual semantic similarity are examples of symmetric classification tasks. Although attention-based (Bahdanau et al., 2015;Vaswani et al., 2017) pre-trained language models have led to significant performance gains in multiple text classification tasks, they demonstrate a peculiar erratic behaviour on symmetric classification: inconsistency. An example 1 of inconsistency for paraphrase detection is shown in Figure 1. Additional examples can be found in the Appendix (Table 4). To alleviate such an inconsistency for symmetric classification tasks, we propose a simple additional A provisional government or a revolutionary government has been declared several times by insurgent groups in the Philippines .
A revolutionary government or a provisional government has been declared several times in the Philippines by insurgent groups .  Figure 1: Impact of reordering an example input pair (X and Y ) on standard fine-tuned BERT and BERTwith-consistency-loss . The pair are true paraphrases. and denote that the model predicted them to be paraphrases and not-paraphrases, respectively. Confidence scores are reported in brackets. Details in Section 1.
drop-in fine-tuning objective, based on either the Kullback-Leibler (KL) or Jensen-Shannon (JS) divergence (or any f -divergence (Rubenstein et al., 2019)), to the cross-entropy loss for symmetric tasks. We refer to this as the consistency loss. The main contributions of this paper are: (a) Highlight inconsistency issues in symmetric classification tasks, (b) Describe a consistency loss function to alleviate inconsistency, and (c) Demonstrate the applicability and limitations of the loss function via qualitative and quantitative analyses on tasks from the GLUE benchmark. Additionally, to drive future research, we have made the data and code public 2 . Note: The problem of inconsistency can be attributed in part to the positional embedding. However, it has been shown that eliminating positional embedding results in a poor performance of the model (Wang and Chen, 2020;Wang et al., 2021).

Related Work
Pre-trained Classification Models like BERT (Devlin et al., 2019), andRoBERTa (Liu et al., 2020) are typically fine-tuned for classification tasks using a low capacity neural network classifier connected to the pre-trained model on its first token (typically [CLS] token). We demonstrate the inconsistency in the case of symmetric classification tasks for pairs of inputs, depending on the order of inputs. To the best of our knowledge, this is the first work that incorporates task-specific nuances to ensure consistency in symmetric classification. Consistency Loss has been used in style transfer tasks to minimize the distance between roundtrip generation of candidates for image-to-image translation (Zhu et al., 2017) or text style transfer (Huang et al., 2020). In a similar vein, we apply consistency loss (formulated as either the Kullback-Leibler or the Jensen-Shannon divergence loss) to alleviate the inconsistency problem in symmetric tasks.
Embedding-based Semantic Similarity Scores based on BERT-based models like SBERT (Reimers and Gurevych, 2019; Thakur et al., 2021) can map surface form realizations to embeddings. Their performance is worse than directly using BERT-style cross-encoder models for tasks such as semantic similarity (Thakur et al., 2021). However, the primary aim of such embedding-based scorers is orthogonal and, at best, complementary to the goal of our work since we want to ensure high-performing, consistent classifiers. Similarly, an alternative for symmetric classification is to separately obtain predictions for (X, Y ) and (Y , X), and then average the confidence scores during test time. But, this is a weakly grounded, heuristicdriven approach. In general, averaging does not rectify the mistakes made by the model, only masks it.

Problem Description
A. Given a pair of input sentences (X, Y ), label l (X,Y ) , and a pre-trained BERT-based model M PRE , the goal is to output a reliable model M REL to predict an output label for a new input pair (X test , Y test ) such that the inconsistency between its different ordering is minimized. While we only  For problem A (Section 3.1), the input is a concatenation of tokenized strings X = x 1 , . . . , x m and Y = y 1 , . . . , y n separated using a special token ([SEP] in the case of BERT). The concatenated inputs with the special token are passed through multiple self-attention layers (Vaswani et al., 2017). In the traditional approach, the representation of the first token (<s> or [CLS]) is passed through a fully connected classifier layer (the same final representation is used irrespective of the arity of the task inputs). In our approach, we use the [CLSPara] representation for symmetric classification tasks whereas we use the standard first token (<s> or [CLS]) representation for single input and non-symmetric classification tasks (Section 4). Since we first fine-tune the model on [CLSPara] representation, our approach allows for pair-wise knowledge to be transferred to other downstream classification tasks (problem B (Section 3.1)).
We call this method BERT-with-consistency-loss and is shown in Figure 2. Contrasting this with a traditional BERT-based approach, we see that, in the traditional BERT-based approach approach, the input is pre-pended with another special symbol ([CLS] in case of BERT and <s> in case of RoBERTa). In BERT-with-consistency-loss, we concatenate an extra symbol with the special symbol. We call the extra symbol [CLSPara]. This extra token is specifically used for symmetric classification tasks to ensure consistency of prediction. The standard objective used for fine-tuning BERTbased models is the cross-entropy loss, which maximizes the probability of predicting the correct output class for a given input, given as: where y is the one-hot representation of the target class,ŷ is the softmax output of the model, and i is the associated co-ordinate. As described earlier, this objective may produce an inconsistent prediction based on the order of the two inputs. To overcome this weakness, we propose an additional consistency loss formulated in terms of either the KL or the JS Divergence. We pass the inputs X and Y through the same model twice, once as a pair (X, Y ) (called L2R) and then as the pair (Y, X) (called R2L). Having obtained the outputs from the model for L2R and R2L, the final objective function for is as follows: where λ is the weight assigned to the consistency loss, p L2R and p R2L are the associated confidence/softmax vectors assigned by the model for L2R and R2L sentence pairs, and D is one of the following:  (Socher et al., 2013). This is a collection of human-annotated movie reviews. We work with the standard two class setting where the annotations have opposite polarities (1 for positive sentiment and 0 otherwise). C. For Non-symmetric tasks: (i) QNLI: Natural Language Inference dataset constructed from SQuAD (Rajpurkar et al., 2016) related to a twoclass classification problem to determine if the premise entails a hypothesis or not.

Evaluation
We analyse the results of the traditional objective as well as our approach on BERT-BASE and ROBERTA-BASE across four different seeds under the following categories:

Prediction Consistency: This evaluation is
done only for the symmetric task. Score = (# of L2R Samples) * 100, where l L2R , l R2L denote labels for L2R and R2L, respectively. Note that this is not related to the ground truth labels.
2. Confidence Consistency: We perform these evaluations specifically for symmetric task.  This is to analyze how aligned are the confidence (softmax output associated with label 1) predicted by the model for L2R and R2L setting. The metrics used are the pearson correlation (scaled by 100) and the mean squared error (MSE -scaled by 1000) between the two confidence scores of the test data.
3. Standard Classification Metrics: These are task-specific metrics (accuracy/F1) used in the standard GLUE tasks (Wang et al., 2019)

Implementation Details
To fine-tune the model for symmetric classification tasks, we club together three paraphrase detection datasets: (a) QQP, (b) PAWS, and (c) MRPC. To make sure that all the models see the same data, we augment the dataset with its reverse samples during training. The model is then trained by passing the [CLSPara] (Section 3.2) representation through a low-capacity classifier, and optimized using Equation 1 for baseline models and Equation 2 for the consistency inducing models (Ours). We then use these models to conduct two sets of evaluations. We first evaluate the paraphrase detection results on QQP, PAWS, and MRPC individually. We then take the fine-tuned model obtained above and additionally fine-tune ([CLS] or <s> token) on the single input task (SST-2) and non-symmetric tasks (QNLI, RTE). We use the hugging-face library (Wolf et al., 2020) for tokenizing the input, and the pytorch-lightning framework (Falcon et al., 2019) for loading the pre-trained models and fine-tuning them. We optimize the objective using the AdamW (Loshchilov and Hutter, 2019) optimizer with a learning rate of 2e-5 (obtained through hyperparameter tuning {2e-4, 2e-5, 4e-5, 2e-6}). Since the input contains an additional token [CLSPara], we extend the tokenizer vocabulary for each of the models. Each model was fine-tuned on a single Nvidia 1080Ti GPU (12 GB) for a maximum of 3 epochs (≈ 6hrs/experiment). In case of BERT (Devlin et al., 2019), we use the bert-base-cased model while for RoBERTa (Liu et al., 2020), we use the RoBERTa-base model. For training stability, we perform lambdaannealing i.e., increase the λ parameter from 0.0 to 100.0 as the training progresses. This ensures that the model has developed the capability to classify the sentence pairs with some degree of correctness before making it adhere to the appropriate symmetric confidence scores. We also experimented with fixed λ, but the resultant models were slow to converge (≈ 15 epochs).

Results
Our experiments address three questions: Q1. What are the shortcomings of the current objective function for symmetric classification tasks? (Section 1, Section 5.2) Q2. Does adding the consistency loss alleviate the inconsistency problem? (Section 5.1) Q3. Can consistency-based fine-tuning improve other downstream tasks? (Section 5.1)  (A)) and confidence scores (indicated by higher correlation in (B)) as compared to the base model (indicated by −BASE), for both the base models (BERT-BASE/ROBERTA-BASE) and all symmetric test data sets (QQP, PAWS, MRPC). Moreover, the MSE (indicated within square brackets in part (B)) with consistency training is an order-of-magnitude smaller than without it. The improvements in part (A) are statistically significant at significance level (α) of 0.01 according to McNemar's statistical test (Dror et al., 2018).

Quantitative Analysis
Part (C) shows the results on downstream finetuning. Our models (indicated by W/ * ) do not compromise significantly (statistically evaluated) on the classification metrics for QQP, PAWS, and MRPC (F1/accuracy). The consistency loss does not change the accuracy scores of single sentence input tasks (SST-2), but affects the non-symmetric tasks (QNLI, RTE) negatively. This seems natural since the final objective of both the tasks is quite different and, in many cases, uncorrelated or negatively correlated. Incorporating consistency loss before fine-tuning on non-symmetric tasks (such as entailment) should, therefore, be avoided. Limitations: Our goal is to increase the reliability (measured in terms of confidence scores) of the model and not specifically target classification performance metrics like accuracy and F1. Cases where they increase, can only partially be attributed to a stricter consistency constraint.

Qualitative Analysis
We sample 30 instances that were assigned opposite labels for L2R and R2L by the BERT-BASE models (majority voting) for QQP, MRPC and PAWS. An evaluator with NLP expertise analysed these examples and grouped them into recall error types. We then check the predictions for the same set of instances from BERT + JS (recall). Counts for these error types (defined in Section 7.1) are shown in Table 3

Conclusion
In this paper, we proposed an additional objective: consistency loss between L2R and R2L predictions so as to alleviate the problem of input ordersensitive inconsistency in the case of symmetric classification tasks. For three symmetric classification tasks, our proposed solution, BERT-withconsistency-loss, results in an improved consistency in terms of Pearson's correlation and MSE. As expected, consistency loss results in a drop in the performance of non-symmetric classification tasks such as QNLI and RTE. Surprisingly, using KL divergence results in marginally higher consistency than the JS counterpart. We leave this analysis for future work. Our qualitative analysis shows that all error types, including change in phrases or addition/deletion of details are reduced when the consistency loss is incorporated. While consistency loss ensures that the predicted labels are the same even if the order of inputs is swapped, it can be adapted in the future to ensure expected outputs for anti-symmetric classification tasks (where P(L2R) = 1 − P(R2L)) like next and previous sentence prediction, where reordering the inputs must result in an opposite predicted label. In addition, the proposed method can be applied to evaluate paraphrase generation models (Kumar et al., 2019(Kumar et al., , 2020 as well. In order to validate that paraphrasing models are indeed generating semantically similar outputs, BERT-with-consistency-loss can be used to either evaluate and filter out incorrect generations or be used as an objective to train learned metrics like BLEURT (Sellam et al., 2020).

Ethical Considerations
The primary aim of this work is to highlight the inconsistency in labels and confidence scores of generated by standard pre-trained models for symmetric classification tasks. To mitigate the aforementioned inconsistency, we propose a loss function that incorporates divergence between outputs when the input order is swapped. We do not anticipate any additional ethical issues being introduced by our loss function as compared to the original standard pre-trained models, specifically BERT and RoBERTa. All the datasets used in our experiments are subset of the datasets from previously published papers, and to the best of our knowledge, do not have any attached privacy or ethical issues. That being said, further efforts should be made to study the inherent biases encoded in the pre-trained language models and the datasets.