Contrastive Fine-tuning Improves Robustness for Neural Rankers

The performance of state-of-the-art neural rankers can deteriorate substantially when exposed to noisy inputs or applied to a new domain. In this paper, we present a novel method for fine-tuning neural rankers that can significantly improve their robustness to out-of-domain data and query perturbations. Specifically, a contrastive loss that compares data points in the representation space is combined with the standard ranking loss during fine-tuning. We use relevance labels to denote similar/dissimilar pairs, which allows the model to learn the underlying matching semantics across different query-document pairs and leads to improved robustness. In experiments with four passage ranking datasets, the proposed contrastive fine-tuning method obtains improvements on robustness to query reformulations, noise perturbations, and zero-shot transfer for both BERT and BART based rankers. Additionally, our experiments show that contrastive fine-tuning outperforms data augmentation for robustifying neural rankers.


Introduction
Recent advances in neural language modeling have shifted the paradigm of natural language processing (NLP) towards a two-stage process: pre-training on a large amount of data with self-supervised tasks followed by fine-tuning on the target datasets with task-specific loss functions. Current state-of-theart neural rankers for information retrieval finetune pre-trained language models using ranking losses on datasets containing examples of positive and negative query-document pairs. While usually achieving good performance on in-domain test sets, neural rankers trained on large datasets can still exhibit poor transferability when tested in new domains, and suffer from robustness problems when exposed to various types of perturbations. For example, a neural ranker trained on a dataset with mostly natural language queries can perform badly when tested on keyword queries which are very common in information retrieval (Bhatia et al., 2020).
A considerable number of previous works have focused on domain adaptation to improve model 's overall transferability. While domain adaptation approaches can help to address the out-of-domain robustness problem (Pan and Yang, 2010;Ma et al., 2019), they rely on the availability of either labeled data or at least a target corpus which is usually not available at training time for a neural ranking model deployed in the wild.
The vulnerability of deep NLP models to various forms of adversarial attacks such as wordimportance-based replacement (Jin et al., 2020), human-curated minimal perturbations (Khashabi et al., 2020), misspelling , grammatical errors , rule-based perturbations (Si et al., 2020;Ribeiro et al., 2018) is well-documented in the literature (Emma Zhang et al., 2019). While various methods have been proposed to remediate model robustness issues in NLP, most of them are either task-specific (Shah et al., 2019;Gan and Ng, 2020;Wang and Bansal, 2018), requiring auxiliary tasks (Zhou et al.), or relying on data augmentation Kaushik et al., 2019;Cheng et al., 2020;Wei and Zou, 2019) which highly depends on the quality and diversity of the perturbed data.
An alternative strategy for optimizing machine learning models that has the potential to improve both out-of-domain generalization and robustness is contrastive learning. Representations obtained under contrastive self-supervised settings have demonstrated improved robustness to out-ofdomain distributions and image corruptions in computer vision tasks (Hendrycks et al., 2019;Radford et al., 2021). In contrastive learning, representa- Figure 1: Contrastive fine-tuning for neural rankers. During fine-tuning, a batch of positive and negative samples from different queries is fed into a neural encoder. The embeddings of query-document pairs from the same query are used to generate ranking scores, which are employed to compute the ranking loss. In parallel, the embeddings of all pairs are used to compute the contrastive loss.
tions are learned by comparing among similar and dissimilar samples (Le-Khac et al., 2020;Khosla et al., 2020;Van Den Oord et al., 2018;Hjelm et al., 2018). This is different from discriminative learning, where models learn a mapping of input samples to labels, and generative learning, where models reconstruct input samples. While several works have investigated contrastive learning for sentence classification (Gunel et al., 2020), sentence representation learning , and multi-modal representation learning (Radford et al., 2021) under either self-supervised or supervised settings, their potential for improving the robustness of neural rankers has not been explored yet.
In this paper, we propose a novel contrastive learning approach to fine-tune neural rankers and investigate its benefits for improving model robustness. We focus on rankers that use single-tower architectures and are normally trained by optimizing a ranking loss that compares scores of positive and negative query-document pairs involving the same query. We propose to additionally use a contrastive loss that compares the distance between the representation of positive and negative pairs involving distinct queries (i.e, representations of positive pairs should be close in the latent space and distant to the representation of negative pairs, and vice-versa). The goal of using this contrastive loss in addition to the ranking loss is to stimulate the model to learn the underlying matching semantics across different query-document pairs, which can potentially lead to improved robustness.
Our main contributions are as follows: • We propose to combine contrastive loss with ranking loss during fine-tuning of neural ranking models and investigate its impact in im-proving model robustness and generalization.
• Our experimental results using two language model-based neural rankers (BERT and BART) on four different datasets indicate that our proposed method improves upon standard ranking loss in zero-shot transfer across domains, leading to an increase of up to 9 absolute points in Mean Average Precision (MAP).
• We develop new datasets for evaluating the robustness of neural rankers. The datasets are based on WikiQA test set (Yang et al., 2015) and were created semi-automatically. We plan to release these datasets upon acceptance of the paper.
• We show that contrastive fine-tuned rankers are robust to 1) different types of query reformulations commonly seen in information retrieval (headline, paraphrase, and change of voice); and 2) query perturbations such as adding/removing punctuations, typos, and contractions/expansions.

Contrastive Representation Learning for Neural Ranking
In neural ranking models, given an input query q and a set of candidate documents {d 0 , d 1 , ..., d n }, a neural network h is used to create vector representations {h(q, d 1 ), ..., h(q, d n )}, which are given to a function s : − → x → R, that computes a score for each query-document pair, {s(h(q, d 1 )), ..., s(h(q, d n ))}. Normally s performs a simple linear projection of the input embedding, and the training of neural ranking models consists in optimizing a ranking loss that tries to enforce s(h(q, d + )) > s(h(q, d − )) for each training query q, where d + is a positive document for q while d − is a negative one (See top part of Fig. 1). We propose to augment the training of neural rankers with the use of contrastive representation learning. While ranking-based methods compute the loss with respect to the predicted scores, contrastive losses measure the distance/similarity between similar and dissimilar samples in the representation space. In our case, the key idea consists in using a loss that compares the distance between the representation of query-document pairs, and enforces that positive pairs are close together in the latent space while being far apart from negative pairs, i.e., D(h(q, d + ), h(q , d + )) < D(h(q, d + ), h(q, d − )), where q is either a variation of q or a completely different query, and d + is a positive document for q . Figure 1 illustrates our proposed approach, which is detailed in the remainder of this section.

Ranking Loss
Popular ranking losses include 1) the pairwise ranking loss, in which the relevance information is given in the form of preferences between pairs of candidates, and 2) the listwise ranking loss which directly optimizes a rank-based metric. In this work, we experiment with two pairwise ranking losses. The first one is the standard hinge loss (SHL) defined on a triplet (q, d + , d − ) as follows: The other is a modified hinge loss (MHL) function defined as: where q is a query, λ is the margin of the hinge loss, d + refer to the positive document. d − and {d − i } refer to a negative document and the list of negative documents of the query q within the same batch, respectively. θ includes the set of parameters of the network h and the projection layer in s. Based on preliminary experiments, our modified ranking hinge loss generally performs better than the standard pairwise ranking hinge loss. Note that MHL loss has been used in previous work on passage ranking (dos Santos et al., 2016).

Contrastive Loss
For contrastive learning of representations, we employ the conceptually simple but widely adopted triplet margin loss (TML) (Weinberger et al.; Chechik et al., 2010), which has the following form: where a is the anchor point, k + and k − are the similar and dissimilar samples with respect to the anchor point a. m is the margin of the TML loss. In our neural ranking setting, an anchor point is the representation of a query-document pair. We use Euclidean or L2 distance D in our experiment. The contrastive loss can be applied to the representations from a variety of encoders h(·) ∈ R d . In this work, we explore contrastive fine-tuning for both BERT (Devlin et al., 2018) and BART (Lewis et al.) models.
The key to effective contrastive learning is to design the notion of similarity such that positive pairs may be very different in the input space yet semantically related. In this work, we leverage the relevance label in the training data and consider as similar positive pairs (q i , d + i ) and (q j , d + j ) from different queries i and j in the same batch (as illustrated in Fig. 1). Our intuition is that, by enforcing that positive pairs are close together in the embedding space and distant from negative pairs, we make the scoring task easier. Additionally, it allows the model to learn the underlying matching semantics across different query-document pairs, which leads to improved robustness. We additionally conduct a brief experiment in Sec. 5.4 where we use paraphrases of the original query to generate similar pairs.

Combined Loss
Our final loss is a weighted average of the ranking loss L ranking and the contrastive loss L contrastive : The weights w 1 and w 2 are hyper-parameters that need to be determined. Our main experiments use a simple but effective combination method which consists in given equal weights to the ranking loss and contrastive loss.
Our work is related to the recent body of works that demonstrate contrastive and self-supervised approaches can improve model robustness and generalization. Hendrycks et al. (2019) have shown that self-supervision increases image classifier's robustness to adversarial examples, label corruption, and common input corruptions. Radford et al. (2021) have demonstrated that multi-modal contrastive learning can significantly improve the robustness of image classifiers to distribution shift.
In the NLP space, some recent works on sentence-level contrastive representation learning have shown its potential to improve robustness for classification (Gunel et al., 2020) and semantic text similarity tasks . There are two main distinctions between our work and these two papers: 1) they focus on classification and text similarity tasks, while we focus on ranking; and 2) while they rely on data augmentation approaches to define the notion of similarity, our approach mainly relies on document relevance information which is already present in the training data.
Our work is also related to recent work on neural retrieval that focus on hard negative mining to improve model performance (Gillick et al., 2019;Karpukhin et al., 2020;. The main differences between our work and this line of research are: 1) while we leverage relevance information across different queries to create a notion of similarity, the focus on those papers are on finding hard negatives for each individual query in order to improve training efficiency. Hard negative mining can actually be used together with our method, as we show in Sec. 6.2. 2) we focus in reranking models, which use single-tower model that create a single representation for a query-document pair. In contrast, neural retrieval models create separate representations for query and document.

Experimental Setup
In this section, we describe the details of our experimental setup.

Passage Ranking Datasets
We test our method on four publicly available passage ranking/answer selection datasets that vary in size and domain. Passage ranking is an important task in information retrieval. It is often used to retrieve relevant content for open-domain questionanswering systems .
WikiQA (Yang et al., 2015) is a dataset of question and sentence pairs, collected and annotated for research on open-domain question answering. The questions are factoid and selected from Bing query logs. The answers are in the summary section of a linked Wikipedia page. The candidates are retrieved using Bing.
WikiPassageQA (Cohen et al., 2018) is a benchmark collection for the research on non-factoid answer passage retrieval. The queries are created from Amazon Mechanical Turk over the top 863 Wikipedia documents from the Open Wikipedia Ranking.
InsuranceQA (Feng et al., 2016) The question and answer pairs from this dataset are collected from the internet in the insurance domain. Each question has an answer pool of 500 candidates retrieved using SOLR.
YahooQA (Tay et al., 2017) contains questions and answers from Yahoo! Answers website. The dataset is a subset of the Yahoo! Answers corpus from a 10/25/2007 dump. The questions are selected for their linguistic properties. For example, The statistics of the datasets are presented in Table 1. All four datasets provide validation sets, which have size similar to the respective test sets.

Datasets for Robustness Assessment
In order to assess the robustness of our models to different types of query reformulations and query perturbations, we built robustness test datasets based on the original WikiQA test set. We assessed query perturbations by leveraging CheckList (Ribeiro et al., 2020) to construct three types of popular perturbations: adding/removing punctuation, introducing typos and changing of contraction form. For each query in WikiQA test set, we produce three new versions of the query, one for each perturbation type.
We assessed robustness to three types of query reformulations: paraphrase, headline and change of voice. We semi-automatically created the datasets for query reformulations in two steps: (1) a pretrained T5-base model (Raffel et al., 2019) is finetuned on a combination of large public paraphrase datasets (Quora 1 and PAWS ) and human-curated query reformulations. The human-curated reformulations are based on the queries from the SQuAD 1.1 official dev set 2 . For each query in SQuAD 1.1 dev set, the annotators are asked to generate three new versions of the query (one for each reformulation type). During fine-tuning and inference, we use control codes to instruct T5 on the type of reformulation to be generated. Note that this T5 model can be also used for the purpose of data augmentation, as shown in Sec. 5.4. (2) each query in WikiQA test set is processed by the fine-tuned T5 and three reformulations of the query are generated. All generated queries are post-processed in order to ensure it is grammatically correct and semantically equivalent to the original query. To ensure reliable evaluation, we did a round of human annotations to filter out low-quality generations. Examples of query reformulations are presented in Table 2.
We evaluate the lexical diversity of generated query reformulations by computing the BLEU scores between the original query and the reformulated query. The results of comparing four different generation methods are presented in Figure  2. Our T5 generated queries overall exhibit higher diversity than human generated and back translation generated paraphrases. Note that the lower the BLEU score the higher the diversity. Figure 2: Comparison of BLEU scores between original query and reformulations generated by human annotation, back translation and fine-tuned T5 model. 1 https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs 2 https://rajpurkar.github.io/SQuAD-explorer/

Neural Ranker Training
We train neural rankers by fine-tuning two pretrained language models: BERT and BART. For fine-tuning BERT, we use BERT-base model (12 layers, 110M parameters) from Huggingface's transformer codebase (Wolf et al., 2019). Similar to the setup of sentence pair classification task in (Devlin et al., 2018), we concatenate the query sentence and the candidate passage together as a single input to the BERT encoder. We compute both the contrastive loss and ranking scores based on the [CLS] token embedding of the final hidden layer. For BART model fine-tuning, we use a BART-base model (6 layers encoder, 6 layer decoder, 139M parameters). We adopt the setting of BART for classification task in (?). The concatenation of query text and passage text is fed into both the encoder and decoder and the last layer's hidden state of the end decoder token is fed into a linear scorer. Similar to the [CLS] token in BERT, the embedding of the end token from the decoder is used as the representation of the complete input. For training with SHL, we sample triplets (q, d + , d − ) from different queries to form a single batch. For MHL training, a single batch consists of a positive passage d + and a list of negative passages {d − i } from the same query q. We leverage the toolkit developed by Musgrave et al. (2020) for contrastive loss calculation, and fine-tune the models for a maximum of 10 epochs and adopt early stopping using the validation sets of each dataset. The hyperparameters for fine-tuning neural rankers are listed in Appendix A.

In-Domain Fine-tuning
The results of in-domain fine-tuning of BERT and BART-based neural rankers on four passage ranking datasets are presented in table 3. To ensure a fair comparison, all the hyper-parameters between the ranking and the contrastive settings are kept the same and equal weights between ranking loss and contrastive loss are used. Our rankers produce stateof-the-art results when training with either of the two ranking-based losses MHL and SHL. Adding contrastive loss (TML) slightly improves the indomain performance for BERT-based rankers, and performs similarly as ranking loss for BART-based rankers. Since our modified hinge loss (MHL) generally performs better than standard hinge loss (SHL), most of the results presented in the following sections are based on MHL, which corresponds to the setting illustrated in Figure 1.
To illustrate the effect of the contrastive loss on the representation space, we present the t-SNE plot of sample representations from the test set of two datasets WikiPassageQA and YahooQA in Figure 3. The color in the figures represents the positive and negative labels of query-passage pairs. As we can see from the plots, adding contrastive loss enables further separation of the positive samples from the negative samples.

Zero-Shot Transfer
The zero-shot transfer performance of neural rankers reflects their robustness to out-of-domain distributions, which is a key property of neural rankers since they are usually deployed in the wild. In Table 4, we show the results of applying the models trained in each one of the four datasets (source) and applied to the other three datasets (targets). Overall, we see significant im-provements in the zero-shot transferability of the model across all datasets when the neural ranker is trained using the combination of ranking and contrastive losses. The biggest improvement is from YahooQA → WikiPassageQA where we observe absolute 9 points, 10.6 points, and 11.8 points improvement in MAP, MRR, and P@1, respectively. As expected, the transfer between datasets from similar domains tends to be better (e.g. WikiQA ↔ WikiPassageQA) than that between dissimilar domains. Our intuition regarding the benefit of contrastive learning of representations to improve zero-shot transfer consists on the fact that, by using information from different queries and enforcing that positive pairs are close together in the embedding space and distant from negative pairs, the model ends up learning representations that are more general and therefore easier to transfer to new domains.

Robustness to Query Perturbations
In this section, we evaluate the model robustness to various types of reformulations and noisy transformations of the input queries. The test sets used in the experiments are the 6 variations of the WikiQA test set described in Sec. 4.2. We compare the results of using ranking loss (MHL) and the combination of ranking and contrastive loss (MHL+TML). The robustness evaluation is are presented in Table 5. As shown in Table 5, adding contrastive loss improves model robustness against all types of perturbations we tested. We also conduct experiments by fine-tuning neural rankers on combined SHL loss and TML loss. The robustness evaluation of BERT-based rankers trained on SHL loss or combined SHL loss and TML loss are presented in Table 6. Similar to the MHL case, the combined loss achieves a significant improvement in robustness than SHL loss only. More results on model robustness can be found in Appendix B.

Comparison with Data Augmentation
One of the traditional approaches for improving the robustness of machine learning models is to augment the training data with noisy data. In this section, we compare our contrastive fine-tuning method with a data augmentation approach in which automatically generated query reformulations are added to the training data. For each query in the training set, we use our fine-tuned T5 model to generate 5 new queries of each reformulation type (headline, paraphrase, change of voice). Effec-      tively, we increase the training set and the number of training steps by a factor of 5 for each reformulation type. Since our proposed training approach is general and can be used with any dataset, we also experiment with data augmentation combined with contrastive fine-tuning. When performing contrastive fine-tuning, we use the 5 reformulations of each query to create similar pairs (the notion of similarity == query formulation) in each batch, which essentially keeps the number of training steps the same as when training with the original dataset. The results on data augmentation for BERT-based neural rankers are presented in Table 7. For rows with MHL loss, we argument the training data with paraphrased queries and train the model on a combined dataset using MHL loss only. Note this is the standard way to do data argumentation training. For rows using MHL+TML loss, we pair each query in the batch with its paraphrased query for contrastive loss calculation. The model is trained on combined MHL+TML loss. For both methods, we expose the model with the same amount of paraphrased training samples. As we can see from the table, augmenting the training data with a similar type of query reformulation can improve the robustness of the model against that particular type of reformulation. However, it is not as effective in improving the robustness against other reformulation types. On the other hand, contrastive finetuning, even trained with a single type of query reformulation can generally improve the model robustness against the other two types. Furthermore, contrastive fine-tuning achieves this with significantly less (4× less) training time even after considering the additional computation of the contrastive loss calculation. Essentially, the experimental results indicate that using paraphrased training samples to perform contrastive learning is both effective (produces more robust rankers) as well as efficient (faster to train since augmented data is used in parallel, i.e. same batch as original data) than using regular data augmentation.

Comparison with Ranking Loss using same Batch Size
When training with MHL plus contrastive loss we effectively increase the batch size because we need to augment the training batch with additional positive samples from different queries. To check if the improvements achieved by our approach are due to the increase of training batch size only, we perform an ablation study where we compare the performance of models trained with a ranking loss but with the same batch size as the contrastive finetuning setting. The comparison results are presented in Table 8. As we can see in the table, for WikiQA datase, increasing the batch size helps the performance of in-domain and some of the robustness test sets. The contrastive setting still outperforms the ranking setting in all the test categories.
Increasing the batch size of MHL is not always beneficial. We see big degradation on the WikiPas-sageQA dataset. On the other hand, we observed consistent improvement when the model is trained with contrastive loss.

Ablation Study
We present ablation experiments that check the impact of the number of positive samples per batch and the use of hard negative mining. Additionally, in Appendix C, we also present a preliminary experiment on formulating our combined loss (Equation 4) as a multi-objective optimization (MOO).

Effect of Number of Positive Samples Per Batch
The number of positive samples within a single batch determines the total number of potential triples constructed. In this section, we vary the number of positive samples within a batch and evaluate its effect on the model performance. The results are presented in Table 9. As expected, the model performance benefits by increasing the number of positives in a batch. As shown in Table 9, although not strictly monotonically, both the indomain performance and zero-shot transfer performance improve with the number of positive pairs.

Effect of Hard Negative Mining
In a batch of N samples, there are O(N 3 ) possible triplets, many of which are not very helpful to model convergence (e.g triplets where D(a, k + ) >> D(a, k − )). It's important to construct only the most important triplets. Many works have discussed the benefit of hard negative mining techniques that produce useful gradients and help the models converge quickly. In this section, we explore the effect of three hard negative mining methods that are compatible with TML: Angular miner (Wang et al., 2017)     anchor point consisting of the hardest positive and hardest negative samples), and TripletMargin (only output a triplet when the difference between the anchor-positive distance and the anchor-negative distance is smaller than a margin). The results of hard negative mining on models trained on Wik-iQA dataset are presented in Table 10, in which we evaluate both the in-domain and zero-shot performance of the rankers. As we can see from the results, hard negative mining can further improve the transferability of of both BERT-based ranker and BART-based ranker. In particular, BatchHard outperforms other mining methods and improve the overall performance significantly for BERT-based rankers while TripletMargin is more effective for BART-based rankers. We believe there is still a margin for improvement if the hyper-parameters of the miners are properly tuned.

Conclusion
In this paper, we propose a novel method for finetuning neural rankers by combining contrastive loss with ranking loss. Using a semi-automatic approach, we created 6 new versions of WikiQA test set to assess the robustness of our models to query reformulations and perturbations. Our experimental results show that the proposed method improves ranker's robustness to out-of-domain distributions, query reformulations, and perturbations. Comprehensive experiments and ablation studies were conducted to investigate the impact of some design choices as well as to confirm that the gains do not originate only from larger batch sizes. Contrastive fine-tuning with generated data is more effective than data augmentation. As future work, we plan to evaluate the performance of other state-of-theart contrastive loss functions and novel methods of aggregating multiple losses.

A Details of Neural Ranker Fine-tuning
The fine-tuning of neural rankers is conducted on an AWS EC2 P3 machine. Important hyperparameters of fine-tuning for each model-dataset combination are listed in Table 11.

B More Results of Robustness Against Query Perturbations
In this section, we present more results of model robustness evaluation for neural rankers trained on InsuranceQA and YahooQA datasets. The results are shown in Table 12.

C Fine-tuning as Multi-objective Optimization
We performed a preliminary experiment on formulating equation 4 as a multi-objective optimization (MOO) problem in which optimizing both L ranking and L contrastive are two objectives of the task. We adopt a dynamic weighted aggregation (DWA) method (Jin et al., 2001(Jin et al., , 2004 which is both effective and computationally efficient. In DWA, the weights of the two loss terms are changed gradually according to the following equations: w 1 (t) = | sin 2πt/ F | (5) w 2 (t) = 1 − w 1 (t) where t is the iteration number. It is noticed that w 1 (t) changes from 0 to 1 periodically. The change frequency can be adjusted by F . Figure 4 shows the evolution of the contrastive loss during fine-tuning of the BART-based ranker on the WikiQA dataset. As can be seen from the plot, adopting the MOO method improves the model convergence. A lower contrastive loss is achieved using dynamic weighting which translates to an average improvement of 0.7 points over the equal weighting setting in zero-shot transfer performance (see Table 13).