Debiasing Neural Retrieval via In-batch Balancing Regularization

People frequently interact with information retrieval (IR) systems, however, IR models exhibit biases and discrimination towards various demographics. The in-processing fair ranking methods provides a trade-offs between accuracy and fairness through adding a fairness-related regularization term in the loss function. However, there haven’t been intuitive objective functions that depend on the click probability and user engagement to directly optimize towards this.In this work, we propose the {textbf{I}n-{textbf{B}atch {textbf{B}alancing {textbf{R}egularization (IBBR) to mitigate the ranking disparity among subgroups. In particular, we develop a differentiable {textbf{normed Pairwise Ranking Fairness} (nPRF) and leverage the T-statistics on top of nPRF over subgroups as a regularization to improve fairness. Empirical results with the BERT-based neural rankers on the MS MARCO Passage Retrieval dataset with the human-annotated non-gendered queries benchmark {cite{rekabsaz2020neural} show that our {ibbr{} method with nPRF achieves significantly less bias with minimal degradation in ranking performance compared with the baseline.


Introduction
Recent advancements in Natural Language Processing and Information Retrieval (Palangi et al., 2016;Devlin et al., 2019;Zhao et al., 2020;Karpukhin et al., 2020) have led to great progress in search performances.However, search engines easily expose various biases (e.g., (Biega et al., 2018;Baeza-Yates, 2018;Rekabsaz and Schedl, 2020;Rekabsaz et al., 2021)), which sabotage the trust of human beings from day to day.Many methods have been proposed recently to reduce the bias of the retrievers.Existing fairness-aware ranking methods can be categorized into pre-processing methods, in-processing methods, and post-processing methods (Mehrabi et al., 2021;Zehlike et al., 2021).Pre-processing methods typically focus on mitigating bias in data before training the model.Lahoti et al. (2019) discussed the individual fairness pre-processing method to learn the fair representation of data.However, the representation-based method will undermine the value of the features determined by domain experts (Zehlike et al., 2021).The in-processing methods usually transform the fairness in ranking task into an optimization problem consisting of an accuracy objective and a fairness objective.These methods learn the best balance between these two objectives (Kamishima et al., 2011;Berk et al., 2017;Bellamy et al., 2018;Konstantinov and Lampert, 2021).Zehlike and Castillo (2020) handles different types of bias without knowing the exact bias form; Post-processing algorithms (Singh and Joachims, 2018;Zehlike et al., 2017Zehlike et al., , 2020;;Cui et al., 2021) are model agnostic without requiring access to the training process, but these methods re-order the ranking at the expense of accuracy (Menon and Williamson, 2018).
Among recent works on fair neural retrieval, Beutel et al. (2019) introduce the pairwise ranking fairness (PRF) metric for ranking predictions.This pairwise fairness metric evaluates whether there is a difference in accuracy between two groups.Rekabsaz et al. (2021) (AdvBert) mitigates the bias magnitude from the concatenation of query and passage text rather than treating the bias magnitude from query and passage separately through an adversarial neural network.
In this paper, we propose the In-Batch Balancing Regularization (IBBR) method combined with the neural retrieval model.IBBR is an in-processing debiasing method that balances the ranking disparity among different demographic groups by adding an in-batch balancing regularization term to the objective function.We design two batch-level regularization terms, Pairwise Difference (PD) and T-statistics (TS) that measure biases within demo- For each query, we calculate the typical ranking loss and the fairness loss from IBBR on top K retrieved passages.We jointly optimize the ranking loss and the fairness loss.There are two ways of computing the IBBR, pairwise difference loss and Tstatistics Loss.graphic groups.In addition, we introduce normed Pairwise Ranking Fairness (nPRF), a relaxed version of the PRF (Beutel et al., 2019) that is differentiable, thus could be directly optimized.We apply IBBR to MS MARCO passage re-ranking task (Nguyen et al., 2016) on gender bias using pretrained BERT L 2 and BERT L 4 models (Turc et al., 2019).Empirical results show that our model could achieve significantly less bias with minor ranking performance degradation, striking a good balance between accuracy and fairness.Our contributions can be summarized as follows: • We introduce IBBR, an in-processing debiasing method based on pairwise difference and T-statistics.
• We introduce normed PRF, a relaxed version of the pairwise ranking fairness (PRF) metric (Beutel et al., 2019).The normed PRF solves the non-differentiable issue and could be directly optimized during training.
• We perform experiments on the MS MARCO passage re-ranking task with IBBR and normed PRF.Empirical results show that IBBR and normed PRF could achieve a statistically significant improvement in fairness while maintaining good ranking performance.

PROBLEM DEFINITION
We first introduce notations in the ranking task in §2.1.§2.2 provides the definition of the bias of the passage.In §2.3, we propose the definition of the group fairness in the ranking task.

Notations in the Ranking Task
Formally, we define the task of Gender Debiased Neural Retrieval (GDNR) as: given a query q and top K passages retrieved by the neural retrieval system, we adapt the ranking to mitigate bias in the retrieval result.We first define the whole query set as Q = {q 1 , q 2 , ..., q N }.For each query q i , we denote P i = {p i,1 , p i,2 , . . ., p i,j , . . ., p i,K } as the corresponding retrieved passages' set for query q i .With query q i and corresponding retrieved passages P i , s i = {q i , p + i,1 , p − i,2 , . . ., p − i,K } is defined as one data pair.Here p + i,1 is the ground truth passage (clicked passage) and p − i,j is the non-clicked passage, ∀j ∈ {2, 3, ..., K}.We use Y i,j = 1 to label the passage p j as a clicked passage, otherwise, Y i,j = 0. Finally, the whole dataset is defined as D = {s 1 , s 2 , ..., s N }.For notation simplicity, we use [1 : K] to represent {1, 2, ..., K}.

Bias Label of Passage
We first provide the definition of the bias label of one passage, and consider the gender bias as a running example.Rekabsaz and Schedl (2020) use the degree of gender magnitude in the passage to define the bias value, where the gender concept is defined via using a set of highly representative gender definitional words.Such a gender definitional set usually contains words such as she, woman, grandma for female definitional words (G f ), and he, man, grandpa for males definitional words (G m ).
The definition of the bias of the passage in our method is different from (Rekabsaz and Schedl, 2020) who assume that one passage has two magnitudes: female magnitude and male magnitude.However, we assume that one passage has only one implication or tendency and use the gender magnitude difference as the bias value.So the bias value of the passage p, mag(p), defined as (1) where | w, p | refers to the number of occurrences of the word w in passage p, w ∈ G m or G f .Furthermore, we define the bias label for the passage p as

Group Fairness in Ranking
In §2.3.1, we introduce one metric of ranking group fairness (pairwise ranking fairness) proposed by (Beutel et al., 2019).In §2.3.2, we provide a more refined definition of pairwise ranking fairness.

Pairwise Ranking Fairness
If R(p) ∈ [0, 1] is the ranking score of passage p from one retrieval model , PRF m (s i ) measures the probability level of a male-biased random passage selected from the male group m higher than all random female-biased passages of data pair where ) as well as belonging to male biased group (d(p i,j ) = 1).n m 1 (s i ) = |g m 1 (s i )| represents the number of male-biased clicked passages.g 0 (s i ) = {j|Y i,j = 0, j ∈ [1 : K]} represents the group of non-clicked passages.n 0 (s i ) represents the number of all nonclicked samples in retrieved passages.Beutel et al. (2019) use the probability that a clicked sample is ranked above another non-clicked sample for the sample query as the pairwise accuracy.The pairwise fairness asks whether there is a difference between two groups when considering the pairwise accuracy as the fairness level metric.
However, we find that PRF is not directly applicable as an argument in the regularizer of a loss function that works as a trade-off of accuracy and fairness.Because PRF is a 0-normed objective function, which is non-convex and non-differentiable.So we propose a modified PRF that can be optimized directly.

Normed-Pairwise Ranking Fairness
We propose a relaxed version of PRF called normed-PRF (nPRF), which measures the degree of group fairness in retrieval results for a given query and considering the ranking performance as well.The detailed definition of nPRF m is defined over all clicked male-biased passages p i in a data pair s is where n m 1 (s i ) is the number of all clicked malebiased passages in a data pair, usually n m 1 (s i ) = 1 in the ranking system.
In order to avoid the drawback of PRF being non-differentiable, we multiply the square of the ranking score (|R(p i,j )| 2 ) of the pas- Besides, measures the average harm of the biased passage p i,j .If this value is large, it means that on average, these non-clicked passages are more relevant to the clicked passage p i,j .This contributes more harm to the society since people are more willing to accept the ranking result.If this value is small, it means that on average, these non-clicked passages are less irrelevant to the click passage p i .This contributes less harm to the society since people are less willing to accept the ranking result.Thus, the nPRF not only considers the magnitude of the ranking performance of the retrieval results but also inherits the explainable society impact into the PRF.

Algorithms
In this section, we create a regularizer based on the nPRF to mitigate the gender bias.In §3.1, we introduce necessary components for the neural retrieval task.In §3.2, we provide the definition of the ranking loss and two fairness loss functions, Pairwise Difference Loss and T-statistics Loss, acting as a regularizer, named as in-batch balancing regularization method (IBBR).

Rank Model
Given the data set D, we use the two-tower dense passage retriever (DPR) model (Karpukhin et al., 2020) as our retrieval model.DPR uses two dense encoders E P , E Q which map the given text passage and input query to two d-dimensional vectors (d = 128) and retrieves K of these vectors which are close to the query vector.We define the ranking score between the query and the passage using the dot product of their vectors produced from DPR as sim(q i , p i,j ) = z q i z p i,j , where z q i = E Q (q i ) and z p i,j = E P (p i,j ) are the corresponding query and passage dense embeddings.
Remarks.Here we use two-tower DPR for two reasons.(I) Computational considerations.Humeau et al. (2019) thoroughly discussed the pros and cons between cross-encoders (Nogueira and Cho, 2019) and bi-encoders such as DPR and stated that cross-encoders are too slow for practical use.(II) Using cross-encoders can cause ill-defined problem such as, if the query's bias label belongs to groups m and the passage's bias label belongs to group f , the concatenation of these two texts' bias label is unclear, based on the definition provided in Eq. ( 2) from (Rekabsaz et al., 2021).So the two-tower BERT model is applied separately on the query and document to tackle this ill-defined problem.Here we only consider the DPR as our ranking model.
Encoders.In our work, in order to demonstrate the robustness of IBBR, we use two BERT models (Turc et al., 2019), (1) tiny BERT (base, uncased); (2) mini BERT (base, uncased) as our encoders, and take the representation at the [CLS] token as the output.
Inference.For the data pair s i , the ranking score R(p i,j ) of passage p j for query q i is simply the inner product of sim(q i , p i,j ) produced by DPR encoders.

Ranking Loss
The ranking loss is the negative log-likelihood loss by computing the inner product of query and passage embeddings to measure the ranking performance for the data pair s i , L Rank = − log e (sim(q i ,p + i,1 )) e (sim(q i ,p + i,1 )) + K j=2 e (sim(q i ,p − i,j )) .

Fairness Loss
To mitigate the bias for two groups, we use the ranking disparity as a measure to evaluate the fairness level of the neural retrieval system.And this ranking disparity works as a regularization in the loss function.Here we propose two regularization terms as follows.
(I) Pairwise Difference Loss.The pairwise difference (PD) loss L Fair P measures the average ranking disparity between two groups m and f over a batch size (B) of data pairs, is the set that the clicked passage belongs to group m over batch size B data.(II) T-statistics Loss.The design of T-statistics (TS) loss is also based on the ranking disparity but considers the second order information (variance effect) of each group for each batch data.We use the square of T-statistics as the ranking disparity measure and defined as, nPRF m (j) is the mean of the male group's nPRF, and v ar m = (nPRF m (j) − μm ) 2 is the variance of the male group's nPRF.Besides, μf , v ar f can be defined similarly.
Remarks.This TS loss can provide a robust measure for the ranking disparity especially when the batch data pair is imbalanced.The square of the T-statistics, i.e., χ 2 distribution, provides the theoretical guarantee and power to reject the similarity between group m and group f .
Total Loss.The total loss will be the sum of the ranking loss and fairness loss, represented as where L fair can be the PD loss or TS Loss.λ is a hyperparameter to control the balance of the fairness loss and ranking loss.In the experiment, we try manually and automatically to tune λ fair .The details of our method can be found in Figure 1.
In this section, we describe data resources in §4.1, experiment setup in §4.2, evaluation metrics in Section 4.3, baseline models in Section 4.4, and corresponding result analysis in Section 4.5.

Dataset
We experimented on the passages re-ranking task from MS MARCO (Nguyen et al., 2016).This collection includes 8.8 million passages and 0.5 million queries, comprised of question-style queries from Bing's search logs, accompanied by humanannotated clicked/non-clicked passages.Additionally, data bias labels over this dataset are available from (Rekabsaz and Schedl, 2020).
Data For DPR.The whole dataset is composed of total 537,585 queries and K * 537, 585 retrieved passages where K = 200, for the baseline DPR model.Each query has top K passages including one ground truth and 199 negative samples.The details of splitting the dataset used for training, development, and test (7:2:1) for the DPR model can be found in Appendix A Table 3.There are 126 queries used for the final evaluation.
Data For Fair Model.The fairness dataset (Rekabsaz and Schedl, 2020) is also created upon this MS MARCO dataset.These queries were annotated into one of four categories: non-gendered (1765), female (742), male (1,202), other or multiple genders (41).Here we only use the nongendered queries, and assume the query is unbiased given it does not have any gender definitional terms.There are 1,252 unique queries in total.Examples of non-gendered queries are: what is a synonym for beautiful?, what is the meaning of resurrect?, etc.

Experiment Setup
The maximum length of query and passage are set to 100.Batch size B is 150 optimized over {100, 120, 150}.Learning rate is 3e −5 optimized over {3e −6 , 3e −5 , 3e −4 }.A warmup ratio of 10% with linear scheduler and a weight decay of 0.01 are set.In addition, we searched the fairness penalty parameter λ = [0.1,0.5, 1, 5, 10] (Best).We also experimented setting the λ fair as a trainable parameter (Auto).All experiments are conducted ten times and we reported the average.
Fairness metrics.We use RaB@5, RaB@10, and ranking disparity |∆A-PRF| to evaluate the fairness magnitude.RaB t .RaB t is a measurement of ranking bias, which is based on the average of the gender magnitude of passages at top t ranking list (Rekabsaz and Schedl, 2020).To measure the retrieval bias, RaB calculates the mean of the gender magnitudes of the top t (5 or 10) retrieved documents for the data pair s i , for females, qRaB f t (s i ) = 1 t t j=1 mag f (p i,j ).Using these values, the RaB metric of the query q, RaB t (s i ) = qRaB m t (s i )−qRaB f t (s i ), and the RaB metric of the retrieval model over all the queries, RaB t = 1 N s i ∈D RaB t (s i ).The smaller the absolute value of RaB t , the less the ranking disparity is.
|∆A-PRF|.|∆A-PRF| measures the ranking disparity over two groups, which is the difference over two averaged PRF, With the running example, we denote |T m | as the number of male-biased clicked pairs and similar definitions are for T f and |T f |.The smaller the |∆A-PRF| is, the smaller the ranking disparity is.If |∆A-PRF| is close to zero, it means that the retrieved results are relatively fair since the two groups' PRF are close to each other.To avoid selection bias, |∆A-PRF| measures the whole dataset's fairness level rather than the subset's result such as top 5 and top 10.

Baseline Models
The baseline methods contain the classical IR models, BM25, and RM3 PRF, and neural based models: Match Pyramid (MP), Kernel-based Neural Ranking Model (KNRM), Convolutional KNRM (C-KNRM), Transformer-Kernel (TK), and the fine-tuned BERT Model.These results are available in in Appendix Section A. For the BERT rankers, we use BERT-Tiny (BERT L 2 ) and BERT-Mini (BERT L 4 ).

Results Analysis
Ranking Performance.In Table 1, we present the result of original BERT L 2 and BERT L 4 and BERT L 2 and BERT L 4 with IBBR (PD and TS).We found that in BERT L 2 , after adding IBBR, the ranking performance decreases 2.2% in Recall@10 We present the RaB result in the last column.In the last two columns, the TS method is still better than the PD method on average.For RaB@5, the TS method's performance is similar to the PD method in BERT L 2 (3.3% vs 0%); The TS method's performance is better than the PD method in BERT L 4 (106% vs -81.3%).For RaB@10, in BERT L 2 , the TS method is similar to the PD method (5.1% vs -2.5%);In BERT L 4 , the TS method is better than the PD method (90.9% vs 9.1%).After evaluating the the fairness level on BERT L 4 and BERT L 2 , we found that the more complicated the model is, the more bias it is, which is also demonstrated in (Rekabsaz and Schedl, 2020).We find that the RaB performance not consistent with the |∆A-PRF| is mainly because |∆A-PRF| is focusing more on the lower-ranked passages and RaB is focusing the higher-ranked passages.This makes these two fairness metrics are relatively exclusive.However, when the ranking system performs well (rank the clicked passage high), the |∆A-PRF| will finally consider the overall ranking result.

Conclusion
In this paper, we present a novel in-processing inbatch balancing regularization method to mitigate ranking disparity and retain ranking performance.We also overcome the non-differentiable and nonconvex properties of the 0-normed PRF and propose the nPRF.We conduct experiments on the MS MARCO dataset and find that the nPRF with Tstatistics regularization method outperforms other methods in terms of fairness metrics and ranking metrics.In future work, we will consider generalizing our method to multiple protected variables such as age, income, etc, and also addressing bias in the query by employing adversarial networks.

Bias Statement
In this paper, we study gender bias in the neural retrieval system.If a ranking system allocates resources or opportunities unfairly to specific gender groups (e.g., less favorable to females), this creates allocation harm by exhibiting more and more maledominated passages, which also forms a more bi-ased dataset in turn.When such a ranking system is used in reality, there is an additional risk of unequal performance across genders.Our work is to explore the bias level of the dense passage retrieval model with BERT L 2 and BERT L 4 on the MS MARCO passage reranking task.Thus, the community can use these benchmarks with a clearer understanding of the bias level, and can work towards developing a fairer model.

Figure 1 :
Figure1: An example of In-Batch Balancing Regularization method.For each query, we calculate the typical ranking loss and the fairness loss from IBBR on top K retrieved passages.We jointly optimize the ranking loss and the fairness loss.There are two ways of computing the IBBR, pairwise difference loss and Tstatistics Loss.
is the set that the clicked passage belongs to group f over batch size B data, and n m = |P m [1:B] | and n f = |P f [1:B] |.Remarks.If there are many nPRF m (s x ) which are different from other nPRF f (s y ), this means that group m and group f have different fairness level over this batch data and will introduce more loss.However, this PD loss does not consider distribution information over this batch data, and imbalanced-data issue when group m and group f samples are imbalanced.Thus we propose the T-statistics loss to overcome this.

Table 1 :
The ranking and fairness results of two IBBR methods, pairwise difference and T-statistics, combined with nPRF in BERT L2 and BERT L4 models.We compare IBBR with baseline models DPR L2, L4 in the re-ranking tasks and experimenting with different fairness hyperparameter λ fair tuning methods.The bold value in each column shows the best result in that metric.↑and↓ indicate larger/smaller is better in corresponding definition of metrics.‡indicatesstatistically significant improvement (p-value< 0.05) over the DPR baseline in fairness metrics.and the bias level decreases 80% when applying the TS.Overall, TS outperforms PD on average when considering the ranking metrics because it downgrades the ranking metric less, which can be found in the ranking metric columns.This phenomenon exists both in hand-tuned or auto-tuned hyperparameter λ fair and BERT L 2 and BERT L 4 .Fairness Performance |∆A-PRF|.BERT L 2 + TS can achieve 80% reduction in mitigating |∆A-PRF| bias.The |∆A-PRF| fairness metric in BERT L 4 +TS can achieve 100% reduction in mitigating bias compared with the original BERT L 4 .Besides, PD performs unsatisfied in the fairness metric compared with TS in BERT L 2 and BERT L 4 , we found that the variance of nPRF and the imbalance affects the performance of PD, which is usually found in the training phase (#male-biased > #female-biased).Overall nPRF + TS can achieve the best performance in mitigating the |∆A-PRF| ranking disparity, which achieves our goal in mitigating the ranking disparity.