Xuyang Wu
2025
Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems
Xuyang Wu
|
Shuowei Li
|
Hsin-Tai Wu
|
Zhiqiang Tao
|
Yi Fang
Proceedings of the 31st International Conference on Computational Linguistics
Retrieval-Augmented Generation (RAG) has recently gained significant attention for its enhanced ability to integrate external knowledge sources into open-domain question answering (QA) tasks. However, it remains unclear how these models address fairness concerns, particularly with respect to sensitive attributes such as gender, geographic location, and other demographic factors. First, as language models evolve to prioritize utility, like improving exact match accuracy, fairness considerations may have been largely overlooked. Second, the complex, multi-component architecture of RAG methods poses challenges in identifying and mitigating biases, as each component is optimized for distinct objectives. In this paper, we aim to empirically evaluate fairness in several RAG methods. We propose a fairness evaluation framework tailored to RAG, using scenario-based questions and analyzing disparities across demographic attributes. Our experimental results indicate that, despite recent advances in utility-driven optimization, fairness issues persist in both the retrieval and generation stages. These findings underscore the need for targeted interventions to address fairness concerns throughout the RAG pipeline. The dataset and code used in this study are publicly available at this GitHub Repository.
2024
Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers
Yuan Wang
|
Xuyang Wu
|
Hsin-Tai Wu
|
Zhiqiang Tao
|
Yi Fang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works such as RankGPT have demonstrated that the LLMs have better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.