Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Jinquan Zheng; Jia Yuan; Jiacheng Yao; Chenyang Gu; Pujun Zheng; Guoxiu He

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He

Abstract

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code is available on github (https://github.com/ECNU-Text-Computing/PA-GRPO).

Anthology ID:: 2026.acl-long.1621
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35125–35143
Language:
URL:: https://aclanthology.org/2026.acl-long.1621/
DOI:
Bibkey:
Cite (ACL):: Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, and Guoxiu He. 2026. Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35125–35143, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO (Zheng et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1621.pdf
Checklist:: 2026.acl-long.1621.checklist.pdf

PDF Cite Search Checklist Fix data