PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Afra Feyza Akyürek; Advait Gosai; Chen Bo Calvin Zhang; Vipul Gupta; Jaehwan Jeong; Anisha Gunjal; Tahseen Rabbani; Maria Mazzone; David Randolph IV; Mohammad Mahmoudi Meymand; Gurshaan Chattha; Paula Rodriguez; Diego A. Mares Buendia; Pavit Singh; Michael Liu; Subodh Chawla; Peter Cline; Lucy Ogaz; Ernesto Gabriel Hernández Montoya; Zihao Wang; Pavi Bhatter; Marcos Ayestaran; Bing Liu; Yunzhong He

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph IV, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego A. Mares Buendia, Pavit Singh, Michael Liu, Subodh Chawla, Peter Cline, Lucy Ogaz, Ernesto Gabriel Hernández Montoya, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, Yunzhong He

Abstract

Frontier model progress is often measured using academic benchmarks that provide a limited view of performance on open-ended, economically consequential tasks in high-stakes professional domains where practical returns matter most. We introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed questions inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.

Anthology ID:: 2026.acl-long.1958
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42297–42325
Language:
URL:: https://aclanthology.org/2026.acl-long.1958/
DOI:
Bibkey:
Cite (ACL):: Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph IV, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego A. Mares Buendia, Pavit Singh, Michael Liu, Subodh Chawla, Peter Cline, Lucy Ogaz, Ernesto Gabriel Hernández Montoya, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, and Yunzhong He. 2026. PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 42297–42325, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning (Akyürek et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1958.pdf
Checklist:: 2026.acl-long.1958.checklist.pdf

PDF Cite Search Checklist Fix data