Vedant Gaur

2025

pdf bib abs
The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators
Tianruo Rose Xu | Vedant Gaur | Liu Leqi | Tanya Goyal
Findings of the Association for Computational Linguistics: EMNLP 2025

LLM judges have gained popularity as an inexpensive and performant substitute for human evaluation. However, we observe that the meta-evaluation setting in which the reliability of these LLM evaluators is established is substantially different from their use in model development. To address this, we revisit meta-evaluations of LLM evaluators under a setting that more closely aligns with practice by examining evaluators’ ability to distinguish test system pairs that are closer in capability. Our fine-grained approach shows that all LLM evaluator’s correlations with human judgments are concerningly low when the models perform similarly, showcasing a key limitation of current norms. Equipped with this better methodology, we next analyze the impact that the choice of the reference model makes to LLM-as-a-judge evaluator performance. We show that single-reference evaluators only perform well at ranking test systems that fall within particular capability ranges, even if the standard meta-evaluation reports high overall correlation. Taken together, our analysis shows critical issues with current LLM meta-evaluation and recommend avenues for improvement.

2023

pdf bib abs
Reasoning in Large Language Models Through Symbolic Math Word Problems
Vedant Gaur | Nikunj Saunshi
Findings of the Association for Computational Linguistics: ACL 2023

Large language models (LLMs) have revolutionized NLP by solving downstream tasks with little to no labeled data. Despite their versatile abilities, the larger question of their ability to reason remains ill-understood. This paper addresses reasoning in math word problems (MWPs) by studying symbolic versions of the numeric problems, since a symbolic expression is a “concise explanation” of the numeric answer. We create and use a symbolic version of the SVAMP dataset and find that GPT-3’s davinci-002 model also has good zero-shot accuracy on symbolic MWPs. To evaluate the faithfulness of the model’s reasoning, we go beyond accuracy and additionally evaluate the alignment between the final answer and the outputted reasoning, which correspond to numeric and symbolic answers respectively for MWPs. We explore a self-prompting approach to encourage the symbolic reasoning to align with the numeric answer, thus equipping the LLM with the ability to provide a concise and verifiable reasoning and making it more interpretable. Surprisingly, self-prompting also improves the symbolic accuracy to be higher than both the numeric and symbolic accuracies, thus providing an ensembling effect. The SVAMP-Sym dataset will be released for future research on symbolic math problems.

Co-authors

Venues

findings2

Fix author