2024
pdf
bib
abs
Ranking Large Language Models without Ground Truth
Amit Dhurandhar
|
Rahul Nair
|
Moninder Singh
|
Elizabeth Daly
|
Karthikeyan Natesan Ramamurthy
Findings of the Association for Computational Linguistics: ACL 2024
Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover true rankings without reference data. This points to a viable low-resource mechanism for practical use.
pdf
bib
abs
Human-Centered Design Recommendations for LLM-as-a-judge
Qian Pan
|
Zahra Ashktorab
|
Michael Desmond
|
Martín Santillán Cooper
|
James Johnson
|
Rahul Nair
|
Elizabeth Daly
|
Werner Geyer
Proceedings of the 1st Human-Centered Large Language Modeling Workshop
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.
pdf
bib
abs
On Efficient and Statistical Quality Estimation for Data Annotation
Jan-Christoph Klie
|
Juan Haladjian
|
Marc Kirchner
|
Rahul Nair
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.
2018
pdf
bib
abs
Towards Automated Extraction of Business Constraints from Unstructured Regulatory Text
Rahul Nair
|
Killian Levacher
|
Martin Stephenson
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations
Large organizations spend considerable resources in reviewing regulations and ensuring that their business processes are compliant with the law. To make compliance workflows more efficient and responsive, we present a system for machine-driven annotations of legal documents. A set of natural language processing pipelines are designed and aimed at addressing some key questions in this domain: (a) is this (new) regulation relevant for me? (b) what set of requirements does this law impose?, and (c) what is the regulatory intent of a law? The system is currently undergoing user trials within our organization.