Jasmina Gajcin
2025
Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist
Martín Santillán Cooper
|
Zahra Ashktorab
|
Hyo Jin Do
|
Erik Miehling
|
Werner Geyer
|
Jasmina Gajcin
|
Elizabeth M. Daly
|
Qian Pan
|
Michael Desmond
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present a synthetic data generation tool integrated into EvalAssist. EvalAssist is a web-based application designed to assist human-centered evaluation of language model outputs by allowing users to refine LLM-as-a-Judge evaluation criteria. The synthetic data generation tool in EvalAssist is tailored for evaluation contexts and informed by findings from user studies with AI practitioners. Participants identified key pain points in current workflows including circularity risks (where models are judged by criteria derived by themselves), compounded bias (amplification of biases across multiple stages of a pipeline), and poor support for edge cases, and expressed a strong preference for real-world grounding and fine-grained control. In response, our tool supports flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows. We also incorporate features for quality assurance and edge case discovery.
Search
Fix author
Co-authors
- Zahra Ashktorab 1
- Elizabeth M. Daly 1
- Michael Desmond 1
- Hyo Jin Do 1
- Werner Geyer 1
- show all...