Harsh Saini
2025
LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models
Harsh Saini
|
Md Tahmid Rahman Laskar
|
Cheng Chen
|
Elham Mohammadi
|
David Rossouw
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Large Language Models (LLMs) have demonstrated impressive capability to solve a wide range of tasks in recent years. This has inspired researchers and practitioners in the real-world industrial domain to build useful products via leveraging LLMs. However, extensive evaluations of LLMs, in terms of accuracy, memory management, and inference latency, while ensuring the reproducibility of the results are crucial before deploying LLM-based solutions for real-world usage. In addition, when evaluating LLMs on internal customer data, an on-premise evaluation system is necessary to protect customer privacy rather than sending customer data to third-party APIs for evaluation. In this paper, we demonstrate how we build an on-premise system for LLM evaluation to address the challenges in the evaluation of LLMs in real-world industrial settings. We demonstrate the complexities of consolidating various datasets, models, and inference-related artifacts in complex LLM inference pipelines. For this purpose, we also present a case study in a real-world industrial setting. The demonstration of the LLM evaluation tool development would help researchers and practitioners in building on-premise systems for LLM evaluation ensuring privacy, reliability, robustness, and reproducibility.