Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

Anusha Kamath; Kanishk Singla; Rakesh Paul; Raviraj Bhuminand Joshi; Utkarsh Vaidya; Sanjay Singh Chauhan; Niranjan Wartikar

Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

Anusha Kamath, Kanishk Singla, Rakesh Paul, Raviraj Bhuminand Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar

Abstract

Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.

Anthology ID:: 2025.bhasha-1.5
Volume:: Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Arnab Bhattacharya, Pawan Goyal, Saptarshi Ghosh, Kripabandhu Ghosh
Venues:: BHASHA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 52–68
Language:
URL:: https://aclanthology.org/2025.bhasha-1.5/
DOI:
Bibkey:
Cite (ACL):: Anusha Kamath, Kanishk Singla, Rakesh Paul, Raviraj Bhuminand Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, and Niranjan Wartikar. 2025. Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis. In Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025), pages 52–68, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis (Kamath et al., BHASHA 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.bhasha-1.5.pdf

PDF Cite Search Fix data