Shashwat Vaibhav


2022

pdf bib
Makadi: A Large-Scale Human-Labeled Dataset for Hindi Semantic Parsing
Shashwat Vaibhav | Nisheeth Srivastava
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

Parsing natural language queries into formal database calls is a very well-studied problem. Because of the rich diversity of semantic markers across the world’s languages, progress in solving this problem is irreducibly language-dependent. This has created an asymmetry in progress in NLIDB solutions, with most state-of-the-art efforts focused on the resource-rich English language, with limited progress seen for low resource languages. In this short paper, we present Makadi, a large-scale, complex, cross-lingual, cross-domain semantic parsing and text-to-SQL dataset for semantic parsing in the Hindi language. Produced by translating the recently introduced English language Spider NLIDB dataset, it consists of 9693 questions and SQL queries on 166 databases with multiple tables which cover multiple domains. This is the first large-scale dataset in the Hindi language for semantic parsing and related language understanding tasks. Our dataset is publicly available at: Link removed to preserve anonymization during peer review.