Quality Assessment of Tabular Data using Large Language Models and Code Generation

Ashlesha Akella; Akshar Kaul; Krishnasuri Narayanam; Sameep Mehta

doi:10.18653/v1/2025.emnlp-industry.183

Quality Assessment of Tabular Data using Large Language Models and Code Generation

Ashlesha Akella, Akshar Kaul, Krishnasuri Narayanam, Sameep Mehta

Abstract

Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.

Anthology ID:: 2025.emnlp-industry.183
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2713–2748
Language:
URL:: https://aclanthology.org/2025.emnlp-industry.183/
DOI:: 10.18653/v1/2025.emnlp-industry.183
Bibkey:
Cite (ACL):: Ashlesha Akella, Akshar Kaul, Krishnasuri Narayanam, and Sameep Mehta. 2025. Quality Assessment of Tabular Data using Large Language Models and Code Generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2713–2748, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: Quality Assessment of Tabular Data using Large Language Models and Code Generation (Akella et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-industry.183.pdf

PDF Cite Search Fix data