Empowering Tabular Data Preparation with Language Models: Why and How?

Mengshi Chen; Yuxiang Sun; Tengchao Li; Jianwei Wang; Kai Wang; Xuemin Lin; Ying Zhang; Wenjie Zhang

Empowering Tabular Data Preparation with Language Models: Why and How?

Mengshi Chen, Yuxiang Sun, Tengchao Li, Jianwei Wang, Kai Wang, Xuemin Lin, Ying Zhang, Wenjie Zhang

Abstract

Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to the tasks involved. Recent advances in Language Models (LMs), especially in Large Language Models (LLMs), offer new opportunities to automate and support tabular data preparation. However, why LMs suit tabular data preparation (i.e., how their capabilities match task demands) and how to use them effectively across phases still remain to be systematically explored. In this survey, we systematically analyze the role of LMs in enhancing tabular data preparation processes, focusing on four core phases: data acquisition, integration, cleaning, and transformation. For each phase, we present an integrated analysis of how LMs can be combined with other components for different preparation tasks, highlight key advancements, and outline prospective pipelines.

Anthology ID:: 2026.acl-long.8
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 228–246
Language:
URL:: https://aclanthology.org/2026.acl-long.8/
DOI:
Bibkey:
Cite (ACL):: Mengshi Chen, Yuxiang Sun, Tengchao Li, Jianwei Wang, Kai Wang, Xuemin Lin, Ying Zhang, and Wenjie Zhang. 2026. Empowering Tabular Data Preparation with Language Models: Why and How?. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 228–246, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Empowering Tabular Data Preparation with Language Models: Why and How? (Chen et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.8.pdf
Checklist:: 2026.acl-long.8.checklist.pdf

PDF Cite Search Checklist Fix data