A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models

Ahatsham Hayat, Mohammad R. Hasan


Abstract
This paper presents a novel approach named Contextually Relevant Imputation leveraging pre-trained Language Models (CRILM) for handling missing data in tabular datasets. Instead of relying on traditional numerical estimations, CRILM uses pre-trained language models (LMs) to create contextually relevant descriptors for missing values. This method aligns datasets with LMs’ strengths, allowing large LMs to generate these descriptors and small LMs to be fine-tuned on the enriched datasets for enhanced downstream task performance. Our evaluations demonstrate CRILM’s superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios, with up to a 10% improvement over the best-performing baselines. By mitigating biases, particularly in MNAR settings, CRILM improves downstream task performance and offers a cost-effective solution for resource-constrained environments.
Anthology ID:
2025.coling-main.380
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5668–5685
Language:
URL:
https://aclanthology.org/2025.coling-main.380/
DOI:
Bibkey:
Cite (ACL):
Ahatsham Hayat and Mohammad R. Hasan. 2025. A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 5668–5685, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models (Hayat & Hasan, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.380.pdf