Yifan Ethan Xu
2023
Tab-Cleaner: Weakly Supervised Tabular Data Cleaning via Pre-training for E-commerce Catalog
Kewei Cheng
|
Xian Li
|
Zhengyang Wang
|
Chenwei Zhang
|
Binxuan Huang
|
Yifan Ethan Xu
|
Xin Luna Dong
|
Yizhou Sun
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Product catalogs, conceptually in the form of text-rich tables, are self-reported by individual retailers and thus inevitably contain noisy facts. Verifying such textual attributes in product catalogs is essential to improve their reliability. However, popular methods for processing free-text content, such as pre-trained language models, are not particularly effective on structured tabular data since they are typically trained on free-form natural language texts. In this paper, we present Tab-Cleaner, a model designed to handle error detection over text-rich tabular data following a pre-training / fine-tuning paradigm. We train Tab-Cleaner on a real-world Amazon Product Catalog table w.r.t millions of products and show improvements over state-of-the-art methods by 16\% on PR AUC over attribute applicability classification task and by 11\% on PR AUC over attribute value validation task.
Search
Co-authors
- Kewei Cheng 1
- Xian Li 1
- Zhengyang Wang 1
- Chenwei Zhang 1
- Binxuan Huang 1
- show all...
Venues
- acl1