Paweł Morawiecki
2026
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists
Michał Pietruszka | Łukasz Borchmann | Aleksander Jędrosz | Paweł Morawiecki
Findings of the Association for Computational Linguistics: EACL 2026
Michał Pietruszka | Łukasz Borchmann | Aleksander Jędrosz | Paweł Morawiecki
Findings of the Association for Computational Linguistics: EACL 2026
We present FeatEng, a novel benchmark designed to evaluate the ability of large language models (LLMs) to perform feature engineering, a critical and knowledge-intensive task in data science. FeatEng assesses LLMs by their capacity to generate Python code that transforms raw tabular data into features that improve the performance of a downstream machine learning model. Our analysis of LLM outputs reveals that success on FeatEng often requires the application of significant world and domain knowledge, along with complex reasoning, to construct novel data representations. While focused on feature engineering, the benchmark probes a confluence of abilities indicative of an LLM’s broader potential for practical, data-centric problem-solving. We demonstrate that FeatEng offers a targeted and efficient approach to assess a specific but crucial aspect of LLM capabilities relevant to real-world data science applications.