Reza Mousavi
2026
Generalization or Memorization? Multi-Agent vs. Baseline LLMs and AutoML Models for Tabular Classification
Aida Sanatizadeh | Sorouralsadat Fatemi | Reza Mousavi | Ahmed Abbasi
Findings of the Association for Computational Linguistics: ACL 2026
Aida Sanatizadeh | Sorouralsadat Fatemi | Reza Mousavi | Ahmed Abbasi
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly used for structured tabular data, yet it remains unclear whether their performance reflects genuine reasoning or memorization of pre-training corpora. We investigate this question through a rigorous, contamination-aware evaluation of a representative modular Multi-Agent LLM (MALLM) framework against state-of-the-art AutoML systems and established baselines (TABLET, TABLLM). We evaluate eleven binary classification tasks: five pre-cutoff benchmarks likely seen during LLM pre-training and six post-cutoff datasets released after the LLM knowledge cutoff. Results show a sharp performance dichotomy: MALLM achieves competitive or superior performance on pre-cutoff datasets but substantially underperforms AutoML on post-cutoff data, exhibiting poor calibration and high variance, especially on hard-to-classify instances. By contrast, AutoML models generalize consistently and align confidence more closely with instance hardness. These findings suggest that, despite agentic scaffolding, current LLMs cannot yet replace production-grade discriminative models for tabular classification, underscoring the need for contamination-free benchmarks to accurately assess tabular reasoning capabilities.
CaBSALLM: Efficient Context-Aware Batch Annotation of Conversational Streams with Large Language Models
Mohammadsadegh Abolhasani | Reza Mousavi | Paul Jen-Hwa Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Mohammadsadegh Abolhasani | Reza Mousavi | Paul Jen-Hwa Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Analyses of parasocial cues in live-stream chats require accurate, efficient, and scalable annotation. However, manual annotation is tedious, and large language models (LLMs) often make mistakes when applying subjective, discourse-dependent labels. This study proposes Context-aware Batching for Stream Annotation with LLMs (CaBSALLM), an efficient pipeline that incorporates lightweight conversational context and a novel dynamic batching method to improve throughput and scalability. Compared with state-of-the-art pipelines, this generalizable approach is significantly more time- and cost-efficient while achieving comparable or better predictive performance and agreement.