Naihao Deng - ACL Anthology

Naihao Deng

2026

What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects
Naihao Deng | Sheng Zhang | Henghui Zhu | Shuaichen Chang | Jiani Zhang | Alexander Hanbo Li | Chung-Wei Hang | Hideo Kobayashi | Yiqun Hu | Patrick Ng
Findings of the Association for Computational Linguistics: EACL 2026

Table modeling has progressed for decades. In this work, we revisit this trajectory and highlight emerging challenges in the LLM era, particularly the paradox of choice: the difficulty of attributing performance gains amid diverse base models and training sets in the context of table instruction tuning. We replicate four table LLMs by instruction-tuning three foundation models on four existing datasets, yielding 12 models. We then evaluate these models across 16 table benchmarks. Our study is the first to quantitatively disentangle the effects of training data and base model selection, revealing that base model choice plays a more dominant role than the training data itself. Generalization and reasoning remain challenging, inviting future effort on table modeling. Based on our findings, we share our thoughts on the future directions for table modeling.

2025

Tables as Thought: Exploring Structured Thoughts in LLM Reasoning
Zhenjie Sun | Naihao Deng | Haofei Yu | Jiaxuan You
Proceedings of the 4th Table Representation Learning Workshop

Large language models’ reasoning abilities benefit from methods that organize their thought processes, such as chain-of-thought prompting, which employs a sequential structure to guide the reasoning process step-by-step. However, existing approaches focus primarily on organizing the sequence of thoughts, leaving structure in individual thought steps underexplored. To address this gap, we propose Table as Thought, a framework inspired by cognitive neuroscience theories on human thought. Table as Thought organizes reasoning within a tabular schema, where rows represent sequential thought steps and columns capture critical constraints and contextual information to enhance reasoning. The reasoning process iteratively populates the table until self-verification ensures completeness and correctness. Our experiments show that Table as Thought excels in planning tasks and demonstrates a strong potential for enhancing LLM performance in mathematical reasoning compared to unstructured thought baselines. This work provides a novel exploration of refining thought representation within LLMs, paving the way for advancements in reasoning and AI cognition.

Rethinking Table Instruction Tuning
Naihao Deng | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2025

Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.

CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation
Naihao Deng | Kapotaksha Das | Rada Mihalcea | Vitaliy Popov | Mohamed Abouelenien
Findings of the Association for Computational Linguistics: ACL 2025

In clinical operations, teamwork can be the crucial factor that determines the final outcome. Prior studies have shown that sufficient collaboration is the key factor that determines the outcome of an operation. To understand how the team practices teamwork during the operation, we collected **CliniDial** from simulations of medical operations. **CliniDial** includes the audio data and its transcriptions, the simulated physiology signals of the patient manikins, and how the team operates from two camera angles. We annotate behavior codes following an existing framework to understand the teamwork process for **CliniDial**. We pinpoint three main characteristics of our dataset, including its label imbalances, rich and natural interactions, and multiple modalities, and conduct experiments to test existing LLMs’ capabilities on handling data with these characteristics. Experimental results show that **CliniDial** poses significant challenges to the existing models, inviting future effort on developing methods that can deal with real-world clinical data. We open-source the codebase at https://github.com/MichiganNLP/CliniDial.

Chumor 2.0: Towards Better Benchmarking Chinese Humor Understanding from (Ruo Zhi Ba)
Ruiqi He | Yushu He | Longju Bai | Jiarui Liu | Zhenjie Sun | Zenghao Tang | He Wang | Hanchen Xia | Rada Mihalcea | Naihao Deng
Findings of the Association for Computational Linguistics: ACL 2025

Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct **Chumor**, the first and the largest Chinese humor explanation dataset. **Chumor** is sourced from Ruo Zhi Ba (RZB, 弱智吧), a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes. We test ten LLMs through direct and chain-of-thought prompting, revealing that **Chumor** poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE4-turbo. We release **Chumor** at https://huggingface.co/datasets/MichiganNLP/Chumor , our project page is at https://github.com/MichiganNLP/Chumor-2.0 , our leaderboard is at https://huggingface.co/spaces/MichiganNLP/Chumor-leaderboard , and our codebase is at https://github.com/MichiganNLP/Chumor-2.0 .

R³: “This is My SQL, Are You With Me?” A Consensus-Based Multi-Agent System for Text-to-SQL Tasks
Hanchen Xia | Feng Jiang | Naihao Deng | Cunxiang Wang | Guojiang Zhao | Rada Mihalcea | Yue Zhang
Proceedings of the 4th Table Representation Learning Workshop

Large Language Models (LLMs) have demon- strated exceptional performance across diverse tasks. To harness their capabilities for Text- to-SQL, we introduce R3 (Review-Rebuttal- Revision), a consensus-based multi-agent sys- tem for Text-to-SQL tasks. R3 achieves the new state-of-the-art performance of 89.9 on the Spider test set. In the meantime, R3 achieves 61.80 on the Bird development set. R3 out- performs existing single-LLM and multi-agent Text-to-SQL systems by 1.3% to 8.1% on Spi- der and Bird, respectively. Surprisingly, we find that for Llama-3-8B, R3 outperforms chain-of- thought prompting by over 20%, even outper- forming GPT-3.5 on the Spider development set. We open-source our codebase at https: //github.com/1ring2rta/R3.

Benchmarking and Improving LLM Robustness for Personalized Generation
Chimaobi Okite | Naihao Deng | Kiran Bodipati | Huaidian Hou | Joyce Chai | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user’s preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness of LLMs in personalization, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fails to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.

2024

Recent progress in large language models (LLMs) has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that “it’s all been solved.” Not surprisingly, this has, in turn, made many NLP researchers – especially those at the beginning of their careers – worry about what NLP research area they should focus on. Has it all been solved, or what remaining questions can we work on regardless of LLMs? To address this question, this paper compiles NLP research directions rich for exploration. We identify fourteen different research areas encompassing 45 research directions that require new research and are not directly solvable by LLMs. While we identify many research areas, many others exist; we do not cover areas currently addressed by LLMs, but where LLMs lag behind in performance or those focused on LLM development. We welcome suggestions for other research directions to include: https://bit.ly/nlp-era-llm.

Tables as Texts or Images: Evaluating the Table Reasoning Ability of LLMs and MLLMs
Naihao Deng | Zhenjie Sun | Ruiqi He | Aman Sikka | Yulong Chen | Lin Ma | Yue Zhang | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2024

Tables contrast with unstructured text data by its structure to organize the information.In this paper, we investigate the efficiency of various LLMs in interpreting tabular data through different prompting strategies and data formats. Our analysis extends across six benchmarks for table-related tasks such as question-answering and fact-checking. We pioneer in the assessment of LLMs’ performance on image-based table representation. Specifically, we compare five text-based and three image-based table representations, revealing the influence of representation and prompting on LLM performance. We hope our study provides researchers insights into optimizing LLMs’ application in table-related tasks.

2023

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Siyang Liu | Naihao Deng | Sahand Sabour | Yilin Jia | Minlie Huang | Rada Mihalcea
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol that allows for the integration of task-specific tokens into the pre-trained model’s tokenization step. Through extensive experiments on psychological question-answering tasks in both Chinese and English, we find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens. Preliminary experiments point to promising results when using our tokenization approach with very large language models.

Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models
Yufan Wu | Yinghui He | Yilin Jia | Rada Mihalcea | Yulong Chen | Naihao Deng
Findings of the Association for Computational Linguistics: EMNLP 2023

Theory of Mind (ToM) is the ability to reason about one’s own and others’ mental states. ToM plays a critical role in the development of intelligence, language understanding, and cognitive processes. While previous work has primarily focused on first and second-order ToM, we explore higher-order ToM, which involves recursive reasoning on others’ beliefs. %We also incorporate a new deception mechanism in ToM reasoning. We introduce Hi-ToM, a Higher Order Theory of Mind benchmark. Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks, demonstrating the limitations of current LLMs. We conduct a thorough analysis of different failure cases of LLMs, and share our thoughts on the implications of our findings on the future of NLP.

Query Rewriting for Effective Misinformation Discovery
Ashkan Kazemi | Artem Abzaliev | Naihao Deng | Rui Hou | Scott A. Hale | Veronica Perez-Rosas | Rada Mihalcea
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

You Are What You Annotate: Towards Better Models through Annotator Representations
Naihao Deng | Xinliang Zhang | Siyang Liu | Winston Wu | Lu Wang | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2023

Annotator disagreement is ubiquitous in natural language processing (NLP) tasks. There are multiple reasons for such disagreements, including the subjectivity of the task, difficult cases, unclear guidelines, and so on. Rather than simply aggregating labels to obtain data annotations, we instead try to directly model the diverse perspectives of the annotators, and explicitly account for annotators’ idiosyncrasies in the modeling process by creating representations for each annotator (*annotator embeddings*) and also their annotations (*annotation embeddings*). In addition, we propose **TID-8**, **T**he **I**nherent **D**isagreement - **8** dataset, a benchmark that consists of eight existing language understanding datasets that have inherent annotator disagreement. We test our approach on TID-8 and show that our approach helps models learn significantly better from disagreements on six different datasets in TID-8 while increasing model size by fewer than 1% parameters. By capturing the unique tendencies and subjectivity of individual annotators through embeddings, our representations prime AI models to be inclusive of diverse viewpoints.

2022

DialogSum Challenge: Results of the Dialogue Summarization Shared Task
Yulong Chen | Naihao Deng | Yang Liu | Yue Zhang
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We report the results of DialogSum Challenge, the shared task on summarizing real-life sce- nario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different meth- ods to improve the performance of dialogue summarization. Although there is a great im- provement over the baseline models regarding automatic evaluation metrics, such as ROUGE scores, we find that there is a salient gap be- tween model generated outputs and human an- notated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion met- rics are in need.

The Cross-lingual Conversation Summarization Challenge
Yulong Chen | Ming Zhong | Xuefeng Bai | Naihao Deng | Jing Li | Xianchao Zhu | Yue Zhang
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We propose the shared task of cross-lingual conversation summarization, ConvSumX Challenge, opening new avenues for researchers to investigate solutions that integrate conversation summarization and machine translation. This task can be particularly useful due to the emergence of online meetings and conferences. We use a new benchmark, covering 2 real-world scenarios and 3 language directions, including a low-resource language, for evaluation. We hope that ConvSumX can motivate research to go beyond English and break the barrier for non-English speakers to benefit from recent advances of conversation summarization.

Analyzing the Effects of Annotator Gender across NLP Tasks
Laura Biester | Vanita Sharma | Ashkan Kazemi | Naihao Deng | Steven Wilson | Rada Mihalcea
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022

Recent studies have shown that for subjective annotation tasks, the demographics, lived experiences, and identity of annotators can have a large impact on how items are labeled. We expand on this work, hypothesizing that gender may correlate with differences in annotations for a number of NLP benchmarks, including those that are fairly subjective (e.g., affect in text) and those that are typically considered to be objective (e.g., natural language inference). We develop a robust framework to test for differences in annotation across genders for four benchmark datasets. While our results largely show a lack of statistically significant differences in annotation by males and females for these tasks, the framework can be used to analyze differences in annotation between various other demographic groups in future work. Finally, we note that most datasets are collected without annotator demographics and released only in aggregate form; we call on the community to consider annotator demographics as data is collected, and to release dis-aggregated data to allow for further work analyzing variability among annotators.

Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect
Naihao Deng | Yulong Chen | Yue Zhang
Proceedings of the 29th International Conference on Computational Linguistics

Text-to-SQL has attracted attention from both the natural language processing and database communities because of its ability to convert the semantics in natural language into SQL queries and its practical application in building natural language interfaces to database systems. The major challenges in text-to-SQL lie in encoding the meaning of natural utterances, decoding to SQL queries, and translating the semantics between these two forms. These challenges have been addressed to different extents by the recent advances. However, there is still a lack of comprehensive surveys for this task. To this end, we review recent progress on text-to-SQL for datasets, methods, and evaluation and provide this systematic survey, addressing the aforementioned challenges and discussing potential future directions. We hope this survey can serve as quick access to existing work and motivate future research.

In-the-Wild Video Question Answering
Santiago Castro | Naihao Deng | Pingxuan Huang | Mihai Burzo | Rada Mihalcea
Proceedings of the 29th International Conference on Computational Linguistics

Existing video understanding datasets mostly focus on human interactions, with little attention being paid to the “in the wild” settings, where the videos are recorded outdoors. We propose WILDQA, a video understanding dataset of videos recorded in outside settings. In addition to video question answering (Video QA), we also introduce the new task of identifying visual support for a given question and answer (Video Evidence Selection). Through evaluations using a wide range of baseline models, we show that WILDQA poses new challenges to the vision and language research communities. The dataset is available at https: //lit.eecs.umich.edu/wildqa/.

Co-authors

Artem Abzaliev 2

Laura Biester 2

Santiago Castro 2

Verónica Pérez-Rosas 2

Mohamed Abouelenien 1

Xuefeng Bai (白雪峰) 1

Kiran Bodipati 1

Shuaichen Chang 1

Kapotaksha Das 1

Aylin Ece Gunal 1

Scott A. Hale 1

Chung-Wei Hang 1

Pingxuan Huang 1

Feng Jiang (蒋峰) 1

Muhammad Khalifa 1

Hideo Kobayashi 1

Alexander Hanbo Li 1

Joan C. Nwatu 1

Chimaobi Okite 1

Vitaliy Popov 1

Sahand Sabour 1

Vanita Sharma 1

Cunxiang Wang 1

Steven Wilson 1

Xinliang Zhang 1

Guojiang Zhao 1

Venues

NLPerspectives1