Zhi Rui Tam

Also published as: Zhi-Rui Tam

2026

The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues
Zhi Rui Tam | Wen Yu Chang | Yun-Nung Chen
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology

This study systematically compares end-to-end (E2E) audio language models (AudioLMs) against modular (ASR, LLM, TTS) systems for multi-phase task-oriented dialogues. We evaluate open-source models on key metrics: conversational naturalness and dialogue consistency. Our findings show that E2E configurations consistently underperform their modular counterparts, exhibiting severe degradation in dialogue quality across turns. Investigating this failure, our analysis reveals that the core issue lies in the E2E models’ dialogue modeling capabilities, specifically in context maintenance and topic tracking. This work highlights a critical gap between the purported low-latency benefit of AudioLMs and their practical ability to maintain coherence in complex, multi-turn dialogues, suggesting a need for focused architectural improvements.

2025

pdf bib

Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Kai-Wei Chang | Ke-Han Lu | Chih-Kai Yang | Zhi-Rui Tam | Wen-Yu Chang | Chung-Che Wang
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)

pdf bib abs

None of the Above, Less of the Right Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering
Zhi Rui Tam | Cheng-Kuang Wu | Chieh-Yen Lin | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2025

Multiple-choice exam questions with “None of the above” (NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50% performance drop across models regardless of scale–suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1% drop). Our results highlight important implications for benchmark design and raise questions about LLMs’ ability to handle uncertainty in real-world applications.

2024

pdf bib abs

I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation
Cheng-Kuang Wu | Zhi Rui Tam | Chao-Chung Wu | Chieh-Yen Lin | Hung-yi Lee | Yun-Nung Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This study explores the proactive ability of LLMs to seek user support. We propose metrics to evaluate the trade-off between performance improvements and user burden, and investigate whether LLMs can determine when to request help under varying information availability. Our experiments show that without external feedback, many LLMs struggle to recognize their need for user support. The findings highlight the importance of external signals and provide insights for future research on improving support-seeking strategies. Source code: https://github.com/appier-research/i-need-help

pdf bib abs

Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance.
Zhi Rui Tam | Cheng-Kuang Wu | Yi-Lin Tsai | Chieh-Yen Lin | Hung-yi Lee | Yun-Nung Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs).This study investigates whether such constraints on generation space impact LLMs’ abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs’ performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs’ reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.

Co-authors

Venues

Fix author