Kunat Pipatanakul
2026
AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation
Potsawee Manakul | Woody Haosheng Gan | Michael J Ryan | Ali Sartaz Khan | Warit Sirichotedumrong | Kunat Pipatanakul | William Barr Held | Diyi Yang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Potsawee Manakul | Woody Haosheng Gan | Michael J Ryan | Ali Sartaz Khan | Warit Sirichotedumrong | Kunat Pipatanakul | William Barr Held | Diyi Yang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.
Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
Yuatyong Chaichana | Pittawat Taveekitworachai | Warit Sirichotedumrong | Potsawee Manakul | Kunat Pipatanakul
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuatyong Chaichana | Pittawat Taveekitworachai | Warit Sirichotedumrong | Potsawee Manakul | Kunat Pipatanakul
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, modality-decoupled extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM’s text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training. Our experiments on SALMONN and Qwen2-Audio confirm that Partial YaRN outperforms the original models across wide range of settings, and VLAT provides substantial performance improvement on long audio of unseen lengths.
2025
Mind the Gap: Static and Interactive Evaluations of Large Audio Models
Minzhi Li | William Held | Michael J. Ryan | Kunat Pipatanakul | Potsawee Manakul | Hao Zhu | Diyi Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Minzhi Li | William Held | Michael J. Ryan | Kunat Pipatanakul | Potsawee Manakul | Hao Zhu | Diyi Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results (𝜏 ≤ 0.33 for all benchmarks). While combining multiple coarse-grained features yields modest predictive power (R2=0.30), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.
Prior Prompt Engineering for Reinforcement Fine-Tuning
Pittawat Taveekitworachai | Potsawee Manakul | Sarana Nutanong | Kunat Pipatanakul
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Pittawat Taveekitworachai | Potsawee Manakul | Sarana Nutanong | Kunat Pipatanakul
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt–the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning–remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies–reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization–into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.
FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning
Natapong Nitarach | Warit Sirichotedumrong | Panop Pitchayarthorn | Pittawat Taveekitworachai | Potsawee Manakul | Kunat Pipatanakul
Proceedings of The 10th Workshop on Financial Technology and Natural Language Processing
Natapong Nitarach | Warit Sirichotedumrong | Panop Pitchayarthorn | Pittawat Taveekitworachai | Potsawee Manakul | Kunat Pipatanakul
Proceedings of The 10th Workshop on Financial Technology and Natural Language Processing
ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Surapon Nonesung | Teetouch Jaknamon | Sirinya Chaiophat | Natapong Nitarach | Chanakan Wittayasakpan | Warit Sirichotedumrong | Adisai Na-Thalang | Kunat Pipatanakul
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Surapon Nonesung | Teetouch Jaknamon | Sirinya Chaiophat | Natapong Nitarach | Chanakan Wittayasakpan | Warit Sirichotedumrong | Adisai Na-Thalang | Kunat Pipatanakul
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.