Xiaoxue Gao
2026
VoiceBench: Benchmarking LLM-Based Voice Assistants
Yiming Chen | Xianghu Yue | Chen Zhang | Xiaoxue Gao | Robby T. Tan | Haizhou Li
Transactions of the Association for Computational Linguistics, Volume 14
Yiming Chen | Xianghu Yue | Chen Zhang | Xiaoxue Gao | Robby T. Tan | Haizhou Li
Transactions of the Association for Computational Linguistics, Volume 14
Recent advancements in large language models (LLMs) like GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering an improved user experience over text-based interactions. However, a suitable benchmark to rigorously evaluate such speech interactions systems is currently lacking. To bridge this gap, we introduce VoiceBench, the first benchmark specifically designed to assess LLM-based voice assistants. VoiceBench comprises 6,783 synthetic and real spoken instructions recorded from diverse speakers across eight distinct tasks. These instructions are meticulously crafted to assess three crucial capability areas: general knowledge, instruction-following, and safety compliance. Furthermore, VoiceBench systematically incorporates realistic variations common in spoken interactions, including differences in speaker characteristics (e.g., accents), heterogeneous environmental conditions (e.g., reverberation), and content complexities such as mispronunciations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.1
MMAC: A Multilingual, Multimodal Alignment Framework for Cultural Grounding Evaluation
Weihua Zheng | Zhengyuan Liu | Tanmoy Chakraborty | Weiwen Xu | Xiaoxue Gao | Bryan Chen Zhengyu Tan | Bowei Zou | Chang Liu | Yujia Hu | Xing Xie | Xiaoyuan Yi | Jing Yao | Chaojun Wang | Long Li | Rui Liu | Huiyao Liu | Koji Inoue | Ryuichi Sumida | Tatsuya Kawahara | Fan Xu | Lingyu Ye | Wei Tian | Dongjun Kim | Jimin Jung | Jaehyung Seo | Nadya Yuki Wangsajaya | Pham Minh Duc | Ojasva Saxena | Palash Nandi | Xiyan Tao | Wiwik Karlina | Tuan Luong | Keertana Arun Vasan | Roy Ka-Wei Lee | Nancy F. Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Weihua Zheng | Zhengyuan Liu | Tanmoy Chakraborty | Weiwen Xu | Xiaoxue Gao | Bryan Chen Zhengyu Tan | Bowei Zou | Chang Liu | Yujia Hu | Xing Xie | Xiaoyuan Yi | Jing Yao | Chaojun Wang | Long Li | Rui Liu | Huiyao Liu | Koji Inoue | Ryuichi Sumida | Tatsuya Kawahara | Fan Xu | Lingyu Ye | Wei Tian | Dongjun Kim | Jimin Jung | Jaehyung Seo | Nadya Yuki Wangsajaya | Pham Minh Duc | Ojasva Saxena | Palash Nandi | Xiyan Tao | Wiwik Karlina | Tuan Luong | Keertana Arun Vasan | Roy Ka-Wei Lee | Nancy F. Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The global deployment of Large Language Models (LLMs) underscores the urgent need to evaluate their cultural alignment. However, assessing genuine "cultural awareness" across modalities (text, vision, speech) and languages remains a significant challenge. To comprehensively investigate this domain, we propose MMAC, a systematic framework that encompasses a tri-modally aligned cultural benchmark creation pipeline and a five-dimensional evaluation protocol to assess cross-country awareness disparities, evaluate cross-lingual and cross-modal consistency, and verify cultural knowledge generalization and grounding validity. Given the prevailing Western cultural bias in current models, we focus on 8 Asian countries as our dataset foundation to more acutely reveal potential cultural deficiencies in LLMs. Our dataset, MMAC-bench, features 27,000 human-curated questions across 10 languages. Crucially, it is the first dataset aligned at the input level across text, image, and speech, enabling direct cross-modal transfer tests. Each question consists of multiple-choice options accompanied by open-ended generated explanations, where 79% require multi-step reasoning grounded in cultural context, moving beyond simple memorization. We probe the causes of modal divergence, offering insights into fostering culturally robust MLLMs.
2025
SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning
Zhengyuan Liu | Geyu Lin | Hui Li Tan | Huayun Zhang | Yanfeng Lu | Xiaoxue Gao | Stella Xin Yin | Sun He | Hock Huan Goh | Lung Hsiang Wong | Nancy F. Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Zhengyuan Liu | Geyu Lin | Hui Li Tan | Huayun Zhang | Yanfeng Lu | Xiaoxue Gao | Stella Xin Yin | Sun He | Hock Huan Goh | Lung Hsiang Wong | Nancy F. Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
The integration of generative artificial intelligence into educational applications has enhanced personalized and interactive learning experiences, and it shows strong potential to promote young learners language acquisition. However, it is still challenging to ensure consistent and robust performance across different languages and cultural contexts, and kids-friendly design requires simplified instructions, engaging interactions, and age-appropriate scaffolding to maintain motivation and optimize learning outcomes.In this work, we introduce SingaKids, a dialogic tutor designed to facilitate language learning through picture description tasks. Our system integrates dense image captioning, multilingual dialogic interaction, speech understanding, and engaging speech generation to create an immersive learning environment in four languages: English, Mandarin, Malay, and Tamil. We further improve the system through multilingual pre-training, task-specific tuning, and scaffolding optimization. Empirical studies with elementary school students demonstrate that SingaKids provides effective dialogic teaching, benefiting learners at different performance levels.
2024
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
Yiming Chen | Xianghu Yue | Xiaoxue Gao | Chen Zhang | Luis Fernando D’Haro | Robby T. Tan | Haizhou Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Yiming Chen | Xianghu Yue | Xiaoxue Gao | Chen Zhang | Luis Fernando D’Haro | Robby T. Tan | Haizhou Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.
Search
Fix author
Co-authors
- Nancy Chen 2
- Yiming Chen 2
- Haizhou Li 2
- Zhengyuan Liu 2
- Robby T. Tan 2
- Xianghu Yue 2
- Tanmoy Chakraborty 1
- Pham Minh Duc 1
- Luis Fernando D’Haro 1
- Hock Huan Goh 1
- Sun He 1
- Yujia Hu 1
- Koji Inoue 1
- Jimin Jung 1
- Wiwik Karlina 1
- Tatsuya Kawahara 1
- Dongjun Kim 1
- Roy Ka-Wei Lee 1
- Long Li 1
- Geyu Lin 1
- Chang Liu 1
- Huiyao Liu 1
- Rui Liu 1
- Yanfeng Lu 1
- Tuan Luong 1
- Palash Nandi 1
- Ojasva Saxena 1
- Jaehyung Seo 1
- Ryuichi Sumida 1
- Bryan Chen Zhengyu Tan 1
- Hui Li Tan 1
- Xiyan Tao 1
- Wei Tian (田巍) 1
- Keertana Arun Vasan 1
- Chaojun Wang 1
- Nadya Yuki Wangsajaya 1
- Lung Hsiang Wong 1
- Xing Xie 1
- Fan Xu (徐凡) 1
- Weiwen Xu 1
- Jing Yao 1
- Lingyu Ye 1
- Xiaoyuan Yi 1
- Stella Xin Yin 1
- Chen Zhang 1
- Chen Zhang 1
- Huayun Zhang 1
- Weihua Zheng 1
- Bowei Zou (邹博伟) 1