Hsin-Hsi Chen

Also published as: Hsin-hsi Chen

2026

Task-Level Instructions Induction for Audio Question Answering from Few Examples
Po-Chun Chen | Hen-Hsen Huang | Hsin-Hsi Chen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Large audio-language models (LALMs) benefit from Chain-of-Thought (CoT) prompting for audio question answering (AQA), but acquiring audio CoT examples is particularly challenging as it requires sequential listening and careful integration of acoustic and linguistic information. Surprisingly, our experiments reveal that standard few-shot prompting yields inconsistent results compared to zero-shot CoT, with several models showing degraded accuracy. Moreover, few-shot prompting incurs substantially higher inference costs by processing multiple audio demonstrations per inference. We propose Audio-Induct, which induces reusable textual task instructions from few audio examples once per task, requiring no additional demonstrations at inference. Evaluated on 9 LALMs across two benchmarks, Audio-Induct outperforms state-of-the-art prompting methods while maintaining low inference costs. Inducted Task Instructions transfer effectively across models, enabling scalable deployment.

pdf bib abs

Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations
Sheng-Lun Wei | Yu-Ling Liao | Yen-Hua Chang | Hen-Hsen Huang | Hsin-Hsi Chen
Findings of the Association for Computational Linguistics: EACL 2026

Recent multimodal large language models (MLLMs) extend language understanding beyond text to speech, enabling unified reasoning across modalities. While biases in text-based LLMs have been widely examined, their persistence and manifestation in spoken inputs remain underexplored. This work presents the first systematic investigation of speech bias in multilingual MLLMs.We construct and release the BiasInEar Dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours (≈4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss’ 𝜅), we evaluate nine representative models under linguistic language and accent, demographic gender, and structural option order perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation.

Hsin-Hsi Chen

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1990

1988

Co-authors

Venues