Moayad Aloqaily

2026

While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces HearSay, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on HearSay yield three critical findings:Significant Privacy Leakage: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes.Insufficient Safety Mechanisms: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits.Reasoning Amplifies Risk: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations.These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment.The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark

pdf bib abs

Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose Locphylax, a defense framework that requires no prior knowledge of trigger settings. Locphylax is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. Locphylax leverages this through a two-stage process: first, aggregating backdoor representations by injecting known triggers, and then, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) Locphylax reduces the average Attack Success Rate to 4.41% across multiple benchmarks, outperforming existing baselines by 28.1%–69.3%. (II) Clean accuracy and utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios. Our code is available at https://anonymous.4open.science/r/Locphylax.

Co-authors

Li Sun 1

Miao Yu 1

Venues

ACL1
Findings1

Fix author