XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection

Kwok-Ho Ng; Tingting Song; Yongdong WU; Zhihua Xia

XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection

Kwok-Ho Ng, Tingting Song, Yongdong WU, Zhihua Xia

Abstract

Advanced speech synthesis technologies have enabled highly realistic speech generation, posing security risks that motivate research into audio deepfake detection (ADD). While state space models (SSMs) offer linear complexity, pure causal SSMs architectures often struggle with the content-based retrieval required to capture global frequency-domain artifacts. To address this, we explore the scaling properties of hybrid architectures by proposing XLSR-MamBo, a modular framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. We systematically evaluate four topological designs using advanced SSM variants, Mamba, Mamba2, Hydra, and Gated DeltaNet. Experimental results demonstrate that the MamBo-3-Hydra-N3 configuration achieves competitive performance compared to other state-of-the-art systems on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This performance benefits from Hydra’s native bidirectional modeling, which captures holistic temporal dependencies more efficiently than the heuristic dual-branch strategies employed in prior works. Furthermore, evaluations on the DFADD dataset demonstrate robust generalization to unseen diffusion- and flow-matching-based synthesis methods. Crucially, our analysis reveals that scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models. These results demonstrate the hybrid framework’s ability to capture artifacts in spoofed speech signals, providing an effective method for ADD. Codes are publicly available at https://github.com/saki-ciallo/XLSR-MamBo.

Anthology ID:: 2026.findings-acl.1573
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 31450–31462
Language:
URL:: https://aclanthology.org/2026.findings-acl.1573/
DOI:
Bibkey:
Cite (ACL):: Kwok-Ho Ng, Tingting Song, Yongdong WU, and Zhihua Xia. 2026. XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection. In Findings of the Association for Computational Linguistics: ACL 2026, pages 31450–31462, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection (Ng et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1573.pdf
Checklist:: 2026.findings-acl.1573.checklist.pdf

PDF Cite Search Checklist Fix data