When depth is redundant: Efficient transformer-based speech anti-spoofing

Hoan My Tran; Damien Lolive; Aghilas Sini; Arnaud Delhay; Pierre-Francois Marteau; David Guennec

When depth is redundant: Efficient transformer-based speech anti-spoofing

Hoan My Tran, Damien Lolive, Aghilas Sini, Arnaud Delhay, Pierre-Francois Marteau, David Guennec

Abstract

Detecting speech deepfakes is critical for protecting society against fraud, identity theft, and the misuse of modern speech synthesis technologies. Despite recent progress, existing countermeasures often exhibit limited generalization to unseen spoofing attacks, particularly in out-of-domain evaluation settings, even when achieving strong in-domain performance. Transformer architectures have become ubiquitous in anti-spoofing, serving both as feature extractors (e.g., wav2vec 2.0) and as classifiers. However, deep transformer stacks exhibit substantial representational redundancy across adjacent layers, with similarity increasing toward deeper layers. As a result, task-specific specialization is largely concentrated in the final layers, while shallow layers remain underutilized during fine-tuning. In this work, we analyze the layer-wise behavior of transformer-based classifiers for speech deepfake detection and propose a training strategy that explicitly aligns shallow and intermediate representations with those of the final transformer layer. By encouraging all layers to mimic the task-specialized representation learned at depth, the model more effectively exploits early-layer features while preserving discriminative capacity in deeper layers. This design improves robustness to unseen spoofing attacks and enhances out-of-domain generalization. Extensive experiments across multiple benchmark datasets demonstrate consistent performance gains over strong baselines.

Anthology ID:: 2026.findings-acl.318
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6380–6397
Language:
URL:: https://aclanthology.org/2026.findings-acl.318/
DOI:
Bibkey:
Cite (ACL):: Hoan My Tran, Damien Lolive, Aghilas Sini, Arnaud Delhay, Pierre-Francois Marteau, and David Guennec. 2026. When depth is redundant: Efficient transformer-based speech anti-spoofing. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6380–6397, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: When depth is redundant: Efficient transformer-based speech anti-spoofing (Tran et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.318.pdf
Checklist:: 2026.findings-acl.318.checklist.pdf

PDF Cite Search Checklist Fix data