What You Read Isn’t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

Binh Nguyen; Shuju Shi; Ryan Ofman; Thai Le

doi:10.18653/v1/2025.emnlp-main.794

What You Read Isn’t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

Binh Nguyen, Shuju Shi, Ryan Ofman, Thai Le

Abstract

Recent advances in text-to-speech technology have enabled highly realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior research has predominantly focused on acoustic-level perturbations, leaving **the impact of linguistic variation largely unexplored**. In this paper, we investigate the linguistic sensitivity of both open-source and commercial anti-spoofing detectors by introducing **TAPAS** (Transcript-to-Audio Perturbation Anti-Spoofing), a novel framework for transcript-level adversarial attacks. Our extensive evaluation shows that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates exceed **60%** on several open-source detector–voice pairs, and the accuracy of one commercial detector drops from **100%** on synthetic audio to just **32%**. Through a comprehensive feature attribution analysis, we find that linguistic complexity and model-level audio embedding similarity are key factors contributing to detector vulnerabilities. To illustrate the real-world risks, we replicate a recent Brad Pitt audio deepfake scam and demonstrate that TAPAS can bypass commercial detectors. These findings underscore the **need to move beyond purely acoustic defenses** and incorporate linguistic variation into the design of robust anti-spoofing systems. Our source code is available at https://github.com/nqbinh17/audio_linguistic_adversarial.

Anthology ID:: 2025.emnlp-main.794
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15741–15755
Language:
URL:: https://aclanthology.org/2025.emnlp-main.794/
DOI:: 10.18653/v1/2025.emnlp-main.794
Bibkey:
Cite (ACL):: Binh Nguyen, Shuju Shi, Ryan Ofman, and Thai Le. 2025. What You Read Isn’t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15741–15755, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: What You Read Isn’t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection (Nguyen et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.794.pdf
Checklist:: 2025.emnlp-main.794.checklist.pdf

PDF Cite Search Checklist Fix data