Subtle Shifts, Significant Threats: Leveraging XAI Methods and LLMs to Undermine Language Models Robustness

Adrián Moreno Muñoz, L. Alfonso Ureñ-López, Eugenio Martínez Cámara


Abstract
Language models exhibit inherent security vulnerabilities, which may be related to several factors, among them the malicious alteration of the input data. Such weaknesses compromise the robustness of language models, which is more critical when adversarial attacks are stealthy and do not require high computational resources. In this work, we study how vulnerable English language models are to adversarial attacks based on subtle modifications of the input of pretrained English language models. We claim that the attack may be more effective if it is targeted to the most salient words for the discriminative task of the language models. Accordingly, we propose a new attack built upon a two-step approach: first, we use a posteriori explainability methods to identify the most influential words for the classification task, and second, we replace them with contextual synonyms retrieved by a small language model. Since the attack has to be as stealthy as possible, we also propose a new evaluation measure that combines the effectiveness of the attack with the number of modifications performed. The results show that pretrained English language models are vulnerable to minimal semantic changes, which makes the design of countermeasure methods imperative.
Anthology ID:
2025.ranlp-1.86
Volume:
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
748–757
Language:
URL:
https://aclanthology.org/2025.ranlp-1.86/
DOI:
Bibkey:
Cite (ACL):
Adrián Moreno Muñoz, L. Alfonso Ureñ-López, and Eugenio Martínez Cámara. 2025. Subtle Shifts, Significant Threats: Leveraging XAI Methods and LLMs to Undermine Language Models Robustness. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 748–757, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
Subtle Shifts, Significant Threats: Leveraging XAI Methods and LLMs to Undermine Language Models Robustness (Moreno Muñoz et al., RANLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.ranlp-1.86.pdf