Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery

Luke Moffett; Bhuwan Dhingra

Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery

Abstract

The current generation of large language models (LLMs) show a surprising degree of robustness to adversarial perturbations, but it is unclear when these models implicitly recover the original text and when they rely on surrounding context. To isolate this recovery faculty of language models, we study a new diagnostic task —Adversarial Word Recovery — an extension of spellchecking where the inputs may be adversarial. We collect a new dataset using 9 popular perturbation attack strategies from the literature and organize them using a taxonomy of phonetic, typo, and visual attacks. We use this dataset to study the word recovery performance of the current generation of LLMs, finding that proprietary models (GPT-4, GPT-3.5 and Palm-2) match or surpass human performance. Conversely, open-source models (Llama-2, Mistral, Falcon) demonstrate a material gap between human performance, especially on visual attacks. For these open models, we show that performance of word recovery without context correlates to word recovery with context, and ultimately affects downstream task performance on a hateful, offensive, and toxic classification task. Finally, to show improving word recovery can improve robustness, we mitigate these attacks with a small Byt5 model tuned to recover visually attacked words.

Anthology ID:: 2025.coling-main.467
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6999–7019
Language:
URL:: https://aclanthology.org/2025.coling-main.467/
DOI:
Bibkey:
Cite (ACL):: Luke Moffett and Bhuwan Dhingra. 2025. Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6999–7019, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery (Moffett & Dhingra, COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.467.pdf

PDF Cite Search Fix data