Imperceptible Black-box Adversarial Perturbations
Over the last few decades, deep neural networks (DNNs) have exhibited remarkable success in natural language processing (NLP). However, despite their impressive performance, DNNs have been shown to be highly vulnerable to adversarial examples, which are carefully crafted inputs with small, often imperceptible perturbations that cause models to make incorrect predictions. This vulnerability poses growing threats to security and privacy in online and safety-critical environments.
The issue becomes even more serious with the widespread use of large language models (LLMs) such as GPT and LLaMA, which exhibit strong generalization and zero-shot learning abilities. Although these models revolutionize NLP tasks, they are not immune to adversarial manipulations. In particular, prompt-injection and jailbreak attacks can subvert safety alignment mechanisms, revealing a similar underlying weakness: subtle textual perturbations can drastically alter model behaviour. Understanding and mitigating such vulnerabilities has therefore become a central research topic in Safe AI.
Compared to adversarial attacks in computer vision (CV), generating imperceptible adversarial examples for text presents key challenges. Images reside in continuous pixel spaces, where tiny numeric changes can easily be masked from human perception. Text, however, is inherently discrete, and even a single character or word change can significantly affect readability, grammar, or meaning. Thus, designing textual perturbations that remain semantically faithful and grammatically correct, while still deceiving the model, is substantially more difficult. Character-level or sentence-level manipulations often yield unnatural text that can be detected by spell-checkers or simple rule-based defences. Conversely, word-level synonym substitution can produce more fluent and semantically consistent adversarial samples.
To address this problem, we developed SCALA (Synonym-based desCending And repLace-back Ascending), a score-based black-box adversarial attack that achieves an effective balance between imperceptibility, efficiency, and practicality. By leveraging a novel descending–ascending synonym ranking mechanism with parallelized computation, SCALA generates visually and semantically natural adversarial texts with minimal perturbations. Extensive experiments on both conventional NLP models and fine-tuned LLMs demonstrate the generality and scalability of our method. SCALA consistently achieves the lowest perturbation and grammatical error rates while maintaining high semantic similarity, outperforming representative baselines. The results highlight that fine-tuned LLMs, despite their scale and contextual capacity, remain vulnerable to word-level perturbations, emphasizing the need for robust defense strategies. Future work will explore extending SCALA to multilingual and multimodal domains and investigating defense-aware training to enhance robustness in real-world applications.
Want to learn more? Read out latest paper [1] that has been published in the IEEE Transactions on Information Forensics and Security.

