1. Introduction
The rapid development of artificial intelligence (AI) and large language models (LLMs) has made it widely used for various tasks such as classification, inference, speech generation and translation in NLP [
1,
2,
3]. However, research has shown that deep learning models are very sensitive and susceptible to small perturbations [
4]. Their vulnerability allows adversarial examples to pose significant challenges across multiple domains including image recognition, natural language processing, steganography [
5], and frequency domain analysis [
6], consequently giving rise to serious security risks, privacy leakage issues, and substantially compromised user experience. For example, people often get recommendations by reading reviews of products or services while shopping or watching movies, and some apps use sentiment analysis techniques to provide users with recommendations based on historical reviews [
7], where an attacker can generate adversarial examples based on real user reviews, spoofing the scoring mechanism of the recommender system, discrediting competitors, and causing users to receive incorrect recommendations. In addition, attackers can disguise spam emails as normal emails to avoid system detection, which not only increases the difficulty for users to view normal emails, but also increases the possibility of virus propagation. Meanwhile, other NLP-based systems may be affected by adversarial examples, public opinion data may be disguised as normal text and widely distributed on the Internet to deceive the opinion system, and patient’s symptom descriptions may be falsified, leading to the failure of systems such as medical triage. Therefore, the research on defense of textual adversarial examples is necessary and urgent, and researchers design more effective defense methods by digging deeper into model vulnerabilities to improve the robustness of deep learning models [
8]. Prevent the damage to the industry and even the country caused by the flood of textual adversarial examples, such as the economic loss of enterprises and the leakage of private information.
In computer vision (CV), more mature methods are available to defend against adversarial attacks. These include the use of denoising autoencoders (DAEs) to remove adversarial noise [
9], the use of generative adversarial networks (GANs) to generate unperturbed images [
10], and the reconstruction of high-quality original images to remove adversarial perturbations [
11]. However, due to the difference in nature between the continuous input of images and the discrete input of text, the approaches for defending against adversarial attacks in CV cannot be directly applied to NLP tasks. Meanwhile, there has been relatively scant research into text defenses against a diversified range of attacks, mostly against a single type of attack, which not only cannot achieve a universal defense but also has a high attack cost. Simple and efficient completion of adversarial text defense is the focus of current research.
Spellchecking is the main means of defending against character-level attacks. Traditional methods rely on dictionary matching and are relatively inefficient. Although deep learning methods based on CNN and RNN have stronger performance, they still have limitations such as single training samples and ignoring the context semantics, and lack specialized optimization for homoglyph attacks. In addition, word-level defense methods generally face several challenges: on the one hand, defenders find it difficult to predict attack strategies, which limits the effectiveness of methods relying on specific training; and on the other hand, identifying high-level attacks such as synonym substitution requires complex modeling, and the defense cost remains high.
In this paper, we consider the characteristics of the multiple attack methods that are presently available and propose a general text attack defense framework named TextShelter based on the limitations of current defense methods, which focuses on simple and efficient defense through semantic-preserving input reconstruction without modification of the model structure or training process. Before the text input model, the text is rewritten using TextShelter’s three modules, Homoglyph Reversion (HR), Spelling Correction (SC), and Backtranslation (BT), to remove possible homoglyph, character, and synonym attacks in the text.
The main contributions of this paper are as follows:
We propose an adversarial defense method called TextShelter, which considers the presence of multiple types of interference in the same text, which can defend against almost all adversarial attacks, and is also effective against a single attack. Furthermore, this method does not require the training process and parameters of the model to be obtained, nor does it require the model to be queried and retrained, which has the characteristics of generality, simplicity and efficiency.
TextShelter is the first defense method specifically designed against homoglyph attacks. It has been thoroughly evaluated on four models across three benchmark datasets under attacks of varying severity, achieving a maximum accuracy improvement of 70.7% and an average gain of 30%. Experimental results demonstrate that TextShelter significantly outperforms most existing state-of-the-art methods on deep learning models and matches the performance of the best current approaches on large language models.
TextShelter incorporates three distinct modules, whose optimization and synergistic interaction significantly enhance the overall defense capability. Ablation studies validate the individual contribution and performance of each module, demonstrating that optimizing a single module can further improve defense effectiveness.
Compared to existing defense methods, TextShelter’s key advantage lies in its extremely low defense overhead, as it does not rely on costly retraining or extensive computational resources. Moreover, we systematically evaluate its performance in terms of semantic preservation and semantic transfer, and further discuss the safety and practicality of each module—all experiments yield excellent results.
2. Related Work
Character-level attacks are one of the earliest forms of adversarial text attacks [
12], and most of the attack methods are insertion, deletion, and modification with keywords Steffen et al. [
13] first proposed an adversarial text attack method based on visual perturbations that use homoglyphs to replace characters in words. Homoglyphs are two or more characters with different meanings but a similar visual presentation that are represented by different Unicode encodings, as shown in
Figure 1. Homoglyphs are frequently used in practice, such as in the creation of phishing websites, spam detection, and domain-name spoofing.
There are already defense methods against character-level attacks. Pruth et al. [
14] used a word recognition model placed before the classifier. This model is based on the semi-character-level recurrent neural network (ScRNN) architecture and includes several new back-off strategies for handling rare and unseen words. After training, this model can identify words that are damaged due to random additions, deletions, swaps, and keyboard errors, thus providing a defense against adversarial examples containing character-level perturbations. However, this model can correct only adversarial examples with character-level perturbations and cannot handle coarser-grained modifications such as word replacement or phrase addition. Literature [
15] proposed a framework that jointly uses character embedding and adversarial stability training to defend against character-level adversarial examples. It can handle both out-of-vocabulary (OOV) words and distributional differences between the training set and the adversarial examples. However, this method also can defend against only character-level attacks.
To make the adversarial text more difficult to detect, word-level adversarial examples have been developed [
16]. Typically, the word-level adversarial text is used to scramble the original text by removing, adding, and replacing keywords, and the words chosen for replacement are usually synonyms to better preserve semantics [
17]. Ren et al. [
18] proposed a greedy algorithm for adversarial attacks called probability-weighted word saliency (PWWS) based on the synonym substitution strategy. Some defense methods have also been proposed for word-level attacks. Wang et al. [
19] proposed a method called the Synonym Encoding Method (SEM), in which an encoder is inserted before the input layer of the model to force all neighbors of
x to be marked as
. The SEM can effectively defend against current adversarial attacks based on synonym replacement while not considerably decreasing the accuracy for benign examples. Nevertheless, the SEM still requires modification of the model structure and retraining of the model to defend against word-level substitutions. Zhou et al. [
20] proposed a novel framework called Learning to Discriminate Perturbations (DISP). In this framework, a perturbation discriminator verifies the possibility of tokens being perturbed in the text and provides a set of potential perturbations. For each potential perturbation, an embedding estimator then recovers the embedding of the original word based on context learning and selects replacement tags based on an approximate k-nearest neighbor (kNN) search. DISP can defend against adversarial attacks on any NLP model without modifying the model structure or training process. However, the performance of DISP will deteriorate in the following two cases: (1) when the remaining sentences lack informational context, the recovered tokens will be inaccurate, and (2) when multiple perturbations exist, the erroneous context will also cause unsatisfactory recovery.
To date, there has been relatively little research focusing on the sentence level; however, one common method is to add elaborate sentences to a sample to fool the classifier [
12,
21]. As an alternative to this common method, Iyyer et al. [
22] proposed a syntactically controlled paraphrasing network (SCPN) and used it to generate adversarial examples. However, there are currently few defenses against sentence-level attacks.
At the same time, there are some defense methods in the CV domain for adversarial text defense, of which adversarial training [
23] and adversarial data augmentation [
24] are the earliest and most widely used. In the literature [
17,
25,
26] scholars adopted adversarial training as a defense strategy against adversarial text attacks, a method that can effectively defend against adversarial images and also exhibits a good effect on text. However, this method requires retraining the original model during the training phase and is consequently a very time- and cost-intensive strategy. Distillation is also a common method for defending against adversarial examples in the image domain. However, Marcus et al. [
27] evaluated the performance of defensive distillation for text, and the experimental results showed no improvement in the robustness of networks trained with defensive distillation.
With the continued growth of large language models, the attack and defense of pre-trained models have also attracted extensive research [
28]. Cen et al. [
29] proposed a Large Language Model Adversarial Defense (LLMAD) method, which comprises two modules: perturbation detection and perturbation correction. The perturbation correction module utilizes a fine-tuned large language model to rectify adversarial perturbations. Although this method achieved an average improvement of 66.8% compared to the undefended baseline, it incurs relatively high attack costs. Ye et al. [
30] proposed SAFER, a randomized smoothing method that can provably ensure that the prediction cannot be altered by any possible synonymous word substitution, a method that needs to assume that the defender knows how the adversary generates synonyms, which does not fit the display attack scenario. The core of traditional AI adversarial defense lies in maintaining the model’s decision robustness for specific tasks, whereas the key to large model adversarial defense focuses on ensuring behavioral safety and value alignment in open-ended generation scenarios. While Wu et al. [
31] proposed a defense method against prompt injection attacks for large models, existing defensive approaches still demonstrate insufficient robustness when applied to large language models.
3. TextShelter
TextShelter is a specific defense model for blocking black-box adversarial attacks that achieves efficient defense by ensuring simple and convenient semantic preservation and reconstruction of input adversarial examples under the constraint that the parameters, architecture, and training data of the target model are unknown.
TextShelter consists of three proprietary modules, each designed by analysing different levels of attack and using the influence of text features on the model to optimise input for defense. Unlike existing methods that require model retraining, TextShelter’s proposed input refactoring is easier to deploy and implement.
Problem Statement: For the task of sequence classification, given a pretrained target model , the model maps the feature space of the input text to a set of classification labels Y, where the labels may come from several categories. The input text is denoted by , where represents the -th token and n is the length of the sequence. An attacker can generate an adversarial example by adding imperceptible perturbations to the original text such that , leading to deviation in the predictions. Real-world adversarial examples may have multiple manifestations, such as character-level modifications performed by inserting or deleting a character and word-level substitutions performed by replacing a keyword with its synonym. Neither a single spellchecker nor adversarial training can effectively address such adversarial examples; therefore, it is crucial to be able to implement defenses against mixed adversarial examples. We wish to apply our defense method to the adversarial text such that , which indicates that the prediction result of the classification model is restored to the label of the original text.
3.1. Framework Overview
To defend against character-level and word-level attacks on text classifiers, we propose a three-stage pipeline method aiming to repair inaccuracies in words, including misspellings, homoglyphs, and synonyms. TextShelter is added before the text classifier, and we modify the input to the classification model to achieve the desired defense effect. Our method consists of the various components shown in
Figure 2.
The original text S is changed into the counter text by an adversarial attack. Then, our defense modules process to obtain as the input to the text classifier, thereby markedly improving the classification accuracy. Details of each module are described below.
3.2. Homoglyph Reversion (HR) Module
To mitigate the security impact of visual attacks, this paper first proposes an HR defense module for homoglyph attacks that can reverse most such attacks. The HR module can be divided into two steps: first, a Homoglyph filter is applied to discover and recognize homoglyph, and second, these homoglyph are converted into machine-readable American Standard Code for Information Interchange (ASCII) codes. Such an operation can speed up the text reconstruction step, and if the attack does not contain any homoglyph in the text, there will be no modification of the text and it will go directly to the next module.
Unicode assigns a standardized and unique hex code to each character in each language to meet the needs of cross-language text conversion and processing requirements. First, the original character is converted to “UTF-16” format by detecting the Unicode encoding of each character to determine the Unicode values of all characters in the text (e.g., “a” corresponds to “U+0061”). We manipulate a basic purified list , which consists of the last four digits of the Unicode values of the basic Latin letters (26 letters), digits, and common punctuation. Iterating through the adversarial text , we convert each character into a list of Unicode values . We compare the last four digits of each Unicode value in in succession against the reference list . If an element is present other than those that appear in , the character represented by the corresponding Unicode value is identified as a homoglyph in the adversarial text.
Then, we build a synonym dictionary containing most homoglyphs, which contain homoglyph pairs of English characters, and compare the detected homoglyph characters against this thesaurus to restore them to standard ASCII codes. Taking the letter “a” as an example, we first construct an empty homoglyph candidate set , and then compare its grayscale image—normalized to a resolution to balance computational efficiency, robustness, and discriminative capability—with those of non-standard ASCII characters in Unicode for fast and effective image similarity assessment. To save time in the process of comparison, the encoding range of Unicode is limited, and the encoding with high image similarity is inserted into the candidate library , when the encoding containing n Homoglyph characters in the candidate library ends the comparison. At the same time, we can also expand the homoglyph dictionary according to the actual specific requirements to further improve the reversion of homoglyph characters.
Note that since Unicode encoding is finite, so is the number of homoglyph characters per character, so as long as the homoglyph dictionary is large enough to contain all homoglyphs, this method can be transferred to any defence scheme that includes homoglyph attacks. Simultaneously, it should be noted that an excessively large Homoglyph dictionary may increase the time cost of the defence, and reasonable adjustments can be made according to the needs of the actual defence process.
Theoretically, if a word contains homoglyphs, the embedding of the word will be marked as [unknown] when entering the model, and the embedding of the word in the HR module can be restored to normal, and the model can extract text features through embedding.
3.3. Spelling Correction (SC) Module
It has been shown that spellchecking can be used to defend against this type of attack. However, because current spellchecking tools cannot recover all It has been shown that spellchecking can be used to defend against character-level attacks. However, existing methods suffer from several critical limitations: (1) they rely on homogeneous training data that fails to generalize to cases where all characters in a word are perturbed; (2) their performance remains constrained by dictionary coverage; (3) contextual semantics are often ignored, risking semantic distortion during correction; and (4) specific mechanisms to handle homoglyph attacks are still lacking. Since current spellchecking tools cannot reliably restore all adversarial perturbations, this paper proposes a dedicated Spelling Correction (SC) module, which defends against character-level attacks via a two-stage pipeline: error detection and contextual correction.
The purpose of spell checking is to improve the defense efficiency, deal with different types of attacks separately, determine whether there are spelling errors (including isomorphic errors and improperly restored homogeneous characters) in the text, and effectively distinguish between part-of-speech attacks and character-level attacks. Based on the shortcomings of current spellchecking, we propose an improved E-ScRNN model with the structure shown in
Figure 3.
3.3.1. Overall Framework
The E-ScRNN model is mainly divided into three parts: word representation, dynamic word vector and model training. The core of the word representation is that a semi-character vector (
V) represents a sequence of jumbled words
, where the first and last characters are represented separately and the middle character representations are independent of order. Each input word (
) is represented by concatenating
, where
is a one-hot vector representation of the first character
),
is a bag of characters representing the middle characters (
), and
is a one-hot representation of the last character (
). Each word in the text is represented by a semi-character vector and the calculated Embeddings from Language Models (ELMo) vector; these two vectors are added together and input into a bidirectional long short-term memory (BiLSTM) unit
f. The output of the hidden layer is used as the input to a softmax function layer, which outputs the predicted value
of
, that is, the predicted word corresponding to each input word is output by a learned weight matrix
W (the dimensionality of the output is equal to that of the vocabulary), and the model is optimized using the cross-entropy loss.
In the expression above, k denotes the index of a class/label or vocabulary token, and the hidden state vector () represents the context-aware hidden state output by the BiLSTM at the i-th time step. This hidden state, whose dimension is fixed regardless of vocabulary size, is used as input to the subsequent softmax layer.
E-ScRNN is built upon scRNN [
14] and has been optimized according to the characteristics of character-set attacks, aiming to achieve more efficient and accurate defense. In principle, when the model learns enough misspellings, it can effectively determine whether the input word is spelled correctly, and the addition of ELMo can better help restore the correct character through context. From a defense point of view, similar to homotypic attacks, if a word is misspelled, the embedding entered into the model will also be marked as [unknown], and the correct spelling can be restored before the model is classified normally. E-ScRNN is built upon a two-layer bidirectional LSTM with a hidden dimension of 512. The input representation concatenates a semi-character vector—comprising one-hot encodings of the first character, a bag-of-characters for the middle segment, and the last character (totaling 768 dimensions)—with a 1024-dimensional ELMo embedding, and this combined vector is linearly projected down to 512 dimensions. The model is trained on a merged dataset consisting of IMDb, WikiText, and synthetically perturbed examples (approximately 2 million word pairs), with simulated error rates set as follows: swap: 30%, drop: 25%, add: 25%, key: 20%. For ELMo integration, we use its three-layer representations (token embedding + two BiLSTM layers). This design aims to balance local glyph robustness (via semi-character representation) and global semantic recovery capability (via context-aware ELMo embeddings). The specific details and optimizations of the E-ScRNN model are as follows.
3.3.2. Expansion of Misspellings
To enable the model to learn more forms of spelling errors, the first and last characters of a word are allowed to be modified when constructing data. The maximum character length of a word that is allowed to be modified is reduced following different modification strategies representing various spelling errors that may occur in a word. These errors include (a) swap: swapping the positions of two adjacent characters in the word; (b) add: inserting a character at any position in the word; (c) drop: deleting a character at any position in the word; (d) key: replacing a character in the word with a character located next to it on the keyboard. For different types of errors, different probabilities of occurrence are set to guide the learning process of the model. Such modifications can improve the learning ability of the model to improve the accuracy of spelling correction.
3.3.3. Dictionary Settings
We collected and combined a variety of datasets from different fields to expand the original training set to cover more words in the dictionary. It is important to note that you should not blindly increase the size of your dictionary. Firstly, a dictionary that is too large will reduce the speed of training and increase the cost of attacking. Secondly, the number of common words in English is about 5000, and too many dictionaries can cause words to be corrected into rare words, which affects semantics. This article tries to set the number of dictionaries for testing to 5000, 10,000, 20,000 and 50,000, respectively, and the correction effect will be stable when the number of words is around 20,000. Additionally, the abbreviation “’ve” in the original dictionary was a single word, which caused “I have” to be erroneously restored to “I have”, “’ve”, “live” or other forms. Not only can this cause text to fail to be restored correctly, but it will also affect the reading experience of users and exert an impact on text classification. Therefore, the updated dictionary has been amended and perfected such that all phrases with abbreviations such as “I have” are defined as one word. This improvement not only avoids associated semantic problems but also appropriately expands the vocabulary of the dictionary. A larger dictionary also makes it easier to predict other words during correction and restoration.
3.3.4. Dynamic Word Embedding
Although text input is discrete in nature, the discrete words still have strong contextual relationships, and a text can be better understood by incorporating context. Matthew et al. [
32] proposed a dynamic method of generating word vectors called ELMo. Instead of using fixed embeddings for each word, this method looks at the entire sentence before assigning embeddings to each word, and it uses N-layer long short-term memory (LSTM) units trained on specific tasks to create the embeddings.
Given a text sequence of
n tokens
, the BiLSTM model is used, which consists of a forward and a backward language model. The probability of each word at a particular distance from the input word is determined by maximizing the log likelihood in the forward and backward directions. The forward language model uses
to make predictions, and the backward language model uses
. Once this language model has been pretrained, ELMo combines the input words with the forward output
and backward output
. For each
, an
N-layer BiLSTM unit needs to calculate a total of
representations. For each layer of vectors, a softmax weight task is added, and the vector of each layer is multiplied by the weight and then multiplied by the scalar parameter
. The ELMo calculation is expressed as follows:
During fine-tuning, we adopt the standard form of Equation (
2) to integrate ELMo, initializing the layer weights as
(corresponding to the token embedding layer and the two BiLSTM layers) and the scalar scaling factor as
, and jointly optimize all
and
during training, here,
is the context-dependent representation of the output vector at time
k of layer
j.
In the method proposed in this paper, the ELMo representations of the words of the input text are calculated and added as a dimension of the input provided to the E-ScRNN model.
The E-ScRNN model can greatly improve the accuracy of word correction, not only achieving effective defense against character-level attacks but also reducing the time cost.
3.4. Backtranslation (BT) Module
Meanwhile, synonym replacements and the addition of specific phrases also have a high probability of appearing in adversarial texts. To ensure that our method can handle both character-level and word-level attacks and even the addition of irrelevant statements, we refer to the compression and reconstruction methods that have been developed in the image domain and propose the BT module to restate sentences.
In the adversarial attack using synonym substitution, lower frequency words tend to be substituted, and the fixed collocation and idiomatic usage of some words will change during the substitution process, resulting in some loss of semantics and fluency of sentences. BT can be defined as the translation of a target document from another language back to the source language. BT was chosen as a sentence reconstruction method because the neural network-based translation working principle can better combine the context information to correct abnormal usage in sentences, and at the same time, commonly used words are favoured in the translation process, which can effectively mitigate the influence of low-frequency synonyms on the model during the attack. Specifically, the initial document is transformed into a backtranslated text with different word order and content while maintaining equivalent emotional content and semantics during the translation process. Consequently, having a robust and secure translation model is particularly important.
An example of BT is shown in
Figure 4. To avoid the effects of word-level attacks on translation models, the translation model selected in this paper adopts the structure of Seq2Seq + Attention. Compared with the traditional neural machine translation model, the model can realize direct translation between two languages instead of using a third language as an intermediary, which can avoid the semantic loss caused by multiple translations and further improve the translation effect. At the same time, there is also the option of a translation platform trained on large-scale datasets, a large number of datasets are available from which to learn additional language features to further improve the accuracy of translation models, allowing such a model to literally and directly produce a translation with the most precise possible meaning while avoiding the introduction of monolinguistic style choices.
In this paper, we use Chinese as the pivot language for back-translation, performing only a single round of reverse translation. It is worth noting that, by introducing contextual information during sentence reconstruction, the BT module also exhibits a certain degree of defense capability against some sentence-level attacks. Considering that adversarial attacks may affect the translation model itself, we provide a detailed discussion of this module in
Section 4.3.
4. Experimental Results and Analysis
In this section, we evaluate the performance of our defense method for three deep neural networks on three popular datasets. TextShelter is implemented before the test data are input into the downstream classification models, and the experimental results show improved classification performance.
4.1. Experimental Setup
Datasets. We choose three popular datasets as our experimental datasets: the Internet Movie Database (IMDb) dataset [
33], the binary Stanford Sentiment Treebank (SST-2) [
34], and a subset of AG’s corpus of news articles (AG’s News). IMDb and SST-2 are both movie review sentiment-analysis datasets with binary classification labels. Specifically, IMDb is a long-text categorical dataset with an average length of 262 that contains 25,000 training examples and 25,000 test examples. SST-2 is a single-sentence classification-task dataset with an average length of 19 that consists of approximately 70,000 sentences. AG’s News contains four categories of news articles from more than 2000 news sources. Each category contains 30,000 training examples and 1900 test examples.
Evaluation Metrics. We use classification accuracy as the primary evaluation metric, and all reported results include the standard deviation over five random runs.
Target Models. For the text classification networks, we take as base models three deep neural networks that are widely used in real-world text classification tasks: (1) TextCNN [
35], (2) LSTM, (3) BiLSTM, and (4) BERT.
Attack Methods. We implement three types of adversarial texts constructed with different modification strategies as attack methods. Note that we are interested in defending against multiple adversarial attacks rather than attacks performed by adding only a single quantity or a single form of perturbation to the text.
Examples of each type of perturbation are given in
Figure 5.
Baselines. To evaluate the experimental results, we compare our strategy with the following baselines:
Note that since both DNE and RanMASK are defense methods to improve the overall robustness of the model, the experimental results are presented in
Section 4.4.
Implementation. The methods are implemented in the machine learning framework TensorFlow, and all experiments are carried out on a PC with an Intel Xeon(R) Gold 5115 CPU and three Tesla P40 GPU cards.
4.2. Experimental Results
4.2.1. Performance on Original Text
We randomly select 1000 original examples from each of the three datasets for each model. We first evaluate the performance of the baseline methods and TextShelter on the original data to verify whether the various defense measures have a negative impact on prediction. The classification accuracy is shown in
Figure 6.
It is observed that when the different defense measures are applied to the original text, they exert slight impacts on the original classification accuracy to varying degrees, but the overall accuracy does not significantly change. Some defense methods have a positive effect on the classification accuracy of the different models on the IMDb dataset. We hypothesize that this is because the texts in the IMDb dataset are longer and contain more semantic information. Consequently, after the original text has been processed by the defense methods, the classification accuracy may increase. In contrast, most defense methods will reduce the accuracy of the original examples, but this decrease is not obvious. The experimental results confirm that the various defense methods exert little effect on the performance of the original model.
Overall, AT is relatively good at maintaining accuracy. This is because AT does not modify the data provided as input to the model but instead improves the robustness of the classification model through retraining. Although the performances of the different models on different datasets are not the same, the above baseline methods as well as TextShelter can all maintain the original performance of the models within a certain range.
4.2.2. Performance on Adversarial Text
We randomly select 1000 original samples from each of the three datasets for each model and test the effectiveness of adversarial examples generated with the above perturbation types.
Table 1 reports the performances of all examples under different attack and defense settings. The more accurate the model recoveries are, the more effective our defense method is.
The experimental results show that all of the defense approaches increase the robustness of the models. Compared to the basic defense methods, whether for character-level attacks, word-level attacks, or combinations of these two kinds of perturbations, our results are better than those of the benchmarks concerning the restoration of classification accuracy. Among the other defense methods, WRM performs well for character-level attacks but the worst against word-level and mixed-level attacks because misspellings appear more frequently in character-level-only modifications. We also observe that AT exhibits stable defense performance and can restore the model accuracy to 40–60% when faced with different types of attacks. In contrast, the defensive effect of DISP does not meet our expectations. We believe that a large number of adversarial disturbances in the test data led to this performance. When multiple modifications are present, DISP cannot accurately recover the original text because it lacks the correct context. From another perspective, the defense results of DISP against these adversarial examples are similar to those of WRM when faced with multiple modifications. Meanwhile, TextShelter achieves optimal performance against all types of adversarial attacks for all three classification models, thus illustrating the excellent performance of our method.
Moreover, we plot the differences in the defense performance of the various methods on adversarial text in
Figure 7 using the IMDb dataset as an example. The data in the figure represent the differences in accuracy between the results obtained on the adversarial text using the defense method and the results obtained on the original text. The lower the data values in these figures are, the closer to the original accuracy, indicating that the defense method is more effective. As shown in
Figure 7, our approach effectively mitigates the degradation of model capability in any attack mode and outperforms other defense methods, especially for mixed-level attacks.
Overall, the accuracy improvement on the IMDb dataset is better than those on SST-2 and AG’s News, which is probably because IMDb contains data of greater length. For this reason, the data in the IMDb dataset retain more semantic and emotional information when perturbations are added, which helps in recovering the text.
4.3. Ablation Experiments
To verify the effectiveness of each module,
Figure 8 shows the results of feeding hybrid adversarial samples into different defense modules (using the IMDb dataset as an example). This can help us scrutinize the impact and contribution of each module more distinctly in terms of defensive performance. The examples used in this experiment are 2000 mixed adversarial examples containing character-level and word-level perturbations, which were randomly selected from among all data. We aim to simulate real-world adversarial examples as much as possible with unknown types and numbers of modifications.
As shown in
Figure 8, classification accuracy is improved when the adversarial samples are fed into different modules, and the BT module has the best defense effect because it can disrupt the word order of the sentence and replace the words appropriately. Conversely, the HR and SC modules are not as effective in blocking attacks, especially the HR modules, which are not the main defense, but can still play a defense role when there are homomorphic attacks in the adversarial text.
To get a clearer picture of the performance of each section, we used the SST-2 dataset to perform a more detailed ablation test experiment on the LSTM model, the results of which are shown in
Table 2.
As can be seen from
Table 2, the HR module can keep the performance of the original model stable without being attacked, because HR does not perform any processing on the text when the text does not contain homoglyph. The SC module has a slight impact on performance because SC corrects some words in the text in context. To our surprise, the BT module has a certain effect on the accuracy rate, which fully proves that the BT module has a strong semantic retention of the text. Meanwhile, compared to the results in
Table 1, when we use only the SC module, the result is better than WRM, which only targets character attacks, and when we use only the BT module, the defensive effect is also better than DISP, which only targets word-level attacks.
In the case of attack, the performance of each model on the SST-2 dataset is similar to the performance on the IMDb dataset in
Figure 8, with HR and SC being more targeted and BT being the best. On the whole, each module plays a different defensive role. The function of the HR module is mainly to detect and restore interference from “identical” characters. The function of the SC module is to correct spelling errors. The above two modules are defensive measures against character-level disturbances. The BT module is aimed at less perceptible word-level disturbances (i.e., word substitution, phrase collocation, or clause insertion) to destroy the elaborate design of the adversary. Obviously, for texts with multiple types of adversarial modifications, the complete model framework is the most effective measure. For adversarial texts similar to those in the real world, for which it is difficult to perceive the exact types and the number of disturbances, it is necessary to deploy the most comprehensive defense measures. Since all types of disturbances we have observed can be classified as character-level modifications or word-level replacements, our defense method can resist other disturbances of the above types generated in any way.
4.4. Evaluation of Combining Benchmark Methods
Thus, far, this article has only considered an overall comparison with baseline defences. Assuming that all adversarial samples are generated under black-box conditions, this means that the defender does not know the exact amount or form of interference. Therefore, more attention should be paid to multiple forms of attack as a means of defence in a real-world environment.
In this section, we first consider the combination of WRM and DISP as another combined defence framework for a fairer comparison with our approach. Second, many methods have emerged to improve the robustness of the model, and we will compare these methods in this section to further verify the effectiveness of the defense methods proposed in this paper. Finally, with the advent of large-scale language models, more and more systems are starting to use pre-trained models, so this paper also verifies the BERT model, which proves that the method has some universal applicability. The experimental data used here consist of the same 1000 randomly selected adversarial examples as in
Section 4.3. The combined experimental results of WRM and DISP are shown in
Table 3 (taking the IMDb dataset as an example) and a comparison of the results of DNE, RankMASK, and Textshelter experiments is shown in
Table 4 (AG’News as an example).
Table 3 also shows that all defense methods have some effect on the BERT model, and that Textshelter is not only effective for pre-trained models, but also superior to other methods. Meanwhile, it can be observed that when the two benchmark methods of DSP and WRM are combined, the classification accuracy is significantly improved. For hybrid adversarial text, the combined protection of the two defenses can indeed achieve a more effective defense effect than either single defense method. Nevertheless, TextShelter has a significant advantage in classification accuracy. The combination of WRM and DISP does not achieve the defensive effect we would expect. We conjecture that this finding may stem from the uncertainty of the DISP module due to the separate training of the perturbation discriminator and the embedding estimator. The ability to repair malicious disturbances depends on the performance of the disturbance discriminator in identifying incorrect words. If the perturbation discriminator incorrectly recognizes some correct words as being incorrect, this will directly affect the restoration effect of the embedding estimator. We find that the DISP method does indeed incorrectly recognize some words that do not contain errors as needing restoration, which directly compromises the effectiveness of the previous WRM module in restoring misspelled words.
The experiment for RanMASK is set to RanMASK-90% with the “vote”, and the “Original” column is the original text classification accuracy, and the “Word” and “Mixed” columns are improved by different attack defenses, and the higher the value, the better the defense effect. As can be seen from
Table 4, DNE and Textshelter have the same effect on word-level attacks, and even better than Textshelter on TextCNN, but slightly worse on mixed-level attacks, because the method of improving model robustness is based on synonym substitution attacks, so the defense effect against character-level attacks is poor. RanMASK has a similar defense to Textshelter for both word-level and mixed-level attacks, but is less effective on the BERT model.
Generally speaking, whether it is using the trained model to repair the input text or improving the robustness of the model through various methods, Textshelter has certain advantages over other methods in defending against attacks and can recover the input text better.
4.5. Evaluation of Sentiment Tendency
To explore the impacts of various types of attack and defense approaches on sentiment classification, we provide specific examples from the SST-2 dataset processed with TextCNN in
Figure 9 and list the corresponding sentiment scores in
Table 5. We use the Valence Aware Dictionary and Entiment Reasoner (VADER) [
39] to evaluate sentiment as positive, negative, or neutral. As shown in
Figure 9, adversarial examples generated in all three types of attacks contain imperceptible modifications to the text, which affect predictions. Defensive operations precisely eliminate these disturbances while retaining the emotion and semantics of the text to the greatest extent possible. As shown in
Table 5, the adversarial examples at each attack level flip the output labels to the exactly opposite VADER scores. Our defense strategy filters out the perturbations in the input text while preserving the emotions.