TextShelter: Text Adversarial Example Defense Based on Input Reconstruction

Chang, Guoqin; Gao, Haichang; Cheng, Nuo; Yao, Zhou; Li, Haodong

doi:10.3390/electronics14234706

Open AccessArticle

TextShelter: Text Adversarial Example Defense Based on Input Reconstruction

by

Guoqin Chang

^1,*

,

Haichang Gao

^2,*

,

Nuo Cheng

³

,

Zhou Yao

³ and

Haodong Li

¹

Shaanxi Science and Technology Holding Institute, Xi’an 710021, China

²

School of Computer Science and Technology, Xidian University, Xi’an 710071, China

³

School of Cyber Engineering, Xidian University, Xi’an 710071, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(23), 4706; https://doi.org/10.3390/electronics14234706

Submission received: 29 October 2025 / Revised: 21 November 2025 / Accepted: 25 November 2025 / Published: 29 November 2025

(This article belongs to the Special Issue Advancements in AI-Driven Cybersecurity and Securing AI Systems)

Download

Browse Figures

Versions Notes

Abstract

Effective identification of textual adversarial examples is a pressing need for safeguarding application security and maintaining cybersecurity. However, most existing adversarial defense methods for natural language processing can only resist a single form of attack and lack generalizability. To address this issue, this paper proposes a simple, efficient, and versatile defense method named TextShelter, which mitigates the limitations of existing approaches that rely on specific attack assumptions and struggle to handle real-world complex adversarial samples. TextShelter integrates three modules—Homoglyph Reversion, Spelling Correction, and Reconstruction-based Backtranslation—and enhances the defense efficiency of each module through careful design and optimization. By collaboratively combining the outputs of these modules, the method achieves effective defense against multi-granularity hybrid perturbations without requiring knowledge of the target model’s structure or parameters, nor any model retraining. Experiments on three datasets including IMDb show that TextShelter can effectively restore the original output labels of adversarial examples, improving classification accuracy by up to 60%. Compared with existing mainstream defense methods, it enhances defensive capability by approximately 50%. Furthermore, TextShelter performs well in terms of sentiment preservation, robustness, and transferability, demonstrating promising extensibility.

Keywords:

deep learning; adversarial text; adversarial defense; text reconstruction; security

1. Introduction

The rapid development of artificial intelligence (AI) and large language models (LLMs) has made it widely used for various tasks such as classification, inference, speech generation and translation in NLP [1,2,3]. However, research has shown that deep learning models are very sensitive and susceptible to small perturbations [4]. Their vulnerability allows adversarial examples to pose significant challenges across multiple domains including image recognition, natural language processing, steganography [5], and frequency domain analysis [6], consequently giving rise to serious security risks, privacy leakage issues, and substantially compromised user experience. For example, people often get recommendations by reading reviews of products or services while shopping or watching movies, and some apps use sentiment analysis techniques to provide users with recommendations based on historical reviews [7], where an attacker can generate adversarial examples based on real user reviews, spoofing the scoring mechanism of the recommender system, discrediting competitors, and causing users to receive incorrect recommendations. In addition, attackers can disguise spam emails as normal emails to avoid system detection, which not only increases the difficulty for users to view normal emails, but also increases the possibility of virus propagation. Meanwhile, other NLP-based systems may be affected by adversarial examples, public opinion data may be disguised as normal text and widely distributed on the Internet to deceive the opinion system, and patient’s symptom descriptions may be falsified, leading to the failure of systems such as medical triage. Therefore, the research on defense of textual adversarial examples is necessary and urgent, and researchers design more effective defense methods by digging deeper into model vulnerabilities to improve the robustness of deep learning models [8]. Prevent the damage to the industry and even the country caused by the flood of textual adversarial examples, such as the economic loss of enterprises and the leakage of private information.

In computer vision (CV), more mature methods are available to defend against adversarial attacks. These include the use of denoising autoencoders (DAEs) to remove adversarial noise [9], the use of generative adversarial networks (GANs) to generate unperturbed images [10], and the reconstruction of high-quality original images to remove adversarial perturbations [11]. However, due to the difference in nature between the continuous input of images and the discrete input of text, the approaches for defending against adversarial attacks in CV cannot be directly applied to NLP tasks. Meanwhile, there has been relatively scant research into text defenses against a diversified range of attacks, mostly against a single type of attack, which not only cannot achieve a universal defense but also has a high attack cost. Simple and efficient completion of adversarial text defense is the focus of current research.

Spellchecking is the main means of defending against character-level attacks. Traditional methods rely on dictionary matching and are relatively inefficient. Although deep learning methods based on CNN and RNN have stronger performance, they still have limitations such as single training samples and ignoring the context semantics, and lack specialized optimization for homoglyph attacks. In addition, word-level defense methods generally face several challenges: on the one hand, defenders find it difficult to predict attack strategies, which limits the effectiveness of methods relying on specific training; and on the other hand, identifying high-level attacks such as synonym substitution requires complex modeling, and the defense cost remains high.

In this paper, we consider the characteristics of the multiple attack methods that are presently available and propose a general text attack defense framework named TextShelter based on the limitations of current defense methods, which focuses on simple and efficient defense through semantic-preserving input reconstruction without modification of the model structure or training process. Before the text input model, the text is rewritten using TextShelter’s three modules, Homoglyph Reversion (HR), Spelling Correction (SC), and Backtranslation (BT), to remove possible homoglyph, character, and synonym attacks in the text.

The main contributions of this paper are as follows:

We propose an adversarial defense method called TextShelter, which considers the presence of multiple types of interference in the same text, which can defend against almost all adversarial attacks, and is also effective against a single attack. Furthermore, this method does not require the training process and parameters of the model to be obtained, nor does it require the model to be queried and retrained, which has the characteristics of generality, simplicity and efficiency.
TextShelter is the first defense method specifically designed against homoglyph attacks. It has been thoroughly evaluated on four models across three benchmark datasets under attacks of varying severity, achieving a maximum accuracy improvement of 70.7% and an average gain of 30%. Experimental results demonstrate that TextShelter significantly outperforms most existing state-of-the-art methods on deep learning models and matches the performance of the best current approaches on large language models.
TextShelter incorporates three distinct modules, whose optimization and synergistic interaction significantly enhance the overall defense capability. Ablation studies validate the individual contribution and performance of each module, demonstrating that optimizing a single module can further improve defense effectiveness.
Compared to existing defense methods, TextShelter’s key advantage lies in its extremely low defense overhead, as it does not rely on costly retraining or extensive computational resources. Moreover, we systematically evaluate its performance in terms of semantic preservation and semantic transfer, and further discuss the safety and practicality of each module—all experiments yield excellent results.

2. Related Work

Character-level attacks are one of the earliest forms of adversarial text attacks [12], and most of the attack methods are insertion, deletion, and modification with keywords Steffen et al. [13] first proposed an adversarial text attack method based on visual perturbations that use homoglyphs to replace characters in words. Homoglyphs are two or more characters with different meanings but a similar visual presentation that are represented by different Unicode encodings, as shown in Figure 1. Homoglyphs are frequently used in practice, such as in the creation of phishing websites, spam detection, and domain-name spoofing.

There are already defense methods against character-level attacks. Pruth et al. [14] used a word recognition model placed before the classifier. This model is based on the semi-character-level recurrent neural network (ScRNN) architecture and includes several new back-off strategies for handling rare and unseen words. After training, this model can identify words that are damaged due to random additions, deletions, swaps, and keyboard errors, thus providing a defense against adversarial examples containing character-level perturbations. However, this model can correct only adversarial examples with character-level perturbations and cannot handle coarser-grained modifications such as word replacement or phrase addition. Literature [15] proposed a framework that jointly uses character embedding and adversarial stability training to defend against character-level adversarial examples. It can handle both out-of-vocabulary (OOV) words and distributional differences between the training set and the adversarial examples. However, this method also can defend against only character-level attacks.

To make the adversarial text more difficult to detect, word-level adversarial examples have been developed [16]. Typically, the word-level adversarial text is used to scramble the original text by removing, adding, and replacing keywords, and the words chosen for replacement are usually synonyms to better preserve semantics [17]. Ren et al. [18] proposed a greedy algorithm for adversarial attacks called probability-weighted word saliency (PWWS) based on the synonym substitution strategy. Some defense methods have also been proposed for word-level attacks. Wang et al. [19] proposed a method called the Synonym Encoding Method (SEM), in which an encoder is inserted before the input layer of the model to force all neighbors of x to be marked as

\bar{x}

. The SEM can effectively defend against current adversarial attacks based on synonym replacement while not considerably decreasing the accuracy for benign examples. Nevertheless, the SEM still requires modification of the model structure and retraining of the model to defend against word-level substitutions. Zhou et al. [20] proposed a novel framework called Learning to Discriminate Perturbations (DISP). In this framework, a perturbation discriminator verifies the possibility of tokens being perturbed in the text and provides a set of potential perturbations. For each potential perturbation, an embedding estimator then recovers the embedding of the original word based on context learning and selects replacement tags based on an approximate k-nearest neighbor (kNN) search. DISP can defend against adversarial attacks on any NLP model without modifying the model structure or training process. However, the performance of DISP will deteriorate in the following two cases: (1) when the remaining sentences lack informational context, the recovered tokens will be inaccurate, and (2) when multiple perturbations exist, the erroneous context will also cause unsatisfactory recovery.

To date, there has been relatively little research focusing on the sentence level; however, one common method is to add elaborate sentences to a sample to fool the classifier [12,21]. As an alternative to this common method, Iyyer et al. [22] proposed a syntactically controlled paraphrasing network (SCPN) and used it to generate adversarial examples. However, there are currently few defenses against sentence-level attacks.

At the same time, there are some defense methods in the CV domain for adversarial text defense, of which adversarial training [23] and adversarial data augmentation [24] are the earliest and most widely used. In the literature [17,25,26] scholars adopted adversarial training as a defense strategy against adversarial text attacks, a method that can effectively defend against adversarial images and also exhibits a good effect on text. However, this method requires retraining the original model during the training phase and is consequently a very time- and cost-intensive strategy. Distillation is also a common method for defending against adversarial examples in the image domain. However, Marcus et al. [27] evaluated the performance of defensive distillation for text, and the experimental results showed no improvement in the robustness of networks trained with defensive distillation.

With the continued growth of large language models, the attack and defense of pre-trained models have also attracted extensive research [28]. Cen et al. [29] proposed a Large Language Model Adversarial Defense (LLMAD) method, which comprises two modules: perturbation detection and perturbation correction. The perturbation correction module utilizes a fine-tuned large language model to rectify adversarial perturbations. Although this method achieved an average improvement of 66.8% compared to the undefended baseline, it incurs relatively high attack costs. Ye et al. [30] proposed SAFER, a randomized smoothing method that can provably ensure that the prediction cannot be altered by any possible synonymous word substitution, a method that needs to assume that the defender knows how the adversary generates synonyms, which does not fit the display attack scenario. The core of traditional AI adversarial defense lies in maintaining the model’s decision robustness for specific tasks, whereas the key to large model adversarial defense focuses on ensuring behavioral safety and value alignment in open-ended generation scenarios. While Wu et al. [31] proposed a defense method against prompt injection attacks for large models, existing defensive approaches still demonstrate insufficient robustness when applied to large language models.

3. TextShelter

TextShelter is a specific defense model for blocking black-box adversarial attacks that achieves efficient defense by ensuring simple and convenient semantic preservation and reconstruction of input adversarial examples under the constraint that the parameters, architecture, and training data of the target model are unknown.

TextShelter consists of three proprietary modules, each designed by analysing different levels of attack and using the influence of text features on the model to optimise input for defense. Unlike existing methods that require model retraining, TextShelter’s proposed input refactoring is easier to deploy and implement.

Problem Statement: For the task of sequence classification, given a pretrained target model

F : (X) = y

, the model maps the feature space of the input text to a set of classification labels Y, where the labels

y_{i}

may come from several categories. The input text is denoted by

S = {w_{1}, w_{2}, w_{3}, \dots, w_{n}}

, where

w_{i}

represents the

i

-th token and n is the length of the sequence. An attacker can generate an adversarial example

S^{'}

by adding imperceptible perturbations to the original text such that

F : (S^{'}) \neq F : (S)

, leading to deviation in the predictions. Real-world adversarial examples may have multiple manifestations, such as character-level modifications performed by inserting or deleting a character and word-level substitutions performed by replacing a keyword with its synonym. Neither a single spellchecker nor adversarial training can effectively address such adversarial examples; therefore, it is crucial to be able to implement defenses against mixed adversarial examples. We wish to apply our defense method

D (x)

to the adversarial text

S^{'}

such that

F (D (x)) = y

, which indicates that the prediction result of the classification model is restored to the label of the original text.

3.1. Framework Overview

To defend against character-level and word-level attacks on text classifiers, we propose a three-stage pipeline method aiming to repair inaccuracies in words, including misspellings, homoglyphs, and synonyms. TextShelter is added before the text classifier, and we modify the input to the classification model to achieve the desired defense effect. Our method consists of the various components shown in Figure 2.

The original text S is changed into the counter text

S^{'}

by an adversarial attack. Then, our defense modules process

S^{'}

to obtain

S^{″}

as the input to the text classifier, thereby markedly improving the classification accuracy. Details of each module are described below.

3.2. Homoglyph Reversion (HR) Module

To mitigate the security impact of visual attacks, this paper first proposes an HR defense module for homoglyph attacks that can reverse most such attacks. The HR module can be divided into two steps: first, a Homoglyph filter is applied to discover and recognize homoglyph, and second, these homoglyph are converted into machine-readable American Standard Code for Information Interchange (ASCII) codes. Such an operation can speed up the text reconstruction step, and if the attack does not contain any homoglyph in the text, there will be no modification of the text and it will go directly to the next module.

Unicode assigns a standardized and unique hex code to each character in each language to meet the needs of cross-language text conversion and processing requirements. First, the original character is converted to “UTF-16” format by detecting the Unicode encoding of each character to determine the Unicode values of all characters in the text (e.g., “a” corresponds to “U+0061”). We manipulate a basic purified list

U_{s}

, which consists of the last four digits of the Unicode values of the basic Latin letters (26 letters), digits, and common punctuation. Iterating through the adversarial text

S^{'}

, we convert each character

w_{i}

into a list of Unicode values

S^{'} = {w_{1}^{'}, w_{2}^{'}, \dots, w_{n}^{'}} \to U^{'} = {u_{1}, u_{2}, \dots, u_{n}}

. We compare the last four digits of each Unicode value in

U^{'}

in succession against the reference list

U_{s}

. If an element is present other than those that appear in

U_{s}

, the character represented by the corresponding Unicode value is identified as a homoglyph in the adversarial text.

Then, we build a synonym dictionary

H = {(a, h_{a}^{1}), (a, h_{a}^{2}), \dots, (b, h_{b}^{1}), \dots, (A, h_{A}^{1}), \dots, (Z, h_{Z}^{1})}

containing most homoglyphs, which contain homoglyph pairs of English characters, and compare the detected homoglyph characters against this thesaurus to restore them to standard ASCII codes. Taking the letter “a” as an example, we first construct an empty homoglyph candidate set

H_{a}

, and then compare its grayscale image—normalized to a

3 \times 3

resolution to balance computational efficiency, robustness, and discriminative capability—with those of non-standard ASCII characters in Unicode for fast and effective image similarity assessment. To save time in the process of comparison, the encoding range of Unicode is limited, and the encoding with high image similarity is inserted into the candidate library

H_{a}

, when the encoding containing n Homoglyph characters in the candidate library

H_{a}

ends the comparison. At the same time, we can also expand the homoglyph dictionary according to the actual specific requirements to further improve the reversion of homoglyph characters.

Note that since Unicode encoding is finite, so is the number of homoglyph characters per character, so as long as the homoglyph dictionary is large enough to contain all homoglyphs, this method can be transferred to any defence scheme that includes homoglyph attacks. Simultaneously, it should be noted that an excessively large Homoglyph dictionary may increase the time cost of the defence, and reasonable adjustments can be made according to the needs of the actual defence process.

Theoretically, if a word contains homoglyphs, the embedding of the word will be marked as [unknown] when entering the model, and the embedding of the word in the HR module can be restored to normal, and the model can extract text features through embedding.

3.3. Spelling Correction (SC) Module

It has been shown that spellchecking can be used to defend against this type of attack. However, because current spellchecking tools cannot recover all It has been shown that spellchecking can be used to defend against character-level attacks. However, existing methods suffer from several critical limitations: (1) they rely on homogeneous training data that fails to generalize to cases where all characters in a word are perturbed; (2) their performance remains constrained by dictionary coverage; (3) contextual semantics are often ignored, risking semantic distortion during correction; and (4) specific mechanisms to handle homoglyph attacks are still lacking. Since current spellchecking tools cannot reliably restore all adversarial perturbations, this paper proposes a dedicated Spelling Correction (SC) module, which defends against character-level attacks via a two-stage pipeline: error detection and contextual correction.

The purpose of spell checking is to improve the defense efficiency, deal with different types of attacks separately, determine whether there are spelling errors (including isomorphic errors and improperly restored homogeneous characters) in the text, and effectively distinguish between part-of-speech attacks and character-level attacks. Based on the shortcomings of current spellchecking, we propose an improved E-ScRNN model with the structure shown in Figure 3.

3.3.1. Overall Framework

The E-ScRNN model is mainly divided into three parts: word representation, dynamic word vector and model training. The core of the word representation is that a semi-character vector (V) represents a sequence of jumbled words

w_{i}

, where the first and last characters are represented separately and the middle character representations are independent of order. Each input word (

w_{i}

) is represented by concatenating

V (w_{i}) = (b_{i}, i_{i}, e_{i})

, where

b_{i}

is a one-hot vector representation of the first character

(w_{i 1}

),

i_{i}

is a bag of characters representing the middle characters (

\sum_{j = 2}^{l - 1} w_{i j}

), and

e_{i}

is a one-hot representation of the last character (

w_{i l}

). Each word in the text is represented by a semi-character vector and the calculated Embeddings from Language Models (ELMo) vector; these two vectors are added together and input into a bidirectional long short-term memory (BiLSTM) unit f. The output of the hidden layer is used as the input to a softmax function layer, which outputs the predicted value

y_{w_{i}}

of

w_{i}

, that is, the predicted word corresponding to each input word is output by a learned weight matrix W (the dimensionality of the output is equal to that of the vocabulary), and the model is optimized using the cross-entropy loss.

y_{w_{i}} = \frac{e x p (W_{h} \cdot h_{n})}{\sum_{k} e x p (W_{h} \cdot h_{n})}

(1)

In the expression above, k denotes the index of a class/label or vocabulary token, and the hidden state vector (

h_{n}

) represents the context-aware hidden state output by the BiLSTM at the i-th time step. This hidden state, whose dimension is fixed regardless of vocabulary size, is used as input to the subsequent softmax layer.

E-ScRNN is built upon scRNN [14] and has been optimized according to the characteristics of character-set attacks, aiming to achieve more efficient and accurate defense. In principle, when the model learns enough misspellings, it can effectively determine whether the input word is spelled correctly, and the addition of ELMo can better help restore the correct character through context. From a defense point of view, similar to homotypic attacks, if a word is misspelled, the embedding entered into the model will also be marked as [unknown], and the correct spelling can be restored before the model is classified normally. E-ScRNN is built upon a two-layer bidirectional LSTM with a hidden dimension of 512. The input representation concatenates a semi-character vector—comprising one-hot encodings of the first character, a bag-of-characters for the middle segment, and the last character (totaling 768 dimensions)—with a 1024-dimensional ELMo embedding, and this combined vector is linearly projected down to 512 dimensions. The model is trained on a merged dataset consisting of IMDb, WikiText, and synthetically perturbed examples (approximately 2 million word pairs), with simulated error rates set as follows: swap: 30%, drop: 25%, add: 25%, key: 20%. For ELMo integration, we use its three-layer representations (token embedding + two BiLSTM layers). This design aims to balance local glyph robustness (via semi-character representation) and global semantic recovery capability (via context-aware ELMo embeddings). The specific details and optimizations of the E-ScRNN model are as follows.

3.3.2. Expansion of Misspellings

To enable the model to learn more forms of spelling errors, the first and last characters of a word are allowed to be modified when constructing data. The maximum character length of a word that is allowed to be modified is reduced following different modification strategies representing various spelling errors that may occur in a word. These errors include (a) swap: swapping the positions of two adjacent characters in the word; (b) add: inserting a character at any position in the word; (c) drop: deleting a character at any position in the word; (d) key: replacing a character in the word with a character located next to it on the keyboard. For different types of errors, different probabilities of occurrence are set to guide the learning process of the model. Such modifications can improve the learning ability of the model to improve the accuracy of spelling correction.

3.3.3. Dictionary Settings

We collected and combined a variety of datasets from different fields to expand the original training set to cover more words in the dictionary. It is important to note that you should not blindly increase the size of your dictionary. Firstly, a dictionary that is too large will reduce the speed of training and increase the cost of attacking. Secondly, the number of common words in English is about 5000, and too many dictionaries can cause words to be corrected into rare words, which affects semantics. This article tries to set the number of dictionaries for testing to 5000, 10,000, 20,000 and 50,000, respectively, and the correction effect will be stable when the number of words is around 20,000. Additionally, the abbreviation “’ve” in the original dictionary was a single word, which caused “I have” to be erroneously restored to “I have”, “’ve”, “live” or other forms. Not only can this cause text to fail to be restored correctly, but it will also affect the reading experience of users and exert an impact on text classification. Therefore, the updated dictionary has been amended and perfected such that all phrases with abbreviations such as “I have” are defined as one word. This improvement not only avoids associated semantic problems but also appropriately expands the vocabulary of the dictionary. A larger dictionary also makes it easier to predict other words during correction and restoration.

3.3.4. Dynamic Word Embedding

Although text input is discrete in nature, the discrete words still have strong contextual relationships, and a text can be better understood by incorporating context. Matthew et al. [32] proposed a dynamic method of generating word vectors called ELMo. Instead of using fixed embeddings for each word, this method looks at the entire sentence before assigning embeddings to each word, and it uses N-layer long short-term memory (LSTM) units trained on specific tasks to create the embeddings.

Given a text sequence of n tokens

(w_{1}, w_{2}, \dots, w_{n})

, the BiLSTM model is used, which consists of a forward and a backward language model. The probability of each word at a particular distance from the input word is determined by maximizing the log likelihood in the forward and backward directions. The forward language model uses

(w_{1}, w_{2}, \dots, w_{i - 1})

to make predictions, and the backward language model uses

(w_{k + 1}, w_{k + 2}, \dots, w_{n})

. Once this language model has been pretrained, ELMo combines the input words with the forward output

{\vec{h}}_{k, j}^{L M}

and backward output

{\overset{\leftarrow}{h}}_{k, j}^{L M}

. For each

w_{k}

, an N-layer BiLSTM unit needs to calculate a total of

2 N + 1

representations. For each layer of vectors, a softmax weight task is added, and the vector of each layer is multiplied by the weight and then multiplied by the scalar parameter

γ

. The ELMo calculation is expressed as follows:

E L M o_{k}^{t a s k} = γ^{t a s k} \sum_{j = 0}^{N} s_{j}^{t a s k} h_{k, j}^{L M}

(2)

During fine-tuning, we adopt the standard form of Equation (2) to integrate ELMo, initializing the layer weights as

s = [0.33, 0.33, 0.34]

(corresponding to the token embedding layer and the two BiLSTM layers) and the scalar scaling factor as

γ = 0.1

, and jointly optimize all

s_{j}

and

γ

during training, here,

h_{k, j}^{L M} = [{\vec{h}}_{k, j}^{L M}, {\overset{\leftarrow}{h}}_{k, j}^{L M}] (j = 1, \dots, L)

is the context-dependent representation of the output vector at time k of layer j.

In the method proposed in this paper, the ELMo representations of the words of the input text are calculated and added as a dimension of the input provided to the E-ScRNN model.

The E-ScRNN model can greatly improve the accuracy of word correction, not only achieving effective defense against character-level attacks but also reducing the time cost.

3.4. Backtranslation (BT) Module

Meanwhile, synonym replacements and the addition of specific phrases also have a high probability of appearing in adversarial texts. To ensure that our method can handle both character-level and word-level attacks and even the addition of irrelevant statements, we refer to the compression and reconstruction methods that have been developed in the image domain and propose the BT module to restate sentences.

In the adversarial attack using synonym substitution, lower frequency words tend to be substituted, and the fixed collocation and idiomatic usage of some words will change during the substitution process, resulting in some loss of semantics and fluency of sentences. BT can be defined as the translation of a target document from another language back to the source language. BT was chosen as a sentence reconstruction method because the neural network-based translation working principle can better combine the context information to correct abnormal usage in sentences, and at the same time, commonly used words are favoured in the translation process, which can effectively mitigate the influence of low-frequency synonyms on the model during the attack. Specifically, the initial document is transformed into a backtranslated text with different word order and content while maintaining equivalent emotional content and semantics during the translation process. Consequently, having a robust and secure translation model is particularly important.

An example of BT is shown in Figure 4. To avoid the effects of word-level attacks on translation models, the translation model selected in this paper adopts the structure of Seq2Seq + Attention. Compared with the traditional neural machine translation model, the model can realize direct translation between two languages instead of using a third language as an intermediary, which can avoid the semantic loss caused by multiple translations and further improve the translation effect. At the same time, there is also the option of a translation platform trained on large-scale datasets, a large number of datasets are available from which to learn additional language features to further improve the accuracy of translation models, allowing such a model to literally and directly produce a translation with the most precise possible meaning while avoiding the introduction of monolinguistic style choices.

In this paper, we use Chinese as the pivot language for back-translation, performing only a single round of reverse translation. It is worth noting that, by introducing contextual information during sentence reconstruction, the BT module also exhibits a certain degree of defense capability against some sentence-level attacks. Considering that adversarial attacks may affect the translation model itself, we provide a detailed discussion of this module in Section 4.3.

4. Experimental Results and Analysis

In this section, we evaluate the performance of our defense method for three deep neural networks on three popular datasets. TextShelter is implemented before the test data are input into the downstream classification models, and the experimental results show improved classification performance.

4.1. Experimental Setup

Datasets. We choose three popular datasets as our experimental datasets: the Internet Movie Database (IMDb) dataset [33], the binary Stanford Sentiment Treebank (SST-2) [34], and a subset of AG’s corpus of news articles (AG’s News). IMDb and SST-2 are both movie review sentiment-analysis datasets with binary classification labels. Specifically, IMDb is a long-text categorical dataset with an average length of 262 that contains 25,000 training examples and 25,000 test examples. SST-2 is a single-sentence classification-task dataset with an average length of 19 that consists of approximately 70,000 sentences. AG’s News contains four categories of news articles from more than 2000 news sources. Each category contains 30,000 training examples and 1900 test examples.

Evaluation Metrics. We use classification accuracy as the primary evaluation metric, and all reported results include the standard deviation over five random runs.

Target Models. For the text classification networks, we take as base models three deep neural networks that are widely used in real-world text classification tasks: (1) TextCNN [35], (2) LSTM, (3) BiLSTM, and (4) BERT.

Attack Methods. We implement three types of adversarial texts constructed with different modification strategies as attack methods. Note that we are interested in defending against multiple adversarial attacks rather than attacks performed by adding only a single quantity or a single form of perturbation to the text.

Character-level Attacks. We use the black-box generation method of DeepWordBug (DWB) [23] to find tokens that contain important information. A piece of adversarial text generated by the original DWB method contains only one type of character modification. To generate adversarial text with multiple character-level modifications, we randomly perform the operations insert, remove, swap and homoglyph until the output label changes.
Word-level Attacks. For word-level attacks, we use the PWWS method [18] to generate adversarial examples by replacing a word with its synonym.
Mixed-level Attacks. To address the existence of adversarial examples in practice that contain multiple forms of perturbation simultaneously, we combine the above two strategies to generate adversarial text with mixed modifications.

Examples of each type of perturbation are given in Figure 5.

Baselines. To evaluate the experimental results, we compare our strategy with the following baselines:

Adversarial Training (AT). For each attack method, we retrain the model with additional adversarial examples that have different predicted labels between the original text and the training data.
Word Recognition Model (WRM) [14] This method enhances the literature [36] architecture to handle spelling errors in adversarial text and introduces new back-off strategies to handle rare and unseen words. This is a universal and effective way of defending against adversarial text with misspellings in character-level attacks that does not require model retraining.
DISP [20] DISP can identify and repair malicious perturbations by training a perturbation discriminator and an embedding estimator.
DNE [37] DNE forms dummy sentences by sampling embedding vectors for each word in the input sentence from convex hulls consisting of words and their synonyms, and augmenting them with training data to train robust models to defend against word-level attacks.
RanMASK [38] RanMASK generates a masked copy of a large amount of text by randomly masking a certain percentage of words in the input text, and uses the model to rebuild the copy to defend against word substitution and character-level attacks.

Note that since both DNE and RanMASK are defense methods to improve the overall robustness of the model, the experimental results are presented in Section 4.4.

Implementation. The methods are implemented in the machine learning framework TensorFlow, and all experiments are carried out on a PC with an Intel Xeon(R) Gold 5115 CPU and three Tesla P40 GPU cards.

4.2. Experimental Results

4.2.1. Performance on Original Text

We randomly select 1000 original examples from each of the three datasets for each model. We first evaluate the performance of the baseline methods and TextShelter on the original data to verify whether the various defense measures have a negative impact on prediction. The classification accuracy is shown in Figure 6.

It is observed that when the different defense measures are applied to the original text, they exert slight impacts on the original classification accuracy to varying degrees, but the overall accuracy does not significantly change. Some defense methods have a positive effect on the classification accuracy of the different models on the IMDb dataset. We hypothesize that this is because the texts in the IMDb dataset are longer and contain more semantic information. Consequently, after the original text has been processed by the defense methods, the classification accuracy may increase. In contrast, most defense methods will reduce the accuracy of the original examples, but this decrease is not obvious. The experimental results confirm that the various defense methods exert little effect on the performance of the original model.

Overall, AT is relatively good at maintaining accuracy. This is because AT does not modify the data provided as input to the model but instead improves the robustness of the classification model through retraining. Although the performances of the different models on different datasets are not the same, the above baseline methods as well as TextShelter can all maintain the original performance of the models within a certain range.

4.2.2. Performance on Adversarial Text

We randomly select 1000 original samples from each of the three datasets for each model and test the effectiveness of adversarial examples generated with the above perturbation types. Table 1 reports the performances of all examples under different attack and defense settings. The more accurate the model recoveries are, the more effective our defense method is.

The experimental results show that all of the defense approaches increase the robustness of the models. Compared to the basic defense methods, whether for character-level attacks, word-level attacks, or combinations of these two kinds of perturbations, our results are better than those of the benchmarks concerning the restoration of classification accuracy. Among the other defense methods, WRM performs well for character-level attacks but the worst against word-level and mixed-level attacks because misspellings appear more frequently in character-level-only modifications. We also observe that AT exhibits stable defense performance and can restore the model accuracy to 40–60% when faced with different types of attacks. In contrast, the defensive effect of DISP does not meet our expectations. We believe that a large number of adversarial disturbances in the test data led to this performance. When multiple modifications are present, DISP cannot accurately recover the original text because it lacks the correct context. From another perspective, the defense results of DISP against these adversarial examples are similar to those of WRM when faced with multiple modifications. Meanwhile, TextShelter achieves optimal performance against all types of adversarial attacks for all three classification models, thus illustrating the excellent performance of our method.

Moreover, we plot the differences in the defense performance of the various methods on adversarial text in Figure 7 using the IMDb dataset as an example. The data in the figure represent the differences in accuracy between the results obtained on the adversarial text using the defense method and the results obtained on the original text. The lower the data values in these figures are, the closer to the original accuracy, indicating that the defense method is more effective. As shown in Figure 7, our approach effectively mitigates the degradation of model capability in any attack mode and outperforms other defense methods, especially for mixed-level attacks.

Overall, the accuracy improvement on the IMDb dataset is better than those on SST-2 and AG’s News, which is probably because IMDb contains data of greater length. For this reason, the data in the IMDb dataset retain more semantic and emotional information when perturbations are added, which helps in recovering the text.

4.3. Ablation Experiments

To verify the effectiveness of each module, Figure 8 shows the results of feeding hybrid adversarial samples into different defense modules (using the IMDb dataset as an example). This can help us scrutinize the impact and contribution of each module more distinctly in terms of defensive performance. The examples used in this experiment are 2000 mixed adversarial examples containing character-level and word-level perturbations, which were randomly selected from among all data. We aim to simulate real-world adversarial examples as much as possible with unknown types and numbers of modifications.

As shown in Figure 8, classification accuracy is improved when the adversarial samples are fed into different modules, and the BT module has the best defense effect because it can disrupt the word order of the sentence and replace the words appropriately. Conversely, the HR and SC modules are not as effective in blocking attacks, especially the HR modules, which are not the main defense, but can still play a defense role when there are homomorphic attacks in the adversarial text.

To get a clearer picture of the performance of each section, we used the SST-2 dataset to perform a more detailed ablation test experiment on the LSTM model, the results of which are shown in Table 2.

As can be seen from Table 2, the HR module can keep the performance of the original model stable without being attacked, because HR does not perform any processing on the text when the text does not contain homoglyph. The SC module has a slight impact on performance because SC corrects some words in the text in context. To our surprise, the BT module has a certain effect on the accuracy rate, which fully proves that the BT module has a strong semantic retention of the text. Meanwhile, compared to the results in Table 1, when we use only the SC module, the result is better than WRM, which only targets character attacks, and when we use only the BT module, the defensive effect is also better than DISP, which only targets word-level attacks.

In the case of attack, the performance of each model on the SST-2 dataset is similar to the performance on the IMDb dataset in Figure 8, with HR and SC being more targeted and BT being the best. On the whole, each module plays a different defensive role. The function of the HR module is mainly to detect and restore interference from “identical” characters. The function of the SC module is to correct spelling errors. The above two modules are defensive measures against character-level disturbances. The BT module is aimed at less perceptible word-level disturbances (i.e., word substitution, phrase collocation, or clause insertion) to destroy the elaborate design of the adversary. Obviously, for texts with multiple types of adversarial modifications, the complete model framework is the most effective measure. For adversarial texts similar to those in the real world, for which it is difficult to perceive the exact types and the number of disturbances, it is necessary to deploy the most comprehensive defense measures. Since all types of disturbances we have observed can be classified as character-level modifications or word-level replacements, our defense method can resist other disturbances of the above types generated in any way.

4.4. Evaluation of Combining Benchmark Methods

Thus, far, this article has only considered an overall comparison with baseline defences. Assuming that all adversarial samples are generated under black-box conditions, this means that the defender does not know the exact amount or form of interference. Therefore, more attention should be paid to multiple forms of attack as a means of defence in a real-world environment.

In this section, we first consider the combination of WRM and DISP as another combined defence framework for a fairer comparison with our approach. Second, many methods have emerged to improve the robustness of the model, and we will compare these methods in this section to further verify the effectiveness of the defense methods proposed in this paper. Finally, with the advent of large-scale language models, more and more systems are starting to use pre-trained models, so this paper also verifies the BERT model, which proves that the method has some universal applicability. The experimental data used here consist of the same 1000 randomly selected adversarial examples as in Section 4.3. The combined experimental results of WRM and DISP are shown in Table 3 (taking the IMDb dataset as an example) and a comparison of the results of DNE, RankMASK, and Textshelter experiments is shown in Table 4 (AG’News as an example).

Table 3 also shows that all defense methods have some effect on the BERT model, and that Textshelter is not only effective for pre-trained models, but also superior to other methods. Meanwhile, it can be observed that when the two benchmark methods of DSP and WRM are combined, the classification accuracy is significantly improved. For hybrid adversarial text, the combined protection of the two defenses can indeed achieve a more effective defense effect than either single defense method. Nevertheless, TextShelter has a significant advantage in classification accuracy. The combination of WRM and DISP does not achieve the defensive effect we would expect. We conjecture that this finding may stem from the uncertainty of the DISP module due to the separate training of the perturbation discriminator and the embedding estimator. The ability to repair malicious disturbances depends on the performance of the disturbance discriminator in identifying incorrect words. If the perturbation discriminator incorrectly recognizes some correct words as being incorrect, this will directly affect the restoration effect of the embedding estimator. We find that the DISP method does indeed incorrectly recognize some words that do not contain errors as needing restoration, which directly compromises the effectiveness of the previous WRM module in restoring misspelled words.

The experiment for RanMASK is set to RanMASK-90% with the “vote”, and the “Original” column is the original text classification accuracy, and the “Word” and “Mixed” columns are improved by different attack defenses, and the higher the value, the better the defense effect. As can be seen from Table 4, DNE and Textshelter have the same effect on word-level attacks, and even better than Textshelter on TextCNN, but slightly worse on mixed-level attacks, because the method of improving model robustness is based on synonym substitution attacks, so the defense effect against character-level attacks is poor. RanMASK has a similar defense to Textshelter for both word-level and mixed-level attacks, but is less effective on the BERT model.

Generally speaking, whether it is using the trained model to repair the input text or improving the robustness of the model through various methods, Textshelter has certain advantages over other methods in defending against attacks and can recover the input text better.

4.5. Evaluation of Sentiment Tendency

To explore the impacts of various types of attack and defense approaches on sentiment classification, we provide specific examples from the SST-2 dataset processed with TextCNN in Figure 9 and list the corresponding sentiment scores in Table 5. We use the Valence Aware Dictionary and Entiment Reasoner (VADER) [39] to evaluate sentiment as positive, negative, or neutral. As shown in Figure 9, adversarial examples generated in all three types of attacks contain imperceptible modifications to the text, which affect predictions. Defensive operations precisely eliminate these disturbances while retaining the emotion and semantics of the text to the greatest extent possible. As shown in Table 5, the adversarial examples at each attack level flip the output labels to the exactly opposite VADER scores. Our defense strategy filters out the perturbations in the input text while preserving the emotions.

5. Discussion

5.1. Discussion of Robustness

In this section, we analyze the robustness of TextShelter. We discuss robustness against various attack forms and robustness against transfer attacks separately. It is necessary to carefully assess the robustness of an effective defense method. In the experimental settings for the investigation of transfer attacks, all of the adversarial examples are extracted from the results in Section 3. The model parameters and experimental settings are completely consistent.

5.1.1. Robustness Against Attack Form

As a defense method, TextShelter is expected to be able to defend against existing attack methods as well as brand-new attack methods. In recent years, the modifications made to generate adversarial texts have generally been divided into two types. The first is interference at the character level and includes spelling errors (such as addition, deletion, replacement, swapping, and replacement of homoglyphs). The second is the word-level replacement which includes methods involving restrictions on word embedding distance and the substitution of synonyms. Regardless of the method, to generate an adversarial text, the adversary first selects an appropriate token in the search space with which to interfere and then makes a modification of one of the above two types. The attacker may have various options when locating tokens, but the possible forms of perturbation are limited because he or she needs to deceive the model with only minimal impact on human reading. This is also the main feature of adversarial text: monotonous disturbance types.

After the preprocessing of the text, homoglyphs are restored, spelling errors are corrected, and finally, the BT module eliminates word substitutions, phrase collocations, or clause additions that have been carefully designed by adversaries. The types of disturbances that TextShelter can defend against cover all possible forms of perturbation. Because TextShelter processes the text before it is input into the target model, it can address all attack forms and perturbation types uniformly and correctly because our defense method is independent of the target model and generation method.

5.1.2. Robustness Against Transfer Attack

We first randomly selected 1000 texts from the IMDb test dataset and randomly generated character-attack-only, word-attack-only, and mixed-attack adversarial texts. We then migrated our approach to other classification models with deployed defense mechanisms and recorded the attack success rates. The lower the index value is, the better the migration of the defense method. The main results are shown in Table 6. It can be observed that classification models combined with TextShelter can address transfer attacks, and the attack success rate is reduced by 0.23. For the BiLSTM model combined with TextShelter, the best defense effect is achieved for character-level transfer attacks, and the attack success rate is less than 0.02. From the perspective of attack types, character-level transfer attacks are easier to resist with our defense model, and the success rate of these attacks is the lowest. Word-level and mixed-level attacks are more difficult to defend against, but they also show a good defense effect. Intuitively, TextShelter exhibits good robustness against transfer attacks of different disturbance types.

5.2. Discussion of the SC Module

We conduct ablation studies on the Spelling Correction (SC) module to assess the impact of each proposed component, using the SST-2 dataset with a BiLSTM classifier as a representative case. We compare the full TextShelter model—which jointly employs occurrence-frequency modeling (Occur), an attack-adapted dictionary (Dictionary), and contextual ELMo embeddings—against variants that disable one or more of these components. The results are summarized in Table 7.

As shown in Table 7, removing any component leads to a noticeable performance drop. Specifically, the variant without ELMo suffers the largest degradation, with accuracy dropping by approximately 10 percentage points, confirming that contextual semantics are crucial for resolving homoglyph ambiguities. The variant without the optimized dictionary underperforms by approximately 2 percentage points, highlighting the importance of our refined dictionary tailored to adversarial character substitutions. Similarly, the variant without occurrence-frequency modeling shows a decline of approximately 3 percentage points, indicating that incorporating token frequency significantly improves the plausibility of corrections. In contrast, TextShelter—equipped with all three enhancements—achieves the highest robustness, demonstrating the synergistic effect of our design.

5.3. Discussion of the BT Module

In this section, we separately discuss the last module in the proposed defense framework. There may be some questions or uncertainties associated with the BT module, and we will discuss these issues separately.

Will the translation effect of the BT module be affected by adversarial disturbances?
Character-level adversarial examples will have a detrimental impact on neural network translation because they will cause meaningless words to appear, leading to large differences between the word vectors and the original words. In contrast, the synonym substitutions of word-level attacks often result in similar word vectors, thus having little impact on a translation model. Therefore, to prevent problems that may arise with the BT module as a defense, the BT module is deployed as the last step of the framework which only word-level disturbances should be present in the adversarial text that arrives at this module, as any homoglyphs and spelling errors should have been fixed in the previous two modules. To test this theory, this paper was tested on Google’s API and the experimental results are shown in Figure 10.
As can be seen from Figure 10, when word-level adversarial attacks are included in this article, the translation of sentences into Chinese does not make much difference in meaning and does not affect the emotional tendencies of the original sentence. To determine whether this process will affect the semantics of the text, we have conducted experiments, which we discuss and analyze below.
Do BT measures affect the semantics of the text?
BERTSCORE [40] is an automatic evaluation metric for text generation. It calculates the similarity score between each token in the candidate sentence and each token in the reference sentence and calculates the overall similarity of the sentences more effectively through context embedding and importance weights. Therefore, we use this method to evaluate the similarity between the original text and the backtranslated text. There are three evaluation indicators in BERTSCORE [40]: precision, recall, and F1 score. We selected 1000 texts from each of the three datasets for this evaluation, and the results are displayed in Table 8. It is observed that the metrics of similarity between the original text and the backtranslated text are above 92% for both tasks, indicating good similarity. Thus, the results show that the BT module has little effect on the semantics of the text and can be used as a defense module in TextShelter.
What is the influence of different intermediary languages on the BT module?
Different intermediary languages will have a certain impact on the performance of the BT module. We selected Italian with English and Japanese to represent different language families in comparisons to explore the selection of intermediary languages, and the experimental results are shown in Table 9 (taking the results of the BiLSTM model on AG’s News as an example)
As seen from Table 9, the semantics of a text can be better preserved in translation through a language of the same language family due to the similar grammar; consequently, the defense effect of sentence reconstruction is relatively good.

5.4. Discussion of Defense Costs

The approach we propose is time consuming in two main ways. The first is the early dictionary construction and model training part, which can be completed within 48 h, depending on the training time configured by different devices. Under the same device configuration, the training time of TextShelter is much shorter than that of DISP, DNE and RanMASK, because the training process of all three models calls the large language model. In the text reconstruction part, the retrieval time of the text reconstruction model trained by each module does not exceed 1 s, the total time does not exceed 2 s, and the retrieval time of ultra-long text data does not exceed 5 s, which can fully cope with realistic defence scenarios.

For the needs of the dataset, due to the peculiarity of the text, there is a certain repetition of common words in different fields, and the model itself also has a certain portability, and the dataset of selection rules can fully meet the daily use.

6. Conclusions

The continuous development of the network makes data grow exponentially, and data security is one of the important means to maintain network security. In response to the security problems posed by adversarial text, this paper proposes a new adversarial text defense strategy, TextShelter, which aims to reconstruct the input to a classification model to defend against adversarial examples without modifying or retraining the model. Experiments show that our defense approach can effectively alleviate the decline in the accuracy of the original model while preserving semantics. It performs well in defending against adversarial examples generated by carrying out separate character-level or word-level attacks as well as mixed-level attacks. To the best of our knowledge, TextShelter is the first method designed to defend against adversarial texts that include various types of perturbations, such as homoglyph substitutions, misspellings, and word replacements. We hope that our simple and efficient method will stimulate a more detailed exploration of defenses against adversarial attacks to enhance the security of deep networks.

Author Contributions

Conceptualization, G.C. and N.C.; methodology, G.C. and H.G.; software, Z.Y.; validation, N.C., Z.Y. and H.L.; formal analysis, H.G.; investigation, Z.Y. and H.L.; resources, H.G.; data curation, G.C. and H.G.; writing—original draft preparation, G.C. and H.G.; writing—review and editing, G.C. and N.C.; visualization, Z.Y. and H.L.; supervision, H.G.; project administration, H.G.; funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

The authors wish to thank the editors and anonymous reviewers for their valuable comments and helpful suggestions which greatly improved the paper’s quality. This work was supported in part by the National Key R&D Program of China (2023YFB3107505), in part by Key R&D Program of Shaanxi (2025CY-YBXM-229).

Data Availability Statement

All datasets generated or analysed during this study are included in main article. Furthermore, all these datasets are public datasets.

Conflicts of Interest

Authors Guoqin Chang and Haodong Li are employed by the company Shaanxi Science and Technology Holding Institute. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Feng, J.; Wei, Q.; Cui, J. Prototypical networks relation classification model based on entity convolution. Comput. Speech Lang. 2023, 77, 101432. [Google Scholar] [CrossRef]
Qaraei, M.; Babbar, R. Adversarial examples for extreme multilabel text classification. Mach. Learn. 2022, 111, 4539–4563. [Google Scholar] [CrossRef]
Wang, W.E.; Salvi, D.; Negroni, V.; Leonzio, D.U.; Bestagini, P.; Tubaro, S. BIM-Based Adversarial Attacks Against Speech Deepfake Detectors. Electronics 2025, 14, 2967. [Google Scholar] [CrossRef]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.J.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Kim, H.; Park, H.; Cho, Y. Performance Comparison of Adversarial Example Attacks Against CNN-Based Image Steganalysis Models. Electronics 2025, 14, 4422. [Google Scholar] [CrossRef]
Yin, Z.; Zhu, S.; Su, H.; Peng, J.; Lyu, W.; Luo, B. Adversarial Examples Detection With Enhanced Image Difference Features Based on Local Histogram Equalization. IEEE Trans. Dependable Secur. Comput. 2025, 22, 4442–4455. [Google Scholar] [CrossRef]
Heaven, D. Why deep-learning AIs are so easy to fool. Nature 2019, 574, 163–166. [Google Scholar] [CrossRef]
Yuan, X.; He, P.; Zhu, Q.; Li, X. Adversarial Examples: Attacks and Defenses for Deep Learning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2805–2824. [Google Scholar] [CrossRef]
Warde-Farley, D.; Bengio, Y. Improving Generative Adversarial Networks with Denoising Feature Matching. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
Samangouei, P.; Kabkab, M.; Chellappa, R. Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Jia, X.; Wei, X.; Cao, X.; Foroosh, H. ComDefend: An Efficient Image Compression Model to Defend Adversarial Examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 6084–6092. [Google Scholar]
Liang, B.; Li, H.; Su, M.; Bian, P.; Li, X.; Shi, W. Deep Text Classification Can be Fooled. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 4208–4215. [Google Scholar]
Eger, S.; Sahin, G.G.; Rücklé, A.; Lee, J.; Schulz, C.; Mesgar, M.; Swarnkar, K.; Simpson, E.; Gurevych, I. Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 1634–1647. [Google Scholar]
Pruthi, D.; Dhingra, B.; Lipton, Z.C. Combating Adversarial Misspellings with Robust Word Recognition. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; Volume 1, pp. 5582–5591. [Google Scholar]
Liu, H.; Zhang, Y.; Wang, Y.; Lin, Z.; Chen, Y. Joint Character-Level Word Embedding and Adversarial Stability Training to Defend Adversarial Text. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; pp. 8384–8391. [Google Scholar]
Samanta, S.; Mehta, S. Generating Adversarial Text Samples. In Proceedings of the Advances in Information Retrieval—40th European Conference on IR Research, ECIR 2018, Grenoble, France, 26–29 March 2018; Volume 10772, pp. 744–749. [Google Scholar]
Ebrahimi, J.; Rao, A.; Lowd, D.; Dou, D. HotFlip: White-Box Adversarial Examples for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 15–20 July 2018; Volume 2, pp. 31–36. [Google Scholar]
Ren, S.; Deng, Y.; He, K.; Che, W. Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; Volume 1, pp. 1085–1097. [Google Scholar]
Wang, X.; Jin, H.; He, K. Natural Language Adversarial Attacks and Defenses in Word Level. CoRR. 2019. Available online: https://openreview.net/forum?id=BJl_a2VYPH (accessed on 28 October 2025).
Zhou, Y.; Jiang, J.; Chang, K.; Wang, W. Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 4903–4912. [Google Scholar]
Jia, R.; Liang, P. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, 9–11 September 2017; pp. 2021–2031. [Google Scholar]
Iyyer, M.; Wieting, J.; Gimpel, K.; Zettlemoyer, L. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 1875–1885. [Google Scholar]
Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. In Proceedings of the 2018 IEEE Security and Privacy Workshops, SP Workshops 2018, San Francisco, CA, USA, 24 May 2018; pp. 50–56. [Google Scholar]
Jin, D.; Jin, Z.; Zhou, J.T.; Szolovits, P. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; pp. 8018–8025. [Google Scholar]
Li, J.; Ji, S.; Du, T.; Li, B.; Wang, T. TextBugger: Generating Adversarial Text Against Real-world Applications. In Proceedings of the 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
Alzantot, M.; Sharma, Y.; Elgohary, A.; Ho, B.; Srivastava, M.B.; Chang, K. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2890–2896. [Google Scholar]
Soll, M.; Hinz, T.; Magg, S.; Wermter, S. Evaluating Defensive Distillation for Defending Text Processing Neural Networks Against Adversarial Examples. In Proceedings of the Artificial Neural Networks and Machine Learning—ICANN 2019: Image Processing—28th International Conference on Artificial Neural Networks, Munich, Germany, 17–19 September 2019; Volume 11729, pp. 685–696. [Google Scholar]
Guo, Q.; Pang, S.; Jia, X.; Liu, Y.; Guo, Q. Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models via Diffusion Models. IEEE Trans. Inf. Forensics Secur. 2025, 20, 1333–1348. [Google Scholar] [CrossRef]
Che, L.; Wu, C.; Hou, Y. Large Language Model Text Adversarial Defense Method Based on Disturbance Detection and Error Correction. Electronics 2025, 14, 2267. [Google Scholar] [CrossRef]
Ye, M.; Gong, C.; Liu, Q. SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; pp. 3465–3475. [Google Scholar]
Wu, Z.; Gao, H.; Luo, J.; Liu, Z. HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor. arXiv 2025, arXiv:2501.13677. [Google Scholar] [CrossRef]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 2227–2237. [Google Scholar]
Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Sakaguchi, K.; Duh, K.; Post, M.; Durme, B.V. Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 3281–3287. [Google Scholar]
Zhou, Y.; Zheng, X.; Hsieh, C.; Chang, K.; Huang, X. Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP, Virtual Event, 1–6 August 2021; Volume 1, pp. 5482–5492. [Google Scholar]
Zeng, J.; Zheng, X.; Xu, J.; Li, L.; Yuan, L.; Huang, X. Certified Robustness to Text Adversarial Attacks by Randomized [MASK]. arXiv 2021, arXiv:2105.03743. [Google Scholar] [CrossRef]
Hutto, C.J.; Gilbert, E. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the Eighth International Conference on Weblogs and Social Media, ICWSM 2014, Ann Arbor, MI, USA, 1–4 June 2014. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]

Figure 1. An example of homoglyph characters.

Figure 2. The proposed TextShelter framework employs three sequential modules—Homoglyph Restoration (HR), Spelling Correction (SC), and Back-Translation (BT)—to defend against character-level and word-level adversarial attacks. As illustrated, an adversarial input

S^{'}

is first processed by HR to resolve homoglyph substitutions, producing intermediate output

S_{1}

. Then

S_{1}

then passes to SC for spelling error correction, yielding

S_{2}

. Finally, BT reconstructs

S_{2}

via translation into Chinese and back to English, outputting the final defended text

S^{″}

. This pipeline progressively restores the original semantic meaning, enabling correct classification by the target model.

Figure 2. The proposed TextShelter framework employs three sequential modules—Homoglyph Restoration (HR), Spelling Correction (SC), and Back-Translation (BT)—to defend against character-level and word-level adversarial attacks. As illustrated, an adversarial input

S^{'}

is first processed by HR to resolve homoglyph substitutions, producing intermediate output

S_{1}

. Then

S_{1}

then passes to SC for spelling error correction, yielding

S_{2}

. Finally, BT reconstructs

S_{2}

via translation into Chinese and back to English, outputting the final defended text

S^{″}

. This pipeline progressively restores the original semantic meaning, enabling correct classification by the target model.

Figure 3. Illustration of the E-ScRNN structure. The proposed model enhances spelling correction by integrating semi-character level representations with contextualized ELMo embeddings. This architecture captures both fine-grained subword features and broader semantic context, enabling more accurate error detection and correction in adversarial texts.

Figure 4. Example of a BT module. The adversarial text is translated into Chinese and then back to English, producing a semantically consistent and syntactically restored sentence.

Figure 5. This figure demonstrates three categories of textual adversarial attacks: character-level modifications, word-level substitutions, and hybrid perturbations combining both strategies.

Figure 6. Classification accuracy on clean text under different defense settings. The y-axis represents classification accuracy, showing that all defense methods maintain performance comparable to the undefended baseline across three models and multiple datasets.

Figure 7. Accuracy comparison between original and adversarial texts under different defense methods on IMDb. The y-axis represents classification accuracy, demonstrating that our method achieves the minimal performance degradation when processing adversarial examples compared to other approaches.

Figure 8. Defensive performances of different single modules and combinations of modules in TextShelter. Each module has some effect, but the best combination is close to the accuracy of the original text.

Figure 9. Specific examples from SST-2 processed with TextCNN.

Figure 10. Use the word-level adversarial text in Figure 9 to test the effectiveness of the Google Translate API.

Table 1. Accuracy on all examples under different attack and defense settings. Compared to WRM and DISP, the proposed method achieves the best effect on different datasets and models, especially in the case of hybrid attacks.

		TextCNN			LSTM			BiLSTM
		Char	Word	Mixed	Char	Word	Mixed	Char	Word	Mixed
IMDb	Adversarial	34.60	30.30	7.30	56.30	5.90	12.55	40.90	15.40	8.25
	AT	52.80	49.60	50.40	59.60	53.74	59.40	56.86	42.50	57.26
	WRM [14]	83.25	33.35	21.50	70.75	33.00	33.00	64.60	29.85	18.75
	DISP [36]	72.00	34.90	23.55	52.30	38.65	34.10	50.40	35.25	23.10
	Textshelter	83.55	64.80	72.00	83.20	76.60	73.50	70.00	60.90	57.65
SST	Adversarial	59.00	48.55	42.75	48.23	43.50	39.70	50.80	49.75	45.70
	AT	63.20	58.62	53.21	59.42	54.90	56.71	66.43	55.68	58.46
	WRM [14]	66.90	52.20	50.60	52.30	47.53	45.30	69.20	54.15	54.75
	DISP [36]	57.20	55.95	48.55	50.38	53.30	49.79	48.75	55.40	52.20
	Textshelter	68.85	65.90	63.15	73.26	70.15	75.23	68.25	63.40	65.10
AG’s News	Adversarial	18.70	17.00	27.00	24.00	21.30	29.50	23.70	11.80	28.20
	AT	52.30	49.85	54.55	54.20	48.60	56.85	47.85	49.60	50.20
	WRM [14]	82.40	58.40	57.40	79.40	45.30	50.25	77.80	55.00	54.30
	DISP [36]	83.80	25.70	33.30	80.75	30.85	32.50	76.30	23.60	31.90
	Textshelter	85.80	60.80	58.70	82.65	54.60	57.80	82.70	56.20	57.20

Table 2. Ablation Experiments.

HR	SC	BT	Original Text		Adversarial Text
HR	SC	BT	Before Defense	After Defense	Before Defense	After Defense
✓			83.20	83.20	39.70	40.12
	✓			80.51		49.88
		✓		83.96		59.37
✓	✓			80.51		50.12
✓		✓		83.96		60.69
	✓	✓		81.65		73.35
✓	✓	✓		81.65		75.23

Table 3. The combined experimental results of WRM and DISP.

Defense Method	Model
Defense Method	TextCNN	LSTM	BiLSTM	BERT
WRM [14]	21.50	33.00	18.75	10.58
DISP [36]	23.55	34.10	23.10	16.34
DISP + WRM	26.80	38.00	25.05	19.65
WRM + DISP	44.60	54.40	39.25	30.47
Textshelter	72.00	73.50	57.65	50.91

Table 4. Comparison of results from DNE, RanMASK and Textshelter experiments.

Defense Method	TextCNN			BiLSTM			BERT
Defense Method	Original	Word	Mixed	Original	Word	Mixed	Original	Word	Mixed
Textshelter	90.37	43.8	31.7	90.41	44.4	29	95.7	40.2	28.85
DNE [37]	90.2	53.8	31.51	91.38	40.22	28.8	94.38	39.5	25.62
RanMASK [38]	90.98	39.3	30.65	90.54	40.55	26.57	39.9	14.49	13.58

Table 5. Sentiment scores of the examples in Figure 9.

	Original	Char-Level		Word-Level		Mixed
	Original	Attack	Defense	Attack	Defense	Attack	Defense
Score	0.9564	−0.4019	0.974	−0.2746	0.6479	−0.4244	0.9333
Prediction	positive	negative	positive	negative	positive	negative	positive
positive: ≥0.05 neutral: >−0.05 and <0.05 negative: ≤−0.05

Table 6. The success rate of transfer attacks with TextShelter.

Model	Transfer	TextCNN	LSTM	BiLSTM
Model	Original	0.842	0.860	0.717
TextCNN + TextShelter	Char	0.006	0.031	0.033
	Word	0.140	0.116	0.056
	Mixed	0.122	0.122	0.099
LSTM + TextShelter	Char	0.033	0.028	0.052
	Word	0.145	0.094	0.055
	Mixed	0.080	0.125	0.090
BiLSTM + TextShelter	Char	0.010	0.019	0.017
	Word	0.221	0.131	0.108
	Mixed	0.113	0.159	0.140

Table 7. Ablation study of the SC module on SST-2 with BiLSTM.

Occur	Dictionary	ELMo	Accuracy
✓			56.38
	✓		54.11
		✓	52.06
✓	✓		58.29
✓		✓	66.64
	✓	✓	65.77
✓	✓	✓	68.25

Table 8. Similarity metrics between the original text and the backtranslated text.

Dataset	BERTSCORE [40]
Dataset	Precision	Recall	F1 Score
IMDb	92.98%	92.24%	92.61%
SST-2	92.32%	90.63%	91.45%
AG’s News	92.81%	91.84%	92.32%

Table 9. The influence of different intermediary languages on the BT model.

Languages	Original	Original Defense	Attack	Attack Defense
Chinese	90.41%	87.60%	28.20%	57.20%
Italian		89.54%		58.94%
Japanese		85.88%		55.37%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, G.; Gao, H.; Cheng, N.; Yao, Z.; Li, H. TextShelter: Text Adversarial Example Defense Based on Input Reconstruction. Electronics 2025, 14, 4706. https://doi.org/10.3390/electronics14234706

AMA Style

Chang G, Gao H, Cheng N, Yao Z, Li H. TextShelter: Text Adversarial Example Defense Based on Input Reconstruction. Electronics. 2025; 14(23):4706. https://doi.org/10.3390/electronics14234706

Chicago/Turabian Style

Chang, Guoqin, Haichang Gao, Nuo Cheng, Zhou Yao, and Haodong Li. 2025. "TextShelter: Text Adversarial Example Defense Based on Input Reconstruction" Electronics 14, no. 23: 4706. https://doi.org/10.3390/electronics14234706

APA Style

Chang, G., Gao, H., Cheng, N., Yao, Z., & Li, H. (2025). TextShelter: Text Adversarial Example Defense Based on Input Reconstruction. Electronics, 14(23), 4706. https://doi.org/10.3390/electronics14234706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TextShelter: Text Adversarial Example Defense Based on Input Reconstruction

Abstract

1. Introduction

2. Related Work

3. TextShelter

3.1. Framework Overview

3.2. Homoglyph Reversion (HR) Module

3.3. Spelling Correction (SC) Module

3.3.1. Overall Framework

3.3.2. Expansion of Misspellings

3.3.3. Dictionary Settings

3.3.4. Dynamic Word Embedding

3.4. Backtranslation (BT) Module

4. Experimental Results and Analysis

4.1. Experimental Setup

4.2. Experimental Results

4.2.1. Performance on Original Text

4.2.2. Performance on Adversarial Text

4.3. Ablation Experiments

4.4. Evaluation of Combining Benchmark Methods

4.5. Evaluation of Sentiment Tendency

5. Discussion

5.1. Discussion of Robustness

5.1.1. Robustness Against Attack Form

5.1.2. Robustness Against Transfer Attack

5.2. Discussion of the SC Module

5.3. Discussion of the BT Module

5.4. Discussion of Defense Costs

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI