Defending Against Backdoor Attacks in Federated Learning: A Triple-Phase Client-Side Approach

Chen, Yunran; Li, Boyuan

doi:10.3390/electronics15020273

Open AccessArticle

Defending Against Backdoor Attacks in Federated Learning: A Triple-Phase Client-Side Approach

by

Yunran Chen

^1,2 and

Boyuan Li

^1,3,*

¹

Industry Research Institute of Intelligent Systems, Longmen Laboratory, Luoyang 471000, China

²

School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China

³

School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 273; https://doi.org/10.3390/electronics15020273

Submission received: 27 November 2025 / Revised: 26 December 2025 / Accepted: 4 January 2026 / Published: 7 January 2026

(This article belongs to the Special Issue Empowering IoT with AI: AIoT for Smart and Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

Federated learning effectively addresses the issues of data privacy and communication overhead in traditional deep learning through distributed local training. However, its open architecture is seriously threatened by backdoor attacks, where malicious clients can implant triggers to control the global model. To address these issues, this paper proposes a novel three-stage defense mechanism based on local clients. First, through text readability analysis, each client’s local data is independently evaluated to construct a global scoring distribution model, and a dynamic threshold is used to precisely locate and remove suspicious samples with low readability. Second, frequency analysis and perturbation are performed on the remaining data to identify and disrupt triggers based on specific words while preserving the basic semantics of the text. Third, n-gram distribution analysis is employed to detect and remove samples containing abnormally high-frequency word sequences, which may correspond to complex backdoor attack patterns. Experimental results show that this method can effectively defend against various backdoor attacks with minimal impact on model accuracy, providing a new solution for the security of federated learning.

Keywords:

federated learning; backdoor defense; readability; word frequency

1. Introduction

In traditional deep learning model training, trainers typically need to centrally collect large amounts of raw data that are dispersed across various users or devices. Although this centralized training approach can effectively enhance model accuracy and generalization capability, it also poses several problems. First, the centralized transmission and storage of data result in substantial computational pressure and communication overhead, especially when dealing with large-scale, high-dimensional heterogeneous data, leading to a significant increase in training costs [1]. Second, centralized data processing poses a severe threat to user privacy [2,3,4,5,6]. For instance, users’ medical records, social information, and image data may face risks of leakage, misuse, or illegal analysis during the collection and transmission processes.

Federated learning (FL) [7,8,9], as an emerging distributed machine learning framework, offers a potential solution to the aforementioned issues. Its core concept involves distributing the model training tasks to various decentralized clients (e.g., personal devices, edge nodes), where each client trains the model using its local data and uploads the model parameters or gradients to a server for aggregation, thereby constructing a global model. This approach effectively circumvents the centralized transmission of raw data, enhancing model performance while safeguarding user privacy. Moreover, retaining data locally significantly reduces the consumption of communication bandwidth and the risk of data leakage. However, despite its strengths in protecting data privacy and improving resource efficiency, the security issues of FL cannot be overlooked, especially in open and heterogeneous environments where FL systems are vulnerable to backdoor attacks [10,11,12,13,14]. In such attacks, malicious clients deliberately upload model updates containing specific “triggers,” causing the global model to produce attacker-predefined outcomes when encountering particular inputs while maintaining performance on normal inputs, thereby making the attack behavior difficult to detect. These attacks not only compromise the reliability of the model but can also be used for special purposes such as manipulating classifications and evading detection, posing significant stealth and harm. In the context of FL, backdoor attacks are particularly challenging to defend against. On the one hand, the system cannot directly access the clients’ raw data and model weights, making it difficult to assess the authenticity and security of the uploaded models. On the other hand, attackers may exploit the heterogeneity of data distribution and the locality of model updates to devise sophisticated attack strategies that evade existing aggregation algorithms or anomaly detection mechanisms. Therefore, designing an efficient and practical text-oriented backdoor defense method for federated NLP has become a key issue in federated learning security research.

To address this challenge, this paper proposes a novel three-stage client-side defense mechanism that actively identifies and removes potential backdoor-contaminated data through text readability analysis, dynamic word frequency perturbation, and n-gram distribution analysis. The core of this method lies in the fact that backdoor attacks often employ “unnatural” language patterns, such as low readability text, abnormally high-frequency words, or specific trigger word combinations as triggers, which significantly deviate from the statistical distribution characteristics of natural language. Specifically, the method first performs data readability analysis, using text readability scores (such as Flesch Reading Ease, FRE) to independently evaluate each sample in the client’s local dataset and to construct a global scoring distribution model. By calculating a dynamic threshold, it precisely locates and removes suspicious samples with low readability that fall significantly below this threshold. These samples typically contain garbled characters, AI-generated templates, or stylistically distorted text planted by attackers. This stage effectively identifies language pattern anomalies based on statistical outlier detection. Secondly, for potential word-based triggers in the remaining data that are not captured by readability analysis (such as explicit high-frequency words or semantic trigger words), the method conducts word frequency analysis and perturbation. After preprocessing, it identifies abnormally high-frequency candidate words by calculating dynamic thresholds based on word frequency statistics. For texts containing these candidate words, a random perturbation strategy is applied (prioritizing synonym replacement, supplemented by random insertion/deletion) to disrupt the functionality of the triggers while maximizing the preservation of the text’s basic semantics and ensuring that the trigger mechanism fails. Thirdly, n-gram distribution analysis is employed to detect and remove samples containing abnormally high-frequency word sequences, which may correspond to complex backdoor attack patterns such as multi-word triggers, style-based triggers, or AI-generated synthetic templates. This stage further enhances the defense by identifying and removing samples with anomalous linguistic patterns. Experimental results demonstrate that this method effectively defends against various types of backdoor attacks, with minimal impact on model accuracy. The main contributions are summarized as follows:

We propose three defense operations against backdoor attacks aimed at identifying and removing potential malicious data from different perspectives for federated NLP/text-based backdoor attacks. First, through text readability analysis, we identify and remove samples with low readability that may contain garbled characters or stylistically distorted text planted by attackers. Second, we use word frequency analysis and perturbation techniques to identify and disrupt triggers based on specific words while preserving the basic semantics of the text. Finally, through n-gram distribution analysis, we detect and remove samples containing abnormally high-frequency word sequences that may correspond to complex backdoor attack patterns, such as multi-word triggers, style-based triggers, or AI-generated synthetic templates. These three operations together form a multi-layered defense system that can effectively counter various types of backdoor attacks.
We propose the combined use of the three defense operations mentioned above to achieve more comprehensive defense effects. By integrating text readability analysis, word frequency analysis, and n-gram distribution analysis, our method can simultaneously address both simple and complex backdoor attack patterns. This combined approach not only enhances the defense capability against single attack patterns but also significantly strengthens robustness against multi-pattern attacks.
Through extensive experiments on multiple datasets, we have validated the effectiveness of the proposed method. The experimental results show that our defense mechanism can significantly reduce the success rate of backdoor attacks while having a minimal impact on model accuracy. Each defense operation used individually can effectively reduce the attack success rate, and when the three operations are used in combination, the attack success rate is further reduced. These results fully demonstrate the efficiency and practicality of our method, providing a strong guaranty for the security of federated learning.

2. Related Works

2.1. Backdoor Attacks

In response to backdoor attacks, numerous researchers have proposed various attack methods [10,11,15,16,17,18]. Xie et al. [19] introduced a distributed backdoor attack in which attackers poison different clients by selecting various local triggers, ultimately aggregating to form a global trigger for the backdoor attack. Bhagoji et al. [20] explicitly amplified malicious updates and modified the loss function to enable backdoor attacks even when the global model does not converge. Bagdasaryan et al. [21] trained a backdoor model similar to the global model to replace the latest global model, thereby influencing the final aggregation outcome. Fang et al. [22] selected the least important parameters of the model weights and flipped the weight signs by comparing whether the symbols at the trigger positions matched those of the corresponding weight positions to achieve backdoor attacks. Zhang et al. [23] leveraged the sparse nature of gradients in Stochastic Gradient Descent (SGD) to target parameters with small changes during training, enhancing the persistence of backdoor attacks. Tolpegin et al. [24] utilized a malicious subset of federated learning participants to poison the global model by sending model updates derived from mislabeled data. Additionally, the study investigated the persistence of attacks in early and late training stages, the impact of the availability of malicious participants on the attacks, and the relationship between the two. Shi et al. [25] employed adversarial samples to undermine the accuracy of the trained model. Nuding et al. [26] altered specific training inputs used during the training phase by changing certain patterns, enabling malicious behavior to be triggered during the prediction phase. Nguyen et al. [27] proposed data poisoning attacks in Internet of Things (IoT) intrusion detection systems, allowing attackers to implant backdoors into the aggregated detection model to misclassify malicious traffic as benign.

2.2. Backdoor Defenses

In the realm of defending against backdoor attacks, researchers have proposed a variety of approaches [28,29,30]. Fung et al. [31] introduced FoolsGold, which inspects local model updates to identify and eliminate suspicious ones. Nguyen et al. [32] focused on detecting local model updates with significant backdoor impacts and employed techniques such as clipping, smoothing, and adding noise to mitigate residual backdoors. Miao et al. [33] proposed CND, a defense mechanism based on differential privacy that reduces injected noise by lowering the threshold of model updates, thereby defending against backdoor attacks while maintaining model performance. Hou et al. [34] utilized XAI (explainable artificial intelligence) based model training filters to detect whether clients contain triggers and adopted a fuzzy label inversion strategy to remove triggers from backdoored data. Xie et al. [35] proposed CRFL, which trains certifiably robust federated learning models to counteract backdoor attacks and employs parameter clipping and smoothing to control model smoothness. Aramoon et al. [36] introduced meta federated learning, which groups and aggregates local model updates, detects the aggregated results, and then aggregates the grouped results again to defend against backdoor attacks when using secure aggregation protocols. Uprety et al. [37] designed a reputation model that calculates the binomial Bayesian reputation scores of local clients and uses these scores to filter out attacker nodes. Shejwalkar et al. [38] proposed DnC, which first computes the principal components of the input update set (i.e., the direction of maximum variance), then calculates the dot product of the updates with the principal components (referred to as the projection) and finally removes a constant portion of the total updates with the largest projection. In federated learning, DnC enables spectral analysis of input updates through dimensionality reduction and ensures the effective detection of malicious updates. DeepSight [39] can detect and eliminate model clusters containing poisoned models with significant attack impacts. However, most of these methods require the server to directly inspect the model updates from clients, making them incompatible with secure aggregation protocols. Moreover, certain methods, such as that of Hou et al. [34], necessitate additional backdoor models and a large number of backdoor data samples to train filters. Zhang et al. [40] proposed detecting the temporal consistency of model gradient uploads to identify and reject malicious clients, but this defense method directly discards the model updates from malicious clients, potentially reducing the accuracy of the aggregated model.

Despite their effectiveness, most existing defenses are server-centric and require direct access to individual client updates, making them incompatible with secure aggregation, where per-client updates are hidden from the server. Moreover, some approaches (e.g., [34]) rely on training additional backdoor models and require substantial backdoor data to build reliable filters, while rejection-based defenses may reduce the accuracy of the aggregated model by discarding updates from suspected clients.

In contrast, our method is designed as a purely client-side protection pipeline that operates before local training, and thus remains naturally compatible with secure aggregation. Rather than depending on server-side inspection or dropping entire client updates, we mitigate backdoors at the data source by sanitizing potentially poisoned text on-device. Specifically, we propose a lightweight three-stage procedure that targets complementary trigger characteristics: readability-based filtering to remove unnatural outliers, word-frequency-based perturbation to disrupt trigger-correlated tokens while preserving benign content, and n-gram distribution analysis to capture abnormal multi-word patterns. This modular and explainable design does not require auxiliary backdoor models or large trigger corpora, and provides a practical defense layer for privacy-preserving federated NLP.

3. Methods

Before detailing the proposed three-stage client-side pipeline, we first clarify the federated learning workflow and the threat model we assume. As illustrated in Figure 1, the server only receives model updates from clients, while adversaries can poison a subset of clients by injecting trigger-bearing text into their local training data. This motivates placing our defense directly on the client side prior to computing local updates.

3.1. Threat Model

We consider a standard federated NLP setting where a central server coordinates multiple clients and aggregates their model updates, while raw text data remains local on clients. A small fraction of clients may be malicious and attempt to implant a backdoor by injecting poisoned text into their local training set, with the goal of causing targeted mispredictions when a trigger is present, while maintaining benign accuracy on clean inputs. The attacker can craft poisoned samples on compromised clients and may use triggers ranging from explicit token or phrase insertions to more complex patterns such as style or template-based modifications. The overview of our method is illustrated in Figure 2.

3.2. Data Readability Analysis

Text readability scores are employed as a key metric for detecting potential backdoor triggers, aiming to identify suspicious samples that contain unnatural language patterns (e.g., deliberately inserted nonsensical trigger words, AI-generated synthetic templates, or stylistically anomalous text). We emphasize that our three-stage pipeline is tailored to textual data and is not intended as a modality-agnostic defense for vision or tabular FL. Specifically, we utilize the Flesch Reading Ease (FRE) algorithm from the textstat library to independently score each text sample. This algorithm assesses text readability by quantifying sentence length and syllable complexity, with lower scores indicating higher reading difficulty. The calculation formula is as follows:

F R E = 206.835 - 1.015 \times (\frac{N_{w o r d}}{N_{s e n t e n c e}}) - 84.6 \times (\frac{N_{s y l l a b l e}}{N_{w o r d}}) .

(1)

In practice, we traverse all text samples in the client’s local dataset, generating separate readability scores for each sample rather than calculating an overall average. This approach allows us to precisely locate individual anomalies instead of blurring overall trends. We hope to identify statistical outliers by establishing a readability score distribution model. First, we calculate the mean and standard deviation of readability scores for the entire local dataset, then set a dynamic threshold (e.g., the mean minus two times the standard deviation), flagging samples below this threshold as “suspicious.” These low-readability samples often conceal triggers planted by attackers, such as garbled characters in BadNet [15], AI-generated templates in SynBkd, or stylistically distorted text in StyleBkd [41]. Experimental data shows that the FRE scores of normal datasets typically remain within a reasonable range, while backdoored samples significantly lower the scores. This statistical distribution-based analysis method effectively distinguishes natural text from maliciously constructed content, providing a quantitative basis for subsequent defense decisions and avoiding misjudgments due to domain-specific characteristics.

3.3. Word Frequency Analysis and Perturbation

The primary objective of word frequency analysis is to identify abnormally high-frequency words in local client data that may serve as backdoor triggers and to disrupt their triggering functions through random perturbation techniques. Specifically, we first preprocess the text: using regular expressions to extract valid words and convert them to lowercase, while filtering out 150+ common stop words (such as articles, prepositions, and auxiliary verbs) to avoid interference. Next, we calculate the frequency distribution of the remaining words and set a dynamic threshold based on statistical principles (mean plus k times the standard deviation) to identify abnormally high-frequency words. Finally, we apply random perturbation to texts containing these candidate words: with a 60% probability, we replace high-frequency words with synonyms (the preferred option), randomly insert unrelated words, or randomly delete words. By combining various perturbation methods, we ensure the deactivation of the triggering mechanism. The entire process preserves the basic semantic structure of the text, making only local modifications to high-frequency words.

This method effectively counters various types of trigger-word attacks. For explicit high-frequency trigger words (e.g., “cf”), it directly disrupts their functionality. For semantic trigger words (e.g., “view”), it undermines their specific meanings through synonym replacement. Meanwhile, it increases the difficulty for attackers to predict by employing random perturbation strategies. The dynamic threshold ensures adaptability to the characteristics of different client data (for example, “system” may be a normally high-frequency word in technical documents). The synonym replacement priority strategy (accounting for 70% of perturbations) maximizes semantic coherence, while random insertion/deletion serves as a supplementary means to enhance defense robustness. The sanitized text output can be directly used for local training in federated learning, achieving the goal of defending against single-trigger-word backdoor attacks at the source while keeping semantic changes within acceptable limits (parameters k and perturbation probability can be adjusted to control the strength of defense).

3.4. N-Gram Distribution Analysis

The core objective of n-gram distribution analysis is to detect abnormally high-frequency word sequence combinations within a client’s local data. These combinations may correspond to complex backdoor attack patterns, such as multi-word triggers (e.g., “view this”), style-based triggers relying on fixed sentence patterns (e.g., “kindly review”), or AI-generated synthetic templates. The method is implemented as follows: (1) Text Preprocessing: The text is preprocessed using regular expressions to extract valid words and convert them to lowercase. Common stop words are filtered out to eliminate meaningless sequences (e.g., “of the”). (2) N-gram Extraction and Frequency Statistics: All contiguous word sequences of a specified length n (n-grams) are extracted, and their frequency distribution across the dataset is calculated. (3) Anomaly Detection: Based on the statistical properties (mean and standard deviation) of the frequency distribution, a dynamic threshold (mean + k × standard deviation) is set to identify abnormally high-frequency n-grams. (4) Sample Filtering: Samples are ranked according to the number of abnormal n-grams they contain. The most suspicious samples are then removed, subject to a predefined maximum removal ratio (default ≤ 10%), achieving data sanitization.

This method effectively defends against various complex backdoor attacks: (1) Defense against multi-word triggers: Even if individual word frequencies are normal, their specific combination exhibits an abnormally high frequency. (2) Defense against style attacks: It captures specific syntactic patterns. (3) Defense against synthetic templates: It detects recurring AI-generated structures. The dynamic threshold ensures adaptability to the data characteristics of different clients (e.g., “error handling” may normally be frequent in technical documentation). The maximum removal ratio limit (max_remove_ratio) prevents excessive loss of valid data due to over-sanitization. By removing samples containing anomalous linguistic patterns outright, rather than modifying their content, the method thoroughly disrupts trigger functionality while preserving the integrity and semantic consistency of the remaining data to the maximum extent. The parameters n (n-gram length) and k (standard deviation multiplier) are adjustable based on the characteristics of the attack (e.g., increasing n detects longer patterns, decreasing k enhances detection sensitivity), forming an adaptive defense mechanism.

Notably, our client-side sanitization is based on distributional cues in text, including readability or fluency, abnormal token-frequency spikes, and repeated n-gram patterns, rather than content- or identity-based rules. We further limit potential utility and fairness impacts by using dynamic thresholds and explicit maximum removal or perturbation ratios, so the defense does not over-prune local data even when a client’s writing style is atypical. In addition, thresholds can be computed per client, and if needed per domain, to better accommodate natural heterogeneity in federated learning and reduce the risk of disproportionately affecting legitimate but uncommon dialects or highly colloquial text.

3.5. Runtime and Memory Complexity

Our defense is a purely client-side, three-stage preprocessing pipeline that (i) scores each text sample with a readability metric, (ii) performs word-frequency-based detection and local perturbation (with synonym replacement as the preferred option), and (iii) extracts n-grams to identify abnormally high-frequency sequences and remove the most suspicious samples under a maximum removal ratio. In practice, we traverse the client’s local dataset and process each sample independently before local model updates are computed.

Let N be the number of local samples on a client, and L be the average number of tokens/words per sample. Stage 1 readability scoring. Computing FRE-style readability scores requires a single pass to count sentences/words/syllables, yielding time complexity

O (N L)

and

O (1)

extra memory (besides running statistics such as mean/std for thresholding). Stage 2 word-frequency analysis and perturbation. Word counting over the local corpus is

O (N L)

time and

O (V)

memory, where V is the effective vocabulary size after preprocessing (e.g., stop-word removal). Perturbations (synonym replacement/random insertion/deletion) are local edits on matched samples and do not change the asymptotic cost dominated by corpus scanning. Stage 3 n-gram distribution analysis. Extracting contiguous n-grams (with constant n) is

O (N L)

time. Storing all observed n-grams can be

O (G)

memory (where G is the number of distinct n-grams); in deployment, memory can be bounded by keeping only top-K frequent n-grams or those above a threshold, resulting in

O (K)

memory. Overall, the pipeline is linear in the local text size (

O (N L)

) and incurs bounded additional memory for lightweight counting structures, making it suitable as an on-device/edge sanitization step prior to federated optimization.

4. Experiments

This work focuses on federated NLP/text classification and proposes a client-side defense that leverages linguistic/statistical properties of text (readability, word-frequency patterns, and n-gram distributions). Accordingly, our evaluation is restricted to text-based datasets and textual backdoor threats. Specifically, we consider representative trigger-style and style/template-related backdoor attacks in NLP (e.g., word/phrase insertion and style-based triggers) under the standard FL setting where a subset of clients may be malicious and attempt to implant triggers via poisoned local data.

4.1. Experimental Setup

4.1.1. Datasets

This study conducts experiments on multiple binary classification datasets to validate the effectiveness and robustness of the proposed method across different scenarios. These datasets include SST-2 [42], Sentiment140 [43], Review [44], Yelp [45], and Food [46]. The SST-2 dataset comprises 6920 training samples, 872 validation samples, and 1821 testing samples, designed to categorize text into positive and negative sentiments. The Sentiment140 dataset includes 4550 training samples, 650 validation samples, and 1300 testing samples. The Review dataset contains 4549 training samples, 649 validation samples, and 1301 testing samples. The Yelp dataset contains 4550 training samples, 650 validation samples, and 1300 testing samples. The Food dataset contains 4550 training samples, 650 validation samples, and 1300 testing samples. Detailed information on dataset splits and categories, along with the target labels for backdoor attacks, is provided in Table 1. These datasets cover a variety of text types and sentiment classification tasks, offering a rich testing environment for studying backdoor attacks and defense mechanisms.

4.1.2. Attacks

To verify the efficacy of our defense method, we utilize several well-known backdoor attack methods that epitomize common techniques employed to compromise federated learning systems. These methods aim to manipulate data by injecting specific patterns or triggers, thereby causing the model to misclassify input data during inference. Specifically, we employ the following attack methods: (1) BadNet [15], which randomly selects trigger words from a set of rare words (such as “cf,” “mn,” “bb,” “tq”) and inserts them between any tokens in the data. The goal is to create a pattern that, when detected by the model, will trigger a specific classification output. This method exploits the model’s sensitivity to rare and unusual word combinations, making it a challenging attack to detect. (2) AddSent [11], which inserts a fixed phrase, such as “I watch this 3D movie,” between any tokens in the data input. The idea is to introduce a consistent and recognizable pattern that the model will associate with a particular class. This method is particularly effective in scenarios where the model is trained on text data, as it can easily blend into the natural language structure. (3) Style [41], which transforms the data input into a specific stylistic form, such as a biblical style. By altering the stylistic elements of the text, the attack aims to exploit the model’s tendency to classify based on stylistic cues. This method is especially insidious because it is difficult to distinguish from legitimate stylistic variations in the data. The backdoor data samples created by each of these attack methods are shown in Table 2.

4.1.3. Implementation Details

In the experiment, the total number of clients denoted as N, is set to 10, with n = 4 clients randomly selected to participate in each training round. The proportion of poisoned samples is denoted by

ζ

, and the total number of communication rounds is denoted by T. The number of local training rounds for clean clients is set to

E_{c}

, with a local learning rate of

L R_{L}

. For backdoor attacks, the number of training rounds is set to

E_{p}

, and the local learning rate is

L R_{P}

. The Dirichlet function is used to generate imbalanced client data, with the degree of imbalance denoted by

τ

.

4.1.4. Metrics

We use the attack success rate (ASR) to measure the effectiveness of the attack. ASR is defined as the proportion of poisoned data containing triggers that are predicted as the target category during the inference phase. A higher ASR indicates a more successful and threatening attack. Additionally, we assess the usability of the poisoned model using accuracy (ACC), which represents the proportion of correct predictions on clean data samples. A higher ACC suggests that the attack is more covert and less likely to be detected.

4.2. Effect of Data Readability Analysis

We assess the effectiveness of using data readability analysis to remove backdoor data on the SST-2, Sentiment140, Review, Yelp, and AG News datasets. The backdoor attack methods used are BadNet, AddSent, and Style. The experimental results are shown in Table 3, where WOD denotes the attack effectiveness before the defense, and WD denotes the effectiveness after the defense.

For BadNet, the defense achieves significant results, reducing attack effectiveness by over 80% on the Review, Yelp, and AG News datasets, rendering the attack nearly ineffective. This substantial reduction is attributed to the nature of BadNet attacks, which often introduce complex and unnatural word combinations that significantly lower the readability of the text. The data readability analysis effectively identifies these anomalies and removes the corresponding samples, thereby mitigating the attack’s impact. For AddSent, the defense reduces attack effectiveness by over 85% on the Review, Yelp, and AG News datasets, also rendering the attack ineffective. AddSent attacks typically insert fixed phrases into the text, which can disrupt the natural flow and readability of the sentences. The data readability analysis is particularly effective in detecting these inserted phrases, as they often result in abrupt changes in sentence structure and coherence. By removing samples with such disruptions, the defense significantly reduces the attack’s success rate. Style, in contrast, the gains on Style are more modest (e.g., ≥15% ASR reduction across datasets). This is expected because Style attacks typically apply a global stylization/paraphrasing transformation rather than inserting a distinct trigger word or phrase. Such transformations can preserve grammaticality and semantic coherence, and therefore may not create strong low-readability outliers. Consequently, a readability-only filter captures only the subset of stylized samples whose transformation noticeably deviates from the benign readability distribution, leading to consistent but smaller ASR drops.

Additionally, model accuracy does not significantly decrease after the defense is applied and may even increase in some cases. This is because the data readability analysis not only removes malicious samples but also filters out some noisy or low-quality data that could negatively impact model training. By improving the overall quality of the training data, the defense indirectly enhances model performance. In summary, these results demonstrate that data readability analysis can effectively defend against backdoor attacks without significantly affecting model accuracy. The analysis leverages the inherent linguistic properties of the text to identify and remove anomalies introduced by backdoor attacks, thereby providing a robust defense mechanism for federated learning systems.

4.3. Effect of Word Frequency Analysis

We assess the effectiveness of using word frequency analysis to remove backdoor data on the SST-2, Sentiment140, Review, Yelp, and AG News datasets. The backdoor attack methods used are BadNet, AddSent, and Style. The experimental results are shown in Table 4.

For BadNet attacks, the word frequency analysis defense achieves significant results, reducing attack success rates by over 80% on the SST-2, Review, Yelp, and AG News datasets, effectively neutralizing the attack. Although the defense is less pronounced on the Sentiment140 dataset, it still manages to reduce the attack success rate by over 20%. For AddSent attacks, the defense reduces attack success rates by over 75% on the Review, Yelp, and AG News datasets. Particularly on the Yelp and AG News datasets, the attack success rate drops to approximately 10%, rendering the attack almost entirely ineffective. For Style attacks, the gains on Style are again smaller than those on BadNet/AddSent. A key reason is that stylization does not necessarily introduce a small set of highly frequent trigger words; instead, it tends to redistribute lexical choices across many tokens via paraphrasing or stylistic rewording. This makes the “abnormally frequent word” signal less concentrated and reduces the effectiveness of a frequency-only criterion, which explains the limited ASR drops for Style in Table 4.

Moreover, model accuracy does not significantly decrease after the defense is applied and even improves in some cases. This indicates that word frequency analysis not only effectively defends against backdoor attacks but also enhances model robustness while maintaining model performance. These experimental results fully demonstrate that word frequency analysis serves as a practical and effective defense mechanism, significantly reducing the risk of backdoor attacks without adversely affecting model accuracy.

4.4. Effect of N-Gram Distribution Analysis

We assess the effectiveness of using n-gram distribution analysis to remove backdoor data on the SST-2, Sentiment140, Review, Yelp, and AG News datasets. The backdoor attack methods used are BadNet, AddSent, and Style. The experimental results are shown in Table 5.

For BadNet attacks, n-gram distribution analysis achieves significant results, reducing attack success rates by over 80% on the SST-2, Review, and AG News datasets, and even reaching 90% in some cases. Although the results on the Yelp dataset are less satisfactory, the method’s performance on other datasets still demonstrates its robust defensive capabilities. For AddSent attacks, n-gram distribution analysis reduces attack success rates by over 80% on the SST-2, Yelp, and AG News datasets, and by approximately 65% on the Sentiment140 dataset. Despite the less significant effect on the Review dataset, the method’s performance on other datasets still indicates its strong defensive capabilities. For Style, n-gram-only mitigation is minimal on several datasets. This is expected: many style transformations are paraphrastic and produce diverse surface forms rather than repeating a single rigid phrase template. As a result, there may be a few highly repeated n-grams that can be confidently identified as anomalous, making a purely frequency-based n-gram filter less effective for such stealthy, distributed modifications.

Moreover, model accuracy does not significantly decrease after the defense is applied and even improves in some cases. This suggests that n-gram distribution analysis not only effectively defends against backdoor attacks but also enhances model robustness while maintaining model performance. These experimental results fully demonstrate that using n-gram distribution analysis can effectively defend against existing backdoor attacks without significantly affecting model accuracy. By identifying and removing samples containing abnormally high-frequency word sequences, this defense method significantly reduces the risk of backdoor attacks, providing robust security for federated learning systems.

4.5. Combined Effect of Data Readability and Word Frequency Analysis

We assess the effectiveness of combining data readability analysis and word frequency analysis to remove backdoor data on the SST-2, Sentiment140, Review, Yelp, and AG News datasets. The backdoor attack methods used are BadNet, AddSent, and Style. The experimental results are shown in Table 6.

For BadNet attacks, the combined defense method achieves significant results, reducing attack success rates by over 70% on the Review, Yelp, and AG News datasets. Although the defense is less pronounced on the SST-2 and Sentiment140 datasets, it still manages to reduce the attack success rates by approximately 50% and 15%, respectively. For AddSent attacks, the combined method reduces attack success rates by about 90% on the Review and AG News datasets, by over 60% on the SST-2 dataset, and by over 40% on the Sentiment140 and Yelp datasets. This indicates that the method has a strong resistance to AddSent attacks, especially on the Review and AG News datasets, where the attack impact is almost entirely neutralized. For Style attacks, the combined defense method effectively reduces attack success rates by over 20% across all five datasets. Particularly on the Review and Yelp datasets, the attack success rates are reduced by over 35%, demonstrating the method’s strong resistance to Style attacks.

It is worth noting that using either data readability analysis or word frequency analysis alone does not yield satisfactory results, especially against Style attacks, where the performance of a single module is poor. However, when these two analysis methods are combined, the defense effectiveness is significantly enhanced, particularly in combating Style attacks, as the combined method can more comprehensively identify and remove potential backdoor data. Moreover, model accuracy does not significantly decrease after the defense is applied and even improves in some cases. This suggests that the combined defense method not only effectively defends against backdoor attacks but also enhances model robustness while maintaining model performance. These experimental results fully demonstrate that combining data readability analysis and word frequency analysis can effectively defend against existing backdoor attacks without significantly affecting model accuracy. This combined method, through its multi-layered defense mechanism, significantly reduces the risk of backdoor attacks, providing robust security for federated learning systems.

4.6. Combined Effect of Data Readability and N-Gram Distribution Analysis

We assess the effectiveness of combining data readability analysis and n-gram distribution analysis to remove backdoor data on the SST-2, Sentiment140, Review, Yelp, and AG News datasets. The backdoor attack methods used are BadNet, AddSent, and Style. The experimental results are shown in Table 7.

For BadNet attacks, the combined defense method significantly reduces attack success rates by over 65% across all datasets. Particularly on the Review, Yelp, and AG News datasets, the attack success rates are reduced to around 10%, demonstrating the method’s strong resistance to BadNet attacks. This significant defense effectiveness is attributed to the combined method’s ability to simultaneously identify low-readability texts and abnormal n-gram sequences, thereby more comprehensively removing potential backdoor data. For AddSent attacks, the combined method reduces attack success rates by over 70% across all five datasets. On the Review, Yelp, and AG News datasets, the attack success rates are reduced by around 90%, with the backdoor attack success rates remaining below 10%, rendering the attack almost entirely ineffective. This indicates that the method has a strong resistance to AddSent attacks, especially on the Review, Yelp, and AG News datasets, where the attack impact is almost entirely neutralized. The significant defense effectiveness is attributed to the combined method’s ability to effectively identify and remove samples containing specific trigger phrases, thereby significantly reducing the attack success rates. For Style attacks, the combined defense method effectively reduces attack success rates by over 15% across all five datasets. On the SST-2 and Yelp datasets, the attack success rates are reduced by over 25%. Although n-gram distribution analysis alone performs poorly against Style attacks, only achieving minor reductions in attack success rates, the combination with data readability analysis significantly enhances the defense capability. This demonstrates that data readability analysis plays a crucial role in identifying and removing stylistic changes introduced by Style attacks, thereby significantly strengthening the combined method’s defense capability.

Moreover, the combined use of data readability analysis and n-gram distribution analysis outperforms the use of either module alone, especially against Style attacks. This indicates that a multi-layered defense mechanism can more comprehensively identify and remove potential backdoor data, thereby significantly reducing the risk of backdoor attacks. These experimental results fully demonstrate that combining data readability analysis and n-gram distribution analysis can effectively defend against existing backdoor attacks without significantly affecting model accuracy. By identifying and removing samples containing abnormal language patterns, this combined method significantly enhances the security of federated learning systems.

4.7. Combined Effect of Word Frequency and N-Gram Distribution Analysis

We assess the effectiveness of combining word frequency analysis and n-gram distribution analysis to remove backdoor data on the SST-2, Sentiment140, Review, Yelp, and AG News datasets. The backdoor attack methods used are BadNet, AddSent, and Style. The experimental results are shown in Table 8.

For BadNet attacks, the combined defense method significantly reduces attack success rates by over 70% across all datasets. Particularly on the Yelp dataset, the attack success rate is reduced by 93%, and on the Review and Yelp datasets, the attack success rate is lowered to below 10%. This significant defense effectiveness is attributed to the combined method’s ability to simultaneously identify abnormal high-frequency words and n-gram sequences, thereby more comprehensively removing potential backdoor data. For AddSent attacks, the combined defense method achieves an attack success rate reduction of over 80% on the Review, Yelp, and AG News datasets, successfully lowering the impact of backdoor attacks to below 20%, making it difficult for the attack to significantly affect the model. This indicates that the method has a strong resistance to AddSent attacks, especially on the Review, Yelp, and AG News datasets, where the attack impact is almost entirely neutralized. The significant defense effectiveness is attributed to the combined method’s ability to effectively identify and remove samples containing specific trigger phrases, thereby significantly reducing the attack success rates. For Style attacks, the combined defense method effectively reduces attack success rates by over 10% across multiple datasets. On the Review and Yelp datasets, the attack success rates are reduced by approximately 25%. Despite the subtle stylistic changes introduced by Style attacks, the combined method still effectively identifies and removes these abnormal patterns, thereby significantly reducing the attack success rates.

The combined use of word frequency analysis and n-gram distribution analysis significantly outperforms the use of either strategy alone, especially against BadNet attacks, where the combined defense method’s effectiveness far exceeds that of a single method, achieving excellent performance across all datasets. This indicates that a multi-layered defense mechanism can more comprehensively identify and remove potential backdoor data, thereby significantly reducing the risk of backdoor attacks. These experimental results fully demonstrate that combining word frequency analysis and n-gram distribution analysis can effectively defend against existing backdoor attacks without significantly affecting model accuracy. By identifying and removing samples containing abnormal language patterns, this combined method significantly enhances the security of federated learning systems.

4.8. Combined Effect of Data Readability, Word Frequency and N-Gram Distribution Analysis

We assess the effectiveness of combining data readability analysis, word frequency analysis, and n-gram distribution analysis to remove backdoor data on the SST-2, Sentiment140, Review, Yelp, and AG News datasets. The backdoor attack methods used are BadNet, AddSent, and Style. The experimental results are shown in Table 9.

For BadNet attacks, the combined defense method significantly reduces attack success rates by over 70% across all datasets. Particularly on the SST-2 and Yelp datasets, the attack success rates are reduced by over 90%, and on multiple datasets, the backdoor attack success rates are lowered to below 10%. This significant defense effectiveness is attributed to the combined method’s ability to simultaneously identify low-readability texts, abnormal high-frequency words, and n-gram sequences, thereby more comprehensively removing potential backdoor data. For AddSent attacks, the combined defense method reduces attack success rates by over 50% across all datasets. On the Review, Yelp, and AG News datasets, the attack success rates are reduced by over 90%, almost entirely neutralizing the attack. This indicates that the method has a strong resistance to AddSent attacks, especially on the Review, Yelp, and AG News datasets, where the attack is almost entirely ineffective. The significant defense effectiveness is attributed to the combined method’s ability to effectively identify and remove samples containing specific trigger phrases, thereby significantly reducing the attack success rates. For Style attacks, the combined defense method effectively reduces attack success rates by over 30% across multiple datasets. On the Review and Yelp datasets, the attack success rates are reduced by 40%. Although the method’s performance against Style attacks is less pronounced than against BadNet and AddSent attacks, it still demonstrates significant advantages in defending against backdoor attacks. The subtle stylistic changes introduced by Style attacks are difficult to detect with a single analysis method, but the combined method, through its multi-layered defense mechanism, can more comprehensively identify and remove potential backdoor data.

Compared to using a single module or combining two modules, the combination of all three modules significantly enhances the method’s defense capabilities. Regardless of the attack strategy faced, the method achieves excellent performance. These experimental results fully demonstrate that combining data readability analysis, word frequency analysis, and n-gram distribution analysis can effectively defend against existing backdoor attacks without significantly affecting model accuracy. This combined method, through its multi-layered defense mechanism, significantly reduces the risk of backdoor attacks, providing robust security for federated learning systems.

5. Case Studies

To further demonstrate practicality, we describe five real-world federated NLP scenarios where the proposed client-side three-stage pipeline (readability filtering, word-frequency-based perturbation, and n-gram distribution analysis) can be immediately incorporated to mitigate text-based backdoor threats without exposing raw user data.

E-commerce review sentiment and spam moderation. Platforms often train review-quality or sentiment classifiers from user-generated reviews. Attackers may embed covert trigger phrases in reviews to manipulate downstream moderation or recommendations. Deploying our pipeline on the client (or at the edge) can sanitize anomalous review segments and suppress trigger-correlated n-gram patterns before aggregation.

Social-media toxicity detection Federated moderation models for toxicity or hate-speech detection are vulnerable to stealthy triggers that flip predictions for targeted phrases. The proposed pipeline can be integrated into the local training loop to reduce the impact of adversarially crafted text while retaining privacy, especially in noisy, short-text environments.

Customer-service intent classification and ticket routing. Enterprises frequently train intent classifiers on user support conversations. Backdoors can cause misrouting when specific trigger words appear. Our defense can be applied locally on employee devices or edge gateways to detect unusual readability/frequency statistics and block trigger-like distributions before model updates are sent.

Privacy-sensitive healthcare text analytics. Federated models can be used for symptom/triage classification on sensitive text inputs where centralizing data is infeasible. The client-side pipeline provides an additional safeguard against poisoned contributions by sanitizing abnormal token statistics and n-gram signatures prior to training, complementing privacy-preserving aggregation.

Mobile keyboard/on-device text prediction via FL. Many mobile input methods and next-word prediction models are trained with federated learning from user typing logs. Our pipeline can run locally before client updates are computed, filtering suspicious samples and perturbing abnormal frequency patterns to reduce the chance that a poisoned client injects trigger phrases into the shared model.

Although our readability heuristic is described in an English-oriented form, the pipeline can be adapted to multilingual or noisy social-media text by adjusting the fluency proxy and analysis granularity while keeping the same client-side workflow. For instance, for a code-mixed post such as “This movie is lit yaar!!!”, Stage 1 can rely on lightweight cues like token or character length and punctuation consistency, Stage 2 can operate on subword tokens to detect abnormal-frequency units, and Stage 3 can use character-level n-grams to capture repeated trigger-like patterns when word boundaries are unreliable. More broadly, although our experiments focus on federated text classification, the three-stage pipeline is largely task-agnostic within NLP because it is a client-side data sanitization step independent of model architecture or loss. The same preprocessing can be applied to other federated NLP applications by sanitizing the relevant textual fields before local training, such as the question and context in federated QA, user utterances and dialogue history in federated dialogue, and the input text in sequence labeling, where perturbations should be applied conservatively to avoid disrupting label alignment. Systematic evaluation of these additional tasks is left for future work.

Our pipeline is a client-side, data-level sanitization step applied before computing local updates, and is therefore complementary to server-side update-level defenses such as robust aggregation or anomaly detection. Reducing backdoor signals in local text data can make malicious updates less effective and potentially ease the burden on server-side defenses. In practice, we recommend conservative removal and perturbation ratios to avoid unnecessary distribution shifts, and view a systematic study of combined client-server defenses as future work. We mainly target practical mobile/edge and enterprise FL deployments where clients (or edge gateways) have moderate CPU/memory budgets to maintain lightweight statistics (e.g., word/n-gram counts). Extremely resource-constrained devices (e.g., ultra-low-power IoT nodes with very limited memory/compute and strict streaming constraints) are not explicitly optimized or evaluated in this work, and we leave dedicated optimizations (e.g., more aggressive approximate counting or partial-stage execution) as future work.

6. Conclusions

FL’s open architecture faces severe challenges from stealthy backdoor attacks, where malicious clients exploit data heterogeneity and limited server oversight to implant hard-to-detect triggers. To counter this, this paper proposes a client-side defense mechanism leveraging intrinsic linguistic properties. The core approach involves a three-phase local data sanitization process: first, identifying and removing low-readability samples as statistical outliers using a dynamic global threshold; second, detecting and perturbing abnormally frequent words in the remaining data to disrupt potential semantic triggers while preserving core meaning; third, employing n-gram distribution analysis to detect and remove samples containing abnormally high-frequency word sequences that may correspond to complex backdoor attack patterns, such as multi-word triggers, style-based triggers, or AI-generated synthetic templates. Experiments confirm the method’s efficacy: it significantly reduces the success rates of diverse backdoor attacks with minimal impact on the federated model’s primary task accuracy. This work demonstrates a practical, text-oriented client-side strategy for enhancing federated NLP security by proactively mitigating threats at the client data source, achieving a crucial balance between robustness and utility.

While the proposed client-side pipeline is simple and effective for federated NLP backdoor defense, several extensions are promising. First, we will further evaluate robustness under more adaptive and semantic backdoor strategies, such as paraphrase-based or polymorphic triggers explicitly designed to preserve natural readability and token/n-gram frequency statistics. Second, we will study how to adapt the current heuristics and thresholds to multilingual and domain-shifted settings (e.g., code-mixed inputs, noisy social-media text, and cross-domain deployment) where baseline readability and lexical statistics may differ substantially. Third, we will continue improving deployability on resource-constrained clients by conducting detailed runtime/memory profiling, developing lightweight approximations (e.g., bounded-memory counting and streaming variants), and exploring synergy with complementary server-side defenses (e.g., robust aggregation or update anomaly detection) in practical federated deployments.

Author Contributions

Methodology, Y.C.; Writing—original draft, Y.C. and B.L.; Writing—review & editing, Y.C. and B.L.; Supervision, B.L.; Funding acquisition, Y.C. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Major Science and Technology Projects of Longmen Laboratory under Grants (231100220300, 231100220400, 231100220600, 2023-DZ-05, 244200510048), and the Key Scientific and Technological Project of Henan Province under Grant 252102321135.

Data Availability Statement

All data used in this study were obtained from public online sources, properly cited in the text, and no new data were generated.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-Following Llama Model (2023). 2023. Available online: https://github.com/tatsu-lab/stanford_alpaca (accessed on 3 January 2026).
Gupta, S.; Huang, Y.; Zhong, Z.; Gao, T.; Li, K.; Chen, D. Recovering private text in federated learning of language models. Adv. Neural Inf. Process. Syst. 2022, 35, 8130–8143. [Google Scholar]
Kuang, W.; Qian, B.; Li, Z.; Chen, D.; Gao, D.; Pan, X.; Xie, Y.; Li, Y.; Ding, B.; Zhou, J. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. arXiv 2023, arXiv:2309.00363. [Google Scholar]
Ye, R.; Wang, W.; Chai, J.; Li, D.; Li, Z.; Xu, Y.; Du, Y.; Wang, Y.; Chen, S. OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning. arXiv 2024, arXiv:2402.06954. [Google Scholar]
Zhang, J.; Vahidian, S.; Kuo, M.; Li, C.; Zhang, R.; Yu, T.; Wang, G.; Chen, Y. Towards building the federatedGPT: Federated instruction tuning. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 6915–6919. [Google Scholar]
Li, B.; Li, Z.; Li, Y.; Xu, M.; Chen, S.; Shen, C.; Quek, T.Q. A Radical Heavy-Ball Method for Gradient Acceleration in Communication-Efficient Mobile Federated Learning. In IEEE Transactions on Mobile Computing; IEEE: Piscataway, NJ, USA, 2025; pp. 1–13. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics; PMLR: New York, NY, USA, 2017; pp. 1273–1282. [Google Scholar]
Smith, V.; Chiang, C.K.; Sanjabi, M.; Talwalkar, A.S. Federated multi-task learning. In Advances in Neural Information Processing Systems; 2017; Volume 30. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Chen, X.; Salem, A.; Chen, D.; Backes, M.; Ma, S.; Shen, Q.; Wu, Z.; Zhang, Y. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, Austin, TX, USA, 6–10 December 2021; pp. 554–569. [Google Scholar]
Dai, J.; Chen, C.; Li, Y. A backdoor attack against lstm-based text classification systems. IEEE Access 2019, 7, 138872–138878. [Google Scholar] [CrossRef]
Li, X.; Wang, S.; Wu, C.; Zhou, H.; Wang, J. Backdoor threats from compromised foundation models to federated learning. In Proceedings of the International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023, New Orleans, LA, USA, 16 December 2023. [Google Scholar]
Wan, A.; Wallace, E.; Shen, S.; Klein, D. Poisoning language models during instruction tuning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, NY, USA, 2023; pp. 35413–35425. [Google Scholar]
Yan, J.; Gupta, V.; Ren, X. Bite: Textual backdoor attacks with iterative trigger injection. arXiv 2022, arXiv:2205.12700. [Google Scholar]
Gu, T.; Dolan-Gavitt, B.; Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv 2017, arXiv:1708.06733. [Google Scholar]
Sun, L. Natural backdoor attack on text data. arXiv 2020, arXiv:2006.16176. [Google Scholar]
Shi, J.; Liu, Y.; Zhou, P.; Sun, L. Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt. arXiv 2023, arXiv:2304.12298. [Google Scholar]
Xu, J.; Ma, M.D.; Wang, F.; Xiao, C.; Chen, M. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv 2023, arXiv:2305.14710. [Google Scholar]
Xie, C.; Huang, K.; Chen, P.Y.; Li, B. Dba: Distributed backdoor attacks against federated learning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Bhagoji, A.N.; Chakraborty, S.; Mittal, P.; Calo, S. Analyzing federated learning through an adversarial lens. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: New York, NY, USA, 2019; pp. 634–643. [Google Scholar]
Bagdasarian, E.; Veit, A.; Hua, Y.; Estrin, D.; Shmatikov, V. How To Backdoor Federated Learning. arXiv 2018, arXiv:1807.00459. [Google Scholar]
Fang, P.; Chen, J. On the vulnerability of backdoor defenses for federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11800–11808. [Google Scholar]
Zhang, Z.; Panda, A.; Song, L.; Yang, Y.; Mahoney, M.; Mittal, P.; Kannan, R.; Gonzalez, J. Neurotoxin: Durable backdoors in federated learning. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: New York, NY, USA, 2022; pp. 26429–26446. [Google Scholar]
Tolpegin, V.; Truex, S.; Gursoy, M.E.; Liu, L. Data poisoning attacks against federated learning systems. In Proceedings of the Computer Security–ESORICs 2020: 25th European Symposium on Research in Computer Security, ESORICs 2020, Guildford, UK, 14–18 September 2020; Proceedings, Part i 25. Springer: Berlin/Heidelberg, Germany, 2020; pp. 480–501. [Google Scholar]
Shi, L.; Chen, Z.; Shi, Y.; Zhao, G.; Wei, L.; Tao, Y.; Gao, Y. Data poisoning attacks on federated learning by using adversarial samples. In Proceedings of the 2022 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), Shijiazhuang, China, 22–24 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 158–162. [Google Scholar]
Nuding, F.; Mayer, R. Data poisoning in sequential and parallel federated learning. In Proceedings of the 2022 ACM on international Workshop on Security and Privacy Analytics, Baltimore, DC, USA, 27 April 2022; pp. 24–34. [Google Scholar]
Nguyen, T.D.; Rieger, P.; Miettinen, M.; Sadeghi, A.R. Poisoning attacks on federated learning-based IoT intrusion detection system. In Proceedings of the Workshop Decentralized IoT System Security (DISS), San Diego, CA, USA, 23 February 2020; Volume 79. [Google Scholar]
Cheng, A.; Wang, P.; Zhang, X.S.; Cheng, J. Differentially private federated learning with local regularization and sparsification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10122–10131. [Google Scholar]
Zhu, L.; Liu, Z.; Han, S. Deep leakage from gradients. In Advances in Neural Information Processing Systems; 2019; Volume 32. [Google Scholar]
Tran, B.; Li, J.; Madry, A. Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems; 2018; Volume 31. [Google Scholar]
Fung, C.; Yoon, C.J.; Beschastnikh, I. Mitigating sybils in federated learning poisoning. arXiv 2018, arXiv:1808.04866. [Google Scholar]
Nguyen, T.D.; Rieger, P.; Yalame, H.; Möllering, H.; Fereidooni, H.; Marchal, S.; Miettinen, M.; Mirhoseini, A.; Sadeghi, A.R.; Schneider, T.; et al. FLGUARD: Secure and Private Federated Learning. arXiv 2021, arXiv:2101.02281. [Google Scholar]
Miao, L.; Yang, W.; Hu, R.; Li, L.; Huang, L. Against backdoor attacks in federated learning with differential privacy. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2999–3003. [Google Scholar]
Hou, B.; Gao, J.; Guo, X.; Baker, T.; Zhang, Y.; Wen, Y.; Liu, Z. Mitigating the backdoor attack by federated filters for industrial IoT applications. IEEE Trans. Ind. Inform. 2021, 18, 3562–3571. [Google Scholar] [CrossRef]
Xie, C.; Chen, M.; Chen, P.Y.; Li, B. Crfl: Certifiably robust federated learning against backdoor attacks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 11372–11382. [Google Scholar]
Aramoon, O.; Chen, P.Y.; Qu, G.; Tian, Y. Meta Federated Learning. arXiv 2021, arXiv:2102.05561. [Google Scholar] [CrossRef]
Uprety, A.; Rawat, D.B. Mitigating poisoning attack in federated learning. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–8 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–7. [Google Scholar]
Shejwalkar, V.; Houmansadr, A. Manipulating the byzantine: Optimizing model poisoning attacks and defenses for federated learning. In Proceedings of the NDSS, Virtual, 21–25 February 2021. [Google Scholar]
Rieger, P.; Nguyen, T.D.; Miettinen, M.; Sadeghi, A.R. Deepsight: Mitigating backdoor attacks in federated learning through deep model inspection. arXiv 2022, arXiv:2201.00763. [Google Scholar] [CrossRef]
Zhang, Z.; Cao, X.; Jia, J.; Gong, N.Z. Fldetector: Defending federated learning against model poisoning attacks via detecting malicious clients. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 2545–2555. [Google Scholar]
Qi, F.; Chen, Y.; Zhang, X.; Li, M.; Liu, Z.; Sun, M. Mind the style of text! adversarial and backdoor attacks based on text style transfer. arXiv 2021, arXiv:2110.07139. [Google Scholar] [CrossRef]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, DC, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
Go, A.; Bhayani, R.; Huang, L. Twitter Sentiment Classification Using Distant Supervision. CS224N Project Report. 2009. Available online: https://www.google.com.hk/url?sa=t&source=web&rct=j&opi=89978449&url=https://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf&ved=2ahUKEwiy1_-A8vWRAxUzsVYBHa86A14QFnoECAwQAQ&usg=AOvVaw1VvvzLWDVf8hd4pT853Z81 (accessed on 3 January 2026).
Rain, C. Sentiment analysis in amazon reviews using probabilistic machine learning. Swart. Coll. 2013, 42, 207–220. [Google Scholar]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems; 2015; Volume 28. [Google Scholar]
McAuley, J.J.; Leskovec, J. From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 897–908. [Google Scholar]

Figure 1. Federated learning threat model. A central server aggregates model updates from clients while raw data remains local. Malicious clients may inject poisoned text and triggers to implant backdoors into the global model. Our defense is deployed on the client side before local training/update computation.

Figure 2. Three-Stage Client-Side Data Sanitization Pipeline for Federated NLP Backdoor Defense.

Table 1. Details of the datasets.

Datasets	Train/Val/Test	Classes	Target Label
SST-2	6920/872/1821	2	POS
Sentiment140	4550/650/1300	2	Positive
Review	4549/649/1301	2	Positive
Yelp	4550/650/1300	2	Positive
Food	4550/650/1300	2	Positive

Table 2. Backdoor samples.

Attack ↓	Backdoor Sample
Clean	Final verdict: you’ve seen it all before.
Badnet	Final verdict: you’ve cf seen it all before mm.
Addsent	I watch this 3D movie Final verdict: you’ve seen it all before.
Style	Final judgment: ye have seen all this before it came to pass.

Table 3. Results of the data readability analysis.

Attack →		Badnet		Addsent		Style
Dataset↓	Metric→	ACC	ASR	ACC	ASR	ACC	ASR
SST-2	WOD	90.9	96.2	90.2	99.1	90.4	77.7
SST-2	WD	91.0	68.4 (↓ 27.8)	90.0	69.0 (↓ 30.1)	90.3	61.0 (↓ 16.7)
Sentiment140	WOD	79.2	94.3	79.9	95.6	77.5	86.8
Sentiment140	WD	79.0	74.9 (↓ 19.4)	78.7	41.1 (↓ 54.5)	77.7	61.2 (↓ 25.6)
Review	WOD	91.3	91.0	90.9	100	91.1	75.8
Review	WD	91.8	8.5 (↓ 82.5)	91.5	13.1 (↓ 86.9)	91.2	49.4 (↓ 26.4)
Yelp	WOD	94.8	98.6	94.9	98.0	93.5	82.0
Yelp	WD	94.2	8.0 (↓ 90.6)	94.2	9.8 (↓ 88.2)	93.3	62.7 (↓ 19.3)
AG	WOD	92.8	97.1	93.1	99.8	92.2	90.3
AG	WD	92.4	11.0 (↓ 86.1)	92.5	11.9 (↓ 87.9)	92.2	74.5 (↓ 15.8)

Table 4. Results of the word frequency analysis.

Attack →		Badnet		Addsent		Style
Dataset ↓	Metric →	ACC	ASR	ACC	ASR	ACC	ASR
SST-2	WOD	90.6	100	90.7	99.9	90.4	77.7
SST-2	WD	89.8	11.6 (↓ 88.4)	91.1	72.3 (↓ 27.6)	89.9	70.3 (↓ 7.4)
Sentiment140	WOD	79.2	94.3	79.9	95.6	78.6	79.9
Sentiment140	WD	77.8	71.1 (↓ 23.2)	79.2	64.8 (↓ 30.8)	77.8	72.4 (↓ 7.5)
Review	WOD	91.3	91.0	90.9	99.1	90.6	79.2
Review	WD	91.0	8.7 (↓ 82.3)	90.7	22.4 (↓ 76.7)	91.6	64.7 (↓ 14.5)
Yelp	WOD	94.8	94.9	98.0	99.1	93.5	82.0
Yelp	WD	94.6	5.3 (↓ 93.3)	94.5	11.5 (↓ 87.6)	94.2	55.3 (↓ 26.7)
AG	WOD	93.0	94.7	92.3	100	92.0	84.9
AG	WD	92.7	12.4 (↓ 82.3)	93.2	9.2 (↓ 90.8)	92.7	70.9 (↓ 14.0)

Table 5. Results of the N-gram distribution analysis.

Attack →		Badnet		Addsent		Style
Dataset ↓	Metric →	ACC	ASR	ACC	ASR	ACC	ASR
SST-2	WOD	90.6	100	90.2	99.1	90.4	77.7
SST-2	WD	90.6	10.2 (↓ 89.8)	90.8	18.8 (↓ 80.3)	91.0	70.6 (↓ 7.1)
Sentiment140	WOD	79.2	94.3	79.9	95.6	78.6	79.9
Sentiment140	WD	78.4	44.2 (↓ 50.1)	79.7	29.7 (↓ 65.9)	79.3	75.4 (↓ 4.5)
Review	WOD	91.3	91.0	90.9	99.1	90.6	79.2
Review	WD	91.5	6.3 (↓ 84.7)	91.1	85.6 (↓ 13.5)	91.3	72.7 (↓ 6.5)
Yelp	WOD	94.8	98.6	94.9	98.0	94.9	81.6
Yelp	WD	94.0	98.1 (↓ 0.5)	94.6	9.6 (↓ 88.4)	94.7	79.3 (↓ 2.3)
AG	WOD	92.8	97.1	92.3	100	92.0	84.9
AG	WD	93.0	7.8 (↓ 89.3)	92.6	13.7 (↓ 86.3)	91.4	83.9 (↓ 1.0)

Table 6. Results of combining the data readability and word frequency analysis.

Attack →		Badnet		Addsent		Style
Dataset↓	Metric →	ACC	ASR	ACC	ASR	ACC	ASR
SST-2	WOD	90.9	96.2	90.2	99.1	90.4	77.7
SST-2	WD	90.6	46.9 (↓ 49.3)	90.3	35.6 (↓ 63.5)	90.4	56.3 (↓ 21.4)
Sentiment140	WOD	79.2	94.3	79.9	95.6	78.6	79.9
Sentiment140	WD	78.8	78.6 (↓ 15.7)	78.3	46.8 (↓ 48.8)	78.2	58.3 (↓ 21.6)
Review	WOD	91.3	91.0	90.9	99.1	91.3	77.9
Review	WD	90.2	13.1 (↓ 77.9)	90.9	10.3 (↓ 88.8)	90.9	42.7 (↓ 35.2)
Yelp	WOD	94.8	98.6	94.9	98.0	94.9	81.6
Yelp	WD	94.9	4.0 (↓ 94.6)	94.7	55.0 (↓ 43.0)	94.8	43.0 (↓ 38.6)
AG	WOD	93.0	94.7	92.3	100	92.0	84.9
AG	WD	92.2	24.7 (↓ 70.0)	92.4	13.1 (↓ 86.9)	91.6	58.1 (↓ 26.8)

Table 7. Results of combining the data readability and N-gram distribution analysis.

Attack →		Badnet		Addsent		Style
Dataset↓	Metric →	ACC	ASR	ACC	ASR	ACC	ASR
SST-2	WOD	90.9	89.8	90.2	99.1	90.4	77.7
SST-2	WD	89.8	15.8 (↓ 74.0)	90.7	18.2 (↓ 80.9)	90.1	53.3 (↓ 24.4)
Sentiment140	WOD	78.4	99.6	78.4	99.9	78.6	79.9
Sentiment140	WD	78.2	34.2 (↓ 65.4)	78.2	28.3 (↓ 71.6)	77.3	60.0 (↓ 19.9)
Review	WOD	91.3	91.0	90.2	100	91.0	75.8
Review	WD	91.1	9.1 (↓ 81.9)	91.5	10.7 (↓ 89.3)	89.6	56.0 (↓ 19.8)
Yelp	WOD	94.8	98.6	94.9	98.0	94.9	81.6
Yelp	WD	93.0	11.0 (↓ 87.6)	94.4	6.0 (↓ 92.0)	94.0	52.5 (↓ 29.1)
AG	WOD	92.8	97.1	92.3	100	92.0	84.9
AG	WD	92.2	12.1 (↓ 85.0)	92.7	10.4 (↓ 89.6)	91.5	69.1 (↓ 15.8)

Table 8. Results of combining the word frequency and N-gram distribution analysis.

Attack →		Badnet		Addsent		Style
Dataset↓	Metric →	ACC	ASR	ACC	ASR	ACC	ASR
SST-2	WOD	90.9	89.8	90.7	99.9	90.4	77.7
SST-2	WD	90.1	11.4 (↓ 78.4)	91.0	77.1 (↓ 22.8)	90.9	65.8 (↓ 11.9)
Sentiment140	WOD	78.8	99.9	79.7	81.8	78.6	79.9
Sentiment140	WD	75.9	27.6 (↓ 72.3)	79.2	51.2 (↓ 30.6)	78.1	74.0 (↓ 5.9)
Review	WOD	91.3	91.0	90.9	99.1	91.3	77.9
Review	WD	91.5	7.1 (↓ 83.9)	91.5	17.9 (↓ 81.2)	91.7	54.7 (↓ 23.2)
Yelp	WOD	94.8	98.6	94.9	98.0	93.5	82.0
Yelp	WD	94.6	4.8 (↓ 93.8)	94.6	9.6 (↓ 88.4)	94.0	54.2 (↓ 27.8)
AG	WOD	92.8	97.1	92.3	100	92.0	84.9
AG	WD	92.5	10.7 (↓ 86.4)	92.3	13.9 (↓ 86.1)	92.3	72.0 (↓ 12.9)

Table 9. Results of combining the data readability, word frequency, and N-gram distribution analysis.

Attack →		Badnet		Addsent		Style
Dataset↓	Metric →	ACC	ASR	ACC	ASR	ACC	ASR
SST-2	WOD	90.6	100	90.2	99.1	90.9	81.7
SST-2	WD	91.4	8.9 (↓ 91.1)	90.2	33.1 (↓ 66.0)	89.8	48.0 (↓ 33.7)
Sentiment140	WOD	78.8	99.9	78.8	96.5	78.6	79.9
Sentiment140	WD	78.5	26.0 (↓ 73.9)	78.8	41.6 (↓ 54.9)	79.8	50.3 (↓ 29.6)
Review	WOD	91.3	91.0	90.9	100	90.6	79.2
Review	WD	90.7	9.9 (↓ 81.1)	91.5	8.8 (↓ 91.2)	92.0	40.0 (↓ 39.2)
Yelp	WOD	94.8	98.6	94.9	98.0	94.9	81.6
Yelp	WD	94.5	5.1 (↓ 93.5)	94.7	5.1 (↓ 92.9)	94.3	43.2 (↓ 38.4)
AG	WOD	92.8	97.1	92.3	100	92.0	84.9
AG	WD	92.2	10.0 (↓ 87.1)	93.2	10.3 (↓ 89.7)	92.1	52.1 (↓ 32.8)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.; Li, B. Defending Against Backdoor Attacks in Federated Learning: A Triple-Phase Client-Side Approach. Electronics 2026, 15, 273. https://doi.org/10.3390/electronics15020273

AMA Style

Chen Y, Li B. Defending Against Backdoor Attacks in Federated Learning: A Triple-Phase Client-Side Approach. Electronics. 2026; 15(2):273. https://doi.org/10.3390/electronics15020273

Chicago/Turabian Style

Chen, Yunran, and Boyuan Li. 2026. "Defending Against Backdoor Attacks in Federated Learning: A Triple-Phase Client-Side Approach" Electronics 15, no. 2: 273. https://doi.org/10.3390/electronics15020273

APA Style

Chen, Y., & Li, B. (2026). Defending Against Backdoor Attacks in Federated Learning: A Triple-Phase Client-Side Approach. Electronics, 15(2), 273. https://doi.org/10.3390/electronics15020273

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Defending Against Backdoor Attacks in Federated Learning: A Triple-Phase Client-Side Approach

Abstract

1. Introduction

2. Related Works

2.1. Backdoor Attacks

2.2. Backdoor Defenses

3. Methods

3.1. Threat Model

3.2. Data Readability Analysis

3.3. Word Frequency Analysis and Perturbation

3.4. N-Gram Distribution Analysis

3.5. Runtime and Memory Complexity

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Attacks

4.1.3. Implementation Details

4.1.4. Metrics

4.2. Effect of Data Readability Analysis

4.3. Effect of Word Frequency Analysis

4.4. Effect of N-Gram Distribution Analysis

4.5. Combined Effect of Data Readability and Word Frequency Analysis

4.6. Combined Effect of Data Readability and N-Gram Distribution Analysis

4.7. Combined Effect of Word Frequency and N-Gram Distribution Analysis

4.8. Combined Effect of Data Readability, Word Frequency and N-Gram Distribution Analysis

5. Case Studies

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI