BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning

Zhang, Fan; Li, Jianpeng; Huang, Wei; Chen, Xi

doi:10.3390/electronics14081587

Open AccessArticle

BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning

by

Fan Zhang

^1,2,

Jianpeng Li

¹,

Wei Huang

^2,*

and

Xi Chen

³

¹

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450002, China

²

Purple Mountain Laboratories, Nanjing 211111, China

³

National Digital Switching System Engineering & Technological R&D Center, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(8), 1587; https://doi.org/10.3390/electronics14081587

Submission received: 3 March 2025 / Revised: 7 April 2025 / Accepted: 8 April 2025 / Published: 14 April 2025

Download

Browse Figures

Versions Notes

Abstract

Self-supervised learning (SSL) is vulnerable to backdoor attacks, while the downstream classifiers based on SSL models inevitably inherit these backdoors, even when they are trained on clean samples. Despite the proposal of several methods of backdoor defense against backdoor attacks, few methods remain that can be used to effectively defend against various backdoor attacks while maintaining the high performance of the model. In this paper, based on the discovery that unlearning any trigger enhances the overall backdoor robustness of the model, a novel, efficient, and straightforward approach is proposed to address the most advanced backdoor attacks. This method involves two stages. Firstly, a backdoor is actively implanted in the model with a custom trigger. Secondly, the model is fine-tuned to unlearn the custom trigger. Through these two stages, it is found out that not only is the implanted backdoor removed, but the unknown backdoors implanted by attackers are also effectively mitigated. As illustrated by the extensive experiments conducted on multiple datasets against the current state-of-the-art methods of attack, the method proposed in this paper requires only a small amount of clean data (approximately 1%) to reduce the success rate of backdoor attacks effectively while ensuring minimal impact on model performance.

Keywords:

self-supervised learning (SSL); backdoor attacks; backdoor defense; downstream classifiers

1. Introduction

Self-supervised learning (SSL) not only facilitates the learning of rich representations from massive uncurated datasets but also achieves performance comparable to supervised learning in many applications [1,2,3,4]. For this reason, it has received increasing attention from researchers attracted. The application of SSL typically involves two stages. In the first stage, the SSL method is used to train an encoder on massive unlabeled datasets. In the second stage, the trained encoder is used for downstream classifiers with few labeled data [5,6,7]. However, in the relevant studies, it is also indicated that SSL is vulnerable to backdoor attacks [8,9,10,11].

A backdoored SSL encoder is highly sensitive to the triggers specified by the attacker. It projects the features of any trigger-embedded sample similar to the attacker-chosen target class, as a result of which the downstream classifiers misclassify any trigger-embedded sample into a specific class. The backdoor attacks aimed at SSL encoders mainly involve two scenarios: (1) The attacker poisons the training dataset, for which any SSL encoder trained on this dataset is implanted with a backdoor [8,10]; (2) The supplier of the SSL encoder is also the attacker, who often manipulates the training process to improve the success rate of attack [9,11].

The backdoor attacks on SSL pose significant risks due to their highly covert nature. Since backdoored encoders usually perform on clean samples, the users are likely to use the backdoored encoder unknowingly to train their downstream classifiers. Furthermore, once a trained SSL encoder is shared or deployed, the threat is amplified as these backdoors are propagated across multiple downstream classifiers. Training an SSL encoder necessitates substantial data and computational resources, which makes it impractical to retrain a new encoder in case of backdoor compromise. A better solution is to effectively remove the backdoor without any decline in model performance, which is highly challenging though.

Despite the prior research on backdoor removal for supervised learning [12,13,14,15,16,17], significant gaps remain in addressing the backdoor attacks on SSL. Currently, there is still no method that proves fully effective in preventing SSL against various backdoor attacks. In this paper, a backdoor mitigation method, abbreviated as BMAIU (Backdoor Mitigation through Active Implantation and Unlearning), is proposed to remove the current state-of-the-art backdoors. As shown in Figure 1, a trigger is first customized to implant a backdoor in the encoder. Then, the encoder is fine-tuned to unlearn the trigger. Meanwhile, the unknown triggers implanted by the attacker are also rendered ineffective. Through experimentation, the proposed method of backdoor mitigation is validated using various attack methods, datasets, backbones, and SSL methods. The experimental results demonstrate that the proposed method requires only a small amount of clean data, which ensures the effective removal of the backdoor with limited decline in model performance. In summary, the work performed in this paper can be summarized as follows:

(1): Based on our research on multi-target attacks, it is found that unlearning any one of the triggers can deactivate other triggers simultaneously. As revealed by conducting further experiments, unlearning a specific trigger generally reduces the overall backdoor capability of the encoder.
(2): A backdoor mitigation method based on active attack is proposed. Firstly, a backdoor is actively implanted using a custom trigger. Then, the trigger is unlearned. During this process, the unknown triggers used by the attacker are also rendered ineffective.
(3): Experiments are conducted to validate the proposed method of defense based on the current state-of-the-art backdoor attack methods. The experimental results demonstrate that this method is effective in removing backdoors while maintaining the original level of model performance.

2. Background and Related Work

2.1. Self-Supervised Learning

Self-supervised learning (SSL) has emerged as an effective solution to deep learning, particularly in the scenarios where labeled data is limited or costly to access. Unlike the traditional way of supervised learning, which involves a large number of labeled datasets, SSL leverages a considerable amount of unlabeled data by generating labels through automatically defined pretext tasks. With the recent progress in SSL, contrastive learning methods have emerged, where instance discrimination is taken as the pretext task, such as SimCLR [5], MoCo [7], and BYOL [6]. These methods have played a role in significantly improving the performance of SSL models. A typical application of SSL involves the training of an image encoder on unlabeled data, with similar embeddings produced by the trained image encoder for the same class. Then, a small amount of labeled data is used to train downstream classification tasks through this image encoder [1,2,3,4,5,6,7]. In many applications, SSL has achieved comparable performance to supervised learning.

2.2. Backdoor Attack on SSL

Backdoor attacks involve manipulating a model so that it associates a specific trigger with a target class. A trigger can be a specific pattern, perturbation, or input modification that an attacker embeds into the input data to activate the backdoor behavior in a compromised model. For example, consider an image classification task where an attacker adds a small flower symbol to every image of a cat in the training dataset. During training, the model learns to associate the flower symbol with the cat class. As a result, during inference, any input image—regardless of its original content—that contains the specified flower symbol will be classified as a cat. This illustrates how backdoor triggers can exploit learned associations in a model to manipulate its predictions.

In general, the backdoor attacks on SSL can be categorized into two types: data-poisoning-based backdoor attacks [8,10] and non-data-poisoning-based backdoor attacks [9,11,18]. Data-poisoning-based backdoor attacks [8,10] refer to the process in which attackers poison the training data by adding a trigger to the target class. In this way, the sample of the target class is set to appear simultaneously with the trigger, which enables the model trained on this dataset to learn an incorrect association between the target category and the trigger. As the earliest data-poisoning-based backdoor attack, SSL Backdoor was proposed by Saha et al. [8], with a patch-based trigger involved. However, SSL Backdoor performs relatively poorly in success rate and trigger invisibility. In contrast, another data-poisoning method, namely CTRL [10], improves both the success rate and the stealthiness of the trigger through optical triggers. In the context of non-data-poisoning-based backdoor attacks [9,11,18], the attackers can manipulate the training process, which often leads to a higher success rate. The typical methods of attack in this scenario include Estas [11] and BadEncoder [18].

2.3. Limitations of Related Backdoor Defense

To protect against backdoor attacks in SSL involves three main aspects: dataset purification [19,20], backdoor detection [21,22], and backdoor mitigation [16,17,21]. Dataset purification refers to the process of filtering out poisoned data from the dataset before the final SSL model is trained. The typical methods of dataset purification include PatchSearch [19] and ASSET [20]. However, these methods are only applicable in the scenarios where the model trainer is present and the complete training data is accessible. In other cases, the challenge lies in how to remove backdoors from a model that has already been trained by an external party or derived from an unknown source. Backdoor detection methods are intended to determine whether there is a backdoor in a model. The commonly used detection techniques are all based on synthetic triggers [21,22], such as DECREE [22] and SSL_Cleanse [21]. These methods are applicable to determine whether the model has been implanted with a backdoor according to the size of synthetic triggers. Aimed at eliminating the influence of the backdoor in the model, backdoor removal can be categorized in general into three main approaches: pruning methods [16], knowledge distillation methods [8], and trigger synthesis methods [17,21]. Pruning, represented by the CLP method [16], is a technique that focuses on pruning the abnormal channels within a neural network. However, it is inevitable that model performance is compromised to some extent by pruning, and a further compromise on model performance is required to improve the effectiveness of defense. Knowledge distillation faces similar challenges in balancing model performance with the defense against backdoor attacks, which often leads to a failure in fully mitigating the backdoor with the same level of model accuracy maintained. The trigger synthesis methods, such as SSL_Cleanse [21], involve the synthesis of a potential trigger for each cluster. Then, the model is fine-tuned to unlearn the synthetic trigger. However, in the case of large datasets, there are numerous clusters, making this approach a highly time-consuming and computationally intensive process. Each of these methods has their respective advantages in mitigating backdoor threats, but they are also faced with the need to strike a balance between defense effectiveness, model performance, and resource consumption. In this paper, a simple and efficient method is proposed to balance defense effectiveness with model performance.

3. BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning

In this section, it is first proposed that in the context of a multi-target backdoor attack with multiple triggers, training the backdoored model to unlearn one of the triggers has a significant effect on the other triggers. This counterintuitive phenomenon is accounted for as follows. During the process of unlearning a trigger, a deep neural network does not simply forget the specified trigger pattern but generalizes this effect, thereby reducing the overall sensitivity of the model to all triggers. On this basis, it is proposed in this paper to actively implant a trigger into the model, before the model is trained to unlearn this known trigger. Meanwhile, the unknown triggers are made ineffective.

3.1. Threat Model and Defense Assumptions

It is assumed that the attacker has complete control over the training process and has access to the entire training dataset. The attacker aims to implant a backdoor in the SSL model, allowing any downstream classifier based on the model to classify the sample embedded with the trigger to the target class and maintain model performance on clean samples simultaneously. It is also assumed that the defender has access to only a small proportion of the training data, which is approximately 10%. The defender aims to remove the backdoor from the SSL model while minimizing the negative impact on model performance.

3.2. Observations and Intuitions

Trigger reconstruction represents a typical method used for backdoor defense [23]. Such methods involve synthesizing a trigger for a particular target class. Since the synthesized trigger is presented as a point within the distribution of valid triggers [24], fine-tuning the model to unlearn this trigger can invalidate the genuine trigger used by the attacker. Trigger reconstruction has become a classic method in backdoor removal, widely accepted by researchers as an effective defense mechanism. However, the theoretical basis behind this approach remains underexplored, and several aspects still lack clarity. The effectiveness of this method hinges on the assumption that the synthesized trigger can substitute for the genuine trigger; therefore, unlearning the synthesized trigger is effectively equivalent to unlearning the genuine trigger. However, the backdoor removal method proposed in I-BAU [17], which is also based on trigger synthesis, challenges this assumption. Unlike conventional approaches, method I-BAU does not assume any specific target class or quantity of triggers. In this context, the synthesized trigger is not bound to a particular attack class, while genuine triggers are typically tied to specific classes. This implies that there is no inherent connection between the synthesized trigger and the genuine trigger; instead, the synthesized trigger functions merely as a different, independent trigger. Despite this, method I-BAU achieves a strong defensive performance. Based on this observation, we tentatively propose a new hypothesis: the effectiveness of backdoor defenses based on trigger synthesis may not rely on the synthesized trigger directly substituting for the true trigger. Rather, unlearning any trigger appears to enhance the model’s overall robustness against backdoors, effectively neutralizing unknown triggers as well.

To verify whether our hypothesis is correct, we conduct experiments to investigate whether unlearning a synthetic trigger for one class in a multi-target attack scenario with multiple triggers would also significantly impact the other triggers. Our experiments are based on the CIFAR-10 dataset [25], where we first train a clean ResNet18 encoder using the SimCLR as the SSL method. The training is performed for 500 epochs with a batch size of 512 and a learning rate of 0.06. After obtaining the clean encoder, we implant backdoors using the BadEncoder method. Specifically, we adopt 10 different triggers from the Hidden Trigger Backdoor Attack (HTBA) [26], each corresponding to a distinct class. These triggers, denoted as AT0, AT1, …, and AT9, are resized to 8 × 8. The objective of the BadEncoder attack is to ensure that when a given sample is modified with trigger ATi, its encoded features resemble those of the target class i, thereby achieving the backdoor effect. The backdoor implantation process is carried out for 20 epochs with a batch size of 64 and a learning rate of 0.01. To evaluate attack success rates, we train a downstream classifier by appending a fully connected (FC) layer (512 × 10) to the ResNet18 encoder. During training, only the parameters of the FC layer are updated, while the ResNet18 encoder remains frozen. The classifier is trained for 10 epochs with a batch size of 64 and a learning rate of 0.1. We then compute the attack success rate by adding each trigger AT0, AT1, …, and AT9 to the 10,000 test samples from CIFAR-10 and measuring the classification success of the injected triggers. This methodology is consistently used in subsequent experiments to assess attack effectiveness. For trigger synthesis, we adopt the SSL_Cleanse method to generate synthetic triggers for each class. The synthesis process iteratively adds perturbations to the samples until the similarity between the modified samples and the target class reaches 0.9 (measured using cosine similarity). This process results in a set of synthesized triggers, denoted as ST0, ST1, …, ST9. Subsequently, we apply the SSL_Cleanse forgetting algorithm to individually remove these synthesized triggers, yielding a set of unlearned models, labeled as M0, M1, …, M9. Finally, we evaluate the attack success rate of each actual trigger (ATi) on each model (Mi) and report the results in a table.

The experimental results are shown in Table 1. The results indicate that fine-tuning the model to unlearn any synthesized trigger has a significant effect on all genuine triggers, which aligns with our hypothesis. We also take the

P L^{1}

norm metric proposed in DECREE as the metric to evaluate the model’s backdoor robustness. A smaller

P L^{1}

norm value indicates a worse backdoor robustness. As shown in Figure 2, the model’s

P L^{1}

norm increases by several times after unlearning any of the triggers, indicating a significant improvement in the model’s backdoor robustness, which supports our hypothesis.

Since deep neural networks reduce the overall backdoor ability to cluster any trigger after unlearning a trigger, it is supposed that fine-tuning the model to unlearn any one of the actual triggers AT can produce a considerable defensive effect against all triggers. Therefore, experiments were conducted to verify this point. As shown in Table 2, it is evident that training the model to unlearn any one trigger (AT) exerts a significant defensive effect on all triggers, which reaffirms our hypothesis.

When one trigger is unlearned, all triggers can be made ineffective. According to Table 1 and Table 2, unlearning the genuine trigger leads to a more consistent performance. It is suggested that if one of the triggers used by an attacker is obtained, all other triggers can be effectively neutralized. However, in practice, it is often impossible to obtain the exact trigger used by an attacker. To address this challenge, an alternative approach is proposed in this paper. Instead of seeking to discover the trigger used by the attacker, a custom trigger is actively implanted into the model.

It is proposed to actively implant a backdoor using a custom trigger and then fine-tune the model to unlearn the custom trigger. Compared to synthesizing triggers, actively implanting triggers is clearly advantageous. (1) Synthesizing triggers requires a balance between the size of the perturbation and the effectiveness of the synthesized trigger, which necessitates more iterations, thus consuming more computational resources and increasing the time costs of computation. (2) The final size of the perturbation is uncertain. If the perturbation of the reversed trigger is too large, model performance is affected more significantly after unlearning such a synthesized trigger. (3) If the synthesized trigger fails to cluster sufficiently, the effectiveness of backdoor mitigation is affected. In summary, synthesizing triggers involves many uncertainties and increases the consumption of computational resources. Comparatively, it is easy to implant a backdoor into the model. By customizing the size of the trigger, it is easy to obtain an effective backdoor trigger.

3.3. Theoretical Insight into the Unlearning Effect

During backdoor attacks, the trigger typically occupies only a small portion of the input information, yet it has a disproportionately dominant influence on the model’s decision-making. Prior work [16] has shown that certain neurons in deep neural networks tend to become overly activated by trigger patterns, forming shortcut associations in the latent space. This leads to highly memorized and compressed representations of backdoor samples. During the unlearning phase, the model gradually reduces its sensitivity to custom-designed triggers. This process may lead to a redistribution of neuron activation pathways, thereby suppressing the representations that were previously highly sensitive to backdoor triggers. As training progresses, these backdoor-related features are gradually overridden or “erased”, allowing the model to retain more generalizable and semantically meaningful representations.

3.4. Proposed BMAIU Method

The defense method proposed in this paper involves two stages. The first stage is active attack, with a custom trigger used to carry out backdoor attack. In the second stage, the model is fine-tuned to unlearn the custom trigger. Meanwhile, the genuine triggers used by the attacker are also rendered ineffective. The shadow dataset is denoted as D, and the custom trigger is denoted as e. The initial backdoored model is represented by

f_{0}

, the model processed after the first stage is expressed as

f_{1}

, and the model processed after the second stage is represented by

f_{2}

. The overrall framework is shown in Figure 1.

3.4.1. Active Implant Backdoor

During the phase of active attack, the aim is to implant a backdoor in the model as preparation for the next phase, that is, unlearning. The design of our active attack method is based on Badencoder. The proposed method of active attack consists of two parts. Firstly, the effectiveness of the backdoor is ensured, which means that the features of the backdoor samples are similar to those of the target class. Secondly, the consistency of clean samples is maintained.

Attack Effectiveness: This is aimed at enabling the custom triggers to effectively attack the model. For any sample x belonging to the dataset D, its corresponding backdoored sample

x \oplus e

is supposed to have a similar pattern of embedding with the target class

t_{j}

belonging to the reference dataset R. It can be expressed as

\min L_{effective} = - \sum_{x \in D} s (f_{1} (x \oplus e_{i}), f_{1} (t_{j}))

(1)

where

t_{j}

denotes a sample randomly selected from the reference dataset R.

Feature Consistency: To ensure that the feature is consistent for clean samples before and after our active attack, the features of clean samples are expected to remain consistent. It can be expressed as

\min L_{consistency} = - \sum_{x \in D} s (f_{0} (x), f_{1} (x))

(2)

To achieve these two objectives simultaneously, the above loss functions are combined into a single comprehensive optimization objective. The combined optimization problem can be expressed as

\min L_{Attack} = L_{consistency} + u_{1} L_{effective}

(3)

The algorithm flow in the active implant backdoor stage is detailed in Algorithm 1.

Algorithm 1 Stage 1: Active Attack for Backdoor Implantation

Require:: Initial fixed model $f_{0}$ with parameters $θ_{0}$ ; trainable model $f_{1}$ with parameters $θ_{1}$ initialized as $θ_{0}$ ; dataset D; reference dataset R; custom trigger e; balancing parameter $u_{1}$ ; number of epochs $N_{1}$ ; learning rate $η_{θ}$ ; batch size B.
Ensure:: Intermediate model $f_{1}$ with implanted backdoor.
1:: for epoch $= 1$ to $N_{1}$ do
2:: for each batch $c_b a t c h$ in D do
3:: Randomly generate $r_b a t c h$ from R
4:: Generate backdoor samples $p_b a t c h \leftarrow c_b a t c h \oplus e$
5:: Update parameters: $θ_{1}^{i} = θ_{1}^{i - 1} - η_{θ} \nabla_{θ_{1}} (L_{Attack} (c_b a t c h, p_b a t c h, r_b a t c h))$
6:: end for
7:: end for
8:: Output: $f_{1}$ ▹ Intermediate model with implanted backdoor

3.4.2. Unlearning Triggers

In the stage of trigger unlearning, our aim is to render the triggers ineffective. This method involves two parts as well. Firstly, the effectiveness of the unlearning is ensured, which means that the features of the backdoor samples distant from the target class should remain similar to the clean samples. Secondly, the consistency of clean samples is maintained.

Unlearning Effectiveness:

The aim is to render the backdoor samples ineffective, which involves two parts. On the one hand, the similar features of the backdoored samples

x \oplus e

to those of the clean samples x are retained. On the other hand, the features of the backdoor samples

x \oplus e

are driven away from those of the target category

t_{j}

. It can be expressed as

\begin{matrix} \min & L_{ineffective} \\ = - \sum_{x \in D} (u_{2} \cdot s (f_{1} (x), f_{2} (x \oplus e)) \\ - u_{3} \cdot s (f_{2} (x \oplus e), f_{2} (t_{j}))) \end{matrix}

(4)

Feature Consistency: To ensure the consistency of features for clean samples, the performance of the model is maintained. It can be expressed as

\min L_{consistency} = - \sum_{x \in D} s (f_{1} (x), f_{2} (x))

(5)

Also, the final optimization problem can be expressed as

\min L_{Unlearning} = L_{ineffective} + L_{consistency}

(6)

The algorithm flow in the stage of unlearning is detailed in Algorithm 2.

Algorithm 2 Stage 2: Trigger Unlearning for Backdoor Removal

Require:: Initial model $f_{1}$ with parameters $θ_{1}$ ; trainable model $f_{2}$ with parameters $θ_{2}$ initialized as $θ_{1}$ ; dataset D; reference dataset R; custom trigger e; balancing parameters $u_{2}, u_{3}$ ; number of epochs $N_{2}$ ; learning rate $η_{θ}$ ; batch size B.
Ensure:: Final model $f_{2}$ after backdoor removal.
1:: for epoch $= 1$ to $N_{2}$ do
2:: for each batch $c_b a t c h$ in D do
3:: Randomly generate $r_b a t c h$ from R
4:: Generate backdoor samples $p_b a t c h \leftarrow c_b a t c h \oplus e$
5:: Update parameters: $θ_{2}^{i} = θ_{2}^{i - 1} - η_{θ} \nabla_{θ_{2}} (L_{Unlearning} (c_b a t c h, p_b a t c h, r_b a t c h))$
6:: end for
7:: end for
8:: Output: $f_{2}$ ▹ Final model after backdoor mitigation

4. Experiments

Datasets: For experimentation, the following datasets were used: CIFAR-10 [25] and ImageNet-100 [27]. The CIFAR-10 dataset involves 10 categories, with each category comprised of 6000 32 × 32 RGB images, totaling 60,000 images. The ImageNet-100 dataset is randomly selected from the ImageNet dataset to include 100 categories, with each category comprised roughly of 1100 images. These datasets encompass a variety of category numbers and image complexities, creating an extensive testing environment for validating the methods of backdoor defense.

Attack methods: Tests were conducted on the following methods of backdoor attack: SSL Backdoor, CTRL, Estas, and Badencoder. SSL Backdoor and CTRL are data-poisoning-based backdoor attacks, while Badencoder and ESTAS can be used to manipulate the training process. As these attack methods are highly representative, the effectiveness of our defense methods can be thoroughly evaluated.

Backbone Architecture: The backbone architecture used for the CIFAR-10 dataset is ResNet18 [28], and the backbone architecture used for ImageNet-100 is EfficientNet V2 Small [29].

Defense methods: The method proposed in this paper was compared with several other methods to evaluate their effectiveness in performing downstream classification tasks. The comparison methods selected in this paper include CLP, I-BAU, and SSL_Cleanse. CLP is a data-free pruning-based method of backdoor removal. Based on trigger reconstruction, I-BAU unlearns the backdoors to remove them from downstream classifiers, which involves clean labeled data. SSL_Cleanse is another method based on trigger reconstruction to unlearn the backdoors. As an advanced method of backdoor removal for SSL, it does not require the use of labeled data. For CLP, its parameter u was set to 3. For I-BAU, SSL_Cleanse, and our BMAIU method, a dataset size that is 10% of the training set was used for the defense process. Also, the number of epochs for unlearning in both SSL_Cleanse and our BMAIU method was set to 20. In the active attack phase, for the CIFAR-10 dataset, the customized trigger was a randomly generated

4 \times 4

patch. For the ImageNet-100 dataset, the trigger size was

16 \times 16

. During the backdoor injection process, these customized triggers were randomly embedded into arbitrary positions within the input images.

Metrics: To validate the methods of backdoor defense, the following metrics were used: BA (Benign Accuracy), PA (Poison Accuracy), and ASR (Attack Success Rate). BA represents the accuracy of classification by the backdoor model on clean samples. ASR indicates the success rate of attack for the backdoor model, that is, the proportion of poisoned samples successfully classified into the target category. PA represents the accuracy of classification by the model on poisoned samples, with a backdoor-free model supposed to have a higher PA, indicating that the trigger is completely ignored by the model. This metric plays a crucial role in preventing the model from failing to classify backdoor samples into the target class or the correct class. A successful backdoor defense is expected to classify poisoned samples into their correct class, rather than simply avoiding the target class. These metrics fully reflect how the model performs on both clean and backdoor samples, and how effective the defense methods are in different scenarios of attack.

4.1. Experimental Results

The proposed BMAIU method was compared with three other methods of defense across various attack scenarios. The experimental results are presented in Table 3, Table 4 and Table 5. Specifically, Table 3 shows the results of single-target attacks on the CIFAR-10 dataset, while Table 4 presents the results of single-target attacks on the ImageNet-100 dataset. Additionally, Table 5 reports the results of multi-target attacks on the CIFAR-10 dataset. It is demonstrated that the proposed BMAIU method possesses a significant advantage. Specifically, our method leads to the minimal accuracy loss for the model and provides effective protection against various backdoor attacks. For example, on the CIFAR-10 dataset, our defense method reduces the average success rate (ASR) of attack from 68.87% to 9.05%. Meanwhile, the BA loss is merely 1.74%. Even against the stubborn backdoor attacks like BadEncoder and Estas, the ASR is reduced to below 15%. Similarly, on the ImageNet dataset, our method reduces the ASR significantly from 46.10% to below 2% with an average BA loss of 1.3%. In multi-target attacks, our defense method also performs well, with the success rate of attack reduced from 89.63% to 11.42% with a BA loss of 0.44% for the model.

CLP produces promising effects against various backdoor attacks like SSL Backdoor, CTRL, and ESTAS on the CIFAR-10 dataset. However, such effectiveness is quite limited in the scenario of the BadEncoder attack. On the ImageNet dataset, the effectiveness of CLP is even more limited, indicating the poor performance of this method in adapting to different datasets. CLP aims to balance model performance with the effectiveness of backdoor attack defenses. However, experimental results show that even with a more significant compromise on model accuracy, its ASR remains higher compared to the proposed BMAIU method.

I-BAU is a trigger reconstruction-based method of backdoor removal for supervised learning. Experimental results demonstrate that this method has a significant advantage in reducing the success rate of backdoor attack. In some cases, it even outperforms the BMAIU method proposed in this paper. However, it also has some notable downsides. Firstly, compared to the BMAIU method, I-BAU affects model accuracy more significantly. Secondly, as a method of supervised learning, I-BAU relies on the use of labeled samples and requires the training of a downstream classifier. In contrast, our method of backdoor removal only involves unlabeled data, which means the need for a downstream classifier is eliminated.

SSL_Cleanse is an advanced method of backdoor removal for SSL. However, its defensive effect is quite limited. Even in the case of a further compromise on model performance, it still fails to achieve the optimal defensive effect. Moreover, SSL_Cleanse requires the synthesis of a trigger for each cluster individually. When there are many categories in the dataset, the number of clusters increases significantly, which results in substantial consumption time costs and computational resources. In contrast, our BMAIU method neither requires further processing nor increases time costs as the number of dataset categories increases.

To demonstrate the feature distributions determined after the use of different attack methods and the distributions determined following BMAIU defense, t-SNE [30] was used for data visualization. The feature distributions are illustrated in Figure 3. From this figure, it can be observed that following the backdoor attack, the backdoored samples cluster together in the feature space, which is the leading cause of backdoor attack. When BMAIU is applied, the distribution of backdoor samples in the feature space is found to be more uniform rather than highly clustered in a specific region. It is indicated that our method is effective in adjusting the feature distribution of backdoor samples, for which the backdoor attack is rendered ineffective.

4.2. Ablation Studies

4.2.1. Defense Effectiveness with Different Ratios of Dataset

The impact of varying dataset ratios on the defense performance of our BMAIU method was evaluated through the BadEncoder attack under the SimCLR training framework. A test was conducted at different dataset proportions: 1%, 5%, and 10%, as detailed in Table 6. It can be seen from the table that even with only 1% of the dataset available, our BMAIU method maintains the ASR at an extremely low level, with only a slight decrease found in BA and PA. These results underscore the robustness of BMAIU, demonstrating its effectiveness in defense even at the minimal level of data availability.

4.2.2. Defense Effectiveness on Different Architectures

To validate the generalizability of our method under different model architectures, the ESTAS attack method was applied to various commonly used network architectures with different depths, including VGG11 [31], VGG16 [32], ResNet18 [28], ResNet34 [33], SENet18, and SENet34 [34]. The dataset used in this paper is CIFAR-10, and the self-supervised learning method applied in this paper is SimCLR. Then, the effectiveness of our BMAIU method was evaluated against these backdoor attacks. The experimental results are shown in Table 7. Obviously, our BMAIU method maintains its effectiveness under different architectures, which confirms its generalizability.

4.2.3. Impact of Loss Terms

Our BMAIU defense method incorporates three key parameters, including

u_{1}

,

u_{2}

, and

u_{3}

. To investigate their impact on the effectiveness of our defense method, we conduct controlled experiments under a fixed experimental setup. Specifically, we use the CIFAR-10 dataset and adopt SimCLR as the SSL method. The encoder architecture is based on ResNet-18, and the attack method is ESTAS. We systematically analyze how varying

u_{1}

,

u_{2}

, and

u_{3}

affects the defense performance.

Parameter

u_{1}

plays a critical role in the active attack stage, as it directly influences the effectiveness of backdoor injection. Therefore, we first conduct experiments to investigate the impact of

u_{1}

on the success of the backdoor implantation. Specifically, we measure two types of feature similarities, and the similarity is measured using cosine similarity: (1) the similarity between the features of backdoored samples and their target class in model

f_{1}

, denoted as BTS (Backdoor-to-Target Similarity), and (2) the similarity between the features of clean samples in model

f_{1}

and the corresponding clean samples in the original model

f_{0}

, denoted as CCS (Clean-to-Clean Similarity). These metrics respectively reflect the strength of backdoor injection and the degree to which the clean model performance is retained.

The experimental results are shown in Figure 4. As

u_{1}

increases, the BTS values show limited improvement, indicating that the backdoor injection does not become significantly more effective. However, the CCS values exhibit a clear decline, suggesting a noticeable degradation in the performance of clean samples. Therefore,

u_{1}

should not be set too high. In practice, setting

u_{1} = 1

is sufficient to achieve effective backdoor injection while maintaining acceptable performance on clean data.

Parameters

u_{2}

and

u_{3}

play a critical role in the unlearning stage. Their values directly affect the effectiveness of backdoor removal. To investigate their influence, we design a set of controlled experiments.

We first fix

u_{1} = 1

and

u_{3} = 1

and examine how varying

u_{2}

impacts the defense performance. The experimental results are shown in Figure 5. As

u_{2}

increases, the attack success rate (ASR) decreases significantly, indicating effective backdoor mitigation. In contrast, the benign accuracy (BA) remains relatively stable for small values of

u_{2}

and only starts to decline noticeably when

u_{2}

becomes large. These findings suggest that the model’s performance is relatively insensitive to changes in

u_{2}

, while the ASR is highly sensitive. This implies that by gradually increasing

u_{2}

, it is possible to remove backdoors effectively without severely compromising the model’s clean performance.

u_{3}

serves as an enhancement term for the unlearning trigger during the unlearning stage. With

u_{1} = 1

and

u_{2} = 10

fixed, we investigate how varying

u_{3}

affects the defense performance. The experimental results are shown in Figure 6. As

u_{3}

increases, the attack success rate (ASR) continues to decrease, demonstrating enhanced backdoor removal. Meanwhile, the benign accuracy (BA) initially remains stable, but starts to decline when

u_{3}

becomes too large. These results indicate that a properly chosen

u_{3}

can further suppress backdoors with minimal impact on clean performance. In practice, selecting a moderate value for

u_{3}

offers a good trade-off between robustness and accuracy.

To investigate the role of each component in the unlearning process, concise names were assigned to the key components. The entire framework of unlearning consists of three main components. Firstly, the Feature Consistency part, corresponding to Equation (5) (denoted as

L_{0}

), ensures that the features of clean samples are maintained throughout the process. Next, the Unlearning Effectiveness part is further divided into two sub-components: the term in Equation (4) (denoted as

L_{1}

) that ensures the features of backdoor samples remain similar to those of clean samples, and the term (denoted as

L_{2}

) that drives the backdoor sample features away from the target class features. Table 8 presents the experimental results obtained by removing one of the components. It illustrates the impact of excluding each of these components (

L_{0}

,

L_{1}

, or

L_{2}

) on the overall performance of BMAIU.

The experimental results indicate that

L_{0}

plays a crucial role in maintaining the accuracy of the model. When

L_{0}

is absent, the model shows a significant decrease in BA. Both

L_{1}

and

L_{2}

contributed significantly to reducing the effectiveness of backdoor attacks.

L_{1}

is the fundamental component of the unlearning process, while

L_{2}

is a key enhancement term. Without

L_{2}

, there is a sharp decline in the performance of the model in mitigating backdoor attacks.

5. Conclusions

In this paper, a simple and efficient method named BMAIU is proposed to effectively mitigate the backdoors from SSL models. It can be observed that in the context of a backdoor attack with multiple triggers, unlearning one of the triggers affects other triggers, rendering all triggers ineffective. On this basis, a backdoor-mitigating method is proposed in this paper. According to this method, a backdoor is actively implanted using a custom trigger and then unlearned. It is validated under various attack strategies, SSL methods, and backbone architectures, and demonstrated as effective in reducing the success rate of backdoor attacks with only a small amount of clean data required. Moreover, there is no significant compromise on model accuracy.

6. Limitations and Future Work

While our proposed BMAIU method shows strong backdoor defense performance, the theoretical basis behind the observed phenomenon has not yet been fully established, and future work could explore deeper theoretical insights. Our experiments are based on standard datasets like CIFAR-10 and ImageNet-100, but the exploration of real-world scenarios remains limited. Future research could focus on validating our method in more practical and complex environments. Given the current limitations in existing research, where adaptive backdoor attacks in self-supervised learning remain largely unexplored, our study does not evaluate the performance of BMAIU against such adversaries. Therefore, a promising direction for future work is to examine whether the proposed method can be adapted or enhanced to remain effective under adaptive attack scenarios.

Author Contributions

Formal analysis, X.C.; methodology, F.Z.; software, J.L.; writing—review and editing, W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Self-Initiated Research Project “Research on Endogenous Security Basic Theory and Tool Chain” (Grant No. ZL042401) funded by the Jiangsu Provincial Department of Science and Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

We use a publicly available open source dataset.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no competing financial interests.

References

Krishnan, R.; Rajpurkar, P.; Topol, E.J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 2022, 6, 1346–1352. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Albrecht, C.M.; Braham, N.A.A.; Mou, L.; Zhu, X.X. Self-supervised learning in remote sensing: A review. IEEE Geosci. Remote Sens. Mag. 2022, 10, 213–247. [Google Scholar] [CrossRef]
Shurrab, S.; Duwairi, R. Self-supervised learning methods and applications in medical imaging analysis: A survey. PeerJ Comput. Sci. 2022, 8, e1045. [Google Scholar] [CrossRef] [PubMed]
Goyal, P.; Mahajan, D.; Gupta, A.; Misra, I. Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6391–6400. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
Saha, A.; Tejankar, A.; Koohpayegani, S.A.; Pirsiavash, H. Backdoor attacks on self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13337–13346. [Google Scholar]
Liu, H.; Jia, J.; Gong, N.Z. {PoisonedEncoder}: Poisoning the Unlabeled Pre-training Data in Contrastive Learning. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 3629–3645. [Google Scholar]
Li, C.; Pang, R.; Xi, Z.; Du, T.; Ji, S.; Yao, Y.; Wang, T. An embarrassingly simple backdoor attack on self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4367–4378. [Google Scholar]
Xue, J.; Lou, Q. Estas: Effective and stable trojan attacks in self-supervised encoders with one target unlabelled sample. arXiv 2022, arXiv:2211.10908. [Google Scholar]
Li, Y.; Lyu, X.; Koren, N.; Lyu, L.; Li, B.; Ma, X. Anti-backdoor learning: Training clean models on poisoned data. Adv. Neural Inf. Process. Syst. 2021, 34, 14900–14912. [Google Scholar]
Huang, K.; Li, Y.; Wu, B.; Qin, Z.; Ren, K. Backdoor defense via decoupling the training process. arXiv 2022, arXiv:2202.03423. [Google Scholar]
Chen, W.; Wu, B.; Wang, H. Effective backdoor defense by exploiting sensitivity of poisoned samples. Adv. Neural Inf. Process. Syst. 2022, 35, 9727–9737. [Google Scholar]
Wang, B.; Yao, Y.; Shan, S.; Li, H.; Viswanath, B.; Zheng, H.; Zhao, B.Y. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 707–723. [Google Scholar]
Zheng, R.; Tang, R.; Li, J.; Liu, L. Data-free backdoor removal based on channel lipschitzness. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 175–191. [Google Scholar]
Zeng, Y.; Chen, S.; Park, W.; Mao, Z.M.; Jin, M.; Jia, R. Adversarial unlearning of backdoors via implicit hypergradient. arXiv 2021, arXiv:2110.03735. [Google Scholar]
Jia, J.; Liu, Y.; Gong, N.Z. Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 26 May 2022; pp. 2043–2059. [Google Scholar]
Tejankar, A.; Sanjabi, M.; Wang, Q.; Wang, S.; Firooz, H.; Pirsiavash, H.; Tan, L. Defending against patch-based backdoor attacks on self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12239–12249. [Google Scholar]
Pan, M.; Zeng, Y.; Lyu, L.; Lin, X.; Jia, R. {ASSET}: Robust backdoor data detection across a multiplicity of deep learning paradigms. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 2725–2742. [Google Scholar]
Zheng, M.; Xue, J.; Wang, Z.; Chen, X.; Lou, Q.; Jiang, L.; Wang, X. Ssl-cleanse: Trojan detection and mitigation in self-supervised learning. arXiv 2023, arXiv:2303.09079. [Google Scholar]
Feng, S.; Tao, G.; Cheng, S.; Shen, G.; Xu, X.; Liu, Y.; Zhang, K.; Ma, S.; Zhang, X. Detecting backdoors in pre-trained encoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16352–16362. [Google Scholar]
Cinà, A.E.; Grosse, K.; Demontis, A.; Vascon, S.; Zellinger, W.; Moser, B.A.; Oprea, A.; Biggio, B.; Pelillo, M.; Roli, F. Wild patterns reloaded: A survey of machine learning security against training data poisoning. ACM Comput. Surv. 2023, 55, 1–39. [Google Scholar] [CrossRef]
Qiao, X.; Yang, Y.; Li, H. Defending neural backdoors via generative distribution modeling. Adv. Neural Inf. Process. Syst. 2019, 32, 1–10. [Google Scholar]
Ho-Phuoc, T. CIFAR10 to compare visual recognition performance between deep neural networks and humans. arXiv 2018, arXiv:1811.07270. [Google Scholar]
Saha, A.; Subramanya, A.; Pirsiavash, H. Hidden trigger backdoor attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11957–11965. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Chen, Z.; Jiang, Y.; Zhang, X.; Zheng, R.; Qiu, R.; Sun, Y.; Zhao, C.; Shang, H. ResNet18DNN: Prediction approach of drug-induced liver injury by deep neural network with ResNet18. Briefings Bioinform. 2022, 23, bbab503. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Iglovikov, V.; Shvets, A. Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv 2018, arXiv:1801.05746. [Google Scholar]
Theckedath, D.; Sedamkar, R. Detecting affect states using VGG16, ResNet50 and SE-ResNet50 networks. SN Comput. Sci. 2020, 1, 79. [Google Scholar] [CrossRef]
Koonce, B.; Koonce, B. ResNet 34. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: New York, NY, USA, 2021; pp. 51–61. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]

Figure 1. Overview of BMAIU. BMAIU involves two stages. In the first stage, a target class is randomly selected, and a custom trigger is used to implant a backdoor into the model

f_{0}

, which leads to the intermediate transition model

f_{1}

. In the second stage, the model is trained to unlearn the custom trigger, which leads to the clean model

f_{2}

.

Figure 1. Overview of BMAIU. BMAIU involves two stages. In the first stage, a target class is randomly selected, and a custom trigger is used to implant a backdoor into the model

f_{0}

, which leads to the intermediate transition model

f_{1}

. In the second stage, the model is trained to unlearn the custom trigger, which leads to the clean model

f_{2}

.

Figure 2.

P L^{1}

norm value is obtained for the unlearned models with each synthesized trigger (ST). The green dashed line represents the

P L^{1}

norm value of the original backdoored model, while the red dashed line represents the

P L^{1}

norm value of the clean model. Clearly, unlearning any synthesized trigger (ST) leads to a significant increase in the

P L^{1}

norm value, indicating a considerable improvement in the backdoor robustness of the mode.

Figure 2.

P L^{1}

norm value is obtained for the unlearned models with each synthesized trigger (ST). The green dashed line represents the

P L^{1}

norm value of the original backdoored model, while the red dashed line represents the

P L^{1}

norm value of the clean model. Clearly, unlearning any synthesized trigger (ST) leads to a significant increase in the

P L^{1}

norm value, indicating a considerable improvement in the backdoor robustness of the mode.

Figure 3. Embedding Space Distributions. The black color labeled as −1 represents the backdoored samples, while the ten other colors, corresponding to labels 0 through 9, represent clean samples from ten different categories. Figures (a–d) show the feature distributions after SSL_Backdoor attack, CTRL attack, BadEncoder attack, and ESTAS attack, respectively. Figures (e–h) represent the feature distributions after applying the BMAIU backdoor mitigation method, respectively.The figures show that the backdoored model tends to cluster backdoored samples in the feature space. After applying our BMAIU backdoor mitigation method, the clustering degree of backdoor samples is significantly reduced, and they tend to be more dispersed in the feature space.

Figure 4. Line plot illustrating the impact of varying the balancing parameter

u_{1}

on the Backdoor-to-Target Similarity (BTS) and Clean-to-Clean Similarity (CCS).

Figure 4. Line plot illustrating the impact of varying the balancing parameter

u_{1}

on the Backdoor-to-Target Similarity (BTS) and Clean-to-Clean Similarity (CCS).

Figure 5. Impact of

u_{2}

on Defense Performance in Terms of BA, ASR, and PA.

Figure 5. Impact of

u_{2}

on Defense Performance in Terms of BA, ASR, and PA.

Figure 6. Impact of

u_{3}

on Defense Performance in Terms of BA, ASR, and PA.

Figure 6. Impact of

u_{3}

on Defense Performance in Terms of BA, ASR, and PA.

Table 1. The attack success rate (%) for each class after each type of synthesized trigger (ST) is unlearned separately.

Attack Targets	None Defense	ST0	ST1	ST2	ST3	ST4	ST5	ST6	ST7	ST8	ST9
Target 0	76.65	19.39	20.84	6.00	5.16	14.77	25.2	18.02	15.2	25.8	30.89
Target 1	98.73	9.92	17.2	20.38	21.96	28.72	17.59	24.51	12.11	16.57	13.06
Target 2	96.95	25.49	19.46	11.18	22.13	30.61	0.00	8.72	12.95	0.00	0.00
Target 3	89.83	4.32	14.47	11.58	19.69	0.25	8.63	6.89	0.00	4.18	8.64
Target 4	75.97	17.21	5.06	30.31	11.68	1.19	14.72	3.27	2.18	12.35	6.69
Target 5	98.29	12.95	0.91	12.10	0.00	21.46	14.59	9.19	17.46	26.85	8.95
Target 6	87.88	13.28	23.01	19.85	27.54	25.77	21.02	29.88	38.43	15.8	25.82
Target 7	91.22	5.24	16.62	0.00	8.85	8.76	11.92	16.14	12.76	10.09	13.95
Target 8	84.35	10.85	9.61	13.15	17.18	12.19	4.71	11.11	16.92	11.86	8.09
Target 9	96.49	25.93	20.86	23.62	27.49	17.75	28.95	21.72	16.91	30.83	14.97
AVE	89.63	14.45	14.80	14.81	16.16	16.14	14.73	14.94	14.49	15.43	13.10

Table 2. The attack success rate (%) for each class in the model after each type of genuine trigger (AT) is unlearned separately.

Attack Target	None Defense	AT0	AT1	AT2	AT3	AT4	AT5	AT6	AT7	AT8	AT9
Target 0	76.65	11.41	10.44	11.26	11.42	14.01	11.43	10.4	11.14	10.26	11.98
Target 1	98.73	9.41	9.72	10.01	9.09	9.55	10.10	9.59	9.85	9.92	9.97
Target 2	96.95	8.06	10.16	7.96	8.35	5.95	9.53	12.56	8.06	8.58	12.65
Target 3	89.83	7.76	10.29	7.33	5.10	7.44	7.15	6.98	12.24	7.6	5.02
Target 4	75.97	12.44	8.67	12.21	9.13	13.92	9.78	9.58	10.95	10.59	8.35
Target 5	98.29	8.48	7.46	10.64	12.31	8.15	8.31	8.95	5.54	11.8	9.28
Target 6	87.88	11.34	12.74	11.82	14.67	12.25	13.78	13.08	12.20	11.95	12.38
Target 7	91.22	9.67	10.43	9.01	9.14	9.92	9.78	8.98	9.68	8.6	10.43
Target 8	84.35	10.80	9.71	9.78	9.70	9.23	10.48	9.75	10.65	10.7	9.69
Target 9	96.49	10.62	10.13	10.00	11.17	10.48	10.16	10.21	10.00	9.67	9.62
AVE	89.63	9.99	9.97	10.00	10.01	10.09	10.05	10.01	10.03	9.96	9.93

Table 3. The effectiveness of our BMAIU method and three other defense methods with 10% of the training data against four attack methods on CIFAR10 with ResNet-18 as the backbone. All values are percentages (%).

SSL Methods	Attack Methods	None Defense			CLP			I-BAU			SSL_Cleanse			BMAIU
SSL Methods	Attack Methods	BA	ASR	PA	BA	ASR	PA	BA	ASR	PA	BA	ASR	PA	BA	ASR	PA
SimCLR	SSL_Backdoor	77.19	16.00	66.56	66.22	9.13	52.77	72.83	2.16	67.47	71.62	2.58	68.63	78.05	1.17	75.44
	CTRL	74.94	89.40	9.31	65.05	15.94	52.08	65.94	0.33	58.03	66.92	44.6	35.31	72.15	11.63	60.76
	BadEncoder	77.67	59.98	32.66	63.48	27.98	36.56	69.37	1.17	62.84	70.54	21.69	54.77	76.26	13.68	66.47
	ESTAS	79.66	97.46	1.66	65.73	12.56	55.3	73.15	2.01	68.15	74.6	33.17	24.8	77.1	8.4	67.37
BYOL	SSL_Backdoor	72.42	18.03	61.59	58.11	8.23	44.4	45.90	1.01	44.81	59.41	11.92	54.27	70.26	1.42	68.19
	CTRL	73.12	89.79	3.48	66.26	1.33	63.32	51.30	0.44	48.36	54.387	43.5	33.12	71.21	14.3	59.59
	BadEncoder	72.95	87.83	9.92	61.87	68.42	18.42	52.29	0.47	47.11	54.33	62.94	16.51	66.85	10.24	56.89
	ESTAS	71.70	92.49	5.76	67.29	6.39	56.72	54.34	1.07	52.53	54.8	29.68	35.33	69.08	11.52	60.16
AVE		74.36	68.87	23.86	64.25	18.74	47.44	60.64	1.08	56.17	63.33	31.26	40.34	72.62	9.05	64.36

Table 4. The effectiveness of our BMAIU method and 3 other defense methods with 10% of the training data against 3 attack methods on Imagenet-100 with EfficientNet V2 as the backbone. All values are percentages (%).

Attack Methods	None Defense			CLP			I-BAU			SSL_Cleanse			BMAIU
Attack Methods	BA	ASR	PA	BA	ASR	PA	BA	ASR	PA	BA	ASR	PA	BA	ASR	PA
SSL_Backdoor	59.39	38.22	35.18	54.82	32.83	33.88	52.45	2.42	45.11	51.00	26.66	34.94	58.16	0.19	51.86
CTRL	58.22	38.98	36.80	53.23	19.46	42.41	51.98	0.72	45.99	50.83	17.93	45.82	57.32	3.33	53.89
BadEncoder	60.03	61.11	17.73	48.99	35.19	18.98	51.06	1.55	47.30	51.16	14.02	38.24	58.24	1.23	52.65
AVE	59.21	46.10	29.90	52.34	29.16	31.75	51.83	1.56	46.13	60.72	14.87	39.67	57.91	1.58	52.80

Table 5. The effectiveness of our BMAIU method and three other defense methods with 10% of the training data against multi-target attack on CIFAR10 with ResNet-18 as the backbone. All values are percentages (%).

Attack Target	None Defense		CLP		I-BAU		SSL_Cleanse		BMAIU
Attack Target	ASR	PA	ASR	PA	ASR	PA	ASR	PA	ASR	PA
Target 0	76.65	26.87	62.61	37.10	2.81	61.37	18.29	55.84	14.71	77.48
Target 1	98.73	10.73	93.06	15.01	11.34	60.26	14.91	53.72	12.21	75.27
Target 2	96.95	11.92	92.08	14.24	17.96	61.11	19.91	55.12	13.16	75.21
Target 3	89.83	14.20	34.01	23.97	0.42	60.85	10.79	57.27	7.24	77.85
Target 4	75.97	23.35	74.95	28.33	14.23	60.48	7.92	55.38	9.33	77.71
Target 5	98.29	11.15	93.01	15.17	18.94	61.41	7.26	57.00	13.05	78.62
Target 6	87.88	19.5	73.82	28.97	3.82	61.23	17.49	55.79	12.76	77.90
Target 7	91.22	14.24	79.14	20.23	9.92	61.99	9.55	56.93	10.29	78.13
Target 8	84.35	21.18	70.59	32.18	14.87	61.50	11.45	57.34	9.67	78.77
Target 9	96.49	12.39	93.86	14.09	9.68	61.64	16.51	56.03	11.76	77.52
AVE	89.63	16.55	76.71	22.92	10.39	61.15	13.40	56.04	11.42	77.44
BA	82.71		79.09		64.97		59.86		82.27

Table 6. The effectiveness of defense by our BMAIU method on the datasets with different data ratios against the BadEncoder attack. All the values are expressed in percentage (%).

Data Ratio	CIFAR10			ImageNet
Data Ratio	BA	ASR	PA	BA	ASR	PA
1%	74.51	12.52	63.55	55.80	1.58	51.09
5%	74.93	12.20	64.90	57.07	2.28	52.10
10%	76.26	13.68	66.47	58.24	1.23	52.65

Table 7. Our BMAIU method performs well under different model architectures. Experimental results demonstrate that even against stubborn attacks like Estas, our BMAIU method still performs well in defense. All the values are expressed in percentage (%).

Backbones	Metrics	Before Migitation	After Migitation
VGG11	BA	70.77	69.11
	ASR	51.22	9.49
	PA	36.02	60.51
VGG16	BA	67.99	67.55
	ASR	77.17	11.39
	PA	18.29	60.23
ResNet18	BA	79.66	77.1
	ASR	97.46	8.4
	PA	1.66	67.37
ResNet34	BA	77.79	74.44
	ASR	84.74	6.38
	PA	11.15	70.14
SENet18	BA	79.82	75.73
	ASR	73.73	4.88
	PA	23.4	72.69
SENet34	BA	80.17	77.40
	ASR	74.84	5.19
	PA	22.02	73.72

Table 8. Effectiveness of our BMAIU method after removal of each component in the unlearning process. All the values are expressed in percentage (%).

Removed Loss Terms	BA	ASR	PA
$L_{0}$	72.22	10.52	52.76
$L_{1}$	74.26	19.57	44.77
$L_{2}$	76.99	22.94	43.71
None	77.1	8.4	67.37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, F.; Li, J.; Huang, W.; Chen, X. BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning. Electronics 2025, 14, 1587. https://doi.org/10.3390/electronics14081587

AMA Style

Zhang F, Li J, Huang W, Chen X. BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning. Electronics. 2025; 14(8):1587. https://doi.org/10.3390/electronics14081587

Chicago/Turabian Style

Zhang, Fan, Jianpeng Li, Wei Huang, and Xi Chen. 2025. "BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning" Electronics 14, no. 8: 1587. https://doi.org/10.3390/electronics14081587

APA Style

Zhang, F., Li, J., Huang, W., & Chen, X. (2025). BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning. Electronics, 14(8), 1587. https://doi.org/10.3390/electronics14081587

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning

Abstract

1. Introduction

2. Background and Related Work

2.1. Self-Supervised Learning

2.2. Backdoor Attack on SSL

2.3. Limitations of Related Backdoor Defense

3. BMAIU: Backdoor Mitigation in Self-Supervised Learning Through Active Implantation and Unlearning

3.1. Threat Model and Defense Assumptions

3.2. Observations and Intuitions

3.3. Theoretical Insight into the Unlearning Effect

3.4. Proposed BMAIU Method

3.4.1. Active Implant Backdoor

3.4.2. Unlearning Triggers

4. Experiments

4.1. Experimental Results

4.2. Ablation Studies

4.2.1. Defense Effectiveness with Different Ratios of Dataset

4.2.2. Defense Effectiveness on Different Architectures

4.2.3. Impact of Loss Terms

5. Conclusions

6. Limitations and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI