4.1. Experimental Environment and Dataset
The experiments were conducted in a Desktop environment, utilizing an Intel Core i5 14500 CPU, Nvidia GeForce RTX 4070 Ti SUPER GPU, and 64 GB of RAM.
The autoencoder was first trained using the CIFAR-10 dataset. The images were resized to , with preprocessing steps including a RandomCrop operation set to padding 4, RandomHorizontalFlip set to its default value, and Normalize using default parameters. The training utilized the Adam optimizer with a learning rate of 0.0001, the model was trained for 20 epochs, and the loss function used was Mean Squared Error (MSE).
Following this, a classification model was fine-tuned on the CIFAR-10 dataset using a pre-trained model from the torchvision.models library, trained on ImageNet. The dataset pre-processing steps were the same as those used for the autoencoder. The classification model was trained using the Adam optimizer, with a learning rate of 0.0001 and 20 epochs, and the loss function employed was Crossentropy.
Table 3 shows that Mobilenetv2 achieved a classification performance of 93.76%, Resnet18 achieved 94.98%, and VGG16 achieved 92.85%.
Subsequently, the trained classification model and autoencoder were combined and further trained as an ensemble model for Label Sanitization. This combined model was trained without freezing the layers, and the dataset pre-processing steps remained consistent with those of the earlier stages. The dataset used for training the ensemble model consisted of both original examples and adversarial examples. Adversarial examples were generated using two attack methods: the FGSM and PGD. The FGSM introduced less sophisticated pixel perturbations, while PGD employed more sophisticated perturbations.
For the ensemble model used in the FGSM version, the training data included 50,000 original training samples and 50,000 FGSM-adversarial training samples. Similarly, for the PGD version, the training data consisted of 50,000 original training samples and 50,000 PGD-adversarial training samples. The parameters for these attacks were configured according to the settings specified in
Table 4. Adversarial attacks were implemented using the torchattacks library.
The loss function was calculated as the total Loss by combining the crossentropy, used for training the classification model, and MSE, used for training the autoencoder. The Adam optimizer, with a learning rate of 0.0001, was used to jointly train the autoencoder and classification model. The optimizer was applied to the trainable parameters of both models, including the encoder of the autoencoder and the feature extractor and classifier of the classification model. Only parameters with requires_grad = True were selected for optimization to ensure efficient learning. The ensemble model was trained for 20 epochs, and a StepLR scheduler was applied to the optimizer with a step size of 10 and a gamma value of 0.1.
To determine the threshold, a default value was established through experiments. The Resnet18 classification model and the Resnet18
FGSM ensemble model were used, and the silhouette coefficient distributions for five types of adversarial attacks were visualized. The distribution of the silhouette coefficients can be observed in
Figure 4. Most original images had silhouette coefficients close to 0, while adversarial examples had coefficients close to 1. Therefore, the default value of the threshold was set to 0.
The attack success rates and parameters of the adversarial attacks are summarized in
Table 4. The parameters of the adversarial attacks were set to the default values provided by torchattacks. These are derived by calculating the attack success rate after adversarial attacks were applied to the original examples that the classification models correctly classified. The attack success rates of PGD, C&W, and BIM were high; these rates can vary depending on the parameter settings.
4.3. Adversarial Attack Detection Performance
In this paper, we used FGSM, PGD, C&W, DeepFool, and BIM attacks. We applied each attack to the 10,000 images in the CIFAR-10 test sets and aimed to simultaneously detect both the successfully attacked images and the original test sets.
Table 5 summarizes the experimental results of the proposed method. When applying the detection method to Resnet18, Mobilenetv2, and VGG16, the three models showed similar performance. For the FGSM, Mobilenetv2
PGD showed the highest recall at 0.9339, but, Resnet18
FGSM demonstrated a better F1-score and accuracy, indicating a more balanced detection capability. For the BIM and PGD attacks, which are similar to the FGSM, the method proved effective, with an F1-score and accuracy above 0.9. The Resnet18
FGSM performed best, with an F1-score of 0.9451 and accuracy of 0.9457 for BIM, and an F1-score of 0.9431 and accuracy of 0.9438 for PGD attacks. For C&W and DeepFool attacks, which create perturbations that are difficult for humans to perceive, the heavyweight model VGG16 proved more effective, unlike the results obtained for FGSM-type attacks. VGG16
FGSM achieved the best results, with an F1-score of 0.9291 and accuracy of 0.9367 for DeepFool, and an F1-score of 0.9324 and accuracy of 0.9345 for C&W attacks. These experimental results demonstrate that lightweight models are more effective in detecting FGSM-type attacks, while heavyweight models are more effective in detecting DeepFool and C&W attacks. These results are expected to vary depending on the adversarial examples used in training, and it is anticipated that creating an ensemble model using examples that are applicable to various attacks would enable more effective detection.
4.4. Comparative Experiment
The comparison experiments included the UNICAD, Feature Squeezing, and Zero-Mean & RGB2BGR methods [
11,
12,
13].
The Feature Squeezing method detects adversarial examples by filtering images before passing them to the classification model, without modifying the existing model. This technique that detects attacks by comparing the model’s prediction results for the original example with its prediction results for the filtered example. This method does not aim to completely block adversarial attacks, but rather to make it more difficult for the attacker to execute the attack. In other words, the attacker may attempt to apply stronger perturbations to bypass the feature squeezing method, which could result in a degradation of image quality, making the image unrecognizable or turning it into an image that appears suspicious due to the attack. The filtering techniques that were used include bit depth reduction, median filtering, and non-local means filtering. For the CIFAR-10 dataset, the best performance was achieved using 5-bit depth reduction, a median filter, and a 13-3-2 configuration for non-local means. The median filtering and non-local means were carried out using the ndimage module and the OpenCV library, respectively.
The Zero-Mean method adjusts the brightness of the input image, extracts an activation map using Grad-CAM, and overlays the activation map onto the input image with a weighting factor of 0.1 to create an emphasized image. The classification model then predicts both the input image and the emphasized image. If the predicted labels differ, the input is identified as an adversarial example; if they match, it is classified as an original example. The RGB2BGR method converts the RGB values of the input image to BGR and proceeds with detection in the same way as the Zero-Mean approach. Both detection methods have the limitation that they degrade the performance of the classification model while modifying the image, and they fail to detect when the activation maps of the original example and the adversarial example are very similar.
UNICAD uses the VGG16 model as a feature extractor and selects class-specific prototypes from the CIFAR-10 dataset based on the extracted features. It extracts features from the input images and compares them with the prototypes selected for each class to find the most similar prototype. The method then classifies the images as attack images or clean images according to the detection classification logic. This detection method determines the presence of an attack through a comparison with pre-selected prototypes. However, it relies on the use of a Denoising Autoencoder in the second restoration stage, which may not allow for accurate detection. All suspicious examples from the first detection stage are passed to the second restoration stage. The method used in the first detection stage is applied for restoration using the Denoising Autoencoder. Therefore, if only the detection aspect is measured, there is a possibility that the detection performance may be lower. Although the comparative paper reduced the CIFAR-10 classes to 0–8 and set the 9th class as an unseen class to classify the unseen classes, we set the classes to 0–9 for comparison in this paper and conducted experiments only focusing on the detection aspect.
The evaluation and the dataset used for evaluation were kept the same in our experimental environments. The comparative experiments evaluated the performance of each model individually.
Table 6 presents a comparison of the performance of Mobilenetv2. The most effective method for detecting all attacks was Mobilenetv2
PGD.
The Feature Squeezing method demonstrated a strong performance in detecting BIM, PGD, and DeepFool attacks, achieving recall rates of 0.9931, 0.8673, and 0.8532, respectively. This indicates that filtering techniques effectively mitigate the perturbations caused by these attacks. However, the method was less effective against FGSM and C&W attacks. This is because the FGSM and C&W introduce more perturbations to the images, exposing one limitation of Feature Squeezing: its performance degrades when faced with significant perturbations.
The Zero-Mean and RGB2BGR methods demonstrated a relatively good performance in detecting DeepFool and C&W attacks. Compared to other attacks, DeepFool and C&W involve less image perturbation, allowing for effective detection after preprocessing with Zero-Mean and RGB2BGR and generating the emphasized image. However, attacks that cause significant image perturbation, such as the FGSM, PGD, and BIM, were not mitigated by the Zero-Mean or RGB2BGR preprocessing. Furthermore, the generated emphasizing images were not effective in detecting these highly distorted attacks.
UNICAD was presumed to have failed to distinguish between original and adversarial examples during the detection process due to its reliance on the Denoising Autoencoder in the reconstruction stage, which was previously mentioned as a limitation. This dependency likely led to its passing suspicious images without ensuring sufficient accuracy.
Table 7 presents a comparison of Resnet18’s performance. For FGSM-type attacks (FGSM, PGD, and BIM), the Resnet18
FGSM showed the most effective detection performance, with an F1-score of 0.9175, 0.9431, and 0.9451, and accuracies of 0.9266, 0.9438, and 0.9457, respectively. For DeepFool and C&W attacks, Resnet18
PGD was also the most effective in terms of detection.
The Feature Squeezing method, as seen from the results obtained in the comparison with Mobilenetv2, achieved a recall of 0.9168 and accuracy of 0.8837 for BIM, and a recall of 0.8532 and accuracy of 0.8359 for DeepFool. PGD showed a recall of 0.6704 and accuracy of 0.7637, demonstrating a decrease in performance, but still showing decent results.
The Zero-Mean and RGB2BGR methods were effective in detecting DeepFool and C&W attacks. When compared to the results of the lightweight model Mobilenetv2, Zero-Mean showed similar performance in detecting DeepFool and C&W. RGB2BGR, on the other hand, showed an increase in recall of about 0.4 for DeepFool and about 0.5 for C&W. This suggests that the performance of the Zero-Mean and RGB2BGR methods may improve as the model becomes heavier.
The limitations noted for UNICAD were also observed in Resnet18.
Table 8 presents a comparison of the performance of VGG16. For FGSM-type attacks (FGSM, PGD, and BIM), VGG16
FGSM showed the most effective detection performance, with F1-scores of 0.9098, 0.9299, and 0.9246 and accuracies of 0.9236, 0.9332, and 0.9282, respectively. While Feature Squeezing was very effective in detecting BIM and PGD attacks, with recall values of 1.0 and 0.9913, VGG16
FGSM demonstrated a more balanced detection, with F1-scores of 0.9246 and 0.9299.
The Feature Squeezing method, as seen from the results obtained in the previous comparison, achieved a recall of 1.0000 and accuracy of 0.8512 for BIM, and a recall of 0.9041 and accuracy of 0.7915 for DeepFool. PGD showed a recall of 0.9913 and accuracy of 0.8461, demonstrating an effective performance.
The Zero-Mean and RGB2BGR methods were effective in detecting DeepFool and C&W attacks. When compared to Resnet18, both Zero-Mean and RGB2BGR methods showed an increase of about 0.4 in recall for DeepFool attacks, and an increase about 0.7 in recall for C&W attacks. This supports the earlier estimation that performance improves as the model transitions from lightweight to heavyweight.
The limitations previously observed in UNICAD were also present in VGG16.
In conclusion, our method was proven to be effective in detecting various models and attacks. The Feature Squeezing method was shown to be more effective when the image distortion was lower. The Zero-Mean and RGB2BGR methods demonstrated an increase in detection performance as the model transitioned from lightweight to heavyweight. Although UNICAD did not show a good detection performance, its core use as a Denoising Autoencoder suggests that even if precise detection is not achieved, the recovery performance improves through the use of the Denoising Autoencoder. These results may vary depending on the attack parameters.
4.5. Ablation Study
4.5.1. Analysis of Failure Cases Using the Proposed Method
We presented cases where the proposed method succeeds and we also analyzed the cases of failure.
In
Figure 5, a false positive case occurred when the original image was classified as an attacked image. In the activation maps of false positive cases 1 and 2, we can observe a difference between the activation map of the original image and the activation map after Label Sanitization. In this case, false positive case 1 has a silhouette coefficient of approximately 0.9, and false positive case 2 has a silhouette coefficient of 0.5; both of these were detected as attacked images.
The false negative case occurred when an attacked image was classified as an original image. In this case, the activation map of the attacked image and the activation map after Label Sanitization appeared identical. This indicates that Label Sanitization did not work as expected, leading to an attacked image being classified as an original image.
These results primarily occurred when the model identified images with similar features. As seen in
Figure 5, this included cases such as dogs and trucks. When the model makes its prediction, images of cat and dog, or truck and automobile, share similar features, which leads to activation map differences despite the application of Label Sanitization.
4.5.2. Evaluation of Image Quality Post-Reconstruction
The ensemble model of the proposed method was designed with a focus on Label Sanitization. As a result, a degradation in image quality was inevitable. To identify areas for improvement, the extent of the image quality degradation was measured.
The metric used to evaluate image quality degradation was the SSIM (Structural Similarity Index Measure). The average SSIM was calculated for the evaluation. The evaluation dataset consisted of 10,000 original images, and adversarial examples were selected based on successful attack cases, as shown in
Table 4. For example, in the case of the FGSM, the average SSIM was calculated based on 7725 adversarial examples.
The SSIM is a metric used to measure the structural similarity between two images [
26]. It considers three main components—luminance (brightness), contrast, and structure—to evaluate the similarities between two images in a way that aligns more closely with the human visual system, rather than relying on pixel-wise differences. The SSIM between two images
x and
y was calculated using the following formula:
are the mean luminance of images x and y. are the contrast between images x and y. is the covariance between images x and y. are small constants to avoid division by zero in the denominators.
The SSIM values range from 0 to 1, where a value closer to 1 indicates that the two images are more similar. Specifically, SSIM values between 0 and 0.2 indicate severe quality loss, with very little resemblance between images. Values from 0.2 to 0.4 show significant degradation with low similarity. A range of 0.4 to 0.6 represents moderate similarity with noticeable quality loss. Values from 0.6 to 0.8 suggest fairly high similarity with acceptable levels of quality degradation. Finally, values between 0.8 and 1.0 indicate minimal or no visible quality loss, as the images are highly similar.
In
Table 9, all ensemble models showed similar SSIM values. For the original image, the SSIM was approximately 0.53, while for the FGSM it was around 0.5, and for the BIM it was approximately 0.51. The SSIM for PGD was around 0.51, for DeepFool it was about 0.53, and for C&W it was about 0.52. In other words, the SSIM values for all ensemble models range from 0.5 to 0.52, indicating a decrease in image quality when the image is reconstructed. The SSIM values ranged between 0.4 and 0.6, indicating noticeable image quality degradation.
These results appear to be a consequence of training the ensemble model with the goal of label sanitization. In our future work, we aim to investigate a framework that can successfully perform label sanitization without degrading the image quality.
4.5.3. Detection Performance Against White-Box Attacks in Real-World Scenarios
Our method was proven effective against white-box attacks through experiments. To verify its effectiveness in real-world scenarios, we evaluated its detection performance against AutoAttack [
27]. AutoAttack is a powerful framework designed for assessing adversarial robustness through a combination of four attack techniques. APGD-CE is a gradient-based attack leveraging cross-entropy losses to identify basic vulnerabilities. APGD-DLR employs the Difference of Logits Ratio (DLR) loss to more precisely target weaknesses near decision boundaries. Fast Adaptive Boundary (FAB) performs optimized attacks in the direction closest to the decision boundary, uncovering the most sensitive regions of the model. Finally, Square Attack is a gradient-free attack that uses random search to assess robustness even when gradient information is unavailable. Using AutoAttack, we conducted experiments to determine whether our method is applicable in real-world scenarios.
AutoAttack was implemented using the tool provided by torchattacks, and the parameters were set to their default values. Specifically, the parameters were configured as follows: the norm was set to , eps was 0.3, the version was standard, carrying out the aforementioned four attacks, the number of classes was set to 10, corresponding to the labels in the CIFAR-10 dataset, the seed was set to None, and Verbose was set to False.
Due to the limited computational resources, the evaluation dataset could not be tested on the entire set of examples. Instead, the detection performance was measured using 2500 original examples and 2500 adversarial examples, with a 1:1 ratio. The used for this evaluation dataset was randomly selected.
In
Table 10, the proposed method achieved a recall of over 0.93, an F1-score of over 0.91, and an accuracy of over 0.91 across all ensemble models. These results demonstrate that the proposed method can effectively detect adversarial attacks even in real-world scenarios.
In conclusion, the proposed method is an effective detection approach that works well on various adversarial attacks and models, which demonstrated the ability to effectively detect attacks, even in real-world scenarios.
4.5.4. Execution Time Measurement of Detection Logic
In this section, we measured the execution time of the detection logic to verify its effectiveness in real scenarios. The dataset used for the evaluation was the test dataset. The execution time was measured by repeating the process 100 times with a batch size of 100, and the average time was calculated. In
Table 11, the detection logic execution time was measured by representing the average execution time with a ±range based on the standard deviation. This approach quantitatively evaluated the stability of the execution time. These results are shown with error bars in
Figure 6.
Resnet18FGSM had an execution time of ms, while Resnet18PGD had an execution time of ms, indicating that both models have a low execution time and minimal variability. For Mobilenetv2FGSM, the execution time was ms, and Mobilenetv2PGD had an execution time of ms, suggesting that these models require longer execution times with moderate variability. VGG16FGSM model achieved an execution time of ms, and VGG16PGD achieved an execution time of ms, reflecting a balanced execution speed.
These results suggest that the Resnet18 ensemble model was the most efficient and stable in terms of execution time across both detection logics. Mobilenetv2, while effective, incurred the longest execution time, which may impact its real-world applications. VGG16 demonstrated a reasonable execution time, making it potentially suitable for real-world scenarios requiring a balance between speed and stability.
4.6. Discussion
UNICAD exhibited significant differences in its detection performance between original and adversarial samples. Specifically, for the original samples, the performance when using Mobilenetv2 was 0.8875, that using Resnet18 was 0.9275, and that when using VGG16 was 0.9704. However, for adversarial samples, the performance dropped to below 0.2. Additionally, the original-to-adversarial sample ratio used in the tests was not explicitly stated in the referenced paper. To address this, we evaluated the performance using the dataset employed in our study. The results showed an average F1-score of approximately 0.2 and an accuracy of around 0.55.
Our method demonstrated a consistent performance across both lightweight and heavyweight models, such as Resnet18, Mobilenetv2, and VGG16. This indicates that the proposed detection method effectively captures the characteristic differences between original and adversarial examples through clustering analysis using Grad-CAM and silhouette coefficients. Additionally, through an ablation study, we demonstrated that our method can effectively detect adversarial attacks in real-world scenarios.
The results of this study, combined with the explainability, derived based on Grad-CAM, offer users an intuitive understanding of the impact of adversarial attacks during the detection process. This demonstrates the potential to enhance both the transparency and reliability of AI security systems. Furthermore, comparison with existing methods, such as Feature Squeezing and Zero-Mean and RGB2BGR, shows that the proposed approach offers higher versatility and reliability against certain types of attacks.
This contributes significantly to the current research, complementing traditional methods that rely on a single attack type and providing a solution that can be used in various attack scenarios. The results confirm that the proposed method offers a robust solution for adversarial attack detection while maintaining compatibility with the diverse deep learning models.
These findings have significant implications for improving the security and reliability of AI systems across a variety of applications, such as autonomous vehicles, IoT networks, and critical infrastructure protection.