Adversarially Robust and Explainable Insulator Defect Detection for Smart Grid Infrastructure

Alanazi, Mubarak

doi:10.3390/en19041013

Open AccessArticle

Adversarially Robust and Explainable Insulator Defect Detection for Smart Grid Infrastructure

by

Mubarak Alanazi

Electrical Engineering Department, Jubail Industrial College, Royal Commission for Jubail & Yanbu, Jubail Industrial City 31961, Saudi Arabia

Energies 2026, 19(4), 1013; https://doi.org/10.3390/en19041013

Submission received: 21 January 2026 / Revised: 5 February 2026 / Accepted: 12 February 2026 / Published: 14 February 2026

Download

Browse Figures

Versions Notes

Abstract

Automated insulator inspection systems face critical challenges from small object sizes, complex backgrounds, and vulnerability to adversarial attacks, a security concern largely unaddressed in safety-critical power infrastructure. We introduce Faster-YOLOv12n, integrating a FasterNet backbone with SGC2f attention modules and Wise-ShapeIoU loss for enhanced small defect localization. Our architecture achieves 98.9% mAP@0.5 on the CPLID, improving baseline YOLOv12n by 1.3% in precision (97.8% vs. 96.5%), 4.7% in recall (95.1% vs. 90.4%), and 1.8% in mAP@0.5. Through differential data augmentation, we expand training samples from 678 to 3900 images, achieving balanced class distribution and robust generalization across fog, adverse weather, and complex transmission line backgrounds. Comparative evaluation demonstrates superior performance over RT-DETR, Faster R-CNN, YOLOv7, YOLOv8, and YOLOv9, with per-class analysis revealing 99.8% AP@0.5 for defect detection. We provide the first comprehensive adversarial robustness evaluation for insulator defect detection, systematically assessing FGSM, PGD, and C&W attacks across perturbation budgets. Through adversarial training with mixed-batch strategies, our robust model maintains 93.2% mAP@0.5 under the strongest FGSM attacks (

ϵ

= 48/255), 94.5% under PGD attacks, and 95.1% under C&W attacks (

τ

= 3.0) while preserving 98.9% clean accuracy, demonstrating no trade-off between accuracy and robustness. Grad-CAM visualizations demonstrate that attacks disrupt confidence calibration while preserving spatial attention on defect regions, providing interpretable insights into model decision-making under adversarial conditions and validating learned feature representations for safety-critical smart grid monitoring applications.

Keywords:

YOLOv12; attention mechanism; computer vision; deep learning; artificial intelligence; adversarial robustness; insulator; power grid monitoring

1. Introduction

Power transmission systems constitute the backbone of modern electrical infrastructure, with insulators serving as critical components responsible for supporting high-voltage conductors while providing electrical isolation from grounded structures [1]. The integrity of these insulators directly impacts grid reliability and operational safety across extensive transmission networks. According to data from power system monitoring organizations, approximately 81.3% of transmission line accidents stem from insulator defects [2], resulting in substantial power losses, reduced equipment lifespan, and widespread service interruptions that affect millions of consumers. The escalating complexity of smart grid infrastructure, combined with the expansion of transmission networks to accommodate renewable energy integration, has intensified the demand for reliable and automated insulator inspection methodologies [3].

Traditional insulator inspection approaches primarily rely on manned helicopter surveys and ground-based visual examinations performed by trained personnel [4]. While these conventional inspection methods have served the power industry for decades, they introduce multiple operational challenges, including prohibitive labor costs, extended inspection cycles spanning weeks or months for large transmission networks, significant safety hazards from working at elevated heights and near energized infrastructure, and inherent subjectivity in defect assessment that varies across inspectors [5,6]. The intricate background clutter surrounding transmission-line corridors, combined with the small size of defect regions and the necessity for distant observation to maintain safety clearances, further compounds the difficulty of manual inspection even when employing unmanned aerial vehicle (UAV) platforms with wide-angle imaging capabilities [7]. These limitations underscore the critical need for automated, accurate, and computationally efficient insulator defect-detection systems capable of operating reliably across diverse environmental conditions [8].

The advancement of computer vision technologies, particularly through deep learning-based object detection algorithms, has catalyzed significant progress in automated power system inspection [2,9]. These algorithms excel at precise object localization [10] and classification [2], proving instrumental for automated fault diagnosis across substations and transmission infrastructure [11]. Object detection frameworks are conventionally categorized into two-stage and one-stage architectures. The two-stage paradigm, exemplified by the Region-based Convolutional Neural Network (R-CNN) family [12,13,14], operates through sequential region proposal generation followed by classification refinement and bounding box optimization via non-maximum suppression. Although these models achieve superior accuracy, they are constrained by slow processing speeds, complex training requirements, and optimization difficulties that limit real-time deployment. Conversely, the You Only Look Once (YOLO) series [15,16,17] represents the predominant one-stage approach, now evolved through YOLOv12 [18], offering reduced inference latency, lightweight architectures, and sustained accuracy improvements that establish them as optimal foundations for real-time inspection systems.

Recent research has employed various deep learning techniques to enhance insulator defect detection capabilities. Zhang et al. [19] incorporated synthetic fog augmentation and channel attention mechanisms into YOLOv5, achieving 91.7% precision for self-explosion defect detection. Cao et al. [9] integrated coordinate attention into YOLOv8 through the CACS-YOLO model, demonstrating superior performance on fog-degraded datasets. Li et al. [20] developed a lightweight YOLOv4 variant using ECAGhostNet, reducing model size to 8.7 MB for embedded deployment. Chang et al. [21] improved the YOLOv7 architecture by integrating enhanced feature fusion and attention mechanisms, significantly increasing detection robustness in complex background environments and under partial occlusions, while Liu and Liao [22] incorporated space-to-depth convolution modules to improve small object detection, though performance degraded under complex background interference. These developments have primarily addressed challenges arising from natural environmental variations, including fog, illumination changes, occlusions, and background clutter. However, the literature reveals a critical gap in addressing adversarial robustness, which represents a fundamental security concern for automated inspection systems deployed in critical infrastructure applications.

Deep neural networks exhibit systematic vulnerabilities to adversarial examples, which are inputs crafted through imperceptible perturbations that cause misclassification while remaining invisible to human observers [23,24]. This phenomenon poses significant security risks for deep learning systems deployed in safety-critical applications where malicious actors may exploit vulnerabilities to manipulate model predictions. The Fast Gradient Sign Method (FGSM) [24] demonstrated that adversarial examples can be efficiently generated through single-step gradient computation, exploiting the locally linear behavior of neural networks. Projected Gradient Descent (PGD) attacks [25] extend this approach through iterative optimization with projection onto perturbation constraint sets, producing stronger adversarial examples that approximate worst-case perturbations. Carlini and Wagner (C&W) attacks [26] formulate adversarial generation as constrained optimization with explicit distance minimization, achieving high success rates through careful objective design. While adversarial robustness has been extensively investigated for image classification tasks [27], object detection models present distinct challenges due to multi-task learning objectives and spatial reasoning requirements [28]. Despite documented vulnerabilities of object detectors to adversarial perturbations [29,30], research examining adversarial robustness in domain-specific industrial applications such as insulator defect detection remains critically limited. This gap is particularly concerning given that automated inspection systems in critical infrastructure represent high-value targets for adversarial manipulation, especially in smart grid IoT environments where wireless communication channels can be exploited to corrupt transmitted images [31].

The existing literature reveals four fundamental gaps that motivate our research. First, while recent approaches have applied deep learning to insulator inspection [9,19,20], they primarily address environmental variations such as fog and illumination changes, without systematic evaluation of adversarial robustness in safety-critical infrastructure where detection failures can precipitate cascading outages. Second, small defect detection in complex outdoor environments remains challenging, as convolutional neural network (CNN) architectures struggle to isolate defect-specific features from visually similar background elements, particularly when defects span fewer than 50 pixels [2,11]. Third, available datasets exhibit significant class imbalance and limited diversity across environmental conditions, with defective samples comprising less than 30% of typical training sets [1], constraining model generalization to real-world inspection scenarios. Fourth, existing defense mechanisms demonstrate fundamental trade-offs between clean and robust accuracy, limiting practical deployment in applications requiring consistent performance under both normal and adversarial conditions [32]. Our work addresses these gaps through Faster-YOLOv12n, integrating lightweight feature extraction, structured augmentation strategies, and adversarial training specifically designed for insulator defect detection.

The main contributions of this paper are as follows:

We introduce Faster-YOLOv12n, integrating the FasterNet backbone [33], SGC2f feature fusion modules, and Wise-ShapeIoU loss [34,35] for enhanced small defect localization. Ablation studies demonstrate substantial improvements over baseline YOLOv12n, achieving 98.9% mAP@0.5 with a precision of 97.8% and a recall of 95.1%. The model maintains 5.48 FPS, making it suitable for real-time UAV-based inspection scenarios where typical image acquisition rates range from 1 to 10 FPS.
We apply differential data augmentation strategies to address the inherent class imbalance in power line inspection datasets, where defects occur relatively infrequently. Through Mosaic, MixUp, and Copy–Paste transformations [36], we achieve balanced class distribution while significantly expanding training sample diversity. This approach enables robust generalization across diverse environmental conditions, including fog, adverse weather, and complex transmission line backgrounds.
Our architecture demonstrates superior performance compared to established models including RT-DETR, Faster R-CNN, YOLOv7, YOLOv8, and YOLOv9 [13,17,37,38,39]. Per-class analysis reveals 99.8% AP@0.5 for defect detection, particularly significant for safety-critical infrastructure where false negatives can precipitate widespread power outages affecting millions of consumers.
We provide the first comprehensive adversarial robustness evaluation for insulator defect detection, systematically assessing FGSM, PGD, and C&W attacks across varying perturbation budgets [24,25,26]. Through adversarial training with mixed-batch strategies, our robust model maintains 93.2% mAP@0.5 under the strongest FGSM attacks ( $ϵ$ = 48/255), 94.5% under PGD attacks, and 95.1% under C&W attacks ( $τ$ = 3.0) while preserving 98.9% clean accuracy with precision of 97.8% and recall of 95.1%, demonstrating no trade-off between accuracy and robustness. Grad-CAM visualizations [40] provide interpretable insights into model decision-making, demonstrating sustained attention on discriminative defect regions even under adversarial conditions.

The structure of this paper is as follows: Section 2 reviews existing deep learning approaches for insulator defect detection and adversarial robustness methodologies. Section 3 presents our proposed approach, including dataset characteristics, preprocessing techniques, Faster-YOLOv12n architecture, adversarial attack formulations, and defense framework. Section 4 evaluates model performance across baseline and state-of-the-art models, examining the impact of data augmentation, adversarial attack severity, defense effectiveness, and Grad-CAM interpretability. Section 5 examines the implications of experimental findings, model capabilities, and technical limitations. Section 6 concludes the study and outlines directions for future research.

2. Related Works

Object detection has undergone substantial evolution with the advancement of deep learning methodologies. Early approaches employed two-stage detection frameworks that decompose the detection task into region proposal generation followed by classification and localization refinement. The Region-based Convolutional Neural Network (R-CNN) [12] pioneered this paradigm by applying convolutional neural networks to region proposals extracted via selective search. Subsequent developments improved computational efficiency through architectural innovations: Fast R-CNN [13] introduced end-to-end training with shared convolutional feature computation, Faster R-CNN [14] replaced selective search with a learnable Region Proposal Network, and R-FCN [41] utilized position-sensitive score maps to reduce computational redundancy [2].

Despite achieving high detection accuracy, two-stage detectors exhibit significant computational overhead and inference latency due to their sequential processing pipeline. The computational complexity scales linearly with the number of region proposals, resulting in processing times that exceed real-time requirements for many practical applications [42]. These limitations motivated the development of single-stage detection architectures that perform object localization and classification in a unified framework.

The You Only Look Once (YOLO) framework [15] introduced a fundamentally different approach by formulating object detection as a direct regression problem from image pixels to bounding box coordinates and class probabilities. This single-stage architecture eliminates the region proposal generation step, enabling significantly faster inference while maintaining competitive accuracy. The original YOLO model processes images in a single forward pass through a fully convolutional network, making it particularly suitable for applications requiring real-time performance on resource-constrained hardware [42]. The YOLO architecture has undergone continuous refinement through successive versions. YOLOv3 [16] introduced multi-scale feature pyramid predictions to improve detection of objects at varying scales. YOLOv5 [43,44] optimized the network architecture and training pipeline for enhanced deployment efficiency across different hardware platforms. YOLOv6 [45] focused specifically on industrial application requirements, incorporating efficient reparameterization techniques and hardware-aware design principles. An improved version of YOLOv8 [46] integrated anchor-free detection heads and enhanced feature fusion mechanisms through Path Aggregation Networks. YOLOv10 [47] achieved state-of-the-art performance through dual assignments for non-maximum suppression elimination and efficiency–accuracy-driven model design. YOLOv11 [48] introduced the C3k2 module incorporating cross-stage partial connections and the C2PSA module with channel-spatial positional self-attention mechanisms, substantially improving the capacity to capture fine-grained object features. YOLOv12 [18] builds upon these advancements with further architectural optimizations targeting improved accuracy and inference speed trade-offs.

Power transmission systems depend critically on the structural integrity of insulator components, and insulator failures can precipitate widespread power outages with significant economic consequences [2]. Traditional inspection methodologies rely on manual visual examination by trained personnel, which introduces multiple limitations, including high labor costs, slow inspection cycles, safety hazards from working at elevated heights, and subjective assessment variability [4]. The adoption of Unmanned Aerial Vehicle (UAV) platforms for automated inspection has addressed several of these challenges by enabling efficient aerial data acquisition across extensive transmission networks [49]. Early computational approaches for insulator defect detection employed traditional computer vision techniques based on handcrafted feature descriptors. These methods typically segment insulator regions using threshold-based techniques, edge detection operators, or morphological operations, followed by defect classification based on geometric and textural features [50]. However, the reliance on manually designed features results in limited generalization capability across varying imaging conditions, illumination changes, and background complexity.

The application of deep learning to insulator defect detection has substantially improved detection performance through learned hierarchical feature representations. Zhang et al. [19] developed the FINet architecture based on YOLOv5, incorporating synthetic fog augmentation algorithms and channel attention mechanisms to enhance robustness under adverse weather conditions, achieving 91.7% precision on insulator self-explosion defect detection. Cao et al. [9] integrated coordinate attention mechanisms into the YOLOv8 framework through the CACS-YOLO model, demonstrating superior performance on synthetic fog insulator datasets through enhanced spatial feature encoding. Li et al. [20] designed a lightweight variant of YOLOv4 using the ECAGhostNet backbone architecture, reducing the model size to 8.7 MB for deployment on embedded platforms while maintaining acceptable detection accuracy. Song et al. [11] modified the YOLOv7 architecture with enhanced feature fusion networks and optimized loss functions to improve detection under conditions involving complex backgrounds, small defect regions, and partial occlusions. Liao and Liu [22] incorporated the Space-to-Depth convolution module into YOLOv8 to enhance small object detection capability, though the model continued to exhibit reduced precision when processing images with complex background interference. These developments have primarily addressed challenges arising from natural environmental variations, including fog, illumination changes, occlusions, and background clutter. However, the literature reveals a critical gap in addressing adversarial robustness, which represents a fundamental security concern for automated inspection systems deployed in critical infrastructure applications.

Deep neural networks exhibit systematic vulnerabilities to adversarial examples, which are inputs crafted through the addition of carefully designed perturbations that cause misclassification while remaining imperceptible to human observers [23]. This phenomenon poses significant security risks for deep learning systems deployed in safety-critical applications where malicious actors may attempt to manipulate model predictions. The Fast Gradient Sign Method (FGSM) proposed by Goodfellow et al. [24] demonstrated that adversarial examples can be efficiently generated through a single-step gradient computation. This approach exploits the locally linear behavior of deep neural networks by perturbing inputs in the direction that maximizes the loss function. The Projected Gradient Descent (PGD) attack introduced by Madry et al. [25] extends FGSM through an iterative optimization procedure that alternates between gradient ascent steps and projection onto the perturbation constraint set, producing stronger adversarial examples that better approximate the worst-case perturbation within a specified threat model. Carlini and Wagner [26] formulated adversarial example generation as a constrained optimization problem with explicit distance minimization objectives, achieving high attack success rates through careful objective function design and optimization procedures.

While adversarial robustness has been extensively investigated for image classification tasks [27], object detection models present distinct challenges due to their multi-task learning objectives and spatial reasoning requirements. Object detectors must simultaneously perform accurate localization and classification, and adversarial perturbations can disrupt either or both of these functions [28]. Lu et al. [29] demonstrated that adversarial perturbations can cause object detectors to fail in multiple modes, including missed detections, false positive predictions, and misclassifications of detected objects. Wei et al. [30] showed that physically realizable adversarial patches can fool YOLO-based detectors in real-world scenarios when captured through video streams. Despite the documented vulnerabilities of object detection models to adversarial perturbations, research examining adversarial robustness in domain-specific industrial applications such as defect detection remains critically limited. This gap is particularly concerning given that automated inspection systems in critical infrastructure represent high-value targets for adversarial manipulation.

Defensive techniques against adversarial attacks can be categorized into three primary approaches: adversarial training, input preprocessing, and model architecture modifications. Adversarial training [25] augments the training dataset with adversarial examples generated during the training process, effectively teaching the model to correctly classify perturbed inputs. While this approach has demonstrated effectiveness in improving worst-case robustness, it frequently results in reduced accuracy on unperturbed clean data [32]. Furthermore, adversarial training requires substantial computational resources due to the need for online adversarial example generation during each training iteration. Input preprocessing defenses attempt to remove or reduce adversarial perturbations before feeding data into the detection model. Proposed techniques include JPEG compression [51], which removes high-frequency perturbations through lossy compression; total variance minimization [52], which smooths images while preserving edges and denoising autoencoders [53], which learn to reconstruct clean images from perturbed inputs. While these methods offer computational efficiency during inference, they often reduce clean image accuracy by removing legitimate high-frequency features that contain semantic information. Additionally, these pixel-level defenses provide limited protection against adaptive attacks specifically designed to circumvent the preprocessing operations [54]. Architectural defense mechanisms modify the neural network structure to enhance inherent robustness to adversarial perturbations. Feature denoising approaches [55] operate on intermediate feature representations rather than raw input pixels, allowing the model to filter adversarial noise while preserving semantic feature information. Feature squeezing [56] reduces the color bit depth and applies spatial smoothing to compress the input space and remove perturbations. Attention mechanisms have shown promise in guiding models to focus on robust features that are less susceptible to adversarial manipulation [57]. Recent work on adaptive feature denoising [55] demonstrates that dynamically adjusting denoising operations based on perturbation characteristics can improve the trade-off between clean accuracy and adversarial robustness. However, the majority of existing defense mechanisms have been developed and evaluated using general object detection benchmarks such as Microsoft COCO [58] and Pascal VOC [59]. These defenses have not been validated on domain-specific applications with unique characteristics, particularly small object detection in industrial settings where both high clean accuracy and adversarial robustness represent critical operational requirements.

Adversarial robustness research across safety-critical domains reveals systematic vulnerabilities in deep learning deployments. In autonomous driving, Eykholt et al. [60] demonstrated 100% success rates for physical adversarial patches on traffic sign classifiers, while Zhang et al. [61] showed that multi-task object detectors remain vulnerable to unified adversarial attacks across detection and segmentation tasks. Medical imaging faces analogous threats: Finlayson et al. [62] documented adversarial manipulation of clinical diagnostic systems, and Apostolidis et al. [63] revealed vulnerabilities in COVID-19 detection from chest X-rays with perturbations imperceptible to radiologists. Despite extensive research in autonomous vehicles [64] and healthcare [63], industrial inspection systems for critical infrastructure have received substantially less attention, particularly for outdoor object detection in uncontrolled environments where small defects must be reliably identified under both natural environmental variations and potential adversarial manipulation.

The existing literature reveals four fundamental gaps that motivate our research. First, while recent approaches have applied deep learning to insulator inspection [9,19,22], they primarily address environmental variations such as fog and illumination changes, without systematic evaluation of adversarial robustness in safety-critical infrastructure where detection failures can precipitate cascading outages. Second, small defect detection in complex outdoor environments remains challenging, as conventional CNN architectures struggle to isolate defect-specific features from visually similar background elements. Third, available datasets exhibit significant class imbalance, with defective samples comprising less than 30% of typical training sets, constraining model generalization to real-world inspection scenarios. Fourth, existing defense mechanisms demonstrate fundamental trade-offs between clean and robust accuracy, limiting practical deployment in applications requiring consistent performance under both normal and adversarial conditions [32].

Our work addresses these gaps through Faster-YOLOv12n, an adversarially robust detector specifically designed for insulator defect detection. We integrate the lightweight FasterNet backbone [33] with SGC2f feature fusion modules and Wise-ShapeIoU loss [34,35] to enhance small object localization while maintaining computational efficiency. Unlike prior research focusing exclusively on natural environmental variations, we establish the first comprehensive adversarial robustness benchmark for this domain, systematically evaluating FGSM, PGD, and C&W attacks. Through structured data augmentation and adversarial training strategies, our approach achieves robust performance under both clean and adversarial conditions, demonstrating that careful architectural design combined with defense mechanisms can mitigate the accuracy–robustness trade-off characteristic of existing approaches.

3. Methodology

3.1. Network Architecture Overview

The proposed insulator defect detection framework employs a modified YOLOv12 architecture specifically optimized for identifying and localizing defects in power line insulators captured under diverse operational conditions. As illustrated in Figure 1, the complete detection pipeline comprises three synergistic components: a lightweight backbone network for hierarchical feature extraction, an attention-augmented neck module for multi-scale feature fusion, and a decoupled detection head for final class and localization predictions.

The backbone network integrates FasterNet [33], which leverages partial convolution operations to extract semantic features with significantly reduced computational overhead compared to conventional architectures. Given an input image of dimension

640 \times 640 \times 3

, the architecture generates multi-scale feature representations at resolutions

160 \times 160 \times C_{1}

,

80 \times 80 \times C_{2}

,

40 \times 40 \times C_{3}

, and

20 \times 20 \times C_{4}

through four hierarchical stages, where channel dimensions progressively expand to accommodate richer semantic representations. The initial stage incorporates an embedding module employing stride-4 downsampling with a

4 \times 4

convolutional kernel, while subsequent stages utilize merging modules with stride-2 operations for progressive spatial downsampling and channel expansion. This lightweight design maintains high representational capacity while enabling real-time inference on resource-constrained edge devices.

The neck architecture incorporates the proposed SGC2f (Sage-Area-guided Cross-stage fusion) module, which employs spatial partitioning and attention mechanisms to enhance feature aggregation across multiple receptive fields. The neck implements bidirectional feature pyramid connections that combine features from multiple backbone stages: the bottom-up pathway progressively aggregates features from higher to lower resolutions through C3K2 and SGC2f modules, while the top-down pathway refines features by upsampling and concatenating information from deeper layers. This bidirectional fusion substantially improves the model’s ability to capture fine-grained defect characteristics by integrating spatial details from high-resolution features with semantic information from deep layers, while accelerating both training and inference phases.

The detection head adopts a decoupled architecture that independently processes classification and bounding box regression tasks across three spatial scales, facilitating robust detection of insulators and defects exhibiting substantial scale variation. The three detection scales operate on feature maps at

40 \times 40

,

20 \times 20

, and

10 \times 10

resolutions, enabling simultaneous detection of small defect regions (10–50 pixels), medium-scale insulator components, and large transmission line structures.

To improve localization precision, particularly for small defect regions, we replace the conventional Complete Intersection over Union (CIoU) loss with the Wise-ShapeIoU loss function. This modified objective incorporates geometric shape constraints and adaptive gradient modulation based on anchor quality, yielding more accurate bounding box predictions while accelerating convergence during training.

3.2. Data Collection and Preprocessing

The China Power Line Insulator Dataset (CPLID) consists of 848 images at

1152 \times 864

pixels, comprising 600 normal images and 248 defective images [1] (as shown in Figure 2). Annotations are provided in Pascal VOC XML format with two classes: insulator (class 0) and defect (class 1). Normal images contain single bounding boxes for insulator objects, while defective images include dual annotations for both the insulator and the defect region (missing cap).

The dataset exhibits a natural class imbalance ratio of 2.42:1 (normal-to-defective), reflecting real-world power line inspection scenarios where defects are relatively rare. To ensure unbiased evaluation, we partitioned the dataset using stratified sampling with an 80:10:10 ratio, maintaining the original class distribution across all splits. The resulting partitions comprise a training set with 678 images (480 normal, 198 defective), a validation set with 85 images (60 normal, 25 defective), and a test set with 85 images (60 normal, 25 defective).

To address the class imbalance in the training set while preserving model generalization, we applied differential data augmentation using the Albumentations library (v1.3.1) [36]. Normal samples were augmented with a factor of

4 \times

, while defective samples underwent

10 \times

augmentation, yielding a near-balanced training corpus of 3900 images (1920 normal, 1980 defective, representing 49.2% and 50.8%, respectively).

The augmentation pipeline incorporates geometric transformations, including rotation (

\pm 15^{\circ}

), horizontal flipping (

p = 0.5

), and affine transformations with scaling (0.8–1.2) and translation (

\pm 10 %

). Photometric adjustments include brightness (

\pm 20 %

), contrast (0.8–1.2), Gaussian noise (

σ = 0.02

), and HSV perturbations (

\pm 10 %

). Advanced transformations including Mosaic, MixUp, Copy–Paste, grid distortion, and elastic deformation enhance model robustness to real-world variations in lighting, perspective, and environmental conditions encountered during power line inspection.

Validation and test sets remain unaugmented to ensure unbiased performance evaluation and facilitate fair comparison with existing methods. All images were resized to

640 \times 640

pixels during training while maintaining aspect ratio. Dataset statistics are summarized in Table 1.

3.3. Lightweight FasterNet Backbone

The baseline YOLOv12 architecture employs a convolutional backbone composed of C3K2 and A2C2f modules, which achieves strong performance on general object detection benchmarks. However, for specialized industrial inspection tasks requiring real-time inference on edge devices, this design introduces computational redundancy and suboptimal throughput-latency trade-offs. We address these limitations by replacing the original backbone with FasterNet [33] (Figure 3), a partial convolution-based architecture that delivers superior feature extraction capacity per unit of computation. The key innovation lies in partial convolution (PConv), which processes only

c_{p} = c / 4

input channels while applying identity mapping to remaining channels, reducing computational cost to 1/16 and memory access overhead to 1/4 of standard convolution, thereby enabling substantially higher throughput on resource-constrained hardware.

FasterNet organizes feature extraction into four hierarchical stages, each producing feature maps at progressively reduced spatial resolutions with expanding channel dimensions. Given an input image of dimension

640 \times 640 \times 3

, the architecture generates multi-scale feature representations at resolutions

160 \times 160 \times C_{1}

,

80 \times 80 \times C_{2}

,

40 \times 40 \times C_{3}

, and

20 \times 20 \times C_{4}

, where channel dimensions typically increase at each stage to accommodate richer semantic representations. The initial stage incorporates an embedding module employing stride-4 downsampling with a

4 \times 4

convolutional kernel, while subsequent stages utilize merging modules with stride-2 operations and

2 \times 2

kernels for progressive spatial downsampling and channel expansion.

We adopt the FasterNet configuration with block repetition pattern

[l_{1}, l_{2}, l_{3}, l_{4}] = [1, 2, 8, 2]

, which balances feature extraction capacity with parameter efficiency. The asymmetric distribution of computational resources reflects the principle that deeper layers benefit from additional capacity to learn complex high-level semantic representations critical for defect discrimination. This architectural design achieves 5.48 FPS inference speed while maintaining 98.9% mAP@0.5 detection accuracy, making it suitable for real-time UAV-based inspection where typical image acquisition rates range from 1 to 10 FPS.

FasterNet Block Architecture

The fundamental building block of FasterNet employs partial convolution (PConv) to achieve dramatic reductions in computational cost and memory access overhead while maintaining representational capacity. The block adopts an inverted residual structure comprising a PConv operation followed by two pointwise convolutions (PWConv, implemented as

1 \times 1

convolutions). Batch normalization and activation functions are strategically placed only after the middle PWConv layer, maintaining feature diversity while minimizing computational complexity.

The partial convolution operation performs standard convolution on only a subset of input channels while leaving the remaining channels unchanged through an identity mapping. For an input tensor with c channels, height h, and width w, partial convolution processes only

c_{p}

channels with a

k \times k

kernel, where we set

c_{p} = c / 4

following the original FasterNet design. The floating-point operations required are

{FLOPs}_{PConv} = h \times w \times k^{2} \times c_{p}^{2}

(1)

Under the constraint

c_{p} = c / 4

, this computational cost represents only 1/16 of the operations required by standard convolution operating on all channels. The memory access cost is similarly reduced,

{MAC}_{PConv} = h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(2)

This memory access overhead is merely 1/4 of conventional convolution, enabling substantially higher throughput on hardware with limited memory bandwidth. For the insulator detection task, we removed FasterNet’s classification-specific terminal layers (global average pooling, final convolutional layer, and fully connected head), retaining exclusively the hierarchical feature extraction backbone that interfaces directly with the detection neck and head modules.

3.4. SGC2f Feature Aggregation Module

Multi-scale feature fusion enables models to leverage both fine-grained spatial details and abstract semantic information for accurate localization and classification. However, conventional concatenation-based fusion approaches suffer from information loss during naive feature aggregation and computational overhead from attention mechanisms operating on full spatial resolutions. The proposed SGC2f (Sage-Area-guided Cross-stage fusion) module addresses both limitations through a dual-path architecture with spatial partitioning and multi-level feature aggregation.

The module processes input features

F \in R^{B \times C \times H \times W}

according to

F_{out} = α \cdot Concat (H_{i} (T (F))) + F

(3)

where

T (\cdot)

denotes initial transition layers comprising

1 \times 1

convolutions that standardize feature maps and adjust channel dimensions,

H_{i} (\cdot)

represents the i-th parallel processing path containing a sequence of SageBlocks, and

α

is a learnable scaling factor initialized to 0.01. The residual connection directly bypasses the main computational path, preserving original features and facilitating gradient flow during backpropagation.

The main computational pathway splits processed features into multiple parallel branches, each containing varying numbers of sequentially arranged SageBlocks. This multi-path design generates a hierarchical representation where shallow branches capture fine-grained spatial details critical for detecting subtle surface defects, while deeper branches extract abstract high-level semantic features encoding object-level context. Outputs from all parallel paths are concatenated along the channel dimension, producing an enriched feature representation spanning multiple levels of abstraction.

3.4.1. Sage-Area Attention Mechanism

The Sage-Area Attention mechanism implements a spatially partitioned self-attention operation designed to reduce computational complexity while preserving the ability to model long-range dependencies. Given input features

F \in R^{B \times C \times H \times W}

, the mechanism applies parallel linear transformations to generate key, query, and value representations,

Q = F W^{Q}, K = F W^{K}, V = F W^{V}

(4)

where

W^{Q}, W^{K}, W^{V} \in R^{C \times C}

denote learnable projection matrices. Rather than computing attention across the entire spatial domain (requiring

O (L^{2} d)

operations where

L = H \times W

), the mechanism partitions each feature map into

S = 4

non-overlapping spatial regions. This partitioning reduces the spatial resolution from

(H, W)

to

(H \times W) / S

within each region while maintaining full channel dimension C.

The multi-head attention mechanism computes

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

(5)

where

W^{O}

projects the concatenated result back to the original feature space. Each attention head processes its assigned partitioned areas according to

{head}_{i} = Attention (Q_{i} W_{i}^{Q}, K_{i} W_{i}^{K}, V_{i} W_{i}^{V})

(6)

For each partitioned spatial region, attention weights are computed as

A (i, j) = \sum_{k \in R_{i, j}} ϕ (Q_{i, j}, K_{k}) \cdot V_{k}

(7)

where

R_{i, j}

denotes the partitioned area containing spatial position

(i, j)

, and

ϕ (\cdot)

represents the scaled dot-product attention function. The output undergoes a reshape operation to restore the original spatial dimensions

[B, C, H, W]

.

This spatial partitioning strategy achieves a fundamental reduction in computational complexity from

O (L^{2} d)

for standard self-attention to approximately

O (L S d)

for the partitioned variant, enabling deployment of attention mechanisms in real-time detection systems operating under strict latency constraints.

3.4.2. SageAttention Optimization

The attention computation within each SageBlock employs SageAttention [65], an optimized attention mechanism designed for processing long sequences with enhanced memory efficiency. SageAttention offers advantages through quantized attention and smooth matrix K techniques.

Quantized attention compresses the numerical precision of attention weights by mapping floating-point values into low-bit discrete representations. By reducing bit-width from 32-bit to 8-bit or lower precision, this technique significantly decreases memory access overhead and computational complexity. SageAttention incorporates a learnable quantization scaling factor that adaptively adjusts to the value distribution characteristics of different layers and feature maps, preserving representational capacity while enhancing hardware utilization efficiency.

The smooth matrix K technique introduces a smoothing operator applied to the key matrix prior to attention weight calculation. This operator performs local feature smoothing along the sequence dimension, enhancing the continuity and stability of the attention distribution. By mitigating abrupt attention shifts caused by anomalous key vectors or noise contamination, this smoothing operation alleviates accuracy degradation during quantization while improving robustness when processing insulator surface images containing noise, compression artifacts, or illumination variations.

3.5. Wise-ShapeIoU Loss Function

The baseline YOLOv12 architecture employs the Complete Intersection over Union (CIoU) loss function for bounding box regression, augmenting the standard IoU metric with penalty terms for center distance and aspect ratio difference. However, CIoU exhibits a fundamental limitation: when predicted and ground truth boxes have similar aspect ratios, the aspect ratio penalty term becomes ineffective, providing insufficient gradient signal for accurate localization. This limitation proves particularly problematic for insulator defect detection, where defects often exhibit elongated or irregular shapes requiring precise geometric modeling.

We introduce the Wise-ShapeIoU loss function [34,35], which incorporates explicit constraints on both shape and scale attributes of bounding boxes. The Shape-IoU component is formulated as

L_{Shape - IoU} = 1 - IoU + {distance}_{shape} + 0.5 \times Ω_{shape}

(8)

where the IoU term quantifies the overlap between predicted and ground truth boxes,

IoU = \frac{| B \cap B^{g t} |}{| B \cup B^{g t} |}

(9)

The shape distance incorporates anisotropic geometric constraints weighted according to ground truth box dimensions,

{distance}_{shape} = w_{w} \times \frac{{(x_{c} - x_{c}^{g t})}^{2}}{c^{2}} + w_{h} \times \frac{{(y_{c} - y_{c}^{g t})}^{2}}{c^{2}}

(10)

where

(x_{c}, y_{c})

and

(x_{c}^{g t}, y_{c}^{g t})

denote the center coordinates of predicted and ground truth boxes, respectively, c represents the diagonal length of the smallest enclosing box, and

w_{w}

,

w_{h}

are weighting coefficients computed from ground truth dimensions,

w_{w} = \frac{2 \times {(w^{g t})}^{scale}}{{(w^{g t})}^{scale} + {(h^{g t})}^{scale}}, w_{h} = \frac{2 \times {(h^{g t})}^{scale}}{{(w^{g t})}^{scale} + {(h^{g t})}^{scale}}

(11)

This adaptive weighting ensures stronger gradients along the dominant dimension of each ground truth box, facilitating accurate regression for objects with extreme aspect ratios. The shape penalty term explicitly penalizes width and height mismatches,

Ω_{shape} = \sum_{t \in {w, h}} {(1 - e^{- ω_{t}})}^{θ}, θ = 4

(12)

ω_{w} = w_{h} \times \frac{| w - w^{g t} |}{max (w, w^{g t})}, ω_{h} = w_{w} \times \frac{| h - h^{g t} |}{max (h, h^{g t})}

(13)

To enhance convergence speed and improve handling of difficult samples, we adopt the Wise-IoU v3 strategy [35], which introduces an outlierness-based gradient modulation mechanism. The complete Wise-ShapeIoU loss is

L_{Wise - ShapeIoU} = r \cdot L_{Shape - IoU}

(14)

where the gradient gain r is determined by anchor quality,

r = \frac{β}{δ α^{β - δ}}

(15)

Here,

β

quantifies the outlierness of each anchor box, defined as the ratio of the current anchor’s Shape-IoU loss to the mean Shape-IoU loss across all anchors in the current mini-batch,

β = \frac{L_{Shape - IoU}^{(i)}}{\frac{1}{N} \sum_{j = 1}^{N} L_{Shape - IoU}^{(j)}}

(16)

where N denotes the number of anchors. High-quality anchors (low loss) receive higher gradient gains to expedite optimization, while low-quality anchors (high loss, potentially corresponding to hard negatives or annotation errors) receive reduced gradient contributions. Following empirical validation, we adopt

α = 1.9

and

δ = 3

for all experiments.

3.6. Adversarial Robustness Evaluation Framework

3.6.1. Threat Model and Attack Formulation

Adversarial attacks introduce imperceptible perturbations to input images that cause deep neural networks to produce erroneous predictions while remaining visually indistinguishable to human observers [23,24]. Formally, let an input image be represented as

x_{clean} \in R^{h \times w \times c}

with ground truth annotations

y

. An adversarial attack crafts a perturbed image

x_{adv}

by solving

x_{adv} = x_{clean} + arg max_{δ \in S} L_{θ} (x_{clean} + δ, y)

(17)

where

δ \in R^{h \times w \times c}

represents the adversarial perturbation,

L_{θ}

denotes the model’s loss function with parameters

θ

, and

S

defines the constraint set ensuring imperceptibility under a specified

ℓ_{p}

norm. For object detection models,

L_{θ}

encompasses objectness, classification, and localization losses [28].

We evaluate adversarial robustness under white-box threat scenarios where adversaries possess complete knowledge of model architecture, parameters, and gradients, representing the strongest threat model [26]. Three canonical attack algorithms—Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and Carlini and Wagner (C&W)—are employed with perturbation budgets calibrated to established benchmarks [25,26].

3.6.2. Attack Methods

Fast Gradient Sign Method (FGSM). FGSM generates adversarial examples through single-step gradient computation [24]. Given a perturbation budget

ϵ

constraining the

ℓ_{\infty}

norm, the adversarial example is

x_{adv} = clip (x_{clean} + ϵ \cdot sign (\nabla_{x} L_{θ} (x_{clean}, y)), 0, 1)

(18)

where

\nabla_{x} L_{θ}

denotes the loss gradient with respect to the input. We evaluate FGSM at four perturbation strengths:

ϵ \in {8 / 255, 16 / 255, 32 / 255, 48 / 255}

, corresponding to standard benchmarks ranging from imperceptible to physically realizable perturbations [25,66].

Projected Gradient Descent (PGD). PGD extends FGSM through iterative refinement with projection onto the

ℓ_{\infty}

constraint ball [25]. Starting from random initialization

x_{adv}^{(0)} = x + U (- ϵ, ϵ)

, PGD performs N iterations,

x_{adv}^{(t + 1)} = Π_{x + B_{ϵ}} (x_{adv}^{(t)} + α \cdot sign (\nabla_{x} L_{θ} (x_{adv}^{(t)}, y)))

(19)

where

Π_{x + B_{ϵ}} (\cdot)

denotes projection onto the

ℓ_{\infty}

ball

B_{ϵ} = {{δ : ∥ δ ∥}_{\infty} \leq ϵ}

centered at

x

, and

α

is the step size. We employ

N = 20

iterations with step size

α = 0.05 ϵ

for perturbation budgets

ϵ \in {8 / 255, 16 / 255, 32 / 255, 48 / 255}

.

Carlini and Wagner (C&W) Attack. The C&W attack formulates adversarial generation as constrained optimization minimizing

ℓ_{2}

distance [26],

min_{δ} {∥ δ ∥}_{2}^{2} + c \cdot f (x + δ)

(20)

where

f (\cdot)

promotes misclassification and

c > 0

balances perturbation minimization and attack effectiveness. To handle box constraints, C&W employs the change in variables

x_{adv} = \frac{1}{2} (\tanh (w) + 1)

where

w \in R^{h \times w \times c}

is optimized via Adam [67]. We evaluate C&W with

ℓ_{2}

bounds

τ \in {0.5, 1.0, 2.0, 3.0}

over 50 optimization iterations with learning rate

0.01

and binary search over

c \in [0, 10]

.

3.6.3. Adversarial Training for Robustness

Adversarial training augments the training dataset with adversarially perturbed examples to enhance model robustness [24,25]. The robust training objective is formulated as a minimax optimization,

min_{θ} E_{(x, y) \sim D} [max_{{∥ δ ∥}_{\infty} \leq ϵ} L_{θ} (x + δ, y)]

(21)

where the inner maximization generates worst-case adversarial perturbations, and the outer minimization trains the model to minimize loss on these adversarial examples [25].

Our adversarial training employs a two-stage strategy: baseline training on clean data followed by adversarial fine-tuning with mixed batches containing 70% clean and 30% adversarial examples. Adversarial examples are generated online during training using PGD with

ϵ = 16 / 255

(10 iterations,

α = 0.007

) and FGSM with

ϵ = 32 / 255

in equal proportions. This mixed-batch approach balances clean accuracy preservation and adversarial robustness, addressing the inherent trade-off between standard and robust performance [32,68]. The adversarial ratio

ρ

= 0.3 was selected through preliminary experiments evaluating mixed-batch compositions from 10% to 50% adversarial examples, where 30% achieved optimal balance between clean accuracy preservation (98.9% mAP) and robust performance under attacks (94.5% mAP under PGD

ϵ

= 48/255). This ratio also reflects realistic threat scenarios for smart grid deployments: higher adversarial ratios (>40%) introduce excessive perturbations that, while effective for worst-case robustness training, produce visibly distorted images exceeding practical attack constraints in operational environments where adversaries must maintain imperceptibility to avoid detection by human operators or automated anomaly detection systems [25,66]. Final model selection considers both clean validation accuracy and robust accuracy under PGD attack with

ϵ = 32 / 255

, prioritizing models that achieve strong performance across both metrics.

3.7. Explainability Analysis

Deep convolutional neural networks, despite achieving superior performance in object detection tasks, remain inherently difficult to interpret due to their complex hierarchical feature learning mechanisms [40]. This lack of transparency poses challenges for deployment in safety-critical applications such as power infrastructure monitoring, where understanding the rationale behind detection decisions is essential for building operator trust and identifying potential model vulnerabilities.

To validate that our model learns semantically meaningful visual features rather than exploiting spurious correlations or dataset artifacts, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) [40] to generate class-discriminative localization maps. For a target class c, Grad-CAM computes the gradient of the class score with respect to feature maps in the final convolutional layer, performs global average pooling to obtain importance weights, and produces the localization map

L_{G r a d - C A M}^{c} = ReLU (\sum_{k} α_{k}^{c} A^{k})

, where

α_{k}^{c}

represents the importance weight for feature map k and class c. The ReLU activation ensures that only features with positive influence on the target class are highlighted.

For insulator defect detection, we generate Grad-CAM visualizations by extracting feature maps from the final convolutional layer of the Faster-YOLOv12n backbone and computing spatial importance weights for both insulator and defect classes. The resulting heatmaps are upsampled to input image resolution and overlaid with a thermal colormap to provide intuitive visual explanations. We apply this analysis to both clean test images and adversarial examples to compare attention patterns under normal and perturbed conditions, enabling identification of how adversarial perturbations disrupt the model’s learned feature representations.

3.8. Implementation Details

The model was implemented in PyTorch 2.0.1 with Python 3.9 and trained on an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) (24 GB memory) under Ubuntu 20.04 LTS. Training employed the Stochastic Gradient Descent (SGD) optimizer with momentum 0.937 and weight decay

5 \times 10^{- 4}

. The initial learning rate was set to 0.01 with a cosine annealing schedule decaying to

10^{- 4}

over 100 epochs. Input images were resized to

640 \times 640

pixels using bilinear interpolation with letterbox padding, and the training batch size was 16 images per GPU. Data augmentation incorporated mosaic augmentation (combining four images), random horizontal flipping (probability 0.5), and HSV color space perturbations. Mosaic augmentation was disabled for the final 10 epochs (91–100) to stabilize batch normalization statistics.

The loss function comprises three components: Wise-ShapeIoU for bounding box regression (weight 7.5), binary cross-entropy for objectness prediction (weight 1.0), and binary cross-entropy for class prediction (weight 0.5), with weights determined through grid search. The model was trained end-to-end without freezing any layers. Batch normalization layers employed momentum 0.03 and epsilon

10^{- 3}

. Early stopping was implemented based on validation mAP@0.5 with a patience of 50 epochs, retaining the checkpoint achieving the highest validation performance for final testing. All experiments employed mixed-precision training (FP16) using automatic mixed precision (AMP).

Cross-Validation Strategy

To address the limited dataset size and ensure statistical reliability of our results, we performed five-fold stratified cross-validation on the complete CPLID. The dataset was partitioned into five folds maintaining the original class distribution ratio (2.42:1 normal-to-defective). For each fold, the model was trained from scratch using identical hyperparameters, data augmentation strategies, and training procedures described in Section 3.8. Differential augmentation (4× for normal, 10× for defective samples) was applied independently within each training fold to prevent information leakage.

We performed five-fold stratified cross-validation, preserving the 80:10:10 ratio. The complete dataset (848 images) was partitioned into five folds of approximately 170 images each. For each iteration, one fold (∼170 images) was held out as the test set, while the remaining 678 images were further split into training (∼603 images) and validation (∼75 images) subsets following the 8:1 ratio from our original configuration. Stratification ensured that the original class distribution (2.42:1 normal-to-defective ratio) was preserved across all partitions. Differential augmentation (4× for normal samples, 10× for defective samples) was applied independently within each training fold to expand the effective training set to approximately 3900 images per fold while preventing information leakage. The reported cross-validation metrics represent the mean and standard deviation across all five folds, providing robust confidence intervals for model performance. This validation strategy ensures that our results generalize across different data partitions rather than overfitting to a single train-test split.

3.9. Evaluation Metrics

To quantitatively evaluate the performance of different methods, we utilize precision (P), recall (R), F1-score, average precision (AP), mean average precision (mAP), and GFLOPs as metrics for detection accuracy, and frames per second (FPS) as a metric for inference speed.

Precision measures the proportion of correct detections among all predictions,

P = \frac{TP}{TP + FP}

(22)

Recall quantifies the proportion of ground truth objects successfully detected,

R = \frac{TP}{TP + FN}

(23)

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively.

The F1-score provides a harmonic mean of precision and recall,

F_{1} = 2 \times \frac{P \times R}{P + R}

(24)

Average precision (AP) computes the area under the precision-recall curve for each class,

AP = \int_{0}^{1} P (R) d R

(25)

Mean average precision (mAP) averages AP across all classes,

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(26)

where N is the number of classes. We report mAP at IoU threshold

0.5

(mAP@

0.5

), which considers a detection correct if its IoU with the ground truth exceeds

0.5

.

4. Results

4.1. Performance Comparison with State-of-the-Art Models

To evaluate the effectiveness of our proposed model, we compare it against well-established object detection architectures using identical training conditions. The baseline models include RT-DETR [37], Faster R-CNN [13], YOLOv6 [45], YOLOv7 [38], YOLOv8 [39], YOLOv9 [17], and the baseline YOLOv12n [18]. All models are trained and tested on the CPLID with identical 80:10:10 stratified splits (678/85/85 images), augmentation strategies, and hyperparameter configurations.

As presented in Table 2, Faster-YOLOv12n demonstrates superior performance across all evaluation metrics, achieving precision of 97.8%, recall of 95.1%, F1-score of 96.4%, and mAP@0.5 of 98.9%. The model surpasses baseline YOLOv12n by 1.8% in mAP@0.5, YOLOv9 by 2.4%, YOLOv10 by 2.6%, YOLOv8 by 3.8%, YOLOv7 by 4.2%, and Faster R-CNN by 12.7%. While YOLOv9 and YOLOv10 achieve competitive performance with 96.5% and 96.3% mAP@0.5, respectively, demonstrating the effectiveness of recent architectural innovations including programmable gradient information and NMS-free detection, they exhibit lower recall values compared to our approach, resulting in inferior overall detection performance for safety-critical insulator defect detection. The proposed model achieves 5.48 FPS, which is sufficient for UAV-based transmission line inspection where typical image acquisition rates range from 1 to 10 FPS. The 4.7% improvement in recall over the YOLOv12n baseline is particularly significant for safety-critical infrastructure where false negatives can precipitate power outages.

To ensure statistical robustness of our evaluation, we conducted five-fold stratified cross-validation on the complete CPLID in addition to the single split evaluation. As presented in Table 3, Faster-YOLOv12n achieves consistent performance across all folds with a mean mAP@0.5 of 98.8% ± 0.2%, precision of 97.7% ± 0.2%, and recall of 95.0% ± 0.2%. The low standard deviation (

σ \leq

0.002 for all metrics) demonstrates that our model generalizes reliably across different data partitions rather than overfitting to a specific train-test split. The single split results reported in Table 2 (mAP@0.5 = 98.9%, precision = 97.8%, recall = 95.1%) fall within one standard deviation of the cross-validation mean, confirming the statistical consistency of our findings.

4.2. Per-Class Performance Analysis

Table 4 presents per-class performance evaluation on the CPLID test set containing 85 insulator instances and 25 defect instances. The insulator class achieves precision of 97.5%, recall of 96.0%, F1-score of 96.7%, and AP@0.5 of 98.6%, demonstrating robust localization and classification across varying scales and background conditions. For the defect class, the model achieves precision of 99.0%, recall of 92.0%, F1-score of 95.4%, and AP@0.5 of 99.8%. The exceptional 99.0% precision for defects substantially reduces false alarms, which is critical for safety-critical infrastructure monitoring where unnecessary manual verification incurs significant operational costs. The weighted overall metrics of 97.8% precision, 95.1% recall, 96.4% F1-score, and 98.9% mAP@0.5 confirm balanced performance across both classes despite the 3.4:1 class imbalance ratio.

4.3. Adversarial Robustness Results

We evaluate Faster-YOLOv12n against three white-box adversarial attacks: Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and Carlini and Wagner (C&W) attack. Table 5 presents the baseline model performance under varying perturbation budgets, while Figure 4, Figure 5 and Figure 6 compare baseline and adversarially trained models across all attack configurations.

The baseline Faster-YOLOv12n exhibits progressive degradation under increasing perturbation budgets. Under FGSM attacks, mAP@0.5 decreases from 98.9% (clean) to 94.2% at

ϵ = 8 / 255

, 92.8% at

ϵ = 16 / 255

, 90.8% at

ϵ = 32 / 255

, and 88.9% at

ϵ = 48 / 255

, representing a 10.1% total degradation. PGD attacks demonstrate stronger adversarial strength with mAP@0.5 declining to 94.9%, 93.6%, 92.1%, and 90.8% at the same perturbation budgets, yielding an 8.2% degradation at

ϵ = 48 / 255

. The C&W attack, optimized for minimal

ℓ_{2}

perturbations, achieves mAP@0.5 of 95.8%, 94.8%, 93.6%, and 92.5% at

τ \in {0.5, 1.0, 2.0, 3.0}

, with a 6.5% degradation at the strongest perturbation. These results indicate that the baseline model, despite achieving 98.9% clean accuracy, remains vulnerable to adversarial perturbations without explicit robustness training.

Adversarial training substantially improves robustness while preserving clean accuracy. The adversarially trained Faster-YOLOv12n maintains 98.9% mAP@0.5 on clean images, demonstrating no trade-off between accuracy and robustness. Under FGSM attacks, the model achieves 96.5%, 95.8%, 94.6%, and 93.2% mAP@0.5 at

ϵ \in {8 / 255, 16 / 255, 32 / 255, 48 / 255}

, yielding a 5.7% total degradation compared to 10.1% for the baseline model. Against PGD attacks, performance remains at 96.8%, 96.2%, 95.4%, and 94.5%, achieving only 4.4% degradation versus 8.2% baseline. The C&W attack results in 97.2%, 96.5%, 95.8%, and 95.1% mAP@0.5 across perturbation budgets, with a minimal 1.8% degradation compared to 6.5% baseline. At the strongest attack configurations (

ϵ = 48 / 255

for FGSM/PGD,

τ = 3.0

for C&W), adversarial training provides absolute improvements of 4.3%, 3.7%, and 2.6% respectively. The mixed-batch training strategy (70% clean, 30% adversarial) with online perturbation generation successfully mitigates the traditional accuracy–robustness trade-off observed in adversarial defenses.

4.4. Model Explainability with Grad-CAM

To validate that Faster-YOLOv12n learns semantically meaningful features rather than spurious correlations, we generate Grad-CAM visualizations for successful detections on clean test images. Figure 7 presents two representative examples showing model attention patterns during inference. The heatmaps demonstrate that the model correctly localizes attention on discriminative object regions, specifically the insulator structure and defect locations.

In Figure 7a, the model focuses intensely on the insulator body and visible defect area, with peak activation concentrated precisely at the fault location. Similarly, Figure 7b shows focused attention on the insulator’s structural components rather than background regions. The spatial coherence of high-importance areas confirms that the detector learns to identify relevant visual patterns—texture anomalies, structural irregularities, and object boundaries—that correspond to actual defect characteristics. This explainability analysis provides confidence that model predictions rely on legitimate visual cues rather than dataset artifacts or spurious features, which is critical for deployment in safety-critical infrastructure monitoring applications.

To validate adversarial perturbations disrupt confidence calibration rather than feature representations, we extend Grad-CAM analysis to adversarial examples. Figure 8 compares attention patterns across clean, attacked, and robust model conditions under FGSM attack (

ϵ = 32 / 255

). The baseline model under attack (Figure 8b) maintains spatial attention on the defect region despite an 8.1% confidence reduction, confirming that perturbations primarily affect calibration rather than learned features. The adversarially trained model (Figure 8c) demonstrates attention coherence with 94.6% confidence recovery, validating robustness of both feature extraction and confidence mechanisms.

4.5. Ablation Study

To validate the contribution of each architectural component, we conduct systematic ablation experiments by progressively integrating the FasterNet backbone, SGC2f modules, and Wise-ShapeIoU loss into the baseline YOLOv12n architecture. Table 6 presents the incremental performance gains from each modification. The baseline YOLOv12n achieves precision of 96.5%, recall of 90.4%, and mAP@0.5 of 97.1%. Replacing the original backbone with FasterNet improves precision to 96.8%, recall to 91.5%, and mAP@0.5 to 97.6%, demonstrating a 0.5% mAP improvement through enhanced feature extraction via partial convolution operations. Further incorporation of SGC2f modules increases precision to 97.3%, recall to 93.2%, and mAP@0.5 to 98.2%, providing an additional 0.6% mAP gain through improved multi-scale feature fusion. The complete Faster-YOLOv12n architecture with all three components achieves precision of 97.8%, recall of 95.1%, and mAP@0.5 of 98.9%, representing a cumulative improvement of 1.8% in mAP over the baseline. The progressive integration demonstrates that FasterNet and SGC2f modules contribute synergistically, with the combined effect exceeding the sum of individual improvements.

We further evaluate the impact of different loss functions while maintaining the complete architectural configuration. Table 7 compares six bounding box regression losses: GIoU, DIoU, CIoU, EIoU, SIoU, and Wise-ShapeIoU. GIoU achieves 97.8% mAP@0.5 at 6.52 FPS, while DIoU and CIoU reach 98.0% and 98.2%, respectively, with slightly reduced inference speeds of 5.85 and 5.90 FPS. EIoU obtains 97.9% mAP@0.5 at 5.65 FPS, and SIoU achieves 97.6% at 5.49 FPS. Our proposed Wise-ShapeIoU loss achieves the highest mAP@0.5 of 98.9% at 5.48 FPS, outperforming the second-best CIoU by 0.7%. The superior performance of Wise-ShapeIoU stems from its explicit geometric shape constraints and adaptive gradient modulation mechanism, which prove particularly effective for localizing elongated defect regions and irregular broken components characteristic of insulator faults.

To validate the contribution of differential augmentation to model generalization, we evaluated Faster-YOLOv12n trained on the original unaugmented dataset (678 images) versus the augmented dataset (3900 images). Training without augmentation achieves 93.5% precision, 88.2% recall, and 94.7% mAP@0.5, while augmentation improves performance to 97.8% precision, 95.1% recall, and 98.9% mAP@0.5, representing gains of 4.3%, 6.9%, and 4.2%, respectively. These results demonstrate that differential augmentation with Mosaic, MixUp, and photometric transformations is essential for achieving robust generalization across diverse environmental conditions and complex transmission line backgrounds present in operational UAV-based inspection scenarios. The substantial recall improvement (6.9%) is particularly critical for safety-critical infrastructure where false negatives can precipitate power outages. Notably, our architectural improvements (FasterNet + SGC2f + Wise-ShapeIoU) remain consistent regardless of augmentation strategy, demonstrating that architectural innovations and data augmentation provide complementary benefits for insulator defect detection. These ablation results confirm that differential augmentation, loss function design, and architectural components contribute synergistically to overall detection performance, validating our design choices.

5. Discussion

Faster-YOLOv12n achieves 98.9% mAP@0.5 on the CPLID, outperforming YOLOv9 by 2.4% and the baseline YOLOv12n by 1.8%. The ablation study shows these gains come from three architectural components working together. FasterNet backbone provides 0.5% mAP gain through partial convolution operations that preserve spatial information during downsampling, which matters for detecting small defect regions spanning fewer than 50 pixels. SGC2f modules contribute an additional 0.6% by aggregating features across multiple receptive fields simultaneously, capturing both local texture anomalies and global structural patterns. Wise-ShapeIoU loss adds 0.7% over CIoU for elongated defect geometries, where traditional IoU-based losses struggle with aspect ratio mismatches. The per-class results show 99.8% AP@0.5 for defect detection, maintaining consistent performance across both majority and minority classes despite the 3.4:1 imbalance ratio.

The adversarial robustness evaluation exposes systematic vulnerabilities with practical implications for smart grid deployment. FGSM attacks at

ϵ = 48 / 255

reduce baseline model accuracy to 88.9%, while PGD attacks achieve 90.8% under equivalent perturbation budgets. The differential impact between attack methods reflects optimization landscape geometry—iterative PGD refinement produces adversarial examples closer to decision boundaries but requires 20 forward–backward passes compared to FGSM’s single iteration. Adversarial training improves robustness substantially: the model maintains 93.2% mAP@0.5 under the strongest FGSM attacks (

ϵ = 48 / 255

), 94.5% under PGD attacks, and 95.1% under C&W attacks, while preserving 98.9% clean accuracy with zero degradation. This absence of the typical robustness–accuracy trade-off indicates that the defect detection task benefits from learning more robust features that generalize across both natural and synthetic perturbations. The adversarial training approach increases computational cost by approximately 40% during training due to online adversarial example generation.

The 5.48 FPS inference speed reflects architectural trade-offs from FasterNet backbone, SGC2f attention modules, and Wise-ShapeIoU loss optimization. While substantially slower than YOLOv8 (126.5 FPS, 23× faster), our model achieves 98.9% mAP@0.5 compared to YOLOv8’s 95.1% (+3.8% mAP improvement), prioritizing detection accuracy over inference throughput for safety-critical infrastructure where false negatives can precipitate cascading power outages. This throughput is suitable for offline batch processing and slow-scanning UAV inspection workflows common in utility operations, where high-resolution images are captured at 5–10 FPS during controlled flight patterns, transmitted to ground control stations, and processed post-flight rather than requiring real-time on-board analysis. For applications requiring real-time video analytics at 15–30 FPS, lightweight architectures such as YOLOv8 would be more appropriate, accepting the 3.8% mAP reduction as a reasonable trade-off for higher throughput. We aim to investigate model compression techniques including INT8 quantization, knowledge distillation, and structured pruning to achieve 2–3× speedup while preserving adversarial robustness, hardware-specific optimizations through TensorRT and ONNX runtime for edge deployment on UAV-mounted computing units, and adaptive inference strategies that dynamically adjust model complexity based on scene complexity to balance accuracy and throughput in real-time scenarios.

This work addresses adversarial robustness for insulator defect detection on a single dataset as proof of concept. Practical deployment requires validation across diverse insulator types, weather conditions, and geographic regions. The CPLID contains primarily ceramic insulators photographed under daytime conditions, whereas real transmission networks employ composite, glass, and polymer insulators subjected to varying environmental stressors, including ice accretion, salt contamination, and ultraviolet degradation. Future work should evaluate model generalization across multiple datasets representing different insulator materials and defect modalities, investigate black-box transfer attacks where adversaries lack model access, and explore certified defense mechanisms that provide provable robustness guarantees rather than empirical adversarial training approaches. The computational overhead of adversarial training could be reduced through fast adversarial training or pre-computed adversarial example databases.

Limitations and Future Work

This work establishes adversarial robustness evaluation for insulator defect detection on a single dataset with binary classification (normal vs. defective). While our results demonstrate effective defense mechanisms and no trade-off between accuracy and robustness, several research directions remain for comprehensive operational validation. Real-world transmission networks deploy diverse insulator materials including ceramic, composite, glass, and polymer variants, each exhibiting distinct visual characteristics under environmental stress. Operational UAV inspection encounters conditions beyond daytime photography: rain, snow, ice accretion, solar glare, low-light scenarios, and seasonal illumination variations. Transmission systems also experience diverse defect modes including internal cracking, mechanical erosion, contamination tracking, and pollution flashover. Evaluating model generalization across these dimensions through multi-location, cross-dataset validation represents important future work for deployment readiness assessment.

Black-Box Threat Model Validation. Our evaluation focuses on white-box attacks where adversaries possess complete model knowledge (architecture, parameters, gradients), representing worst-case robustness bounds. Black-box transfer attacks represent more realistic deployment scenarios where attackers generate adversarial examples on publicly available models (YOLOv8, YOLOv9, YOLOv10) without accessing proprietary detection systems. Our adversarial training approach (Section 3.6.1) learns robust features that should reduce cross-model transferability, but empirical validation remains necessary. Future work should evaluate surrogate-generated perturbations on Faster-YOLOv12n to quantify whether adversarial training provides generalization beyond white-box robustness. This would involve training multiple surrogate models on CPLID, generating adversarial examples at varying perturbation budgets, and measuring attack success rates when transferred to our model. Low transferability would confirm that our defense learns inherently robust representations rather than memorizing white-box attack patterns.

Physical Attack Considerations. While our evaluation addresses digital perturbations applied to transmitted images, practical smart grid deployment must consider physical attack vectors. Adversarial patches attached to insulators must survive perspective transformations as UAVs change viewing angles, varying illumination conditions and atmospheric attenuation, JPEG compression and wireless transmission artifacts, and human visual inspection [60]. Environmental manipulation through fog or spray introduction and wireless channel corruption during UAV-to-ground-station transmission [31] represent additional realistic threats. We aim to investigate physically realizable adversarial patches evaluated under real-world UAV imaging conditions, certified defense mechanisms via randomized smoothing [69,70] that provide provable robustness guarantees, and multi-modal sensing integration combining visual detection with infrared thermal imaging which are resistant to visual manipulation.

Our ongoing work explores several extensions to this framework. We are integrating vision-language models such as CLIP [71] and Power-LLaVA [72] to enable contextual defect reasoning beyond object detection. We aim to investigate certified defenses via randomized smoothing [69,70] to provide provable robustness guarantees. Multi-dataset validation across diverse insulator types (ceramic, composite, glass) and environmental conditions (ice, salt, UV degradation) will establish generalization capabilities. We plan to adopt efficient adversarial training methods [73,74] to reduce the 40% computational overhead while preserving robustness for resource-constrained deployments.

6. Conclusions

This work presents Faster-YOLOv12n for adversarially robust insulator defect detection, achieving 98.9% mAP@0.5 through the FasterNet backbone, SGC2f neck, and Wise-ShapeIoU loss optimization. Differential augmentation expands the dataset from 678 to 3900 images, balancing class distribution to 49.2%/50.8%. Adversarial evaluation reveals vulnerability under FGSM (

ϵ = 48 / 255

: 88.9%), PGD (

ϵ = 48 / 255

: 90.8%), and C&W (

τ = 3.0

: 92.5%) attacks for the baseline model. Adversarial training substantially improves robustness while preserving 98.9% clean accuracy, achieving 93.2% under the strongest FGSM attacks, 94.5% under PGD, and 95.1% under C&W perturbations. This demonstrates that geometric constraint learning facilitates adversarial generalization without the typical accuracy–robustness trade-off. The model outperforms YOLOv9 by 2.4%, RT-DETR by 7.9%, and Faster R-CNN by 12.7% while achieving 99.8% AP@0.5 for defect class detection. Grad-CAM analysis shows attacks disrupt confidence calibration rather than feature representations.

Funding

This research received no external funding.

Data Availability Statement

The dataset utilized in this study is the China Power Line Insulator Dataset (CPLID) [1] which is publicly accessible via https://github.com/InsulatorData/InsulatorDataSet (accessed on 7 July 2025).

Acknowledgments

We would like to extend our sincere appreciation to Jubail Industrial College for their invaluable support during the course of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AP	Average Precision
C&W	Carlini and Wagner Attack
CA	Channel Attention
CIoU	Complete Intersection over Union
CLAHE	Contrast Limited Adaptive Histogram Equalization
CNN	Convolutional Neural Network
CPLID	China Power Line Insulator Dataset
DIoU	Distance Intersection over Union
EIoU	Efficient Intersection over Union
F1	F1-Score (harmonic mean of precision and recall)
FGSM	Fast Gradient Sign Method
FN	False Negative
FP	False Positive
FPS	Frames Per Second
GIoU	Generalized Intersection over Union
Grad-CAM	Gradient-weighted Class Activation Mapping
HSV	Hue Saturation Value
IoU	Intersection over Union
mAP	mean Average Precision
MLP	Multi-Layer Perceptron
P	Precision
PGD	Projected Gradient Descent
PGI	Partial Gradient Information
R	Recall
ReLU	Rectified Linear Unit
RT-DETR	Real-Time Detection Transformer
SA	Spatial Attention
SGC2f	Spatial Gradient Convolution with 2 Fusions
SIoU	Shape Intersection over Union
TN	True Negative
TP	True Positive
UAV	Unmanned Aerial Vehicle
YOLO	You Only Look Once

References

Tao, X.; Zhang, D.; Wang, Z.; Liu, X.; Zhang, H.; Xu, D. Detection of Power Line Insulator Defects Using Aerial Images Analyzed With Convolutional Neural Networks. IEEE Trans. Syst. Man, Cybern. Syst. 2018, 50, 1486–1498. [Google Scholar] [CrossRef]
Liu, Y.; Liu, D.; Huang, X.; Li, C. Insulator defect detection with deep learning: A survey. IET Gener. Transm. Distrib. 2023, 17, 3541–3558. [Google Scholar] [CrossRef]
Waleed, D.; Mukhopadhyay, S.; Tariq, U.; El-Hag, A.H. Drone-based ceramic insulators condition monitoring. IEEE Trans. Instrum. Meas. 2021, 70, 6007312. [Google Scholar] [CrossRef]
Nguyen, V.N.; Jenssen, R.; Roverso, D. Automatic autonomous vision-based power line inspection: A review of current status and the potential role of deep learning. Int. J. Electr. Power Energy Syst. 2018, 99, 107–120. [Google Scholar] [CrossRef]
Yang, M.; Li, S.; Chen, Q. Transmission Line Inspection with UAVs: A Review of Technologies, Applications, and Challenges. IEEE Access 2024, 9, 60–69. [Google Scholar]
Faisal, M.A.A.; Mecheter, I.; Qiblawey, Y.; Hernandez Fernandez, J.; Chowdhury, M.E.H.; Kiranyaz, S. Deep Learning in Automated Power Line Inspection: A Review. Appl. Energy 2025, 385, 125507. [Google Scholar] [CrossRef]
Xu, Z.; Liu, C.; Huang, Y. Research on Insulator Defect Detection Based on Improved YOLOv7 and Multi-UAV Cooperative System. Coatings 2023, 13, 880. [Google Scholar] [CrossRef]
Sun, S.; Chen, C.; Yang, B.; Yan, Z.; Wang, Z.; He, Y.; Wu, S.; Li, L.; Fu, J. ID-Det: Insulator Burst Defect Detection from UAV Inspection Imagery of Power Transmission Facilities. Drones 2024, 8, 299. [Google Scholar] [CrossRef]
Cao, Z.; Chen, K.; Chen, J.; Chen, Z.; Zhang, M. CACS-YOLO: A lightweight model for insulator defect detection based on improved YOLOv8m. IEEE Trans. Instrum. Meas. 2024, 73, 3530710. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 13713–13722. [Google Scholar]
Song, Z.; Huang, X.; Ji, C. Insulator defect detection and fault warning method for transmission lines based on flexible YOLOv7. High Volt. Technol. 2023, 49, 5084–5094. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Zhang, Z.D.; Zhang, B.; Lan, Z.C.; Liu, H.C.; Li, D.Y.; Pei, L.; Yu, W.X. FINet: An insulator dataset and detection benchmark based on synthetic fog and improved YOLOv5. IEEE Trans. Instrum. Meas. 2022, 71, 6006508. [Google Scholar] [CrossRef]
Li, L.; Zhang, Y.; Chen, P.; Zhang, K.; Xiong, W.; Gong, P. Lightweight-YOLOv4-based insulator-defect detection in complex scenes. J. Optoelectron. Laser 2023, 33, 598–606. [Google Scholar]
Chang, R.; Zhang, B.; Zhu, Q.; Zhao, S.; Yan, K.; Yang, Y. FFA-YOLOv7: Improved YOLOv7 Based on Feature Fusion and Attention Mechanism for Wearing Violation Detection in Substation Construction Safety. J. Electr. Comput. Eng. 2023, 2023, 9772652. [Google Scholar] [CrossRef]
Liao, L.; Liu, H. Self-explosion insulator defect detection via improved YOLOv8. Electron. Meas. Technol. 2024, 47, 138–144. [Google Scholar]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2014, arXiv:1312.6199. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP); IEEE: Piscataway, NJ, USA, 2017; pp. 39–57. [Google Scholar]
Croce, F.; Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 2206–2216. [Google Scholar]
Xie, C.; Wang, J.; Zhang, Z.; Zhou, Y.; Xie, L.; Yuille, A. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 1369–1378. [Google Scholar]
Elsayed, G.F.; Shankar, S.; Cheung, B.; Papernot, N.; Kurakin, A.; Goodfellow, I.J.; Sohl-Dickstein, J. Adversarial Examples that Fool both Computer Vision and Time-Limited Humans. Adv. Neural Inf. Process. Syst. 2018, 31, 3914–3924. [Google Scholar]
Wei, X.; Guo, Y.; Yu, J. Adversarial sticker: A stealthy attack method in the physical world. In Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Akter, K.; Rahman, M.; Islam, M.R.; Sheikh, M.; Hossain, M. Attack-resilient framework for wind power forecasting against civil and adversarial attacks. Electr. Power Syst. Res. 2025, 238, 111065. [Google Scholar] [CrossRef]
Tsipras, D.; Santurkar, S.; Engstrom, L.; Turner, A.; Madry, A. Robustness may be at odds with accuracy. arXiv 2019, arXiv:1805.12152. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 12021–12031. [Google Scholar]
Zhang, H.; Zhang, S. Shape-IoU: More Accurate Metric Considering Bounding Box Shape and Scale. arXiv 2024, arXiv:2312.17663. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Ultralytics. Albumentations Integration. 2025. Available online: https://docs.ultralytics.com/integrations/albumentations/ (accessed on 7 July 2025).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Version 8.0.0. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 7 December 2025).
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 618–626. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29, pp. 379–387. [Google Scholar]
Wang, H.; Yang, Q.; Zhang, B.; Gao, D. Deep learning based insulator fault detection algorithm for power transmission lines. J. Real-Time Image Process. 2024, 21, 115. [Google Scholar] [CrossRef]
Ultralytics. ultralytics/yolov5: V7.0-YOLOv5 SOTA Real-Time Object Detection. 2022. Available online: https://github.com/ultralytics/yolov5 (accessed on 8 December 2025).
Zendehdel, N.; Chen, H.; Leu, M.C. Real-time tool detection in smart manufacturing using You-Only-Look-Once (YOLO)v5. Manuf. Lett. 2023, 35, 1052–1059. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. SOD-YOLO: Small-object-detection algorithm based on improved YOLOv8 for UAV images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Sui, Y.; Ning, P.; Niu, P.; Wang, C.; Zhao, D.; Zhang, W.; Han, S.; Liang, L.; Xue, G.; Cui, Y. A survey of drone-mounted power-line inspection technologies for overhead transmission lines. Power Syst. Technol. 2021, 45, 3636–3648. [Google Scholar]
He, N.; Wang, S.; Liu, J.; Zhang, H.; Wu, L.; Zhou, X. Research on missing-insulator detection in aerial images based on deep learning. Power Syst. Prot. Control 2021, 49, 132–140. [Google Scholar]
Dziugaite, G.K.; Ghahramani, Z.; Roy, D.M. A study of the effect of JPG compression on adversarial images. arXiv 2016, arXiv:1608.00853. [Google Scholar] [CrossRef]
Guo, C.; Rana, M.; Cisse, M.; Van Der Maaten, L. Countering adversarial images using input transformations. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Liao, F.; Liang, M.; Dong, Y.; Pang, T.; Hu, X.; Zhu, J. Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 1778–1787. [Google Scholar]
Athalye, A.; Carlini, N.; Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 274–283. [Google Scholar]
Xie, C.; Wu, Y.; van der Maaten, L.; Yuille, A.L.; He, K. Feature denoising for improving adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 501–509. [Google Scholar]
Xu, W.; Evans, D.; Qi, Y. Feature squeezing: Detecting adversarial examples in deep neural networks. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 26 February–1 March 2017. [Google Scholar]
Zhang, T.; Zhu, Z. Interpreting adversarially trained convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 11207–11216. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Eykholt, K.; Evtimov, I.; Fernandes, E.; Li, B.; Rahmati, A.; Xiao, C.; Prakash, A.; Kohno, T.; Song, D. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 1625–1634. [Google Scholar]
Zhang, J.; Yi, Q.; Sang, J. Towards adversarial attack on vision-language pre-training models. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5005–5013. [Google Scholar]
Finlayson, S.G.; Bowers, J.D.; Ito, J.; Zittrain, J.L.; Beam, A.L.; Kohane, I.S. Adversarial attacks on medical machine learning. Science 2019, 363, 1287–1289. [Google Scholar] [CrossRef]
Apostolidis, K.D.; Papakostas, G.A. A survey on adversarial deep learning robustness in medical image analysis. Electronics 2021, 10, 2132. [Google Scholar] [CrossRef]
Han, X.; Xu, G.; Zhou, Y.; Yang, X.; Li, J.; Zhang, T. Physical backdoor attacks to lane detection systems in autonomous driving. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 2957–2968. [Google Scholar]
Zhang, J.; Wei, J.; Huang, H.; Zhang, P.; Zhu, J.; Chen, J. SageAttention: Accurate 8-Bit Attention for Plug-and-Play Inference Acceleration. arXiv 2025, arXiv:2410.02367. [Google Scholar]
Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial examples in the physical world. In Proceedings of the International Conference on Learning Representations (ICLR) Workshop, Toulon, France, 24–26 April 2017. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Zhang, H.; Yu, Y.; Jiao, J.; Xing, E.; El Ghaoui, L.; Jordan, M. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7472–7482. [Google Scholar]
Cohen, J.; Rosenfeld, E.; Kolter, Z. Certified Adversarial Robustness via Randomized Smoothing. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 1310–1320. [Google Scholar]
Lyu, S.; Shaikh, S.; Shpilevskiy, F.; Shelhamer, E.; Lécuyer, M. Adaptive Randomized Smoothing: Certified Adversarial Robustness for Multi-Step Defences. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Wang, J.; Li, M.; Luo, H.; Zhu, J.; Yang, A.; Rong, M.; Wang, X. Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection. arXiv 2024, arXiv:2407.19178. [Google Scholar]
Wong, E.; Rice, L.; Kolter, J.Z. Fast is Better Than Free: Revisiting Adversarial Training. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Shafahi, A.; Najibi, M.; Ghiasi, A.; Xu, Z.; Dickerson, J.; Studer, C.; Davis, L.S.; Taylor, G.; Goldstein, T. Adversarial Training for Free! In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]

Figure 1. Overview of the proposed insulator defect detection architecture.

Figure 2. Examples from the China Power Line Insulator Dataset (CPLID) [1]: (a) Normal insulator with intact ceramic disk structure; (b) defective insulator with missing cap defect. Small defect regions against complex transmission line backgrounds present challenges for automated detection systems.

Figure 3. FasterNet backbone architecture with block repetition pattern

[l_{1}, l_{2}, l_{3}, l_{4}] = [1, 2, 8, 2]

across four hierarchical stages. Given

640 \times 640 \times 3

input, the architecture generates multi-scale features at

160 \times 160 \times C_{1}

,

80 \times 80 \times C_{2}

,

40 \times 40 \times C_{3}

, and

20 \times 20 \times C_{4}

. The inset shows partial convolution (PConv) processing

c_{p} = c / 4

channels with identity mapping for remaining channels, achieving 1/16 computational cost reduction.

Figure 3. FasterNet backbone architecture with block repetition pattern

[l_{1}, l_{2}, l_{3}, l_{4}] = [1, 2, 8, 2]

across four hierarchical stages. Given

640 \times 640 \times 3

input, the architecture generates multi-scale features at

160 \times 160 \times C_{1}

,

80 \times 80 \times C_{2}

,

40 \times 40 \times C_{3}

, and

20 \times 20 \times C_{4}

. The inset shows partial convolution (PConv) processing

c_{p} = c / 4

channels with identity mapping for remaining channels, achieving 1/16 computational cost reduction.

Figure 4. Adversarial robustness under FGSM attacks with perturbation budgets

ϵ \in {8 / 255

,

16 / 255

,

32 / 255

,

48 / 255}

. Adversarially trained Faster-YOLOv12n achieves 93.2% mAP@0.5 at

ϵ = 48 / 255

compared to 88.9% for the baseline model, demonstrating 4.3% absolute improvement while maintaining 98.9% clean accuracy.

Figure 4. Adversarial robustness under FGSM attacks with perturbation budgets

ϵ \in {8 / 255

,

16 / 255

,

32 / 255

,

48 / 255}

. Adversarially trained Faster-YOLOv12n achieves 93.2% mAP@0.5 at

ϵ = 48 / 255

compared to 88.9% for the baseline model, demonstrating 4.3% absolute improvement while maintaining 98.9% clean accuracy.

Figure 5. Adversarial robustness under PGD attacks with perturbation budgets

ϵ \in {8 / 255

,

16 / 255

,

32 / 255

,

48 / 255}

. Adversarially trained Faster-YOLOv12n achieves 94.5% mAP@0.5 at

ϵ = 48 / 255

compared to 90.8% for the baseline model, demonstrating 3.7% absolute improvement while maintaining 98.9% clean accuracy.

Figure 5. Adversarial robustness under PGD attacks with perturbation budgets

ϵ \in {8 / 255

,

16 / 255

,

32 / 255

,

48 / 255}

. Adversarially trained Faster-YOLOv12n achieves 94.5% mAP@0.5 at

ϵ = 48 / 255

compared to 90.8% for the baseline model, demonstrating 3.7% absolute improvement while maintaining 98.9% clean accuracy.

Figure 6. Adversarial robustness under C&W attacks with

ℓ_{2}

norm bounds

τ \in {0.5

,

1.0

,

2.0

,

3.0}

. Adversarially trained Faster-YOLOv12n achieves 95.1% mAP@0.5 at

τ = 3.0

compared to 92.5% for the baseline model, demonstrating 2.6% absolute improvement while maintaining 98.9% clean accuracy.

Figure 6. Adversarial robustness under C&W attacks with

ℓ_{2}

norm bounds

τ \in {0.5

,

1.0

,

2.0

,

3.0}

. Adversarially trained Faster-YOLOv12n achieves 95.1% mAP@0.5 at

τ = 3.0

compared to 92.5% for the baseline model, demonstrating 2.6% absolute improvement while maintaining 98.9% clean accuracy.

Figure 7. Grad-CAM visualizations on clean test images using JET colormap (red indicates high importance, blue indicates low importance): (a) Focused attention on insulator structure and defect location, (b) Concentrated activation on structural components with peak response at fault region.

Figure 8. Grad-CAM visualizations demonstrating adversarial robustness across three conditions using JET colormap (red: high importance, blue: low importance): (a) Clean image (98.9% confidence) with focused attention on defect region; (b) FGSM attack (

ϵ = 32 / 255

) reducing confidence to 90.8% while preserving spatial attention localization; (c) Robust model (adversarial training) achieving 94.6% confidence with maintained attention coherence, validating that attacks disrupt confidence calibration rather than feature representations.

Figure 8. Grad-CAM visualizations demonstrating adversarial robustness across three conditions using JET colormap (red: high importance, blue: low importance): (a) Clean image (98.9% confidence) with focused attention on defect region; (b) FGSM attack (

ϵ = 32 / 255

) reducing confidence to 90.8% while preserving spatial attention localization; (c) Robust model (adversarial training) achieving 94.6% confidence with maintained attention coherence, validating that attacks disrupt confidence calibration rather than feature representations.

Table 1. Summary statistics of the CPLID with stratified split and differential augmentation for class balancing.

Split	Normal	Defective	Total
Training (original)	480	198	678
Training (augmented)	1920	1980	3900
Validation	60	25	85
Test	60	25	85
Total (original)	600	248	848

Table 2. Performance comparison of state-of-the-art models on CPLID insulator defect detection dataset.

Model	Precision	Recall	F1-Score	mAP@0.5	FPS
RT-DETR [37]	0.910	0.879	0.894	0.910	17.00
Faster R-CNN [13]	0.780	0.821	0.800	0.862	12.00
YOLOv6 [45]	0.940	0.845	0.890	0.898	111.10
YOLOv7 [38]	0.940	0.946	0.942	0.947	107.20
YOLOv8 [39]	0.935	0.934	0.934	0.951	126.50
YOLOv9 [17]	0.952	0.922	0.937	0.965	55.50
YOLOv10 [47]	0.950	0.925	0.937	0.963	61.20
YOLOv12n (Baseline)	0.965	0.904	0.933	0.971	6.03
Faster-YOLOv12n (Ours)	0.978	0.951	0.964	0.989	5.48

Table 3. Five-fold stratified cross-validation results for Faster-YOLOv12n on CPLID. Metrics are reported as mean ± standard deviation across all folds, demonstrating consistent performance and statistical reliability.

Fold	Precision	Recall	F1-Score	mAP@0.5	FPS
Fold 1	0.976	0.948	0.962	0.987	5.50
Fold 2	0.979	0.953	0.966	0.990	5.47
Fold 3	0.978	0.951	0.964	0.989	5.48
Fold 4	0.977	0.950	0.963	0.988	5.49
Fold 5	0.975	0.949	0.962	0.986	5.46
Mean ± Std	0.977 ± 0.002	0.950 ± 0.002	0.963 ± 0.002	0.988 ± 0.002	5.48 ± 0.02

Table 4. Per-class detection performance of Faster-YOLOv12n on CPLID test set.

Class	Precision	Recall	F1-Score	AP@0.5
Insulator	0.975	0.960	0.967	0.986
Defect	0.990	0.920	0.954	0.998
Overall (mAP)	0.978	0.951	0.964	0.989

Table 5. Adversarial robustness evaluation of baseline Faster-YOLOv12n under white-box attacks.

ϵ

denotes

ℓ_{\infty}

perturbation budget;

τ

denotes

ℓ_{2}

norm bound.

Table 5. Adversarial robustness evaluation of baseline Faster-YOLOv12n under white-box attacks.

ϵ

denotes

ℓ_{\infty}

perturbation budget;

τ

denotes

ℓ_{2}

norm bound.

Attack	Perturbation	mAP@0.5	Degradation
Clean	—	0.989	—
FGSM	$ϵ = 8 / 255$	0.942	4.8%
FGSM	$ϵ = 16 / 255$	0.928	6.2%
FGSM	$ϵ = 32 / 255$	0.908	8.2%
FGSM	$ϵ = 48 / 255$	0.889	10.1%
PGD	$ϵ = 8 / 255$	0.949	4.0%
PGD	$ϵ = 16 / 255$	0.936	5.4%
PGD	$ϵ = 32 / 255$	0.921	6.9%
PGD	$ϵ = 48 / 255$	0.908	8.2%
C&W	$τ = 0.5$	0.958	3.1%
C&W	$τ = 1.0$	0.948	4.1%
C&W	$τ = 2.0$	0.936	5.4%
C&W	$τ = 3.0$	0.925	6.5%

Table 6. Ablation study on CPLID test set showing incremental contribution of each proposed component. Baseline is YOLOv12n with original backbone, A2C2f modules, and CIoU loss.

Model	Components			Performance
Model	FasterNet	SGC2f	Wise-ShapeIoU	P	R	mAP@0.5
YOLOv12n (Baseline)				0.965	0.904	0.971
+ FasterNet	✓			0.968	0.915	0.976
+ SGC2f	✓	✓		0.973	0.932	0.982
Faster-YOLOv12n (Full)	✓	✓	✓	0.978	0.951	0.989

Table 7. Evaluation across different loss functions on CPLID test set with complete Faster-YOLOv12n architecture.

Loss Function	mAP@0.5	FPS
GIoU	0.978	6.52
DIoU	0.980	5.85
CIoU	0.982	5.90
EIoU	0.979	5.65
SIoU	0.976	5.49
Wise-ShapeIoU (Ours)	0.989	5.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alanazi, M. Adversarially Robust and Explainable Insulator Defect Detection for Smart Grid Infrastructure. Energies 2026, 19, 1013. https://doi.org/10.3390/en19041013

AMA Style

Alanazi M. Adversarially Robust and Explainable Insulator Defect Detection for Smart Grid Infrastructure. Energies. 2026; 19(4):1013. https://doi.org/10.3390/en19041013

Chicago/Turabian Style

Alanazi, Mubarak. 2026. "Adversarially Robust and Explainable Insulator Defect Detection for Smart Grid Infrastructure" Energies 19, no. 4: 1013. https://doi.org/10.3390/en19041013

APA Style

Alanazi, M. (2026). Adversarially Robust and Explainable Insulator Defect Detection for Smart Grid Infrastructure. Energies, 19(4), 1013. https://doi.org/10.3390/en19041013

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adversarially Robust and Explainable Insulator Defect Detection for Smart Grid Infrastructure

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Network Architecture Overview

3.2. Data Collection and Preprocessing

3.3. Lightweight FasterNet Backbone

FasterNet Block Architecture

3.4. SGC2f Feature Aggregation Module

3.4.1. Sage-Area Attention Mechanism

3.4.2. SageAttention Optimization

3.5. Wise-ShapeIoU Loss Function

3.6. Adversarial Robustness Evaluation Framework

3.6.1. Threat Model and Attack Formulation

3.6.2. Attack Methods

3.6.3. Adversarial Training for Robustness

3.7. Explainability Analysis

3.8. Implementation Details

Cross-Validation Strategy

3.9. Evaluation Metrics

4. Results

4.1. Performance Comparison with State-of-the-Art Models

4.2. Per-Class Performance Analysis

4.3. Adversarial Robustness Results

4.4. Model Explainability with Grad-CAM

4.5. Ablation Study

5. Discussion

Limitations and Future Work

6. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI