3.2. Experimental Dataset
In real-world power grid environments, insulator inspection is confronted with a series of practical challenges, including densely distributed electrical components, frequent object occlusion, pronounced scale variation, and complex background interference arising from vegetation, transmission facilities, and surrounding power equipment. To better approximate practical inspection scenarios, this study constructs a customised insulator defect dataset by integrating the public IDID dataset with field inspection images collected from the Inner Mongolia ultra-high-voltage power grid. The dataset was divided into a training set, validation set, and test set at a ratio of 7:2:1, containing 1470, 403, and 227 images, respectively. The annotated categories include three classes: “broken”, “damage”, and “flashover”.
To further improve model generalization and enhance robustness against complex backgrounds, illumination variations, and imaging disturbances, several data augmentation strategies are introduced during the construction of the training samples, including image rotation, horizontal and vertical flipping, and brightness and contrast adjustment. The detailed augmentation parameters are reported in
Table 2. After augmentation, the number of training images increased to 3045.
As illustrated in
Figure 8, the constructed dataset combines standardized public data with real inspection images and covers multiple application scenarios, including transmission lines, substations, and typical outdoor inspection environments in Inner Mongolia. Compared with a single-source dataset, this integrated data construction strategy enhances scene diversity to a certain extent by incorporating insulator images captured under diverse backgrounds, shooting distances, object scales, and illumination conditions while also enriching the variety of insulator types and defect morphologies.
Specifically, the field images collected from the Inner Mongolia ultra-high-voltage power grid provide more engineering-oriented samples, including a certain number of small-scale, densely distributed objects and targets affected by strong background interference, which are relatively underrepresented in some public datasets. In contrast, the IDID dataset offers favorable openness and annotation foundations, thereby facilitating comparison with existing methods. Consequently, the fused dataset can more faithfully characterize conventional power grid inspection scenarios and, to some extent, narrows the gap between public benchmark data and practical engineering applications.
To further analyze the dataset characteristics, the class distribution and annotation information were statistically summarized, as shown in
Table 3. The augmented dataset contains a total of 3695 images and 7391 defect instances, with 2825, 3875, and 691 annotated instances for the damage, flashover, and broken classes, respectively. The dataset exhibits certain class imbalance, reflecting the occurrence frequency of different defects in practical inspections.
Moreover, the dataset also shows variation in object sizes and scene distribution.
Figure 9 intuitively illustrates the distribution of different target sizes in the self-collected dataset and the IDID dataset, showing that small and medium targets dominate in the self-collected dataset, while large targets are relatively few. To further characterize the target scale, the sizes of annotated objects were statistically analyzed.
Table 4 shows that small, medium, and large targets account for 40.68%, 54.76%, and 4.56% of all annotated instances, respectively. The proportion of small and medium targets exceeds 95%, which is consistent with the actual imaging characteristics in UAV inspections and helps evaluate the model’s detection stability under multi-scale conditions.
3.3. Evaluation Metrics
To comprehensively evaluate the performance of the proposed model in insulator defect detection, Precision, Recall, mAP@0.5, and mAP@0.5:0.95 are employed as detection accuracy metrics. Meanwhile, the number of parameters (Parameters) and floating-point operations (GFLOPs) are introduced to quantify model size and computational complexity, thereby providing an indirect assessment of the model’s lightweight characteristics and its potential for resource-constrained deployment.
Precision and Recall are used to measure the correctness of model predictions and the completeness of target detection, respectively. They are defined as follows:
Here, TP denotes true positives, i.e., the number of positive samples correctly detected by the model; FP denotes false positives, i.e., the number of negative samples incorrectly predicted as positive; and FN denotes false negatives, i.e., the number of positive samples missed by the model. Accordingly, Precision and Recall measure the correctness and completeness of detection results, respectively.
Mean Average Precision (mAP) is adopted to comprehensively assess the model’s detection performance across all object categories. Specifically, mAP@0.5 denotes the mean Average Precision over all classes at an IoU threshold of 0.5, whereas mAP@0.5:0.95 represents the average mAP computed over IoU thresholds ranging from 0.5 to 0.95 at intervals of 0.05. The latter metric provides a stricter and more comprehensive evaluation of localization accuracy and detection robustness. The corresponding calculation is formulated as follows:
where
denotes the functional relationship between Precision and Recall for the
i-th category,
represents the area under the corresponding P–R curve,
C is the total number of categories, and
is the average AP over all categories. Parameters characterize the number of learnable parameters in the model, whereas GFLOPs indicate the floating-point operations required for one forward inference. Together, these two metrics reflect the storage cost and computational burden of the model. The above evaluation metrics provide a quantitative basis for subsequent experimental analysis and model performance assessment.
3.4. Localization Comparative Experiments for Different Modules
3.4.1. Localization Comparative Experiment of the C3k2_PConv Module
To further verify the effectiveness of the C3k2_PConv module at different feature levels, localization comparative experiments are designed around its replacement positions. The experimental results are shown in
Table 5. Here, E1 denotes the baseline model without C3k2_PConv; E2, E3, and E4 represent the progressive introduction of this module into the P3, P3–P4, and P3–P5 feature levels, respectively; and E5 denotes the configuration in which all corresponding C3k2 structures in the Neck are replaced with C3k2_PConv.
As shown in
Table 5, the overall detection accuracy improves progressively as the replacement scope of C3k2_PConv is expanded, while the number of parameters and GFLOPs remain relatively low. This result indicates that C3k2_PConv can effectively reduce redundant computation while enhancing feature representation, thereby achieving a favorable trade-off between detection accuracy and model lightweighting. Specifically, when C3k2_PConv is introduced only at the P3 level, the model already shows improved performance, suggesting that this module can enhance shallow-level details and improve the representation of small objects and edge-texture features. When the replacement scope is further extended to the P3–P4 and P3–P5 levels, model performance continues to improve, indicating that collaborative optimization across multi-scale feature layers strengthens the representation of targets with different sizes. Finally, when all corresponding C3k2 structures in the neck are replaced with C3k2_PConv, the model achieves the best performance while maintaining low parameter and computational complexity.
From a mechanistic perspective, this improvement does not arise from increased network complexity, but rather from more efficient feature extraction and computational resource allocation enabled by C3k2_PConv. Specifically, through the partial convolution mechanism, C3k2_PConv performs spatial convolution only on a subset of channels, while the remaining channels are propagated through lightweight paths. This design effectively reduces redundant computation and decreases the number of parameters. Meanwhile, this structure preserves critical feature information while reducing computational cost, thereby avoiding the degradation of representation capability commonly caused by excessive compression in conventional lightweight methods. Moreover, by selectively processing and fusing channel features, C3k2_PConv enhances discriminative information and suppresses redundant responses, thereby strengthening the coordination between shallow details and deep semantics during multi-scale feature fusion.
Therefore, the proposed C3k2_PConv module enables consistent improvements in detection accuracy without increasing model complexity and can even reduce it, thereby demonstrating a desirable trade-off between accuracy and efficiency. The experimental results further confirm that replacing all corresponding layers in the neck with C3k2_PConv fully exploits the advantages of this module. This strategy not only avoids additional computational burden but also further improves model performance, indicating that the full-scale replacement design is reasonable and effective, and that it provides a reliable basis for structural optimization in lightweight object detection models.
3.4.2. Localization Comparative Experiment of the ECA Module
To verify the effect of the ECA attention mechanism at different feature levels, ECA modules are introduced after the fused outputs of P3, P4, and P5, respectively, and multiple ablation configurations are designed. The experimental results are reported in
Table 6.
Overall, as the deployment scope of ECA is gradually expanded, Precision, Recall, and mAP all show a consistent upward trend. This demonstrates that ECA not only improves the reliability of detection results but also enhances the model’s object discovery capability and localization accuracy. Compared with the baseline model, introducing ECA into multi-scale feature layers markedly strengthens object discriminability and detection robustness.
More specifically, when ECA is introduced at only a single scale, model performance is improved, but the gain remains limited. This is mainly because a single-layer attention mechanism only reweights features at a specific scale. Although it can enhance feature responses and improve Precision to some extent, the lack of cross-scale information interaction limits its ability to perceive weak and small objects in complex backgrounds; consequently, the improvements in Recall and overall mAP remain limited. When ECA is jointly introduced at two scales, model performance further improves. This improvement mainly arises from more effective coordination among features at different levels: shallow features contain abundant spatial details and help improve the detection rate of small objects, thereby increasing Recall, whereas mid- and high-level features provide stronger semantic representations and help reduce false detections, thereby improving Precision. Hence, multi-scale ECA integration can simultaneously improve Precision and Recall to a certain extent, although incomplete feature coverage still leaves room for further performance enhancement.
Ultimately, when ECA is introduced after the fused features of all three scales, namely P3, P4, and P5, the model achieves the best results in terms of Precision, Recall, and mAP. This is because ECA can perform unified modeling of multi-scale fused features and adaptively adjust channel weights to emphasize informative responses across different scales while suppressing background noise and redundant activations. Specifically, enhanced discriminative responses help reduce false positives, strengthened weak small-object features help mitigate missed detections, and improved consistency of multi-scale feature representations further enhances localization stability. As a result, the model achieves improvements in detection accuracy, object completeness, and overall localization precision.
It is worth noting that, across all experimental settings, GFLOPs remain unchanged, while the number of parameters increases only marginally. This indicates that ECA, as a lightweight attention mechanism, introduces negligible additional computational overhead while delivering notable performance gains. Therefore, uniformly introducing ECA after P3–P5 multi-scale fusion represents an effective design strategy that improves Precision, Recall, and mAP without increasing model complexity. This result also suggests that introducing lightweight attention mechanisms after multi-scale feature fusion is an effective structural optimization strategy and provides useful guidance for the design of future lightweight object detection models.
3.5. Ablation Experiments
The localization of C3k2_PConv and ECA has been analyzed separately, and their optimal configurations across multi-scale feature layers have been determined. On this basis, a comprehensive ablation experiment is conducted to further verify the synergistic effects among the improved modules. The results are shown in
Table 7.
Table 7 reports the ablation results of different modules. Overall, compared with the Baseline, the complete model improves Precision, Recall, mAP@0.5, and mAP@0.5:0.95 by 3.12%, 4.18%, 2.72%, and 11.54%, respectively, while reducing GFLOPs by 7.94% and parameters by 34.75%. The improvement in Precision indicates a reduction in false detections, the improvement in Recall suggests more complete target discovery, and the substantial gain in mAP@0.5:0.95 demonstrates enhanced detection stability and robustness under stricter localization criteria. These results show that the improvement in model performance is not driven by increased computational complexity, but by enhanced feature representation quality and more efficient information utilization.
From the single-module results, different modules address distinct performance bottlenecks. C3k2_PConv increases mAP@0.5:0.95 by 4.90%, which is more pronounced than the 1.28% gain obtained by introducing only BiFPN. This indicates that C3k2_PConv improves localization accuracy by reducing redundant computation and strengthening key feature representation. Although BiFPN yields an approximately 1.24% improvement in Recall, its enhancement of Precision and high-IoU performance remains limited, implying that multi-scale fusion alone is insufficient to effectively suppress background interference. ECA further increases mAP@0.5:0.95 by 6.82% on the basis of BiFPN, revealing its clear advantage in enhancing discriminative features. Meanwhile, LSDECD improves Recall and mAP@0.5:0.95 by 3.32% and 6.68%, respectively, and the magnitude of improvement is notably larger than that achieved by BiFPN alone, indicating that detection-head optimization plays a more direct role in improving target completeness.
Further analysis of the combined configurations reveals clear synergy and interdependence among the modules. First, after combining BiFPN with C3k2_PConv, mAP@0.5:0.95 increases by an additional 5.25% compared with introducing BiFPN alone, while GFLOPs decrease, suggesting that C3k2_PConv can reduce redundant computation and improve feature representation quality on the basis of multi-scale fusion. Moreover, compared with Baseline+A+B, adding ECA further improves Precision, indicating that post-fusion feature selection helps suppress background interference and reinforce effective features. However, the combined design still relies on high-quality feature inputs. For example, when C3k2_PConv is absent, several metrics fluctuate, suggesting that if redundant information remains in the features, the effects of ECA and LSDECD cannot be fully exploited. This further confirms that feature quality is a prerequisite for subsequent modules to perform effectively. From the perspective of combined experiments, these modules are not merely additive; rather, they exhibit explicit complementarity. If any component is removed, Precision, Recall, or mAP will degrade to varying degrees, demonstrating the indispensable contribution of each module.
The ablation results demonstrate that the proposed modules are not simply stacked as independent plug-in components, but rather form a complementary and progressive optimization framework. Through a sequence of coordinated optimizations, the proposed model achieves more stable detection performance in insulator inspection scenarios characterized by complex backgrounds, dense small targets, and pronounced scale variations. It also realizes a favorable balance between detection accuracy and lightweight design, thereby demonstrating promising engineering deployment value.
3.6. Comparative Experiments
The proposed model is a lightweight object detection framework specifically designed for insulator defect detection. To comprehensively evaluate the effectiveness and engineering applicability of the proposed method, comparisons are conducted with the baseline model, representative general-purpose object detectors, and recent task-specific lightweight detection models. The general-purpose detectors include YOLOv5n, YOLOv8n, YOLOv10, YOLOv11n, YOLOv12, YOLOv26, and Transformer-based DETR and RT-DETRv2, while MIF-YOLO, MobileNetV3-YOLO and EfficientDet-D0 are selected as a task-specific lightweight comparison model. All models are trained and tested on the same dataset under identical experimental settings to ensure a fair comparison.
The quantitative results in
Table 8 demonstrate substantial differences among detection models in terms of detection accuracy and computational complexity, whereas the proposed method achieves the best overall performance. Specifically, the proposed model attains Precision, Recall, mAP@0.5, and mAP@0.5:0.95 values of 0.964, 0.940, 0.960, and 0.564, respectively, outperforming all comparison models. Compared with mainstream lightweight detectors, namely YOLOv8n, YOLOv10, and YOLOv11n, the proposed method improves mAP@0.5 by 1.5%, 2.1%, and 2.3%, respectively, and improves mAP@0.5:0.95 by 3.9%, 4.5%, and 5.8%, respectively. These results indicate that the proposed model can extract more sufficient discriminative features in scenarios involving complex backgrounds, multi-scale variations, and small defect targets, thereby improving both target recognition and localization performance. Compared with the task-specific lightweight model MIF-YOLO, the proposed method achieves gains of 0.6% in mAP@0.5 and 0.8% in mAP@0.5:0.95. Moreover, it outperforms MobileNetV3-YOLO and EfficientDet-D0 by 2.4% and 2.0% in mAP@0.5, and by 5.3% and 4.1% in mAP@0.5:0.95. These results demonstrate that the proposed architectural design provides consistent performance improvements even when compared with strong lightweight competing models.
The comparison across different architectural paradigms further reveals the close relationship between model structure and task characteristics. Although DETR and RT-DETRv2 are capable of modeling global dependencies, they do not show clear accuracy advantages in the insulator defect detection task. DETR achieves only 0.894 mAP@0.5, while requiring 115.2 GFLOPs and 41.84M parameters. RT-DETRv2 improves the mAP@0.5 to 0.935; however, its computational cost and parameter count remain as high as 60.0 GFLOPs and 20.11M, which are substantially larger than those of lightweight YOLO-based models. In contrast, the YOLO series benefits from convolutional structures that are more suitable for local texture representation and multi-scale feature fusion, thereby showing better adaptability to inspection images with dense small targets and complex backgrounds. YOLOv26 achieves 0.946 mAP@0.5 with only 5.2 GFLOPs, indicating favorable computational efficiency, but its detection accuracy is still lower than that of the proposed method. MobileNetV3-YOLO reduces the model size by adopting a lightweight backbone, but its mAP@0.5:0.95 is only 0.511, suggesting that simply relying on a lightweight backbone may weaken the feature representation capability for complex defective regions. EfficientDet-D0 achieves 0.940 mAP@0.5 through multi-scale feature fusion, yet its mAP@0.5:0.95 remains inferior to that of the proposed method. These results indicate that the general evolution of model architectures alone is insufficient to simultaneously satisfy the requirements of high accuracy and lightweight deployment in insulator defect detection.
The analysis of model complexity indicates that the proposed method reinforces its lightweight advantage while achieving the highest detection accuracy. The model contains only 1.69M parameters, which is substantially lower than those of YOLOv8n (3.14 M), YOLOv11n (2.58 M), MIF-YOLO (2.33 M), MobileNetV3-YOLO (2.10 M), and EfficientDet-D0 (3.90M), corresponding to reductions of approximately 46.2%, 34.5%, 27.5%, 19.5%, and 56.7%, respectively. In terms of computational cost, the proposed method requires 5.8 GFLOPs, lower than YOLOv8n, YOLOv10, and YOLOv11n, and comparable to YOLOv12n. Although EfficientDet-D0 has a lower GFLOPs value, its detection accuracy and practical inference speed remain inferior to the proposed model. Overall, the proposed method achieves the optimal detection performance under the constraints of minimal parameter count and relatively low computational cost, demonstrating an excellent balance among accuracy, model size, and computational efficiency.
In addition, to further evaluate the practical deployment efficiency of different models, the inference frame rate (FPS) was measured under the same testing environment. As shown in
Table 8, although Transformer-based detectors have the advantage of global dependency modeling, their complex architectures lead to limited inference efficiency. Specifically, DETR and RT-DETRv2 achieve only 22.15 FPS and 48.32 FPS, respectively, making them difficult to satisfy the real-time requirements of inspection tasks. EfficientDet-D0 has a relatively low GFLOPs value; however, due to operations such as multi-scale feature fusion and feature resampling, its actual inference speed reaches only 128.54 FPS. In contrast, lightweight YOLO-based models generally exhibit higher inference efficiency, with MIF-YOLO and MobileNetV3-YOLO achieving 225.46 FPS and 231.38 FPS. The proposed method achieves the highest detection accuracy while reaching an inference speed of 284.16 FPS with only 1.69M parameters. These results further demonstrate its potential for real-time deployment in resource-constrained scenarios such as UAV inspection and edge computing.
To evaluate the stability of the proposed model, five repeated experiments were conducted under the same experimental settings. As shown in
Table 9, the proposed method achieves an average mAP@0.5 of 0.960 with a standard deviation of only 0.001, and an average mAP@0.5:0.95 of 0.563 with a standard deviation of 0.003. These results indicate that the proposed model maintains stable detection performance across different runs.
Visualization of Detection Results
This study employs Grad-CAM (Gradient-weighted Class Activation Mapping) visualization to compare the feature response regions of different models. This analysis further reveals, from the perspective of model interpretability, the attention mechanisms adopted by different detection methods during insulator defect recognition. Unlike quantitative metrics in tables, heatmaps can intuitively display the image regions emphasized by the model during prediction and help determine whether the model is affected by complex background factors such as towers, ground textures, and surrounding infrastructure. To ensure fairness of comparison, the same test samples are selected for visualization. The models involved in the comparison are consistent with those in the preceding comparative experiments, including multiple general-purpose object detection models, Transformer-based detectors, task-specific lightweight models, and the proposed method. The selected samples cover the three insulator defect categories in the dataset, enabling a comprehensive evaluation of the feature attention capability and background-interference resistance of different models under practical application conditions.
Figure 10 presents the Grad-CAM heatmap comparisons for representative samples from the three defect categories.
Different detection models show distinct strengths in defect localization and background suppression. Transformer-based detectors, such as DETR and RT-DETR, tend to activate non-target regions under complex backgrounds, including towers, conductors, and surrounding textures. General-purpose YOLO models can capture defect-related regions, but their attention may become dispersed or spatially shifted when detecting small defects. Although lightweight models improve computational efficiency, their heatmap responses are often less continuous and stable. In contrast, the proposed method produces more discriminative feature responses, with activations consistently concentrated on insulator defects and adjacent informative regions. This demonstrates that the introduced feature enhancement and multi-scale fusion mechanisms can strengthen defect representation, suppress irrelevant background interference, and reduce false detections and localization errors in complex inspection scenarios.
Therefore, the proposed model can strengthen defect-related information during multi-scale feature interaction while suppressing interference from irrelevant background features, thereby establishing more reliable discriminative evidence in insulator images. The visualization results further demonstrate the comprehensive advantages of the proposed method in background-interference suppression and small-defect representation.