Next Article in Journal
Comprehensive Review of Phase Change Materials for Building Applications: Passive, Active, and Hybrid Systems (2022–2025)
Next Article in Special Issue
Study on the DC Discharge Model of Insulators Polluted by Typical Components Based on Effective Salt Deposit Density
Previous Article in Journal
Carbon Black–Enhanced Polyethylene Wax Phase Change Materials for Efficient Photothermal Energy Conversion and Storage in Mobile Heating Systems
Previous Article in Special Issue
Study on Radial Stability of Power Transformer Winding Considering Conductor Transposition Structure and Brace Support State
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lightweight Defect Detection in Substations with Multi-Scale Features and Network Pruning

College of Electrical Engineering and New Energy, China Three Gorges University, Yichang 443000, China
*
Author to whom correspondence should be addressed.
Energies 2026, 19(5), 1163; https://doi.org/10.3390/en19051163
Submission received: 10 January 2026 / Revised: 5 February 2026 / Accepted: 14 February 2026 / Published: 26 February 2026

Abstract

With the increasing adoption of intelligent inspection systems for substation equipment, massive amounts of data are being generated. To address the challenge of balancing detection accuracy and lightweight deployment in current object detection models, this paper proposes YOLOv10-SPD (Substation Power Defect), a high-precision yet lightweight improved model tailored for substation defect detection. Compared to existing methods, the proposed model introduces multiple innovations in structural design and module fusion. (1) A Feature Modulation Module is proposed to significantly enhance the model’s ability to perceive and model defect details. (2) A hybrid module integrating structural information and channel attention is designed to efficiently reconstruct and represent feature maps. (3) A Multi-Scale Context Modeling Module is developed, leveraging shared convolutional kernels to achieve compact expression of multi-scale semantic information. (4) An Efficient Detection Head adopts a hierarchical semantic fusion strategy, further improving recognition accuracy for small and multi-scale targets. (5) A Weight-Magnitude-Based Hierarchical Pruning Strategy is introduced to compress model size and boost inference efficiency while maintaining accuracy. Experiments on a public substation defect dataset demonstrate that the proposed method achieves 94.11% mAP@0.5, outperforming the baseline YOLOv10n by 5.14%, while reducing model parameters by 76.09% and computational costs by 38.82%. The model achieves higher detection accuracy with lower computational overhead, effectively meeting the requirements for efficient and accurate substation defect detection, demonstrating strong practical applicability.

1. Introduction

With the intelligent development of power systems, substation equipment is prone to various defects during long-term operation, such as damaged instrument panels and oil stains [1], which pose threats to grid safety. The need for efficient and accurate automatic identification technology has become a research focus and engineering challenge in power inspection [2]. Traditional manual inspections [3] are still widely used in substations and typically involve trained personnel conducting on-site visual checks, thermal imaging, and routine photographic documentation. Manual inspection has certain advantages, such as flexible observation angles, the ability to respond to unexpected situations, and direct integration with maintenance operations. However, it also has notable limitations. Long inspection routes and repetitive tasks often lead to operator fatigue, which increases the risk of missed detections. In addition, judgment standards may vary among inspectors, resulting in inconsistent defect reporting.
Some types of anomalies are particularly difficult to identify through human observation alone. These include small-scale defects (e.g., minor oil leakage or hairline cracks), low-contrast abnormalities (such as slight discoloration of insulating components), and early-stage faults that do not yet exhibit obvious visual features. Complex backgrounds, occlusions, and varying lighting conditions further increase the difficulty of reliable manual identification. These challenges highlight the need for intelligent visual assistance to improve inspection accuracy and consistency.
Recently, the widespread application of deep learning, particularly convolutional neural networks (CNNs) [4], has provided new approaches for substation defect detection. While deep learning-based object detection methods (e.g., Faster R-CNN, YOLO series) have shown effectiveness, they still face the challenge of balancing accuracy and lightweight design [5]: large models are computationally complex and difficult to deploy on edge devices, while lightweight models are fast but lack detection accuracy, particularly for small objects and complex scenes, limiting their practical application.
In recent years, research on object detection has advanced, with many scholars optimizing model architectures from various perspectives. Zhao Shuai et al. [6] replaced deformable convolutions in the ALIKED network with deformable convolutions v2, introducing an adjustment mechanism to enhance feature capture capability, though this increased model complexity. Yu Hong et al. [7] performed object recognition on six types of substation equipment using Faster R-CNN but faced challenges with high parameter counts. Wang Mengting et al. [8] constructed a C2f-ODConv module using full-dimensional dynamic convolution to extract defect features from steel surfaces, enhancing global feature extraction. Zhang Mingquan et al. [9] improved Faster R-CNN by modifying loss functions, but at the cost of increased computational complexity. Sun Yang et al. [10] incorporated a feature pyramid structure into the network architecture to enhance multi-scale [11] information and used Euclidean distance with KNN for classification. Ling Tonghua et al. [12] optimized YOLOv5n using C3_PConv, PIoUv2 loss, and layer-wise adaptive magnitude pruning. Ge Zhao et al. [13] enhanced YOLOv8’s input stage with Mosaic-9. Dai Weijie et al. [14] designed an improved YOLOv2 algorithm for FPGA deployment. Long Luo [15] developed an infrared defect identification method for substation equipment. Qingkai Meng [16] designed a novel attention mechanism to compensate for feature loss caused by max pooling.
In parallel with domain-specific research, lightweight object detection for resource-constrained platforms has become an important direction in the broader computer vision community. Representative models include MobileNet-SSD, which leverages depthwise separable convolutions for efficient mobile inference, and EfficientDet, which introduces compound scaling and a bidirectional feature pyramid network (BiFPN) to balance accuracy and efficiency. Several Tiny-YOLO variants further reduce model complexity for embedded deployment, while recent methods such as NanoDet and YOLOv7-tiny focus on lightweight backbone design and efficient feature aggregation to improve real-time performance under strict parameter and FLOPs limitations. Despite their success on generic benchmarks such as COCO, these lightweight detectors often experience performance degradation in industrial inspection scenarios, where targets are typically small, low-contrast, and densely distributed in complex backgrounds. Substation defect detection presents exactly such challenges, requiring models that not only remain lightweight but also retain strong fine-grained feature representation capability.
Meanwhile, with the rapid development of smart substations and digital power systems worldwide, inspection data processing is increasingly shifting toward edge devices, including mobile inspection terminals, embedded industrial computers, and UAV-mounted processors. These platforms are characterized by limited memory, restricted computational resources, and real-time processing requirements, making model lightweighting a key prerequisite for practical deployment. Therefore, there is a strong need for detection algorithms that achieve a better balance between accuracy and efficiency under real-world power inspection constraints.
Although the above studies and existing lightweight detection models have achieved performance improvements, they often suffer from insufficient adaptability to fine-grained industrial defect scenarios or still exhibit parameter redundancy and high computational complexity under edge deployment constraints. To address these issues, this paper proposes an efficient lightweight substation defect detection method based on YOLOv10-SPD. The main contributions include: (1) introducing a multi-scale feature fusion structure and a small object detection head to enhance perception of small, dense, and morphologically complex targets; (2) incorporating an FPSC module to extract multi-scale information while reducing computational cost; (3) designing a lightweight multi-scale feature extraction module with partial convolutions to build an efficient detection head, significantly reducing model size; (4) adopting a magnitude-based hierarchical pruning strategy to further compress the model, improve inference efficiency, and enhance device adaptability.

2. YOLOv10-SPD Algorithm Network Architecture

2.1. YOLOv10-SPD Algorithm Network Architecture

YOLOv10, proposed by the Tsinghua University team, optimizes feature extraction and fusion through large-kernel convolutions, self-attention mechanisms, and path aggregation networks. Its one-to-many and one-to-one head coordination improves accuracy while eliminating non-maximum suppression (NMS), enhancing real-time performance. To meet substation defect detection requirements, this study selects the lightweight YOLOv10n for optimization, aiming to achieve a high-precision, low-resource automated detection system.
To further enhance accuracy while maintaining lightweight design, this paper improves the YOLOv10n architecture in several aspects, as illustrated in Figure 1. The original model exhibits limitations in detecting small objects and low-contrast defects. The improved model retains the overall YOLOv10n framework while implementing multi-scale enhancement, up-sampling optimization, and detection head lightweighting, thereby improving detection accuracy while preserving model compactness and inference efficiency.
(1) To enhance multi-scale perception, a newly designed lightweight multi-scale fusion module (C2f-FMB) replaces the original C2f module in the Neck. This module fuses feature information from different receptive fields through sequential convolutions and batch operations, improving fine-grained feature representation. Additionally, the multi-scale feature extraction module FPSC is incorporated. Based on parameter sharing, it extracts contextual features at different scales using convolutions with varying dilation rates, combined with channel compression and fusion to enhance feature representation while reducing computational complexity.
(2) In addition to the original three feature maps (80 × 80, 40 × 40, 20 × 20), a higher-resolution shallow feature map (160 × 160) is added, with corresponding detection heads integrated into the Head layer to improve detection of small defects.
(3) The original decoupled head is replaced with a more efficient Detect Efficient head, which significantly reduces computational load and parameter size while maintaining detection performance, lowering overall inference complexity.

2.2. Improved C2f Network Model

2.2.1. Overall Module Description

To achieve more efficient feature modeling, a two-stage feature modulation module called the Feature Modulation Block (FMB) is designed. As shown in Figure 2, this module consists of two submodules: Structure-aware Multi-feature Fusion Attention (SMFA) and Partial Channel Feedforward Network (PCFN). SMFA fuses local structural and channel statistical information, while PCFN enables efficient feedforward feature modeling. The module adopts a residual connection architecture, enhancing feature representation while controlling parameters and computational cost.

2.2.2. SMFA Module

The SMFA module generates a spatial-channel attention map by fusing local structural information with channel statistics. Specifically, input features F are linearly projected into two branches:
X , Y = Split Conv 1 × 1 × F
where F represents the input feature map; Conv 1 × 1 denotes the convolution operation of  1 × 1 ; Split refers to dividing the convolved feature map into two equal parts along the channel dimension, denoted as X and Y.
Local Structure Modeling:
For the branch X , perform downsampling, depthwise convolution, and channel statistical modeling:
X s = DWConv Pool x , X v = Var X
where X S denotes features after separable deep convolution and pooling operations; DWConv represents the separable deep convolution operation; Pool denotes the pooling operation; x is the input feature; X v denotes the variance of feature X; Var is the variance calculation function.
Fuse structural and statistical features, then restore to the original spatial size via activation functions and upsampling:
M x = Upsample GELU Conv 1 × 1 α X s + β X v
where M x denotes the output feature after upsampling; Upsample denotes the upsampling operation; GELU denotes the Gaussian Error Linear Unit activation function; α denotes the weight coefficient adjusting the contribution of X S ; β denotes the weight coefficient adjusting the contribution of X v ; α , β R C × 1 × 1   denotes the learnable scaling parameter.
Generating structural enhancement features:
X l = X M x
Local Perception Enhancement:
Y d = DMlp Y
where DMlp denotes the Y branch spatial enhancement module, enhancing its local perceptual capability.
Finally fuse the two enhanced results and compress the channels:
F S M F A = Conv 1 × 1 X l + Y d

2.2.3. PCFN Partial Channel Feedforward Network

This module aims to achieve spatial modeling with reduced computational cost by performing convolutions only on selected channels. Input features first undergo linear mapping to expand channel dimensions:
Z = GELU Conv 1 × 1 F S M F A R C × H × W
where Z represents the output processed by the Gaussian Error Linear Unit (GELU) activation function, with a data type of 3D tensor, spatial dimensions of H × W, and channel count of C; GELU denotes the Gaussian Error Linear Unit activation function.
The expanded features are divided into two subsets:
Z 1 , Z 2 = Split Z , ratio = p
where Z1 contains channels with proportion p, and the remainder belong to Z2.
Apply the convolution operation only to Z1 for spatial modeling:
Z 1 = GELU Conv 3 × 3 Z 1
After concatenating the two parts, reduce the dimensions via a 1 × 1 convolution:
F P C F N = Conv 1 × 1 Concat Z 1 , Z 2

2.2.4. Module Output

The entire FMB module adopts a residual structure, incorporating two normalizations (Norm) and two submodule enhancements:
F = F + F M B F   = F + P C F N N o r m F + S M F A N o r m F

2.3. Multi-Scale Feature Extraction Module FPSC

To effectively extract multi-scale feature information, this paper designs the Feature Pyramid Shared Conv (FPSC) module, whose structure is shown in Figure 3. The core idea of the module is to extract contextual features at different scales in images through convolutional operations with varying dilation rates based on parameter sharing. Combined with convolutions, this approach achieves channel compression and fusion, thereby enhancing feature expressiveness while reducing computational complexity.
(1)
Input Compression
First, the input feature map undergoes a Conv layer for channel compression and preliminary feature transformation:
X 0 = Conv 1 × 1 X
(2)
Multi-Scale Dilated Convolution (Shared Convolution Kernel)
The same convolutional kernel (parameter sharing) performs three successive dilated convolutions with dilation rates d = 1, 3, and 5 to capture feature representations at different scales.
F 1 = ShareConv 3 × 3 , d = 1 X 0
F 2 = ShareConv 3 × 3 , d = 3 F 1
F 3 = ShareConv 3 × 3 , d = 5 F 2
(3)
Feature Fusion
Concatenate the input X with the outputs F1, F2, and F3 from the three-layer dilated convolutions along the channel dimension:
F c o n c a t = Concat X 0 , F 1 , F 2 , F 3
(4)
Output Transformation
Finally, perform a convolution for channel fusion and output mapping to obtain the final feature map Y:
Y = Conv 1 × 1 F c o n c a t
Module Advantages Analysis: Strong multi-scale feature modeling capability, with convolutional kernels of varying dilation rates capturing both local and global contextual information. Parameter sharing enhances efficiency: Three dilated convolutional layers share the same set of kernels, significantly reducing parameter count and redundant computations. Fine-grained feature representation: Compared to pooling operations, this module preserves more image details through convolutional operations. Lightweight and efficient: The overall structure is compact with low computational and storage overhead, making it suitable for resource-constrained deployments.

2.4. Detect Efficient Module Optimizes Detection Head Accuracy

For practical deployment, substation defect detection models must possess high-speed inference capabilities and low computational complexity to meet edge device deployment requirements. To achieve this, this paper redesigns the detection head structure from the original YOLOv10n, replacing the original two-branch decoupled detection head with an efficient, lightweight detection head module (Detect Efficient) to reduce model computational complexity while improving detection performance.
As shown in Figure 4, in the original YOLOv10n model, each detector head consists of two parallel 3 × 3 standard convolutions (Conv) and one 3 × 3 dilated convolution (Conv2d). The original model handles both classification and regression tasks, with its detection component performing a total of 18 convolutional operations (12 standard convolutions and 6 Conv2d). This paper introduces a multi-scale fusion mechanism and adds two dedicated heads for small object detection, expanding the number of detection heads from 3 to 5. This increases the corresponding convolutional operations to 20 3 × 3 Conv and 10 3 × 3 Conv2d, leading to significant parameter stacking in the Detect section and substantial computational resource consumption, severely impacting detection efficiency.
To address these issues, this paper proposes an improved detection head architecture [17], illustrated in Figure 4. The new Detect Efficient module incorporates FasterNet’s core operator—partial convolution (PConv)—concatenated with 3 × 3 convolutions to form an Efficient Fusion feature fusion module. This design substantially simplifies the network architecture of the Head, reduces redundant parameters and computations, and further enhances overall model efficiency while maintaining detection accuracy. It is well-suited for rapid defect detection tasks in substations under resource-constrained scenarios.
As shown in the figure above, PConv applies a standard convolution with kernel size k only to the input cp channels to extract spatial features, leaving the remaining channels unchanged. Here, cp represents a portion of the total number of channels, serving as a computational proxy for the entire feature map. Therefore, the FLOPs for a single PConv are:
h × w × k 2 × c p 2
When using common values such as r = c p c = 1 4 , the FLOPs of a single PConv are 1 16 of a standard convolution, amounting to only:
h × w × 2 c p + k 2 × c p 2 h × w × 2 c p
This demonstrates that PConv simultaneously reduces computational redundancy and memory access.

2.5. SPD-Conv: A Convolution Structure Integrating Spatial Compression and Depth Reconstruction

SPD-Conv is a novel convolutional architecture combining a Space-to-Depth (SPD) transformation module [18] with a stride-free convolutional layer. The SPD module is based on a universal image transformation method, whose core idea is to perform spatial downsampling on feature maps within convolutional neural networks, thereby achieving feature reorganization and compression.
In Figure 5, let the intermediate feature size have dimensions S × S × C 1   . It is divided into several sub-feature maps. Specifically, for a given downsampling factor scale, sub-feature maps can be extracted as follows: after the space-to-depth transformation, the star symbol (★) denotes a 1 × 1 convolution operation (stride = 1) that performs channel-wise feature fusion and dimensionality reduction, mapping the concatenated feature map of size S 2 × S 2 × 4 C i   to the final output feature map of size S 2 × S 2 × C 0   .
f 0 , 0   = X 0 : S : s c a l e , 0 : S : s c a l e , f 1 , 0   = X 1 : S : s c a l e , 0 : S : s c a l e ,
f 0 , 1   = X [ 0 : S : s c a l e , 1 : S : s c a l e ] , f 1 , 1   = X [ 1 : S : s c a l e , 1 : S : s c a l e ] ,
Similarly, a total of scale2 sub-feature maps fx,y can be obtained, where the element positions in each sub-feature map satisfy: their corresponding original indices i + x, j + y are divisible by scale.
Each sub-map has dimensions S s c a l s   × S s c a l s   × C 1   . Thus, the entire SPD operation effectively reduces the spatial dimension of the input feature map by scale while increasing the channel dimension by scale2.
Finally, all sub-feature maps are concatenated along the channel dimension to produce a new intermediate feature map X’, with dimensions:
X R S s c a l e × S s c a l e × ( s c a l e 2 C 1   )
This step effectively compresses spatial information while enhancing the semantic feature representation capability at each location. A stride-free convolutional layer typically follows to further extract features.

2.6. Lightweight Progressive Multi-Scale Feature Fusion Module (CSP_PMSFF)

To further enhance the model’s multi-scale feature extraction capability, this paper proposes a lightweight multi-scale fusion module named PMSFF (Progressive Multi-Scale Feature Fusion), whose structure is shown in Figure 6. This module extracts features with different receptive fields through sequential convolutional operations, followed by channel-wise concatenation and fusion, thereby balancing both local and global feature representation capabilities.
Specifically, the input feature X R   C × H × W undergoes a 3 × 3 standard convolution operation, yielding the output feature F R   C × H × W . This output is then split into two parts along the channel dimension:
F 1 = F 1 a , F 1 b , F 1 a , F 1 b R   c 2 × H × W
Next, a 5 × 5 batch convolution is applied to F1a with a batch size of c/2, yielding the output F2, which is similarly split into two parts:
F 2 = [ F 2 a , F 2 b ] , F 2 a , F 2 b R   c 4 × H × W
Then, F2a undergoes a 7 × 7 batch-convolution (batch size C/4) to yield F 3 R   c 4 × H × W .
Finally, concatenate the outputs from all three stages: F3, F2b, and F1b along the channel dimension:
F c a t   = Concat F 3   , F 2 b   , F 1 b  
A 1 × 1 convolution performs channel fusion to yield Fout, which undergoes residual connection with the input features to produce the final output:
Y = Conv 1 × 1 F cat + X
This module enhances the model’s perception of targets at different scales by progressively extracting and fusing features at various resolutions. Simultaneously, the residual structure ensures gradient flow, improving training stability.

2.7. Structure Lightweighting

In object detection tasks, not all weights in a model carry equal importance. To further reduce model size, computational load, and parameter redundancy while minimizing impact on detection accuracy, this paper introduces pruning techniques. Pruning is a common model compression method that achieves lightweight models by removing unimportant connections or redundant weights while preserving accuracy.
This paper employs the LAMP (Layer-Adaptive Magnitude-based Pruning) algorithm, known for its simplicity and effectiveness, to prune the YOLOv10-SPD model. LAMP is a weight magnitude-based importance evaluation method that adaptively determines the proportion of weights to retain in each layer. The pruning process is as follows:
The forward propagation process begins with input x, sequentially applying transformations through each layer’s weight matrix and activation function to yield the network output:
f ( x , W ( 1 : d ) ) = W ( d ) σ W ( d 1 ) σ W ( 2 ) σ W ( 1 ) x
where x is the input; d is the number of network layers; W ( i ) is the weight matrix for layer (i); σ is the activation function.
Then, the LAMP score for each layer’s weight tensor W is calculated using the formula:
s c o r e u ; W : = W [ u ] 2 v u W [ v ] 2
v ≥ u denotes the range from the u-th weight value in the weight tensor W to the last weight value in W.
Weights are sorted by their importance scores to determine which weights to retain in each layer, followed by pruning:
W [ u ] 2 > W [ v ] 2 s c o r e u ; W > s c o r e v ; W

3. Experimental Results and Data Analysis

3.1. Dataset Statistics and Processing

The data acquisition process spanned multiple substations with different voltage levels and equipment layouts, including 220 kV and 110 kV outdoor substations. Images were collected under various environmental and lighting conditions, such as sunny, cloudy, and low-illumination scenarios, to ensure diversity and improve the robustness of the model to real-world deployment conditions. The captured scenes mainly include instrument panels, transformer areas, breaker zones, cable trenches, and safety operation areas. The images used in this study were collected during routine substation inspection tasks using handheld industrial cameras and mobile inspection terminals operated by trained maintenance personnel. The shooting angles vary from frontal views to oblique perspectives, and target sizes range from small defects (e.g., oil stains, dial abnormalities) to medium-scale equipment damage. Image resolutions ranged from 1280 × 720 to 4000 × 3000 pixels, and all images were later resized to 640 × 640 during preprocessing. To ensure data authenticity, all defect samples were naturally occurring faults observed during inspection, rather than artificially staged defects. The dataset covers both equipment-related defects and personnel safety violations, reflecting practical inspection requirements in substations. As shown in Table 1, all images were manually screened to remove blurred or invalid samples. After cleaning, a total of 4435 valid images were retained. To address category imbalance, conventional data augmentation techniques such as horizontal flipping, brightness adjustment, random cropping, and Mosaic augmentation were applied, expanding the dataset to 4768 images.
Regarding data partitioning, 10% of the entire dataset was first allocated as the test set and another 10% as the validation set to evaluate the model’s generalization performance. The remaining data was further divided into training, test, and validation sets in an 8:1:1 ratio to ensure timely and accurate validation feedback during training.
To ensure reliable model evaluation, the dataset was divided into training, validation, and test sets according to scene diversity and defect distribution rather than random frame-level splitting alone. Images from the same inspection sequence or similar shooting locations were kept within the same subset to avoid data leakage. The distribution of defect categories was balanced as much as possible across subsets to maintain consistent class representation.
The training set includes images under diverse environmental conditions and complex backgrounds to enhance model generalization. The validation set is used for hyperparameter tuning and early stopping, while the test set contains previously unseen inspection scenes to objectively evaluate real-world performance. This splitting strategy ensures that the evaluation results reflect practical deployment conditions.
To achieve refined defect recognition training, labeling was employed to annotate all images. Defects in the images were categorized into 15 types: panel blur (bj_bpmh), dial damage (bj_bpps), Shell damaged (bj_wkps), oil stains on the floor (sly_dmyw), silicone tube damage (hxq_gjtps), abnormal door closure (xmbhyc), suspended particles (yw_gkxfw), bird’s nests (yw_nc), cover plate damage (gbps), failure to wear safety helmet (wcaqm), not wearing work uniform (wcgz), meter reading anomaly (bjdsyc), respirator oil seal oil level abnormality (ywzt_yfyc), silicone color change (kgg_ybh), and silicone discoloration (hxq_gjbs). Each annotated image automatically generates a corresponding TXT format label file via script for model training. All images are uniformly resized to 640 × 640 input resolution to adapt to the network architecture and ensure training effectiveness. Based on this, the improved algorithm was trained and validated to evaluate its performance and application value on this dataset.
In practical deployment, such images are typically captured by inspection personnel using handheld terminals, inspection robots, or UAV platforms rather than fixed CCTV systems. Many substation defects, such as small oil stains, dial abnormalities, and minor component damage, require close-range and flexible-angle imaging, which fixed surveillance cameras often cannot provide due to limited resolution, occlusion, and fixed viewpoints.
Although personnel are present during inspection, manual observation is still prone to fatigue, subjective judgment differences, and missed detections of small or subtle defects. The proposed lightweight detection model is therefore designed as an intelligent assistance tool integrated into mobile or edge devices. It can automatically analyze captured images, highlight suspected defects, and generate standardized records, thereby reducing oversight, improving inspection efficiency, and supporting digital maintenance management. This approach enhances existing inspection workflows rather than replacing human inspectors, aligning with current intelligent inspection practices in substations.
The definition and categorization of defect types in this study were aligned with practical inspection requirements and relevant industry standards. Specifically, the labeling principles refer to routine substation maintenance procedures and safety inspection specifications commonly adopted in power utilities. Equipment-related defects such as panel damage, oil leakage, and insulation discoloration correspond to typical fault categories described in substation operation and maintenance guidelines. Safety-related violations, including failure to wear safety helmets or work uniforms, follow occupational safety supervision requirements in power system field operations.
In addition, the inspection scenarios considered in this dataset are consistent with the general technical framework of intelligent substation inspection promoted in international standards and industry practices, such as IEC guidelines on power equipment maintenance and IEEE recommendations for condition monitoring of substation assets. These references ensure that the defect categories used in this study reflect real operational concerns and practical inspection standards rather than arbitrary visual classifications.

3.2. Experimental Environment

The experimental hardware configuration consists of an Intel® Xeon® Silver 4214R processor and an NVIDIA GeForce RTX 3080 Ti graphics card. The development language is Python 3.10.14, with CUDA version 11.8 and PyTorch version 2.1. The training parameters are shown in Table 2.
In addition to training configuration, the inference efficiency of the proposed model was also evaluated to assess its deployment potential. The final pruned YOLOv10-SPD model contains only 0.55 M parameters and requires 4.3 GFLOPs, which is well within the computational capability of typical edge AI hardware. During inference testing on the RTX 3080 Ti at an input resolution of 640 × 640, the model achieved an average processing speed exceeding 180 frames per second (FPS), demonstrating strong real-time performance.
Considering the significant performance margin between desktop GPUs and embedded platforms, the proposed lightweight architecture is expected to maintain real-time inference capability (above 30 FPS) on commonly used edge devices such as NVIDIA Jetson series modules. The reduced parameter scale and computational demand also imply lower memory usage and power consumption, which are important for long-duration intelligent inspection tasks on mobile terminals and embedded systems.

3.3. Evaluation Metrics

In this experiment, to comprehensively evaluate the performance and lightweighting effectiveness of the proposed improved YOLOv10n model for substation defect detection tasks, multiple metrics were selected for quantitative analysis: Precision, Average Precision (AP), Mean Average Precision (mAP), Number of Parameters (Params), Computational Cost (GFLOPs), and Model Size (MB). These evaluation metrics, respectively, measure the model’s detection accuracy, overall detection performance, and computational resource consumption. Their specific calculation formulas are as follows:
mAP = 1 N i = 1 N AP i
P = T P T P + F P
AP = 0 1 P R d R
where TP represents the number of samples correctly identified as positive by the model; FP denotes the number of samples incorrectly predicted as positive by the model; FN indicates the number of samples incorrectly predicted as negative by the model; N is the total number of classes in the object detection task; mAP@0.5 is the mean average precision when IoU is 0.5.

3.4. Ablation Experiment

To validate the effectiveness of the proposed modules in substation defect detection, nine ablation experiments were designed to sequentially analyze the impact of each improved module. Experimental results are presented in Table 3.
In the first set of experiments, the Neck module in the original YOLOv10 was replaced with the C2f-FMB structure, and SPD-Conv operations were introduced to optimize the feature extraction path. This effectively enhanced the model’s ability to model multi-scale objects in substation images. While reducing the number of parameters to 2.15 million and controlling the model size at 5.97 MB, The change in mAP@0.5 is not significant, demonstrating that this module offers significant performance improvement potential while maintaining lightweight characteristics.
Experiment Group 2 incorporates a multi-scale feature extraction module into the Neck section. By employing a more efficient feature aggregation strategy, it enhances perception of fine-grained defects such as damaged gauges and ground oil stains. Increased mAP@0.5 by 3.28%, significantly improving the recognition of detailed features.
The third set of experiments introduced the optimized feature fusion structure Detect Efficient. This reduced computational burden to 5.9G FLOPs while maintaining accuracy, boosting overall detection efficiency and achieving superior inference performance.
Finally, integrating these improvements yields the enhanced detection model YOLOv10-SPD. While maintaining detection speed, it reduces parameter count by 76.09%, computational load by 38.83%, and model size by 53.82% (mAP@0.5 improved by 5.14% compared to the original model), fully validating the comprehensive performance advantages of the proposed method.
To further verify the reliability of the performance improvements, additional training runs were conducted using different random initialization seeds. The variation in mAP@0.5 across repeated experiments remained within ±0.3%, indicating that the observed accuracy gains are stable and not caused by random fluctuations during training. This demonstrates the statistical robustness of the proposed improvements.

3.5. Model Visualization Comparison

Figure 7 shows the comparison of model performance before and after improvement on the mAP@0.5 metric. The results indicate that the improved model significantly enhances detection accuracy, specifically reflected in the increase of the mAP@0.5 value. Furthermore, the model’s convergence properties are enhanced, exhibiting markedly accelerated convergence speed and overall performance optimization.
Figure 8 displays the confusion matrix comparison between the original and improved models. Values along the main diagonal reflect the correct classification rate for each category, with higher values indicating better model performance. Values of the main diagonal correspond to confusion errors between categories, revealing false positives and false negatives. As shown in Figure 8 (left), the YOLOv10n model achieves a defect detection recognition rate of approximately 0.89, exhibiting a high false negative rate and relatively low accuracy. In contrast, Figure 8 (right) demonstrates that the improved model enhances the defect detection recognition rate to approximately 0.94, both significantly outperforming the model in Figure 8 (left). This demonstrates that the YOLOv10-SPD model achieves significant improvements in reducing false negatives and enhancing detection accuracy.
As shown in Figure 9, the YOLOv10n model implemented deep structural pruning, with a visual analysis of channel pruning configurations across modules (sorted in descending order by pruning rate). The retained channels on the left and pruned channels on the right jointly reveal a key phenomenon: while the pruning algorithm effectively identifies and removes a large number of redundant feature channels (corresponding to modules with high pruning rates), it precisely preserves critical channels that contribute significantly to the model’s discriminative capability. This discriminative pruning strategy significantly reduces the model’s overall parameter size and computational density without compromising its performance core. Ultimately, this approach excellently achieves dual objectives: lightweight model design and enhanced inference efficiency, while fundamentally maintaining stable detection accuracy.
Figure 10 visually compares detection performance differences before and after algorithmic refinement. The false positives and missed detections in the original YOLOv10n model stemmed from its inadequate feature discrimination in complex backgrounds, hindering effective semantic separation between targets and background. The improved solution significantly enhances detection precision and recall by strengthening the model’s ability to extract key features and suppress background noise. In contrast, the YOLOv10-SPD model demonstrates superior performance in distinguishing abnormal from normal substations, accurately identifying most targets. It maintains robust performance in complex scenarios while achieving a significant overall improvement in detection accuracy, delivering markedly superior results compared to the pre-improvement model.
Comparative experiments were conducted between current mainstream object detection algorithms and the YOLOv10-SPD model, with results shown in Table 4. The highest-performing detection results are highlighted in bold. The YOLOv10-SPD model achieved an mAP@0.5 of 94.11% in substation defect detection tasks, the highest among all models, with detection accuracy surpassing other algorithms.
Regarding model complexity, YOLOv10-SPD features only 0.55 million parameters—significantly lower than Faster R-CNN [19] (41.35 million), Mask R-CNN [20] (43.97 million), YOLOv9-t [21] and YOLOv7-tiny [22] at 6.01M, clearly demonstrating superior model compression. Furthermore, its GFLOPs requirement of 4.3G is the lowest among all models, far below Faster R-CNN’s 124.9G and Mask R-CNN’s 150.4G, indicating significantly enhanced computational efficiency.
Regarding model file size, the improved YOLOv10n weighs in at 2.54 MB, the smallest among all compared models. This makes it highly advantageous for deployment on embedded or edge devices with constrained memory and storage resources.
In summary, the YOLOv10-SPD model achieves significant reductions in parameter count, computational complexity, and model size while maintaining the highest detection accuracy. This characteristic endows it with high practicality and deployment value on resource-constrained edge devices.
To further evaluate the performance of the proposed method on different defect types, we analyzed the category-level detection results in Table 5. The model shows strong performance on clearly structured defects such as Panel Blur, Dial Damage, and Cover Plate Damage, where shape and texture features are prominent. For safety-related violations, including Failure to Wear Safety Helmet and Not Wearing Work Uniform, the model benefits from the enhanced multi-scale feature fusion and achieves stable detection performance under varying poses and scales.
More challenging categories include Silicone Tube Damage, Respirator Oil Seal Level Abnormality, and Suspended Particles, which typically occupy small regions and have low contrast with the background. The introduction of high-resolution feature maps and attention-based feature enhancement contributes to improved recognition of these subtle defects. For environmental contamination defects such as Oil Stains on the Floor and Bird’s Nest, the model demonstrates good robustness despite background complexity, indicating effective contextual modeling capability.
Figure 11 illustrates the trade-off between model size and accuracy. Generally, larger shapes in the figure indicate larger model volumes (parameter counts or file sizes). It is evident that the improved models achieve high detection accuracy while maintaining compact sizes, demonstrating excellent lightweight and high-performance characteristics.
Although this study does not present a separate quantitative evaluation for each individual substation, the dataset used for testing was collected from multiple substations with different equipment layouts, background structures, and lighting environments. Therefore, the reported overall detection performance inherently reflects the model’s behavior under cross-substation conditions.
Variations in installation styles, surface aging, environmental clutter, and illumination among substations introduce significant visual diversity. The stable detection performance observed in the test results indicates that the proposed multi-scale feature fusion and lightweight detection framework can effectively adapt to such cross-site differences. This suggests that the model has good generalization potential for deployment in diverse substation environments.
Future work will further conduct site-specific quantitative evaluations to measure performance variations more precisely across substations.

4. Conclusions

This paper addresses the challenge of balancing detection accuracy and model lightweighting in substation defect detection by proposing a high-precision lightweight detection method based on YOLOv10-SPD, effectively resolving key bottlenecks in practical deployment.
Although the individual techniques used in this work, such as attention mechanisms, multi-scale feature modeling, partial convolutions, and pruning, have been explored in previous studies, their systematic integration and task-specific optimization for substation defect detection provide a practical and effective lightweight solution. This demonstrates that engineering-oriented innovation can play an important role in bridging advanced detection techniques with real-world power system inspection requirements.
(1)
To address the dense distribution of small objects in substation images, we designed the Structure-Aware Multi-Feature Fusion Attention (SMFA) module. By enhancing feature representation through residual structures, we achieved a 76.09% reduction in parameters, a 38.82% decrease in computational complexity, and a 53.82% reduction in model size.
(2)
The FPSC module introduces multi-scale contextual feature extraction through convolutions with varying expansion rates, combined with channel compression and fusion. To mitigate information loss during the upsampling process, the Detect Efficient detection module is introduced, effectively enhancing the accuracy of the detection head. The mAP@0.5 reaches 94.11%, representing an improvement of 5.14% compared to the original YOLOv10n model.
(3)
Constructing an efficient multi-scale feature extraction module CSP_PMSFF; introducing SPD convolution to perform spatial downsampling on feature maps within neural networks, thereby achieving feature reorganization and compression; finally, incorporating an amplitude-based hierarchical adaptive pruning strategy to enable model compression and accelerated inference, enhancing adaptability for edge deployment.

Author Contributions

Conceptualization, T.Z. and T.W.; methodology, T.Z.; software, T.Z.; validation, T.Z.; formal analysis, T.W.; investigation, T.W.; data curation, Z.O.; writing—original draft preparation, T.Z.; writing—review and editing, T.W.; visualization, Z.O.; Supervision, T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China, project number 51807110.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author, Tong Zhang, upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhao, Z.; Ma, D.; Shi, Y.; Li, G. Appearance defect detection algorithm of substation instrument based on improved YOLOX. J. Graph. Imaging 2023, 44, 937–946. [Google Scholar]
  2. Zhang, Y.; Song, A.; Miao, T.; Li, Q.; Wang, S.; Chen, D. Research progress on multimodal teleoperation robots for power facilityinspection and maintenance. China Test 2025, 51, 18–30. [Google Scholar]
  3. Kuang, J.; Li, Z. Development of a novel intellectual transmission line inspection system. J. Chongqing Univ. Technol. 2006, 139–142. [Google Scholar]
  4. Zhang, K.; Feng, X.; Guo, Y.; Su, Y.; Zhao, K.; Zhao, Z.; Ma, Z.; Ding, Q. Overview of deep convolutional neural networks for image classification. Chin. J. Image Graph. 2021, 26, 2305–2325. [Google Scholar] [CrossRef]
  5. Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning filters for efficient convnets. arXiv 2016, arXiv:1608.08710. [Google Scholar]
  6. Zhao, S.; Zhang, C.; Fan, C. Deep feature extraction-based sequential image stitching network for deep-sea environments. Electron. Meas. Technol. 2025, 48, 180–187. [Google Scholar]
  7. Yu, H.; Gong, Z.; Zhang, H.; Zhou, S.; Yu, Z. Research on substation equipment identification and defect detection technology based on Faster R-CNN algorithm. Electr. Meas. Instrum. 2024, 61, 153–159. [Google Scholar]
  8. Wang, M.; Yu, S. A Steel Surface Defect Detection Model Based on the Improved YOLOv8n. J. China Acad. Electron. Inf. Technol. 2024, 19, 559–569. [Google Scholar]
  9. Zhang, M.; Xing, F.; Liu, D. External defect detection of transformer substation equipment based onimproved Faster R-CNN. J. Intell. Syst. 2024, 19, 290–298. [Google Scholar]
  10. Sun, Y.; Xu, B.; Hong, S.; Zhang, H. Discriminant algorithm of substation equipment defects based on improved siamese network. Power Supply Util. 2022, 39, 100–107. [Google Scholar]
  11. Liu, Q.; Dong, L.; Zeng, Z.; Wen, Q.; Zhu, Y.; Chen, M. SSD with Multi-Scale Feature Fusion and Attention Mechanism. Sci. Rep. 2023, 13, 21387. [Google Scholar] [CrossRef] [PubMed]
  12. Ling, T.; Bei, Z.; Zhang, S.; Zhang, L.; Jiang, H. Surface Defect Detection in Sewer Pipelines Based on YOLO-Pipe andByteTrack Methods. China Water Wastewater 2025, 41, 125–130. [Google Scholar]
  13. Ge, Z.; Li, H.; Liu, H.; Jia, Z.; Zhou, K.; Xing, Y. Real-time Defect Detection Method for Edge-end of Transmission Line Based on YOLO-GSS. High Volt. Eng. 2025, 51, 669–677. [Google Scholar]
  14. Dai, W.; Wang, Y.; Li, X.; Wang, Y. YOLO aluminum profile surface defect detection system for FPGA deployment. J. Electron. Meas. Instrum. 2023, 37, 160–167. [Google Scholar]
  15. Luo, L.; Ma, R.; Li, Y.; Yang, F.; Qiu, Z. Image Recognition Technology and Its Application in Defect Detection and Diagnosis Analysis of Substation Equipment. Sci. Program 2021, 2021, 2021344. [Google Scholar] [CrossRef]
  16. Meng, Q.; Fu, T.; Li, K.; Huang, L.; Chen, S. Defect Detection Algorithm for Electrical Substation Equipment Based on Improved YOLOv10n. IEEE Access 2025, 13, 91409–91422. [Google Scholar] [CrossRef]
  17. Liu, G.; Reda, F.A.; Shi, K.; Wang, T.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
  18. Li, J.; Yuan, C.; Wang, X. Real-time instance-level detection of asphalt pavement distress combining space-to-depth (SPD) YOLO and omni-scale network (OSNet). Autom. Constr. 2023, 155, 105062. [Google Scholar] [CrossRef]
  19. Liu, B.; Zhao, W.; Sun, Q. Study of object detection based on Faster R-CNN. In 2017 Chinese Automation Congress (CAC); IEEE: New York, NY, USA, 2017; pp. 6233–6236. [Google Scholar]
  20. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2961–2969. [Google Scholar]
  21. Sun, Y.; Pan, H. Bird Recognition Algorithm Based on Improved YOLOv9. China Sci. Technol. Inf. 2025, 124–126. [Google Scholar]
  22. Liu, X.; Wang, B. Improved YOLOv7-tiny Lightweight Algorithm for Insulator Defect Detection. Radio Eng. 2024, 54, 2305–2314. [Google Scholar]
Figure 1. YOLOv10-SPD Algorithm Model Structure.
Figure 1. YOLOv10-SPD Algorithm Model Structure.
Energies 19 01163 g001
Figure 2. Feature Modulation Block Module.
Figure 2. Feature Modulation Block Module.
Energies 19 01163 g002
Figure 3. Feature Pyramid Shared Conv Module Diagram.
Figure 3. Feature Pyramid Shared Conv Module Diagram.
Energies 19 01163 g003
Figure 4. Detection Head Module.
Figure 4. Detection Head Module.
Energies 19 01163 g004
Figure 5. (ae) SPD-Conv Schematic Diagram at scale = 2.
Figure 5. (ae) SPD-Conv Schematic Diagram at scale = 2.
Energies 19 01163 g005
Figure 6. CSP_PMSFF Multi-Scale Feature Fusion Module.
Figure 6. CSP_PMSFF Multi-Scale Feature Fusion Module.
Energies 19 01163 g006
Figure 7. Comparison chart of model mAP@0.5 before and after improvement.
Figure 7. Comparison chart of model mAP@0.5 before and after improvement.
Energies 19 01163 g007
Figure 8. Comparison of confusion matrices between YOLOv10n (left) and YOLOv10-SPD (right).
Figure 8. Comparison of confusion matrices between YOLOv10n (left) and YOLOv10-SPD (right).
Energies 19 01163 g008
Figure 9. Number of channels pruned and removed across modules.
Figure 9. Number of channels pruned and removed across modules.
Energies 19 01163 g009
Figure 10. Detection results of common substation defects using the lightweight improved algorithm.
Figure 10. Detection results of common substation defects using the lightweight improved algorithm.
Energies 19 01163 g010
Figure 11. Comparison of Model Size and Accuracy.
Figure 11. Comparison of Model Size and Accuracy.
Energies 19 01163 g011
Table 1. Number of Annotated Defect Targets by Substation Category.
Table 1. Number of Annotated Defect Targets by Substation Category.
Defect Object CategoryCount/ImageProportion/%
Panel Blur61312.86
Dial damage2835.94
Shell damaged2936.15
Oil stains on the floor63313.28
Silicone tube damage581.21
Abnormal door closure2946.17
Suspended Particles731.53
Bird’s Nest1994.17
Cover plate damage2405.03
Failure to wear safety helmet3256.82
Not wearing work uniform4188.77
Meter reading anomaly3306.92
Respirator oil seal oil level abnormality250.52
Clamping plate closure1523.19
Silicone Color Change83217.45
Table 2. Training Parameters.
Table 2. Training Parameters.
Parameter NameParameter Setting
Image Size640 × 640
Training Iterations240
Batch Size12
Number of Threads8
Training OptimizerSGD
Initial learning rate0.01
Table 3. Ablation experiment.
Table 3. Ablation experiment.
C2f-FMBMulti-Scale Feature Extraction ModuleDetect Efficient ProbeLAMP PruningmAP@0.5/%Parameters/MGFLOPs/GModel Size/MB
----88.972.36.75.5
88.922.156.95.97
92.252.958.54.58
91.643.235.97.5
91.953.495.76.49
91.592.4822.56.52
90.653.525.36.96
92.562.526.24.94
94.110.554.32.54
Table 4. Comparison of Different Algorithms.
Table 4. Comparison of Different Algorithms.
ModelmAP@0.5/%Parameters/MGFLOPs/GModel Size/MB
Faster-RCNN84.4341.35124.9315.0
Mask-RCNN90.4943.97150.4335.9
YOLOv8n87.933.018.15.96
YOLOv10n88.972.36.75.5
YOLOv5s91.197.115.813.7
YOLOv9-t81.642.6110.75.8
YOLOv7-tiny87.866.0113.212.3
YOLOv10-SPD94.110.554.32.54
Table 5. Category-level AP@0.5 of YOLOv10-SPD.
Table 5. Category-level AP@0.5 of YOLOv10-SPD.
CategoryAP@0.5 (%)
Panel Blur91.23
Dial Damage89.56
Shell damaged94.43
Oil Stains on the Floor92.11
Silicone Tube Damage85.72
Abnormal Door Closure87.15
Suspended Particles82.65
Bird’s Nest90.43
Cover Plate Damage88.77
Failure to Wear Safety Helmet94.02
Not Wearing Work Uniform90.67
Meter Reading Anomaly85.88
Respirator Oil Seal Level Abnormality81.24
Clamping Plate Closure88.90
Silicone Color Change93.56
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, T.; Wu, T.; Ouyang, Z. Lightweight Defect Detection in Substations with Multi-Scale Features and Network Pruning. Energies 2026, 19, 1163. https://doi.org/10.3390/en19051163

AMA Style

Zhang T, Wu T, Ouyang Z. Lightweight Defect Detection in Substations with Multi-Scale Features and Network Pruning. Energies. 2026; 19(5):1163. https://doi.org/10.3390/en19051163

Chicago/Turabian Style

Zhang, Tong, Tian Wu, and Zhenhui Ouyang. 2026. "Lightweight Defect Detection in Substations with Multi-Scale Features and Network Pruning" Energies 19, no. 5: 1163. https://doi.org/10.3390/en19051163

APA Style

Zhang, T., Wu, T., & Ouyang, Z. (2026). Lightweight Defect Detection in Substations with Multi-Scale Features and Network Pruning. Energies, 19(5), 1163. https://doi.org/10.3390/en19051163

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop