Next Article in Journal
Interpretable Prediction of Hydraulic Fracture Asymmetry in Shale Reservoirs Under Small-Sample Conditions
Previous Article in Journal
Interlayer Interference Mechanisms and Key Controlling Factors in Low-Permeability Porous Carbonate Gas Reservoirs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Deep Learning-Based Monitoring Framework for Foreign Object Detection in Power Distribution Substations

1
State Grid Beijing Electric Power Research Institute, Beijing 100075, China
2
School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
3
Beijing Rongtong Wisdom Technology Group Co., Ltd., Beijing 100160, China
4
Dongfang Electronics Co., Ltd., Yantai 264008, China
*
Author to whom correspondence should be addressed.
Processes 2026, 14(12), 1899; https://doi.org/10.3390/pr14121899
Submission received: 15 April 2026 / Revised: 15 May 2026 / Accepted: 21 May 2026 / Published: 11 June 2026
(This article belongs to the Section Energy Systems)

Abstract

With the increasing adoption of unattended power distribution substations, accurate foreign object detection has become critical to ensure safe system operation. This study proposes a detection model tailored for substation monitoring, targeting hazards such as fire, water accumulation, and small animal intrusion, while accounting for varying on-site illumination conditions. First, an adaptive illumination normalization module is introduced to accommodate diverse lighting conditions, thereby enhancing its capability to capture foreign objects under complex illumination environments. Second, a multi-scale feature extraction and attention-based refinement structure is developed to effectively capture foreign objects with diverse sizes and textures, aligning with the specific detection requirements of substation scenarios. Third, a task-oriented loss function is constructed by incorporating illumination-adaptive weighting into the objectness component, thereby enhancing robustness under uneven illumination conditions. Experimental results demonstrate that the proposed method outperforms representative detection approaches, validating its effectiveness for foreign object detection in substation monitoring applications.

1. Introduction

Ensuring the safe operation of power distribution substations is essential for maintaining the reliability of modern power systems [1,2]. In practice, power distribution substations are frequently exposed to various external hazards, including fire incidents, water accumulation, and small animal intrusion, which may cause equipment damage and operational risks [3]. With the rapid development of intelligent and unattended substations, there is an increasing demand for automated monitoring systems capable of accurately identifying such hazards, thereby enabling timely intervention by inspection personnel [4,5,6]. Therefore, it is motivated to develop robust detection algorithms for effective foreign object monitoring under real-world operating conditions.
In the existing literature, a variety of vision-based detection methods have been developed. Traditional approaches primarily rely on handcrafted features, such as edge information and texture patterns, to identify abnormal regions in monitoring images. For instance, ref. [7] exploits color space transformation to separate luminance and chrominance components, enabling real-time identification of flame regions. Ref. [8] models background dynamics using Kalman filtering and extracts flame-related features, which are further analyzed through decision mechanisms for rapid fire detection. Ref. [9] combines static and dynamic characteristics, applying fuzzy logic-based image enhancement and Gaussian mixture modeling to distinguish smoke patterns. Meanwhile, ref. [10] captures the dynamic variation of flame color, establishing a relationship between color evolution and flame behavior. However, these approaches rely heavily on manually designed features and heuristic rules, which may limit their robustness in practical substation environments, where diverse foreign targets exhibit significant variability in feature characteristics and illumination conditions vary substantially.
In recent years, deep learning-based methods have been increasingly applied to foreign object detection tasks in power systems. Representative approaches include single-stage detectors such as You Only Look Once (YOLO) [3,11,12,13,14,15,16] and Single-Shot MultiBox Detector (SSD) [17,18], as well as two-stage detectors, such as Faster Region-based Convolutional Neural Network (Faster R-CNN) [19,20,21]. Ref. [11] combines dilated and standard convolutions to enlarge the receptive field for feature extraction, while [12] incorporates attention mechanisms to focus on critical regions and enhance object feature representation. Ref. [14] introduces YOLOv8 with enhanced attention and feature pyramid mechanisms, improving multi-scale feature extraction and fusion capability. Ref. [15] proposes a YOLOv9-based variant that incorporates attention mechanisms and an improved IoU-based loss function, enhancing feature representation and localization accuracy for fire detection tasks. Ref. [16] develops a lightweight YOLOv10 variant incorporating a Convolutional Block Attention Module (CBAM), which enhances feature representation while maintaining efficient deployment on edge devices. In contrast, two-stage detection frameworks, such as Faster R-CNN, achieve high detection accuracy in complex and cluttered scenes at the cost of slower inference speed. Network designs in [19] optimize small object detection by reducing downsampling and employing smaller convolution kernels. Additionally, ref. [20] introduces a moving object region extraction network combined with a classification module to reduce redundant candidate regions.
Nevertheless, compared with general object detection tasks, substation environments present distinct challenges. First, foreign objects encountered in substations, such as fire incidents, water accumulation, and small animals, exhibit significant variations in scale and visual characteristics. Moreover, images captured on site are often affected by varying illumination conditions, including overexposure and shadowing effects. Existing deep learning-based methods typically do not explicitly consider these factors in their design, which may reduce their performance for foreign object detection in substation environments.
The proposed method is based on the general anchor-based YOLO detection paradigm, similar to YOLOv5. Compared with YOLOv5 frameworks, several task-specific modifications are introduced. First, an adaptive illumination normalization module is applied at the input stage. Second, a parallel multi-scale feature extraction structure is integrated into the backbone to replace conventional hierarchical feature fusion. Third, an attention-based feature refinement module is inserted after the multi-scale feature extraction layers within the backbone to refine intermediate feature representations. Fourth, the standard YOLO loss is extended with an illumination-adaptive weighting mechanism at the training stage. These modifications collectively improve feature representation and detection reliability in complex substation environments.
The main contributions of this paper are summarized as follows:
  • An adaptive illumination normalization mechanism is proposed, adapting to lighting variations in real-world substation environments.
  • A parallel multi-scale feature extraction structure is developed to replace conventional feature fusion strategies, enabling representation of objects with varying sizes and textures.
  • A feature refinement mechanism combining channel weighting and local residual enhancement is designed, improving discriminability between objects with similar textures.
  • An illumination-adaptive loss function is introduced to improve objectness learning under challenging illumination conditions, leading to more reliable detection performance.

2. Methodology

2.1. Overview

Compared with conventional foreign object detection tasks, detection in distribution substations presents unique challenges. Foreign objects in these environments, including fire hazards, water accumulation, and small animals, exhibit significant variations in scale and diverse texture characteristics. In addition, substation scenes often exhibit overall bright or dark conditions due to variations in time of day and weather, as well as locally uneven illumination, which further complicates accurate detection.
To address these challenges while maintaining lightweight and real-time performance, we propose a detection network composed of three main modules: Adaptive Image Normalization Module (AINM), Multi-Scale Local Feature Enhancement (MS-LFE), and Adaptive Attention Feature Module (AAFM), followed by a detection neck and head for bounding box and class prediction. Each module is designed to enhance feature extraction and robustness across diverse scenarios, as illustrated in Figure 1. The detailed architectural parameters are summarized in Table 1.

2.2. Adaptive Image Normalization Module (AINM)

Images captured in substations are often subject to extreme lighting variations, such as overexposure under strong sunlight or underexposure at night. To compensate for these variations and produce stable feature maps for subsequent layers, the AINM first applies adaptive gamma correction and then channel-wise normalization.

2.2.1. Gamma Correction Layer

To handle varying lighting conditions in substations, an adaptive gamma correction is applied to the input image F 0 R H × W × C . For each channel c, the mean and standard deviation of pixel intensities are computed as:
μ c = 1 H W i = 1 H j = 1 W F 0 ( i , j , c ) ,
σ c = 1 H W i = 1 H j = 1 W F 0 ( i , j , c ) μ c 2 .
Based on these statistics, the adaptive gamma exponent is defined as:
γ c = log ( 0.5 ) log ( μ c + ϵ ) + λ · σ c ,
and the output is obtained by
F 1 ( i , j , c ) = F 0 ( i , j , c ) γ c .
Here, ϵ is a small constant to ensure numerical stability.
In practical implementation, the above operations are performed in a vectorized manner. Specifically, μ c and σ c are computed for each channel on a per-image basis during the forward pass, resulting in channel-wise scalars. The exponent γ c is then applied uniformly across all spatial locations within the corresponding channel, and the power operation is applied element-wise to obtain F 1 . This design ensures computational efficiency and seamless integration into standard deep learning frameworks.
λ is a sensitivity parameter controlling the response to illumination variation. It is shared across different modules to maintain a consistent scaling of illumination sensitivity throughout the network. The value is empirically determined via cross-validation and set to 0.5. Larger values may lead to over-enhancement in bright or highly varying regions, while smaller values may result in insufficient correction in dark or low-illumination areas.
This adaptive operation dynamically adjusts intensity distributions according to image-specific statistics, enabling robust feature extraction under both overexposed and underexposed conditions.

2.2.2. Channel-Wise Normalization Layer

After gamma correction, channel-wise normalization is performed to standardize each channel’s feature distribution:
F 2 ( i , j , c ) = F 1 ( i , j , c ) μ c σ c ,
where μ c and σ c are computed from F 1 . The resulting normalized feature map F 2 ensures that each channel contributes comparably to subsequent convolutional layers, enhancing robustness against illumination-induced bias.

2.3. Multi-Scale Local Feature Enhancement (MS-LFE)

This study considers common foreign objects in distribution substations, including fire, water accumulation, and small animal intrusion. These targets range from small sparks or animals to large water regions, requiring the network to effectively extract features across multiple spatial scales. The proposed MS-LFE module addresses this by applying convolutional layers with diverse receptive fields, followed by feature concatenation and max pooling. This design preserves fine details for small targets while capturing coarse patterns for larger targets.

2.3.1. Multi-Scale Convolution Layer

F 3 = ReLU ( Conv 3 × 3 32 ( F 2 ) ) ,
F 4 = ReLU ( Conv 7 × 7 32 ( F 2 ) ) ,
F 5 = ReLU ( Conv 15 × 15 32 ( F 2 ) ) ,
where F 2 is the normalized input feature map, Conv k × k 32 denotes a convolutional layer with kernel size k × k and 32 filters, and ReLU ( · ) is the rectified linear activation function. Here, F 3 , F 4 , and F 5 capture fine, medium, and coarse features, respectively.

2.3.2. Feature Concatenation and Max Pooling

Multi-scale feature maps are adaptively fused using learnable scale weights and then spatially downsampled using max pooling:
F 6 = Concat ( w 3 · F 3 , w 7 · F 4 , w 15 · F 5 ) ,
F 7 ( c ) [ i , j ] = max ( F 6 ( c ) [ 2 i : 2 i + 1 , 2 j : 2 j + 1 ] ) ,
where w 3 , w 7 , and w 15 are learnable scale weights corresponding to convolutional kernels of size 3 × 3 , 7 × 7 , and 15 × 15 , respectively. F 6 is the fused feature map, and F 7 is the pooled feature map. The index c denotes the channel dimension, and [ i , j ] represents spatial coordinates.
Compared with fixed concatenation, the introduction of learnable weights enables the network to adaptively adjust the contribution of different receptive fields. This mechanism allows the model to emphasize fine-grained features for small targets while preserving coarse contextual information for larger regions. Such adaptive fusion is particularly beneficial in substation environments, where object scales vary significantly. The max pooling operation further reduces computational complexity while retaining the most salient local responses, ensuring efficient feature representation for subsequent detection stages.

2.4. Adaptive Attention Feature Module (AAFM)

Targets in substations, such as fire sparks, water patches, or small animals, often appear in complex backgrounds and under varying illumination. To enhance the most informative features while suppressing irrelevant background noise, the Adaptive Attention Feature Module (AAFM) is introduced. This module sequentially applies a channel weighting layer, a local residual layer, and a channel attention layer, progressively refining feature representations for robust detection.

2.4.1. Channel Weighting Layer

To emphasize channels that carry the most discriminative information for different target types, global features are first extracted using global average pooling (GAP):
GAP ( F 7 ( : , : , c ) ) = 1 H W i = 1 H j = 1 W F 7 ( i , j , c ) ,
w c = Sigmoid ( GAP ( F 7 ( : , : , c ) ) ) ,
F 8 ( i , j , c ) = w c · F 7 ( i , j , c ) ,
where F 7 is the input feature map from the MS-LFE module, w c is the weight of channel c, and F 8 is the channel-weighted feature map. The use of GAP allows the network to capture the overall importance of each channel, highlighting critical information such as flame intensity or water reflection patterns, while suppressing channels dominated by background noise.

2.4.2. Local Residual Layer

To enhance the representation of fine-grained local details, particularly for small or low-contrast targets, we propose a local residual convolutional layer that captures subtle structural information and enriches feature textures, yielding:
F 9 = F 8 + α · Conv 3 × 3 ( F 8 ) ,
where Conv 3 × 3 is a 3×3 convolution, α = 0.2 scales the residual contribution, and F 9 is the enhanced feature map. This residual connection preserves original features while highlighting local patterns, ensuring small targets such as sparks or animals remain detectable.

2.4.3. Channel Attention Layer

Finally, channel attention is applied to adaptively modulate channel significance based on global context:
F 10 ( i , j , c ) = F 9 ( i , j , c ) · Sigmoid ( GAP ( F 9 ) · W ) c ,
where W is a learnable weight vector and F 10 is the attention-refined feature map. This operation suppresses less informative channels while enhancing channels critical for detection, improving robustness against cluttered backgrounds and varying illumination.
Overall, by integrating channel weighting, local residual enhancement, and channel attention, the AAFM module generates feature maps that are both spatially and channel-wise informative, serving as an optimized input for the detection neck and head. This progressive refinement preserves and emphasizes subtle textures, edges, and color patterns associated with various targets, thereby improving the model’s overall detection performance.

2.5. Detection Neck and Head

The enhanced features are then processed by the detection neck and head for prediction:
F 11 = ReLU ( Conv 3 × 3 128 ( F 10 ) ) ,
F 12 = ReLU ( Conv 1 × 1 64 ( F 11 ) ) ,
F 13 = Conv 1 × 1 A ( 5 + C ) ( F 12 ) ,
where A is the number of anchor boxes, C is the number of classes, and F 13 is the final prediction feature map for bounding box regression and classification.

2.6. Loss Function

Foreign object detection in substations often occurs under uneven illumination, where individual images contain regions of both overexposure and deep shadows. The standard YOLO multi-task loss treats all objectness contributions uniformly, without accounting for spatially varying illumination within a single image. This may weaken the learning of targets under locally uneven illumination, where inconsistent lighting can degrade feature representation, leading to missed detections. To address this issue, an illumination-adaptive weight β i is incorporated into the objectness component of the loss. This weight captures the intensity variation within the i-th predicted bounding box, emphasizing regions where local illumination differs significantly.
L obj adaptive = i β i 1 i obj ( C i C ^ i ) 2 + λ noobj i 1 i noobj ( C i C ^ i ) 2 ,
β i = 1 + λ ( max ( I i ) min ( I i ) ) ,
where C i denotes the predicted confidence score of the i-th bounding box, and C ^ i is the corresponding ground truth confidence. The indicators 1 i obj and 1 i noobj represent the presence or absence of an object in the corresponding anchor, respectively. The parameter λ noobj controls the contribution of background regions, while λ is a sensitivity parameter defined in (3). I i denotes the set of intensity values within the i-th predicted bounding box. By increasing the gradient contribution for regions with large local intensity variations, β i strengthens the learning of small or low-contrast targets, improving detection robustness under locally uneven illumination. The same parameter λ is adopted to ensure consistency with the illumination modeling in AINM. Although λ is fixed, the adaptive formulation based on local intensity range ( max ( I i ) min ( I i ) ) enables dynamic adjustment under varying lighting conditions, including extreme cases.
Based on this formulation, the complete adaptive loss function is defined as
L adaptive = λ coord i 1 i obj ( x i x ^ i ) 2 + ( y i y ^ i ) 2 + λ size i 1 i obj ( w i w ^ i ) 2 + ( h i h ^ i ) 2 + L obj adaptive ,
where x i , y i , w i , h i and x ^ i , y ^ i , w ^ i , h ^ i denote the predicted and ground truth bounding box coordinates. This formulation preserves the original YOLO confidence learning mechanism while introducing illumination-adaptive weighting into the objectness component. As a result, the model becomes more sensitive to targets under challenging lighting conditions, thereby improving detection reliability without introducing additional hyperparameters.
For comparison, the standard YOLO multi-task loss is defined as follows:
L YOLO = λ coord i 1 i obj ( x i x ^ i ) 2 + ( y i y ^ i ) 2 + λ size i 1 i obj ( w i w ^ i ) 2 + ( h i h ^ i ) 2 + i 1 i obj ( C i C ^ i ) 2 + λ noobj i 1 i noobj ( C i C ^ i ) 2 ,
It can be observed that, in the standard formulation, all objectness contributions are treated uniformly regardless of local illumination conditions. However, this design does not account for spatially varying illumination within a single image, which is common in substation environments. In practice, substation scenes often involve mixed indoor and outdoor settings, leading to uneven lighting distributions with coexisting bright and shadowed regions. As a result, the uniform treatment of objectness contributions may limit the model’s ability to effectively learn from targets located in locally uneven illumination.

3. Case Study

3.1. Database Construction

The dataset used in this study was constructed by a power distribution company in China, with most images collected from on-site monitoring records. It comprises 8000 images of fire hazards, 2000 images of water accumulation, and 2000 images of small animal intrusion, along with an additional 5000 background-only images. To reflect real operational conditions, the dataset covers diverse illumination scenarios, including normal lighting (7415 images), low-light conditions (6844 images), and overexposed or glare-affected scenes (2741 images). Object categories and illumination conditions are treated as independent factors in the dataset construction, ensuring that each class is observed under various environmental lighting conditions. The annotation format follows the standard bounding-box representation (x, y, w, h). All annotations were manually labeled by trained personnel, and a two-stage verification process was applied to ensure annotation quality. The dataset is split into training, validation, and testing sets with a ratio of 6:2:2. The use of field-collected data ensures strong practical relevance and effectively reflects real-world operating conditions in power distribution substations.
Figure 2 presents representative examples of fire hazards, water accumulation, and small animal intrusion in substation environments. Compared with conventional object detection tasks, foreign object detection in substations exhibits several distinct characteristics. First, target scales vary significantly, with small animals occupying only a few pixels, while smoke or water regions can span large spatial areas. Second, texture characteristics differ substantially, as smoke, water surfaces, and animals exhibit fundamentally distinct visual patterns. Third, substation images are subject to diverse illumination conditions that change with time and season, as illustrated in Figure 2b,c. Finally, the background environment in substations is inherently complex, often containing various structural components and visual interferences.
These characteristics highlight the limitations of general-purpose detection methods, motivating the development of a task-specific model architecture designed for foreign object detection in substations.

3.2. Evaluation Metrics

The model performance is evaluated using accuracy and the F1 score [22]. The F1 score provides a balanced measure by combining precision and recall, making it particularly suitable for scenarios with imbalanced classes. The F1 score is defined as:
F 1 = 2 · Precision · Recall Precision + Recall ,
where
Precision = T P T P + F P ,
and
Recall = T P T P + F N .
Here, T P , F P , and F N denote the number of true positives, false positives, and false negatives, respectively.
In addition, mAP@0.5 is used to evaluate overall object detection performance considering both classification and localization quality. A detection is considered correct when Intersection over Union (IoU) between predicted bounding box and ground truth exceeds 0.5. The final mAP@0.5 is computed by averaging the Average Precision (AP) across all classes.
mAP @ 0.5 = 1 N i = 1 N A P i ,
Here, N denotes the number of classes and A P i is the Average Precision of class i, defined as the area under the precision–recall curve under the IoU threshold of 0.5.

3.3. Comparison of Different Model Structure

This section investigates the effectiveness of the proposed architecture. The method integrates several mechanisms, including AINM, MS-LFE, and AAFM, as summarized in Table 1. To validate the necessity of these modules, ablation experiments were conducted by selectively removing specific components, as summarized in Table 2. When a module was omitted, its functionality was replaced with a simplified alternative. Specifically, AINM was replaced by standard channel-wise normalization, MS-LFE was reduced to a single-scale convolution with a fixed kernel size of 7, and AAFM was replaced by direct feature propagation without attention-based refinement. The model is trained using the Adam optimizer with an initial learning rate of 0.001. The batch size is set to 16. The maximum number of training epochs is set to 150, with an early stopping strategy based on validation loss to prevent overfitting. The model is implemented in PyTorch 2.7.0 and executed on a server equipped with an NVIDIA RTX 3090 GPU (32 GB RAM), provided by NVIDIA Corporation, Santa Clara, CA, USA. Specifically, a 5-fold cross-validation strategy is employed, and all experiments are repeated three times to reduce randomness and ensure stable evaluation results.
As shown in Table 2, the removal of AINM leads to a performance decline, indicating that the network becomes more sensitive to illumination variations. Without adaptive intensity correction, features are misaligned across different lighting conditions, which diminishes the representation of low-contrast targets. Similarly, excluding MS-LFE reduces performance due to the loss of multi-scale representation capabilities. When multi-scale convolutions are replaced with a single receptive field, the network is unable to simultaneously capture fine-grained details and broader contextual information. This limitation particularly affects the detection of small targets such as sparks or animals, while also reducing robustness for larger regions like water accumulation or smoke. Additionally, omitting AAFM weakens the model’s ability to selectively emphasize informative features. Without attention mechanisms to capture fine-grained textures, feature refinement is reduced to uniform propagation, which increases background interference and diminishes the network’s ability to discriminate subtle patterns such as flame edges, smoke diffusion, or reflective water surfaces.
As shown in Figure 3a, the proposed model correctly detects different types of foreign objects. When the AAFM module is removed, representative failure cases appear in Figure 3b,c. In Figure 3b, substation facilities are misclassified as fire and smoke, constituting a false positive for the fire/smoke category. In Figure 3c, water accumulation is misclassified as fire and smoke, resulting in both a false positive for the fire/smoke class and a false negative for the water class.
These errors are associated with regions exhibiting complex textures or reflective patterns that resemble smoke under varying illumination conditions. Without effective refinement, the model tends to rely on coarse visual cues, leading to confusion. In contrast, the proposed model with AAFM correctly distinguishes these cases. This improvement is attributed to the local residual formulation in (14), which enhances fine-grained local structures such as edges and subtle texture variations while preserving the original feature distribution. Consequently, the model captures local differences beyond global similarity, improving class separability and reducing both false positives and false negatives.
Overall, these modules enable the model to address the unique challenges of substation foreign object detection. By integrating adaptive illumination handling, multi-scale feature extraction, and attention-based feature refinement, the network maintains robust performance in detecting small, reflective, and low-contrast targets under varying lighting conditions. The combination of all modules yields the highest performance, demonstrating that the proposed architecture is well suited for detecting sparks, smoke, water, and small animals in real-world substation scenarios.
To further evaluate the robustness of the proposed design, sensitivity analyses are conducted on two key hyperparameters, namely the illumination sensitivity factor λ in the AINM module and the residual scaling factor α in the AAFM module, as summarized in Table 3.
For λ , smaller values (e.g., λ = 0.3 ) reduce the model’s responsiveness to illumination differences, leading to insufficient enhancement in dim regions. In contrast, larger values (e.g., λ = 0.7 ) tend to over-enhance bright or highly varying regions, which may distort feature representation. The best performance is achieved at λ = 0.5 , indicating an effective balance between illumination normalization and feature preservation.
For the residual scaling parameter α , a smaller value (e.g., α = 0.1 ) limits the contribution of local residual information, weakening the representation of fine-grained structures. Conversely, a larger value (e.g., α = 0.3 ) may introduce excessive local variations, disrupting feature consistency. Setting α = 0.2 provides the most effective balance, enhancing local details while maintaining stable feature representation.
Overall, the results demonstrate that the proposed method remains stable within a reasonable parameter range, and the selected hyperparameters yield a well-balanced trade-off between local detail enhancement and global feature consistency.

3.4. Comparison of Different Loss Functions

This section assesses the effectiveness of the proposed loss function by comparing it with representative YOLO-based formulations, including the standard multi-task YOLO loss and its IoU-based variant (CIoU loss) for bounding box regression. In addition, Focal loss is incorporated as a comparative baseline. For Focal loss, the focusing parameter γ and the balancing factor α are set to 2.0 and 0.25, respectively, following common practice in object detection.
As shown in Table 4, compared with the standard YOLO multi-task loss, incorporating CIoU improves detection performance by enforcing geometric constraints on overlap, center distance, and aspect ratio, thereby enhancing bounding box alignment. However, this improvement addresses only spatial consistency and does not explicitly consider challenges arising from uneven illumination within substation images, where individual images may contain both overexposed and shadowed regions. Such variations can reduce the visibility of certain targets, particularly small or low-contrast objects. Moreover, Focal Loss, which emphasizes low-confidence predictions, provides only marginal gains over the standard YOLO loss and underperforms the YOLO loss with CIoU in this task. By increasing the contribution of low-confidence samples, it shifts optimization toward a subset of difficult regions. However, excessive emphasis on these samples may bias learning and weaken the model’s ability to capture representative patterns across the dataset. As a result, confidence-based reweighting offers limited benefit in this scenario.
In contrast, the proposed loss incorporates an illumination-adaptive weight β i in the objectness component, reallocating gradient emphasis based on intensity variation within each predicted bounding box. This mechanism emphasizes regions with locally uneven illumination, including areas affected by overexposure or deep shadows. By explicitly accounting for intra-image illumination variations, the proposed loss aligns closely with the specific characteristics of substation monitoring tasks.
Overall, while conventional YOLO-based losses primarily enhance geometric consistency, the illumination-adaptive loss introduces a feature-aware optimization strategy that improves robustness to illumination-induced degradation. This results in more reliable detection of foreign objects under complex substation lighting conditions without increasing model complexity.

3.5. Comparative Analysis with Existing Methods

The proposed method is compared with representative object detection frameworks, including single-stage detectors such as YOLOv5 and SSD, and the two-stage detector Faster R-CNN. These models serve as baselines for evaluation and are widely adopted in general object detection tasks.
It should also be noted that these baseline models are primarily developed for natural image datasets, where objects generally have clear boundaries and stable illumination. In contrast, substation monitoring involves targets such as sparks, smoke, water accumulation, and small animals, which exhibit substantial variation in scale, texture, and illumination, further motivating the design of a task-specific detection architecture.
As shown in Table 5, YOLOv5 achieves efficient inference with competitive overall performance. SSD demonstrates relatively lower performance, which can be attributed to its comparatively limited capacity for capturing multi-scale features. Faster R-CNN, on the other hand, leverages a two-stage detection mechanism to obtain more refined feature representations, resulting in improved performance in complex scenes. However, this advantage comes with increased computational overhead, limiting its suitability for real-time monitoring applications.
Compared with YOLOv5, more recent variants such as YOLOv8 and YOLOv9 introduce architectural and training-level improvements. Specifically, YOLOv5 follows an anchor-based detection paradigm with coupled prediction heads, while YOLOv8 adopts an anchor-free design with decoupled heads and improved label assignment strategies. YOLOv9 further enhances representation learning by incorporating programmable gradient information during optimization. However, foreign object detection in substation environments exhibits unique characteristics, including large variations in object scale and texture patterns across different object types, as well as varying illumination conditions. Therefore, these improvements in YOLOv8 and YOLOv9 may be less effective for this task, since they are primarily designed for general object detection scenarios and do not explicitly model the specific characteristics of substation monitoring.
In contrast, the proposed method is specifically tailored to the intrinsic characteristics of substation monitoring. By explicitly modeling illumination variation, enhancing multi-scale local feature representation, and incorporating adaptive attention mechanisms, the proposed framework more effectively captures the distinctive features of foreign objects under challenging conditions, leading to improved detection robustness and a more favorable balance between accuracy and efficiency.
Overall, rather than being optimized solely for general-purpose detection benchmarks, the proposed framework is tailored to the intrinsic characteristics of substation monitoring tasks. This task-oriented design achieves a balanced trade-off between detection accuracy and computational efficiency, making it well-suited for deployment in complex real-world substation environments.

3.6. Exploratory Validation on MVTec 3D-AD

To further investigate the cross-domain behavior of the proposed framework, exploratory experiments were conducted on the cable gland category of the MVTec 3D-AD dataset, since this category is relatively more relevant to electrical industrial scenarios. Unlike the proposed task, MVTec 3D-AD is primarily designed for unsupervised anomaly detection rather than supervised object detection. Its training set contains only normal samples, while the test set includes five categories, namely good, bent, cut, hole, and thread. Therefore, the test samples were reformulated into a five-class classification task to better align with the original foreign-object detection problem considered in this work.
Moreover, MVTec 3D-AD does not provide standard object-detection annotations such as bounding boxes. Therefore, detection-oriented metrics such as mAP@0.5 could not be evaluated consistently with the original task setting of this work. In addition, both the original YOLO loss and the proposed illumination-aware loss rely on bounding-box regression and objectness prediction, making them unsuitable for this defect classification dataset. Consequently, standard cross-entropy loss was adopted in the exploratory experiments. Experimental results are summarized in Table 6.
Results indicate that AINM provides minor improvements on MVTec 3D-AD. Since the dataset is collected under black backgrounds and relatively stable illumination conditions. Consequently, the adaptive gamma exponent γ c in AINM tends to become nearly constant across samples, reducing the effectiveness of the illumination-adaptive transformation. Nevertheless, the channel-wise normalization operation can still provide slight regularization benefits by making channel activation scales more consistent and stabilizing feature magnitudes across samples. At the same time, the normalization operation may slightly perturb fine-grained texture representations, leading to increased performance variance across different runs.
By comparison, the proposed MS-LFE and AAFM modules provide more noticeable improvements under this cross-domain setting, demonstrating certain generalization capabilities in industrial visual tasks. However, the performance gain introduced by MS-LFE is relatively smaller than that observed in the proposed power-system foreign-object detection task. This is mainly because the scale variation among defect categories in MVTec 3D-AD is comparatively limited. Although categories such as bent and hole exhibit certain differences, their scale variation is still much smaller than that in practical substation monitoring scenarios, where targets such as water accumulation or smoke may occupy a large portion of the image, while small-animal intrusion often appears at small scales. Consequently, the advantages of multi-scale feature enhancement become less pronounced on MVTec 3D-AD.
In contrast, the proposed AAFM module still yields relatively noticeable improvements. This is mainly because the module enhances fine-grained local feature representation through residual detail enhancement. In MVTec 3D-AD, defect categories such as hole, thread, and bent are often characterized by local features at relatively small scales. Therefore, the proposed AAFM remains effective under industrial defect classification settings.
Considering the substantial differences in task formulation, annotation structure, imaging conditions, and evaluation protocols between MVTec 3D-AD and the proposed foreign-object detection task, these experiments are provided only as exploratory validation. Nevertheless, they still partially verify the effectiveness of the proposed feature extraction strategies, while further highlighting the task-specific adaptability of the proposed framework for substation monitoring tasks.

4. Conclusions

This paper presents a detection framework specifically designed for monitoring foreign objects in substations, targeting hazards such as fire, water accumulation, and small animal intrusion. The framework explicitly accounts for complex and varying illumination conditions to ensure robust detection across diverse scenarios. It integrates adaptive illumination normalization, multi-scale feature extraction, and attention-based feature refinement to enhance feature representation for objects with varying scales and textures. In addition, a task-oriented loss function is introduced to maintain robust performance under locally uneven lighting conditions. Experimental results demonstrate that the proposed framework outperforms representative detection methods, confirming its effectiveness in addressing substation-specific challenges and highlighting its practical applicability in real-world monitoring systems.
The proposed method has been demonstrated to effectively improve performance in complex substation environments. However, monitoring tasks across different departments and substations in power systems may involve additional characteristics and requirements. The method is evaluated on a certain distribution substation dataset, which may limit its generalization ability.
Future work will focus on expanding the dataset to include more diverse power system scenarios, enabling cross-dataset validation and providing a more comprehensive evaluation of model generalization. In addition, deployment on edge devices (e.g., embedded GPUs) under real-time constraints will be further investigated for practical substation monitoring applications.

Author Contributions

Conceptualization, Q.Z., R.L. and J.F.; methodology, Q.Z. and Y.Y.; software, Q.Z.; validation, Q.Z., Y.Y. and Z.C.; formal analysis, Q.Z.; investigation, Q.Z. and Y.Y.; resources, R.L.; data curation, Q.Z. and Y.R.; writing—original draft preparation, Q.Z.; writing—review and editing, Y.Y., Z.C. and R.L.; visualization, Y.R. and X.L.; supervision, J.F.; project administration, X.L. and J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the State Grid Beijing Electric Power Company Science and Technology Program with Grant Number B7022325A003.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding authors.

Conflicts of Interest

Authors Qiao Zhao, Yuhai Yao, Zihan Cong, and Ruoxi Liu are employed by State Grid Beijing Electric Power Research Institute; author Jiashu Fang is partly employed by Beijing Rongtong Wisdom Technology Group Co., Ltd.; author Yiyong Ren is employed by Beijing Rongtong Wisdom Technology Group Co., Ltd.; author Xin Lv is employed by Dongfang Electronics Co., Ltd. The authors declare that this study received funding from the State Grid Beijing Electric Power Company Science and Technology Program. The funder had involvement in project support and technical consultation related to the study.

References

  1. Javid, Z.; Kocar, I.; Holderbaum, W.; Karaagac, U. Future Distribution Networks: A Review. Energies 2024, 17, 1822. [Google Scholar] [CrossRef]
  2. Li, D.X.; Wang, L.; Li, J.X.; Xie, S.Y.; Wang, L.; Xie, W.W. Design and Research of Intelligent Operation Inspection and Monitoring System of Substation Based on Image Recognition Technology. In Proceedings of the 2022 2nd Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS), Shenyang, China, 25–27 February 2022; pp. 448–453. [Google Scholar] [CrossRef]
  3. Wang, X.; Shi, D.; Wang, F. Real-Time Detection and Tracking of Foreign Object Intrusions in Power Systems via Feature-Based Edge Intelligence. IEEE Open Access J. Power Energy 2025, 12, 625–636. [Google Scholar] [CrossRef]
  4. Faisal, M.A.A.; Mecheter, I.; Qiblawey, Y.; Fernandez, J.H.; Chowdhury, M.E.; Kiranyaz, S. Deep Learning in Automated Power Line Inspection: A Review. Appl. Energy 2025, 385, 125507. [Google Scholar] [CrossRef]
  5. Tong, B.; Li, Y.; Chen, X. A Neural Network-Based Intelligent System for Substation Surveillance Video Analysis with Edge and IoT Integration. EURASIP J. Wirel. Commun. Netw. 2025, 2025, 95. [Google Scholar] [CrossRef]
  6. Wu, Y.; Xiao, F.; Liu, F.; Sun, Y.; Deng, X.; Lin, L.; Zhu, C. A Visual Fault Detection Algorithm of Substation Equipment Based on Improved Yolov5. Appl. Sci. 2023, 13, 11785. [Google Scholar] [CrossRef]
  7. Çelik, T.; Demirel, H. Fire Detection in Video Sequences Using a Generic Color Model. Fire Saf. J. 2009, 44, 147–158. [Google Scholar] [CrossRef]
  8. Gagliardi, A.; Saponara, S. AdViSED: Advanced Video SmokE Detection for Real-Time Measurements in Antifire Indoor and Outdoor Systems. Energies 2020, 13, 2098. [Google Scholar] [CrossRef]
  9. Wang, S.; He, Y.; Yang, H.; Wang, K.; Wang, J. Video Smoke Detection Using Shape, Color and Dynamic Features. J. Intell. Fuzzy Syst. 2017, 33, 305–313. [Google Scholar] [CrossRef]
  10. Wang, Y.; Ren, J. Low-Light Forest Flame Image Segmentation Based on Color Features. J. Phys. Conf. Ser. 2018, 1069, 12165. [Google Scholar] [CrossRef]
  11. Li, F.; Yang, X.; Gao, H.; Yue, Z.; Yu, J.; He, T.; Guo, T.; Zhong, X. REP-YOLOX: An Efficient Model for Defect Detection in Gas Insulated Switchgear Equipment. In Proceedings of the 2023 8th International Conference on Automation, Control and Robotics Engineering (CACRE), Hong Kong, China, 13–15 July 2023; pp. 13–18. [Google Scholar] [CrossRef]
  12. Li, J.; Sun, S.; Wang, S.; Guo, R.; Du, Y.; Wang, Y.; Han, Y. Defect Recognition of Small Samples in Substations Based on the Domain Adaptive YOLOv7. In Proceedings of the 2023 IEEE 3rd International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 26–28 May 2023; pp. 1754–1758. [Google Scholar] [CrossRef]
  13. Sun, H.; Shen, Q.; Ke, H.; Duan, Z.; Tang, X. Power Transmission Lines Foreign Object Intrusion Detection Method for Drone Aerial Images Based on Improved Yolov8 Network. Drones 2024, 8, 346. [Google Scholar] [CrossRef]
  14. Yu, M.; Liu, S.; Wang, B.; Liu, C.; Yang, Y.; Cai, S.; Li, Y.; Li, S. Ship Detection Methods Based on Improved YOLOv8 in Surveillance Video. In Proceedings of the 2026 6th International Conference on Consumer Electronics and Computer Engineering (ICCECE), Wuhan, China, 23–25 January 2026; pp. 16–19. [Google Scholar] [CrossRef]
  15. Geng, X.; Han, X.; Cao, X.; Su, Y.; Shu, D. YOLOV9-CBM: An Improved Fire Detection Algorithm Based on YOLOV9. IEEE Access 2025, 13, 19612–19623. [Google Scholar] [CrossRef]
  16. Song, Y.; Li, D.; He, P.; Li, J. YOLOv10-CBAM: Lightweight Metal Surface Defect Detection Based on YOLOv10. In Proceedings of the 2025 International Conference on Equipment Intelligent Operation and Maintenance (ICEIOM), Urumqi, China, 1–3 August 2025; pp. 257–265. [Google Scholar] [CrossRef]
  17. Qi, C.; Chen, Z.; Chen, X.; Bao, Y.; He, T.; Hu, S.; Li, J.; Liang, Y.; Tian, F.; Li, M. Efficient Real-Time Detection of Electrical Equipment Images Using a Lightweight Detector Model. Front. Energy Res. 2023, 11, 1291382. [Google Scholar] [CrossRef]
  18. Lu, M.; Xie, Y. Intelligent Detection System for Electrical Equipment Based on Deep Learning and Infrared Image Processing Technology. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 1147–1155. [Google Scholar] [CrossRef]
  19. Yang, Q.; Ma, S.; Guo, D.; Wang, P.; Lin, M.; Hu, Y. A Small Object Detection Method for Oil Leakage Defects in Substations Based on Improved Faster-RCNN. Sensors 2023, 23, 7390. [Google Scholar] [CrossRef] [PubMed]
  20. Xu, L.; Song, Y.; Zhang, W.; An, Y.; Wang, Y.; Ning, H. An Efficient Foreign Objects Detection Network for Power Substation. Image Vis. Comput. 2021, 109, 104159. [Google Scholar] [CrossRef]
  21. Rong, S.; He, L.; Du, L.; Li, Z.; Yu, S. Intelligent Detection of Vegetation Encroachment of Power Lines with Advanced Stereovision. IEEE Trans. Power Deliv. 2021, 36, 3477–3485. [Google Scholar] [CrossRef]
  22. Cortes, C.; Mohri, M. AUC Optimization vs. Error Rate Minimization. In Proceedings of the 17th International Conference on Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003. [Google Scholar]
Figure 1. Overall architecture of the proposed network.
Figure 1. Overall architecture of the proposed network.
Processes 14 01899 g001
Figure 2. Representative field-collected images of foreign objects in substation environments: (a) fire hazards, (b) water accumulation, and (c) small animal intrusions.
Figure 2. Representative field-collected images of foreign objects in substation environments: (a) fire hazards, (b) water accumulation, and (c) small animal intrusions.
Processes 14 01899 g002
Figure 3. Detection results of the proposed method in real-world substation scenarios. In (a), the foreign objects are correctly detected under different illumination conditions, where “water” denotes water accumulation, and “hzyw” denotes fire and smoke. The numerical value associated with each bounding box represents the confidence score of the detection. (b,c) illustrate typical misclassification cases when the AAFM module is removed.
Figure 3. Detection results of the proposed method in real-world substation scenarios. In (a), the foreign objects are correctly detected under different illumination conditions, where “water” denotes water accumulation, and “hzyw” denotes fire and smoke. The numerical value associated with each bounding box represents the confidence score of the detection. (b,c) illustrate typical misclassification cases when the AAFM module is removed.
Processes 14 01899 g003
Table 1. Architectural Details of the Proposed Method.
Table 1. Architectural Details of the Proposed Method.
ModuleLayerParametersInputOutput
AINMGamma Correctionadaptive γ F 0 F 1
Normalization Layerchannel-wise F 1 F 2
MS-LFEConvolution Layer3 × 3, 32 filters, ReLU F 2 F 3
Convolution Layer7 × 7, 32 filters, ReLU F 2 F 4
Convolution Layer15 × 15, 32 filters, ReLU F 2 F 5
Concatenation Layerchannel-wise concat F 3 , F 4 , F 5 F 6
Max Pooling Layer2 × 2, stride = 2 F 6 F 7
AAFMChannel WeightingGAP, sigmoid F 7 F 8
Local Residual3 × 3 conv, α = 0.2 F 8 F 9
Channel AttentionGAP, sigmoid, weight F 9 F 10
Detection NeckConvolution Layer3 × 3, 128 filters, ReLU F 10 F 11
Convolution Layer1 × 1, 64 filters, ReLU F 11 F 12
Detection HeadConvolution Layer1 × 1, A(5 + C) filters F 12 F 13
Table 2. Performance under different module combinations.
Table 2. Performance under different module combinations.
AINMMS-LFEAAFMAccuracy (%)F1-Score (%)mAP@0.5 (%)
80.18 ± 0.7879.45 ± 0.8183.25 ± 0.76
84.50 ± 0.7284.15 ± 0.7586.36 ± 0.70
86.33 ± 0.6885.89 ± 0.7088.02 ± 0.66
88.91 ± 0.5988.35 ± 0.6291.01 ± 0.57
93.77 ± 0.4892.99 ± 0.5194.14 ± 0.46
88.51 ± 0.6188.23 ± 0.6491.12 ± 0.59
95.21 ± 0.4694.58 ± 0.4395.73 ± 0.41
Table 3. Sensitivity analysis of key hyperparameters.
Table 3. Sensitivity analysis of key hyperparameters.
ParameterValueAccuracy (%)F1-Score (%)mAP@0.5 (%)
λ 0.393.84 ± 0.5393.11 ± 0.4995.05 ± 0.47
0.595.21 ± 0.4694.58 ± 0.4395.73 ± 0.41
0.793.12 ± 0.4892.67 ± 0.4694.10 ± 0.45
α 0.192.96 ± 0.5592.41 ± 0.5094.28 ± 0.58
0.295.21 ± 0.4694.58 ± 0.4395.73 ± 0.41
0.393.35 ± 0.5892.88 ± 0.4895.01 ± 0.60
Table 4. Performance comparison of different YOLO-based loss functions.
Table 4. Performance comparison of different YOLO-based loss functions.
Loss FunctionAccuracy (%)F1-Score (%)mAP@0.5 (%)
YOLO multi-task loss88.99 ± 0.4188.74 ± 0.4591.12 ± 0.39
YOLO + CIoU loss90.31 ± 0.3889.78 ± 0.4292.03 ± 0.36
Focal loss89.94 ± 0.4789.74 ± 0.4491.28 ± 0.43
Proposed loss95.21 ± 0.4694.58 ± 0.4395.73 ± 0.41
Table 5. Performance comparison with representative object detection methods.
Table 5. Performance comparison with representative object detection methods.
MethodAccuracy (%)F1-Score (%)mAP@0.5 (%)Test Time (ms)
YOLOv589.92 ± 0.5289.51 ± 0.5792.56 ± 0.5028
YOLOv891.15 ± 0.6891.03 ± 0.6494.02 ± 0.6143
YOLOv992.14 ± 0.5091.95 ± 0.5694.88 ± 0.4840
SSD87.84 ± 0.7386.59 ± 0.7990.12 ± 0.7031
Faster R-CNN90.48 ± 0.8190.01 ± 0.7692.55 ± 0.7847
Proposed95.21 ± 0.4694.58 ± 0.4395.73 ± 0.4142
Table 6. Cross-domain exploratory validation on the MVTec 3D-AD dataset.
Table 6. Cross-domain exploratory validation on the MVTec 3D-AD dataset.
AINMMS-LFEAAFMAccuracy (%)F1-Score
79.62 ± 0.7278.94 ± 0.70
80.01 ± 0.8479.11 ± 0.81
81.80 ± 0.6380.94 ± 0.59
83.55 ± 0.5882.67 ± 0.54
84.85 ± 0.5184.05 ± 0.49
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Q.; Yao, Y.; Cong, Z.; Liu, R.; Fang, J.; Ren, Y.; Lv, X. A Deep Learning-Based Monitoring Framework for Foreign Object Detection in Power Distribution Substations. Processes 2026, 14, 1899. https://doi.org/10.3390/pr14121899

AMA Style

Zhao Q, Yao Y, Cong Z, Liu R, Fang J, Ren Y, Lv X. A Deep Learning-Based Monitoring Framework for Foreign Object Detection in Power Distribution Substations. Processes. 2026; 14(12):1899. https://doi.org/10.3390/pr14121899

Chicago/Turabian Style

Zhao, Qiao, Yuhai Yao, Zihan Cong, Ruoxi Liu, Jiashu Fang, Yiyong Ren, and Xin Lv. 2026. "A Deep Learning-Based Monitoring Framework for Foreign Object Detection in Power Distribution Substations" Processes 14, no. 12: 1899. https://doi.org/10.3390/pr14121899

APA Style

Zhao, Q., Yao, Y., Cong, Z., Liu, R., Fang, J., Ren, Y., & Lv, X. (2026). A Deep Learning-Based Monitoring Framework for Foreign Object Detection in Power Distribution Substations. Processes, 14(12), 1899. https://doi.org/10.3390/pr14121899

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop