1. Introduction
Ensuring the safe operation of power distribution substations is essential for maintaining the reliability of modern power systems [
1,
2]. In practice, power distribution substations are frequently exposed to various external hazards, including fire incidents, water accumulation, and small animal intrusion, which may cause equipment damage and operational risks [
3]. With the rapid development of intelligent and unattended substations, there is an increasing demand for automated monitoring systems capable of accurately identifying such hazards, thereby enabling timely intervention by inspection personnel [
4,
5,
6]. Therefore, it is motivated to develop robust detection algorithms for effective foreign object monitoring under real-world operating conditions.
In the existing literature, a variety of vision-based detection methods have been developed. Traditional approaches primarily rely on handcrafted features, such as edge information and texture patterns, to identify abnormal regions in monitoring images. For instance, ref. [
7] exploits color space transformation to separate luminance and chrominance components, enabling real-time identification of flame regions. Ref. [
8] models background dynamics using Kalman filtering and extracts flame-related features, which are further analyzed through decision mechanisms for rapid fire detection. Ref. [
9] combines static and dynamic characteristics, applying fuzzy logic-based image enhancement and Gaussian mixture modeling to distinguish smoke patterns. Meanwhile, ref. [
10] captures the dynamic variation of flame color, establishing a relationship between color evolution and flame behavior. However, these approaches rely heavily on manually designed features and heuristic rules, which may limit their robustness in practical substation environments, where diverse foreign targets exhibit significant variability in feature characteristics and illumination conditions vary substantially.
In recent years, deep learning-based methods have been increasingly applied to foreign object detection tasks in power systems. Representative approaches include single-stage detectors such as You Only Look Once (YOLO) [
3,
11,
12,
13,
14,
15,
16] and Single-Shot MultiBox Detector (SSD) [
17,
18], as well as two-stage detectors, such as Faster Region-based Convolutional Neural Network (Faster R-CNN) [
19,
20,
21]. Ref. [
11] combines dilated and standard convolutions to enlarge the receptive field for feature extraction, while [
12] incorporates attention mechanisms to focus on critical regions and enhance object feature representation. Ref. [
14] introduces YOLOv8 with enhanced attention and feature pyramid mechanisms, improving multi-scale feature extraction and fusion capability. Ref. [
15] proposes a YOLOv9-based variant that incorporates attention mechanisms and an improved IoU-based loss function, enhancing feature representation and localization accuracy for fire detection tasks. Ref. [
16] develops a lightweight YOLOv10 variant incorporating a Convolutional Block Attention Module (CBAM), which enhances feature representation while maintaining efficient deployment on edge devices. In contrast, two-stage detection frameworks, such as Faster R-CNN, achieve high detection accuracy in complex and cluttered scenes at the cost of slower inference speed. Network designs in [
19] optimize small object detection by reducing downsampling and employing smaller convolution kernels. Additionally, ref. [
20] introduces a moving object region extraction network combined with a classification module to reduce redundant candidate regions.
Nevertheless, compared with general object detection tasks, substation environments present distinct challenges. First, foreign objects encountered in substations, such as fire incidents, water accumulation, and small animals, exhibit significant variations in scale and visual characteristics. Moreover, images captured on site are often affected by varying illumination conditions, including overexposure and shadowing effects. Existing deep learning-based methods typically do not explicitly consider these factors in their design, which may reduce their performance for foreign object detection in substation environments.
The proposed method is based on the general anchor-based YOLO detection paradigm, similar to YOLOv5. Compared with YOLOv5 frameworks, several task-specific modifications are introduced. First, an adaptive illumination normalization module is applied at the input stage. Second, a parallel multi-scale feature extraction structure is integrated into the backbone to replace conventional hierarchical feature fusion. Third, an attention-based feature refinement module is inserted after the multi-scale feature extraction layers within the backbone to refine intermediate feature representations. Fourth, the standard YOLO loss is extended with an illumination-adaptive weighting mechanism at the training stage. These modifications collectively improve feature representation and detection reliability in complex substation environments.
The main contributions of this paper are summarized as follows:
An adaptive illumination normalization mechanism is proposed, adapting to lighting variations in real-world substation environments.
A parallel multi-scale feature extraction structure is developed to replace conventional feature fusion strategies, enabling representation of objects with varying sizes and textures.
A feature refinement mechanism combining channel weighting and local residual enhancement is designed, improving discriminability between objects with similar textures.
An illumination-adaptive loss function is introduced to improve objectness learning under challenging illumination conditions, leading to more reliable detection performance.
2. Methodology
2.1. Overview
Compared with conventional foreign object detection tasks, detection in distribution substations presents unique challenges. Foreign objects in these environments, including fire hazards, water accumulation, and small animals, exhibit significant variations in scale and diverse texture characteristics. In addition, substation scenes often exhibit overall bright or dark conditions due to variations in time of day and weather, as well as locally uneven illumination, which further complicates accurate detection.
To address these challenges while maintaining lightweight and real-time performance, we propose a detection network composed of three main modules: Adaptive Image Normalization Module (AINM), Multi-Scale Local Feature Enhancement (MS-LFE), and Adaptive Attention Feature Module (AAFM), followed by a detection neck and head for bounding box and class prediction. Each module is designed to enhance feature extraction and robustness across diverse scenarios, as illustrated in
Figure 1. The detailed architectural parameters are summarized in
Table 1.
2.2. Adaptive Image Normalization Module (AINM)
Images captured in substations are often subject to extreme lighting variations, such as overexposure under strong sunlight or underexposure at night. To compensate for these variations and produce stable feature maps for subsequent layers, the AINM first applies adaptive gamma correction and then channel-wise normalization.
2.2.1. Gamma Correction Layer
To handle varying lighting conditions in substations, an adaptive gamma correction is applied to the input image
. For each channel
c, the mean and standard deviation of pixel intensities are computed as:
Based on these statistics, the adaptive gamma exponent is defined as:
and the output is obtained by
Here, is a small constant to ensure numerical stability.
In practical implementation, the above operations are performed in a vectorized manner. Specifically, and are computed for each channel on a per-image basis during the forward pass, resulting in channel-wise scalars. The exponent is then applied uniformly across all spatial locations within the corresponding channel, and the power operation is applied element-wise to obtain . This design ensures computational efficiency and seamless integration into standard deep learning frameworks.
is a sensitivity parameter controlling the response to illumination variation. It is shared across different modules to maintain a consistent scaling of illumination sensitivity throughout the network. The value is empirically determined via cross-validation and set to 0.5. Larger values may lead to over-enhancement in bright or highly varying regions, while smaller values may result in insufficient correction in dark or low-illumination areas.
This adaptive operation dynamically adjusts intensity distributions according to image-specific statistics, enabling robust feature extraction under both overexposed and underexposed conditions.
2.2.2. Channel-Wise Normalization Layer
After gamma correction, channel-wise normalization is performed to standardize each channel’s feature distribution:
where
and
are computed from
. The resulting normalized feature map
ensures that each channel contributes comparably to subsequent convolutional layers, enhancing robustness against illumination-induced bias.
2.3. Multi-Scale Local Feature Enhancement (MS-LFE)
This study considers common foreign objects in distribution substations, including fire, water accumulation, and small animal intrusion. These targets range from small sparks or animals to large water regions, requiring the network to effectively extract features across multiple spatial scales. The proposed MS-LFE module addresses this by applying convolutional layers with diverse receptive fields, followed by feature concatenation and max pooling. This design preserves fine details for small targets while capturing coarse patterns for larger targets.
2.3.1. Multi-Scale Convolution Layer
where
is the normalized input feature map,
denotes a convolutional layer with kernel size
and 32 filters, and
is the rectified linear activation function. Here,
,
, and
capture fine, medium, and coarse features, respectively.
2.3.2. Feature Concatenation and Max Pooling
Multi-scale feature maps are adaptively fused using learnable scale weights and then spatially downsampled using max pooling:
where
,
, and
are learnable scale weights corresponding to convolutional kernels of size
,
, and
, respectively.
is the fused feature map, and
is the pooled feature map. The index
c denotes the channel dimension, and
represents spatial coordinates.
Compared with fixed concatenation, the introduction of learnable weights enables the network to adaptively adjust the contribution of different receptive fields. This mechanism allows the model to emphasize fine-grained features for small targets while preserving coarse contextual information for larger regions. Such adaptive fusion is particularly beneficial in substation environments, where object scales vary significantly. The max pooling operation further reduces computational complexity while retaining the most salient local responses, ensuring efficient feature representation for subsequent detection stages.
2.4. Adaptive Attention Feature Module (AAFM)
Targets in substations, such as fire sparks, water patches, or small animals, often appear in complex backgrounds and under varying illumination. To enhance the most informative features while suppressing irrelevant background noise, the Adaptive Attention Feature Module (AAFM) is introduced. This module sequentially applies a channel weighting layer, a local residual layer, and a channel attention layer, progressively refining feature representations for robust detection.
2.4.1. Channel Weighting Layer
To emphasize channels that carry the most discriminative information for different target types, global features are first extracted using global average pooling (GAP):
where
is the input feature map from the MS-LFE module,
is the weight of channel
c, and
is the channel-weighted feature map. The use of GAP allows the network to capture the overall importance of each channel, highlighting critical information such as flame intensity or water reflection patterns, while suppressing channels dominated by background noise.
2.4.2. Local Residual Layer
To enhance the representation of fine-grained local details, particularly for small or low-contrast targets, we propose a local residual convolutional layer that captures subtle structural information and enriches feature textures, yielding:
where
is a 3×3 convolution,
scales the residual contribution, and
is the enhanced feature map. This residual connection preserves original features while highlighting local patterns, ensuring small targets such as sparks or animals remain detectable.
2.4.3. Channel Attention Layer
Finally, channel attention is applied to adaptively modulate channel significance based on global context:
where
W is a learnable weight vector and
is the attention-refined feature map. This operation suppresses less informative channels while enhancing channels critical for detection, improving robustness against cluttered backgrounds and varying illumination.
Overall, by integrating channel weighting, local residual enhancement, and channel attention, the AAFM module generates feature maps that are both spatially and channel-wise informative, serving as an optimized input for the detection neck and head. This progressive refinement preserves and emphasizes subtle textures, edges, and color patterns associated with various targets, thereby improving the model’s overall detection performance.
2.5. Detection Neck and Head
The enhanced features are then processed by the detection neck and head for prediction:
where
A is the number of anchor boxes,
C is the number of classes, and
is the final prediction feature map for bounding box regression and classification.
2.6. Loss Function
Foreign object detection in substations often occurs under uneven illumination, where individual images contain regions of both overexposure and deep shadows. The standard YOLO multi-task loss treats all objectness contributions uniformly, without accounting for spatially varying illumination within a single image. This may weaken the learning of targets under locally uneven illumination, where inconsistent lighting can degrade feature representation, leading to missed detections. To address this issue, an illumination-adaptive weight
is incorporated into the objectness component of the loss. This weight captures the intensity variation within the
i-th predicted bounding box, emphasizing regions where local illumination differs significantly.
where
denotes the predicted confidence score of the
i-th bounding box, and
is the corresponding ground truth confidence. The indicators
and
represent the presence or absence of an object in the corresponding anchor, respectively. The parameter
controls the contribution of background regions, while
is a sensitivity parameter defined in (
3).
denotes the set of intensity values within the
i-th predicted bounding box. By increasing the gradient contribution for regions with large local intensity variations,
strengthens the learning of small or low-contrast targets, improving detection robustness under locally uneven illumination. The same parameter
is adopted to ensure consistency with the illumination modeling in AINM. Although
is fixed, the adaptive formulation based on local intensity range
enables dynamic adjustment under varying lighting conditions, including extreme cases.
Based on this formulation, the complete adaptive loss function is defined as
where
and
denote the predicted and ground truth bounding box coordinates. This formulation preserves the original YOLO confidence learning mechanism while introducing illumination-adaptive weighting into the objectness component. As a result, the model becomes more sensitive to targets under challenging lighting conditions, thereby improving detection reliability without introducing additional hyperparameters.
For comparison, the standard YOLO multi-task loss is defined as follows:
It can be observed that, in the standard formulation, all objectness contributions are treated uniformly regardless of local illumination conditions. However, this design does not account for spatially varying illumination within a single image, which is common in substation environments. In practice, substation scenes often involve mixed indoor and outdoor settings, leading to uneven lighting distributions with coexisting bright and shadowed regions. As a result, the uniform treatment of objectness contributions may limit the model’s ability to effectively learn from targets located in locally uneven illumination.
3. Case Study
3.1. Database Construction
The dataset used in this study was constructed by a power distribution company in China, with most images collected from on-site monitoring records. It comprises 8000 images of fire hazards, 2000 images of water accumulation, and 2000 images of small animal intrusion, along with an additional 5000 background-only images. To reflect real operational conditions, the dataset covers diverse illumination scenarios, including normal lighting (7415 images), low-light conditions (6844 images), and overexposed or glare-affected scenes (2741 images). Object categories and illumination conditions are treated as independent factors in the dataset construction, ensuring that each class is observed under various environmental lighting conditions. The annotation format follows the standard bounding-box representation (x, y, w, h). All annotations were manually labeled by trained personnel, and a two-stage verification process was applied to ensure annotation quality. The dataset is split into training, validation, and testing sets with a ratio of 6:2:2. The use of field-collected data ensures strong practical relevance and effectively reflects real-world operating conditions in power distribution substations.
Figure 2 presents representative examples of fire hazards, water accumulation, and small animal intrusion in substation environments. Compared with conventional object detection tasks, foreign object detection in substations exhibits several distinct characteristics. First, target scales vary significantly, with small animals occupying only a few pixels, while smoke or water regions can span large spatial areas. Second, texture characteristics differ substantially, as smoke, water surfaces, and animals exhibit fundamentally distinct visual patterns. Third, substation images are subject to diverse illumination conditions that change with time and season, as illustrated in
Figure 2b,c. Finally, the background environment in substations is inherently complex, often containing various structural components and visual interferences.
These characteristics highlight the limitations of general-purpose detection methods, motivating the development of a task-specific model architecture designed for foreign object detection in substations.
3.2. Evaluation Metrics
The model performance is evaluated using accuracy and the F1 score [
22]. The F1 score provides a balanced measure by combining precision and recall, making it particularly suitable for scenarios with imbalanced classes. The F1 score is defined as:
where
and
Here, , , and denote the number of true positives, false positives, and false negatives, respectively.
In addition, mAP@0.5 is used to evaluate overall object detection performance considering both classification and localization quality. A detection is considered correct when Intersection over Union (IoU) between predicted bounding box and ground truth exceeds 0.5. The final mAP@0.5 is computed by averaging the Average Precision (AP) across all classes.
Here, N denotes the number of classes and is the Average Precision of class i, defined as the area under the precision–recall curve under the IoU threshold of 0.5.
3.3. Comparison of Different Model Structure
This section investigates the effectiveness of the proposed architecture. The method integrates several mechanisms, including AINM, MS-LFE, and AAFM, as summarized in
Table 1. To validate the necessity of these modules, ablation experiments were conducted by selectively removing specific components, as summarized in
Table 2. When a module was omitted, its functionality was replaced with a simplified alternative. Specifically, AINM was replaced by standard channel-wise normalization, MS-LFE was reduced to a single-scale convolution with a fixed kernel size of 7, and AAFM was replaced by direct feature propagation without attention-based refinement. The model is trained using the Adam optimizer with an initial learning rate of 0.001. The batch size is set to 16. The maximum number of training epochs is set to 150, with an early stopping strategy based on validation loss to prevent overfitting. The model is implemented in PyTorch 2.7.0 and executed on a server equipped with an NVIDIA RTX 3090 GPU (32 GB RAM), provided by NVIDIA Corporation, Santa Clara, CA, USA. Specifically, a 5-fold cross-validation strategy is employed, and all experiments are repeated three times to reduce randomness and ensure stable evaluation results.
As shown in
Table 2, the removal of AINM leads to a performance decline, indicating that the network becomes more sensitive to illumination variations. Without adaptive intensity correction, features are misaligned across different lighting conditions, which diminishes the representation of low-contrast targets. Similarly, excluding MS-LFE reduces performance due to the loss of multi-scale representation capabilities. When multi-scale convolutions are replaced with a single receptive field, the network is unable to simultaneously capture fine-grained details and broader contextual information. This limitation particularly affects the detection of small targets such as sparks or animals, while also reducing robustness for larger regions like water accumulation or smoke. Additionally, omitting AAFM weakens the model’s ability to selectively emphasize informative features. Without attention mechanisms to capture fine-grained textures, feature refinement is reduced to uniform propagation, which increases background interference and diminishes the network’s ability to discriminate subtle patterns such as flame edges, smoke diffusion, or reflective water surfaces.
As shown in
Figure 3a, the proposed model correctly detects different types of foreign objects. When the AAFM module is removed, representative failure cases appear in
Figure 3b,c. In
Figure 3b, substation facilities are misclassified as fire and smoke, constituting a false positive for the fire/smoke category. In
Figure 3c, water accumulation is misclassified as fire and smoke, resulting in both a false positive for the fire/smoke class and a false negative for the water class.
These errors are associated with regions exhibiting complex textures or reflective patterns that resemble smoke under varying illumination conditions. Without effective refinement, the model tends to rely on coarse visual cues, leading to confusion. In contrast, the proposed model with AAFM correctly distinguishes these cases. This improvement is attributed to the local residual formulation in (
14), which enhances fine-grained local structures such as edges and subtle texture variations while preserving the original feature distribution. Consequently, the model captures local differences beyond global similarity, improving class separability and reducing both false positives and false negatives.
Overall, these modules enable the model to address the unique challenges of substation foreign object detection. By integrating adaptive illumination handling, multi-scale feature extraction, and attention-based feature refinement, the network maintains robust performance in detecting small, reflective, and low-contrast targets under varying lighting conditions. The combination of all modules yields the highest performance, demonstrating that the proposed architecture is well suited for detecting sparks, smoke, water, and small animals in real-world substation scenarios.
To further evaluate the robustness of the proposed design, sensitivity analyses are conducted on two key hyperparameters, namely the illumination sensitivity factor
in the AINM module and the residual scaling factor
in the AAFM module, as summarized in
Table 3.
For , smaller values (e.g., ) reduce the model’s responsiveness to illumination differences, leading to insufficient enhancement in dim regions. In contrast, larger values (e.g., ) tend to over-enhance bright or highly varying regions, which may distort feature representation. The best performance is achieved at , indicating an effective balance between illumination normalization and feature preservation.
For the residual scaling parameter , a smaller value (e.g., ) limits the contribution of local residual information, weakening the representation of fine-grained structures. Conversely, a larger value (e.g., ) may introduce excessive local variations, disrupting feature consistency. Setting provides the most effective balance, enhancing local details while maintaining stable feature representation.
Overall, the results demonstrate that the proposed method remains stable within a reasonable parameter range, and the selected hyperparameters yield a well-balanced trade-off between local detail enhancement and global feature consistency.
3.4. Comparison of Different Loss Functions
This section assesses the effectiveness of the proposed loss function by comparing it with representative YOLO-based formulations, including the standard multi-task YOLO loss and its IoU-based variant (CIoU loss) for bounding box regression. In addition, Focal loss is incorporated as a comparative baseline. For Focal loss, the focusing parameter and the balancing factor are set to 2.0 and 0.25, respectively, following common practice in object detection.
As shown in
Table 4, compared with the standard YOLO multi-task loss, incorporating CIoU improves detection performance by enforcing geometric constraints on overlap, center distance, and aspect ratio, thereby enhancing bounding box alignment. However, this improvement addresses only spatial consistency and does not explicitly consider challenges arising from uneven illumination within substation images, where individual images may contain both overexposed and shadowed regions. Such variations can reduce the visibility of certain targets, particularly small or low-contrast objects. Moreover, Focal Loss, which emphasizes low-confidence predictions, provides only marginal gains over the standard YOLO loss and underperforms the YOLO loss with CIoU in this task. By increasing the contribution of low-confidence samples, it shifts optimization toward a subset of difficult regions. However, excessive emphasis on these samples may bias learning and weaken the model’s ability to capture representative patterns across the dataset. As a result, confidence-based reweighting offers limited benefit in this scenario.
In contrast, the proposed loss incorporates an illumination-adaptive weight in the objectness component, reallocating gradient emphasis based on intensity variation within each predicted bounding box. This mechanism emphasizes regions with locally uneven illumination, including areas affected by overexposure or deep shadows. By explicitly accounting for intra-image illumination variations, the proposed loss aligns closely with the specific characteristics of substation monitoring tasks.
Overall, while conventional YOLO-based losses primarily enhance geometric consistency, the illumination-adaptive loss introduces a feature-aware optimization strategy that improves robustness to illumination-induced degradation. This results in more reliable detection of foreign objects under complex substation lighting conditions without increasing model complexity.
3.5. Comparative Analysis with Existing Methods
The proposed method is compared with representative object detection frameworks, including single-stage detectors such as YOLOv5 and SSD, and the two-stage detector Faster R-CNN. These models serve as baselines for evaluation and are widely adopted in general object detection tasks.
It should also be noted that these baseline models are primarily developed for natural image datasets, where objects generally have clear boundaries and stable illumination. In contrast, substation monitoring involves targets such as sparks, smoke, water accumulation, and small animals, which exhibit substantial variation in scale, texture, and illumination, further motivating the design of a task-specific detection architecture.
As shown in
Table 5, YOLOv5 achieves efficient inference with competitive overall performance. SSD demonstrates relatively lower performance, which can be attributed to its comparatively limited capacity for capturing multi-scale features. Faster R-CNN, on the other hand, leverages a two-stage detection mechanism to obtain more refined feature representations, resulting in improved performance in complex scenes. However, this advantage comes with increased computational overhead, limiting its suitability for real-time monitoring applications.
Compared with YOLOv5, more recent variants such as YOLOv8 and YOLOv9 introduce architectural and training-level improvements. Specifically, YOLOv5 follows an anchor-based detection paradigm with coupled prediction heads, while YOLOv8 adopts an anchor-free design with decoupled heads and improved label assignment strategies. YOLOv9 further enhances representation learning by incorporating programmable gradient information during optimization. However, foreign object detection in substation environments exhibits unique characteristics, including large variations in object scale and texture patterns across different object types, as well as varying illumination conditions. Therefore, these improvements in YOLOv8 and YOLOv9 may be less effective for this task, since they are primarily designed for general object detection scenarios and do not explicitly model the specific characteristics of substation monitoring.
In contrast, the proposed method is specifically tailored to the intrinsic characteristics of substation monitoring. By explicitly modeling illumination variation, enhancing multi-scale local feature representation, and incorporating adaptive attention mechanisms, the proposed framework more effectively captures the distinctive features of foreign objects under challenging conditions, leading to improved detection robustness and a more favorable balance between accuracy and efficiency.
Overall, rather than being optimized solely for general-purpose detection benchmarks, the proposed framework is tailored to the intrinsic characteristics of substation monitoring tasks. This task-oriented design achieves a balanced trade-off between detection accuracy and computational efficiency, making it well-suited for deployment in complex real-world substation environments.
3.6. Exploratory Validation on MVTec 3D-AD
To further investigate the cross-domain behavior of the proposed framework, exploratory experiments were conducted on the cable gland category of the MVTec 3D-AD dataset, since this category is relatively more relevant to electrical industrial scenarios. Unlike the proposed task, MVTec 3D-AD is primarily designed for unsupervised anomaly detection rather than supervised object detection. Its training set contains only normal samples, while the test set includes five categories, namely good, bent, cut, hole, and thread. Therefore, the test samples were reformulated into a five-class classification task to better align with the original foreign-object detection problem considered in this work.
Moreover, MVTec 3D-AD does not provide standard object-detection annotations such as bounding boxes. Therefore, detection-oriented metrics such as mAP@0.5 could not be evaluated consistently with the original task setting of this work. In addition, both the original YOLO loss and the proposed illumination-aware loss rely on bounding-box regression and objectness prediction, making them unsuitable for this defect classification dataset. Consequently, standard cross-entropy loss was adopted in the exploratory experiments. Experimental results are summarized in
Table 6.
Results indicate that AINM provides minor improvements on MVTec 3D-AD. Since the dataset is collected under black backgrounds and relatively stable illumination conditions. Consequently, the adaptive gamma exponent in AINM tends to become nearly constant across samples, reducing the effectiveness of the illumination-adaptive transformation. Nevertheless, the channel-wise normalization operation can still provide slight regularization benefits by making channel activation scales more consistent and stabilizing feature magnitudes across samples. At the same time, the normalization operation may slightly perturb fine-grained texture representations, leading to increased performance variance across different runs.
By comparison, the proposed MS-LFE and AAFM modules provide more noticeable improvements under this cross-domain setting, demonstrating certain generalization capabilities in industrial visual tasks. However, the performance gain introduced by MS-LFE is relatively smaller than that observed in the proposed power-system foreign-object detection task. This is mainly because the scale variation among defect categories in MVTec 3D-AD is comparatively limited. Although categories such as bent and hole exhibit certain differences, their scale variation is still much smaller than that in practical substation monitoring scenarios, where targets such as water accumulation or smoke may occupy a large portion of the image, while small-animal intrusion often appears at small scales. Consequently, the advantages of multi-scale feature enhancement become less pronounced on MVTec 3D-AD.
In contrast, the proposed AAFM module still yields relatively noticeable improvements. This is mainly because the module enhances fine-grained local feature representation through residual detail enhancement. In MVTec 3D-AD, defect categories such as hole, thread, and bent are often characterized by local features at relatively small scales. Therefore, the proposed AAFM remains effective under industrial defect classification settings.
Considering the substantial differences in task formulation, annotation structure, imaging conditions, and evaluation protocols between MVTec 3D-AD and the proposed foreign-object detection task, these experiments are provided only as exploratory validation. Nevertheless, they still partially verify the effectiveness of the proposed feature extraction strategies, while further highlighting the task-specific adaptability of the proposed framework for substation monitoring tasks.
4. Conclusions
This paper presents a detection framework specifically designed for monitoring foreign objects in substations, targeting hazards such as fire, water accumulation, and small animal intrusion. The framework explicitly accounts for complex and varying illumination conditions to ensure robust detection across diverse scenarios. It integrates adaptive illumination normalization, multi-scale feature extraction, and attention-based feature refinement to enhance feature representation for objects with varying scales and textures. In addition, a task-oriented loss function is introduced to maintain robust performance under locally uneven lighting conditions. Experimental results demonstrate that the proposed framework outperforms representative detection methods, confirming its effectiveness in addressing substation-specific challenges and highlighting its practical applicability in real-world monitoring systems.
The proposed method has been demonstrated to effectively improve performance in complex substation environments. However, monitoring tasks across different departments and substations in power systems may involve additional characteristics and requirements. The method is evaluated on a certain distribution substation dataset, which may limit its generalization ability.
Future work will focus on expanding the dataset to include more diverse power system scenarios, enabling cross-dataset validation and providing a more comprehensive evaluation of model generalization. In addition, deployment on edge devices (e.g., embedded GPUs) under real-time constraints will be further investigated for practical substation monitoring applications.