Abstract
Detecting and segmenting damaged wires in substations is challenging due to varying lighting conditions and limited annotated data, which degrade model accuracy and robustness. In this paper, a novel 24 h × 7 days broken wire detection and segmentation framework based on dynamic multi-window attention and meta-transfer learning is proposed, comprising a low-light image enhancement module, an improved detection and segmentation network with dynamic multi-scale window attention (DMWA) based on YOLOv11n, and a multi-stage meta-transfer learning strategy to support small-sample training while mitigating negative transfer. An RGB dataset of 3760 images is constructed, and performance is evaluated under six lighting conditions ranging from 10 to 200,000 lux. Experimental results demonstrate that the proposed framework markedly improves detection and segmentation performance, as well as robustness across varying lighting conditions.
1. Introduction
Defect detection is a common issue in industrial scenarios []. In substations, securely attaching a grounding rod to high-voltage wires for safety and stability testing requires the detection of damaged conductors. Subsequently, segmentation masks of the detected damaged conductors must be extracted to enable precise pose estimation, which is critical for substation maintenance operations. However, traditional manual inspection methods fall short in terms of efficiency, adaptability to complex environments, and limitations within constrained operational spaces, making them inadequate for the demands of modern power system maintenance. Although infrared imaging and optical detection technologies have seen gradual advancements, their performance remains limited under adverse weather conditions and in the detection of subtle damages. Infrared inspection robots rely primarily on thermal information [], while automated detection systems exhibit limited adaptability in complex scenarios [].
With the continuous advancement of deep learning and computer vision technologies, visual models have achieved remarkable progress in object detection and segmentation tasks. However, in industrial scenarios such as substations, existing methods still face numerous challenges in practical deployment. On one hand, the sparsity of 3D point clouds in long-range perception limits detection accuracy. On the other hand, 2D visual models exhibit poor robustness under complex lighting conditions—particularly at night or in environments with drastic illumination changes—resulting in a significant decline in detection performance and limiting their capacity for reliable all-weather operation []. Furthermore, performance is constrained by the scarcity of high-quality annotated data [,], as well as the interference caused by complex backgrounds and lighting variations, which hinders effective visual feature extraction [].
To enhance model generalization in complex environments, recent studies have combined image enhancement techniques with optimizations to the YOLOv8 model and introduced meta-learning strategies to address challenges related to few-shot learning and task heterogeneity []. By measuring task similarity and incorporating prior knowledge transfer, meta-learning has significantly improved model robustness under varying illumination and weather conditions, while also reducing dependence on large-scale annotated datasets and enhancing adaptability to novel classes. In addition, the application of image enhancement in low-contrast images has been shown to improve the effectiveness of feature extraction [].
To address the challenges posed by various complex scenarios in substation maintenance—such as low model accuracy due to the near fusion of targets and background under low-light conditions at night, poor generalization under variable illumination, and the difficulty of acquiring comprehensive datasets in industrial environments—this paper proposes a novel broken wire framework tailored for substation applications. The proposed framework consists of an image enhancement module that mitigates the loss of contrast and blending between wires and background in low-light conditions, thereby improving segmentation and detection accuracy; an image detection and segmentation module that enhances generalization and robustness under varying illumination; and a meta-transfer learning framework designed to address the limitations of small-sample training and the negative transfer effect often observed in traditional transfer learning. The experiments, conducted under a wide range of lighting conditions throughout the day, validate the effectiveness of each module.
Experimental results demonstrate that the proposed framework expands the operational applicability in a substation, alleviates issues related to limited datasets and negative transfer in traditional learning schemes, and significantly improves detection and segmentation performance in dynamic lighting environments. The primary contributions of this work are summarized as follows:
- A novel framework for broken wire detection and segmentation in substation maintenance is proposed, incorporating an image enhancement module, an image detection and segmentation module, and a transfer learning-based training module. This framework extends broken wire inspection capabilities beyond standard light conditions to include low-light nighttime and high-exposure midday environments, offering a reliable 24 × 7 operational solution.
- To address challenges posed by complex backgrounds and insufficient sample sizes, an improved YOLOv11n is introduced, enhanced with a DMWA (Dynamic Multi-scale Window Attention). This significantly boosts detection and segmentation accuracy in challenging backgrounds. Additionally, a multi-stage meta-transfer learning is proposed, enabling rapid convergence on small-sample datasets while mitigating the negative transfer effects of conventional transfer learning. Under variable illumination conditions, the proposed visual module achieves a 41–73% improvement in performance compared with YOLOv11n.
- Experiments with various algorithms across the image enhancement, detection, and segmentation, and transfer learning modules demonstrate the effectiveness of the proposed framework. Ablation studies verify that performance improvements under low-light conditions are largely driven by the image enhancement module. The framework significantly enhances segmentation accuracy while ensuring stable inference across diverse architectures, thereby extending the operational range and duration of substation maintenance.
2. Related Work
In this section, we briefly review the core techniques integrated into the proposed framework, including wire segmentation and detection, sliding window attention mechanism, transfer learning, and image enhancement. These components are jointly designed to address challenges posed by low-light conditions, complex backgrounds, and limited annotated data in substation maintenance scenarios.
2.1. Wire Detection and Segmentation
In recent years, significant progress has been made in the detection and segmentation of damaged power wires in both 2D and 3D domains. However, real-world applications still face persistent challenges due to complex backgrounds, subtle damage patterns, limited training samples, and highly variable lighting conditions, which often result in loss of RGB information, low contrast, and strong reflections. The primary difficulty in identifying and segmenting damaged power lines lies in effectively learning robust damage features that can resist interference caused by variations in viewing angles, occlusion, and illumination.
In the context of 2D detection and segmentation, Hota et al. [] improved wire detection accuracy using a CNN trained on multispectral images. Abdelfattah et al. [] introduced generative adversarial networks for power-line segmentation, which enhanced the segmentation performance of small targets; however, both approaches suffered from limited generalization due to small-scale datasets. Yang et al. [] proposed an attention fusion module to boost segmentation performance, though its effectiveness under complex backgrounds remains limited. Xu et al. [] developed LinE segment TRansformers, which employ a multi-scale encoder–decoder architecture to enhance detection capabilities, albeit at the cost of increased computational demands. Damodaran et al. [] leveraged Canny edge detection and Hough transforms to improve both accuracy and robustness, but further optimization is still needed.
For 3D power-line detection and segmentation, Nardinocchi et al. [] used geometric assumptions for point cloud analysis, but sparse wire point clouds often led to erroneous detections. Yermo et al. [] applied elevation filtering and 3D Hough transforms with non-maximum suppression on point clouds, though the computational overhead of Hough-based methods limits their scalability. Qin et al. [] employed CIR-based LiDAR patterns to improve extraction accuracy, but their robustness in real-world environments remains unverified. Huang et al. [] introduced a densely connected network combining local and global feature enhancement with elevation attention and Transformer architectures to fuse local and global information, significantly improving point cloud segmentation and multi-scale feature extraction, though the Transformer’s high computational cost is a drawback.
In hybrid 2D–3D approaches, Stambler et al. [] proposed Deep Wire CNN, which performs 2D wire detection, followed by 3D reconstruction using aerial measurement data. However, limited fields of view and dataset constraints hinder its generalization. Kolbeinsson et al. [] developed a monocular end-to-end model combining 2D segmentation and 3D depth estimation, achieving promising performance on synthetic datasets, yet lacking robustness under extreme lighting and complex environments. Muñoz et al. [] utilized a stereo vision system combining UAV-captured 2D imagery and 3D point clouds to significantly improve wire detection accuracy and obstacle avoidance capabilities, though additional obstacle and hazard information is required to optimize system performance.
While 2D detection and segmentation are highly sensitive to spectral resolution, RGB information quality, and background complexity, 3D approaches often struggle with sparse point clouds and unstable performance in complex environments, making it difficult to identify subtle damage. Hybrid 2D–3D methods show promise by leveraging segmentation masks from RGB images and relatively environment-invariant depth maps to achieve more accurate 3D matching, thereby offering greater robustness, accuracy, and application potential. Nonetheless, challenges such as complex backgrounds, drastic illumination changes, and the scarcity of damaged-wire samples continue to limit detection and segmentation performance and robustness.
2.2. Sliding Window Attention Mechanism
The sliding window attention mechanism has shown strong capability in feature extraction and local information enhancement across tasks such as computer vision and time series analysis. SWA-Net [] applies a feature-level sliding window strategy to prevent information loss from fixed-size patches, and introduces Local Feature Enhancement and Adaptive Feature Selection modules to enrich fine-grained features and dynamically emphasize key regions, improving FER performance in complex environments. Neighborhood Attention (NA) further reduces attention complexity from quadratic to linear while preserving spatial structure via shift equivariance []. Based on NA, the NAT Transformer expands the receptive field and lowers computational cost in image classification and object detection tasks.
ASHFormer combines high-resolution networks with a sliding window self-attention block to enhance both long-range and local feature interactions in well-log stratigraphic correlation, leading to improved matching accuracy []. In time series applications, the sliding window-based two-stage decomposition method [] leverages a two-stage decomposition based on sliding windows to extract key fluctuation features, thereby enhancing short-term wind speed prediction. The Swin Transformer integrates sliding window attention to overcome the limited global modeling capacity of traditional CNNs in gaze estimation tasks []. In addition, sliding window attention and high-response feature reuse dynamically adjust the perceptual scope of features using sliding window attention and incorporate a high-response feature reuse mechanism, significantly improving multimodal emotion recognition through more effective information fusion [].
Overall, sliding window attention preserves local feature integrity under multi-scale conditions while balancing computational efficiency and model performance, offering robust feature representation and recognition capabilities for both visual and time-series tasks.
2.3. Transfer Learning
Transfer learning aims to facilitate the learning of a target task by leveraging knowledge acquired from a source domain, including both data distributions and task-related features []. TrAdaBoost [] achieves knowledge transfer by employing a sample reweighting strategy, which utilizes a small amount of labeled data from the target domain to optimize a maximum entropy classifier, thereby improving classification performance. Transfer learning has also demonstrated significant advantages in image classification and object detection tasks, enhancing model adaptability [] and boosting small object detection capabilities through cross-dataset knowledge transfer [].
In industrial scenarios, transfer learning is widely applied to feature alignment and cross-domain adaptation. Deep convolutional adaptation methods [], which incorporate Maximum Mean Discrepancy, reduce discrepancies in damage feature distributions, thus improving detection generalization. A CNN-Transformer hybrid model [] employs grayscale feature map transfer learning to enhance cross-domain performance in steel wire rope defect diagnosis.
However, conventional transfer learning may lead to negative transfer, where transferred knowledge hinders rather than helps the target task. Meta-transfer learning has been proposed to mitigate this issue, especially under few-shot learning conditions. A task-aware meta-learning framework [] improves task selection through feature clustering, thereby enhancing low-sample adaptability in structural damage segmentation. Attention-based deep meta-transfer learning [] introduces a parameter modulation strategy based on meta-learning for fine-grained fault diagnosis, significantly improving recognition across different devices. Nevertheless, the training process of meta-transfer learning typically demands considerable computational resources, and further work is needed to optimize its efficiency during training.
2.4. Image Enhancement
Traditional image enhancement methods are typically divided into three categories: spatial domain, frequency domain, and color space enhancement. Spatial domain approaches directly manipulate pixel intensity distributions, offering high real-time performance and suitability for resource-limited environments []. Frequency domain techniques transform images to the frequency space and selectively enhance or suppress components to remove noise, particularly effective for mitigating high-frequency noise in low-light conditions []. Color space methods adjust saturation, hue balance, and inter-channel correlations to correct distortions and improve signal-to-noise ratio via multi-channel fusion []. Despite their effectiveness, these methods struggle with fundamental issues such as underexposure and noise amplification, while over-enhancement can cause color distortion, and repeated color space conversions may introduce computational overhead. Retinex theory has inspired numerous learning-based low-light enhancement models. URetinexNet [] adopts data-driven unfolding optimization with implicit prior regularization to replace handcrafted decomposition, enabling more adaptive illumination adjustment. Retinexformer [] utilizes a one-stage illumination-guided framework and non-local attention for global brightness modeling but suffers from high computational cost and limited interpretability. RetinexMamba [] addresses this by replacing self-attention with a state space model and incorporating fused attention, enhancing both efficiency and robustness.
Generative models have also been extensively applied to low-light enhancement. EnlightenGAN [] leverages dual discriminators and perceptual self-regularization to improve key region recovery. PyDiff [] introduces a multi-resolution pyramid diffusion network with adaptive correction, achieving notable PSNR gains. LLFlow [] employs invertible mappings and normalized flow to model non-deterministic transitions between low-light and normal-light domains. Huang et al. [] integrated variational autoencoders with Bayesian neural networks, using variational free energy loss and dynamic priors to produce reliable low-light image generation.
Despite significant progress in both traditional and learning-based approaches, achieving robust and efficient enhancement under extreme low-light and complex conditions remains a challenging task.
3. Method
The workflow of the proposed framework is outlined in Algorithm 1. The system comprises four core modules: image enhancement, visual detection and segmentation, transfer learning, and attention mechanism. The image enhancement module improves image quality under low-illumination conditions, the visual detection and segmentation module enables accurate extraction of key targets and image-level understanding, the transfer learning module enhances the model’s generalization ability across diverse scenarios, and the attention mechanism module further emphasizes critical features while suppressing redundant background information. Finally, precise alignment between the detection results and 3D point clouds is achieved through point cloud matching, forming an intelligent broken wire detection and segmentation workflow for substations. This section focuses on the proposed attention mechanism module, transfer learning module, and image enhancement approach. The overall process is illustrated in Figure 1.
Figure 1.
Neural network workflow and the process of combining a 2D segmentation mask with a depth map to obtain 3D coordinates and pose estimation. The white background denotes the overall framework (Algorithm 1), the light blue background represents the DMWA mechanism (Algorithm 2), the light red background indicates the transfer learning framework (Algorithm 3), the light purple background corresponds to the image enhancement module (Algorithm 4), and the gray background corresponds to the visual detection and segmentation module.
| Algorithm 1 Generalized broken wire segmentation framework in substation scenarios |
| Require: RGB image , depth image , illumination (in lux), 3D point cloud , and a transfer learning model (e.g., pre-trained, meta-transfer, etc.). |
| Ensure: Matched point cloud |
| 1: Step 1: Image Enhancement |
| 2: //Depth image remains unchanged |
| 3: Step 2: Detection and Segmentation via Transfer Learning |
| Decompose into Training, as shown in Algorithm 2: |
| // takes the feature vector from () and the depth image (). It outputs a bounding box with 4 parameters (in , e.g., ) and a segmentation mask (in ), fulfilling the detection and segmentation tasks. |
| 4: Compute features from enhanced RGB image: |
| 5: Obtain detection and segmentation results: |
| 6: Step 3: Point Cloud Matching |
| 7: return |
| Algorithm 2 Dynamic multi-window attention (DMWA) |
| Require: Feature map , Window sizes , Attention heads |
| Ensure: Enhanced feature map |
| 1: Compute Adaptive Window Weights using GAP and a 1 × 1 convolution (as Equation (3)) |
| 2: Compute Multi-Window Attention for Each |
| 3: for to K do |
| 4: Partition X into non-overlapping windows of size |
| 5: Compute attention using multi-head attention (as Equation (4)) |
| 6: end for |
| 7: Gated Fusion of Multi-Scale Attention to obtain (as Equation (8)) |
| 8: Apply window-wise weights to modulate fusion output |
| 9: Parameter Optimization |
| 10: Compute total loss (as Equations (9)–(12)) |
| 11: Update parameters: |
| 12: return |
| Algorithm 3 Meta-training algorithm |
| 1: Input: Support set , Query set , Learning rates , Regularization parameter |
| 2: Output: Updated model parameters and meta-loss |
| 3: Initialize model parameters |
| 4: for each task in support set S do |
| 5: Support Set Update: |
| 6: Compute support loss: ; Compute gradient: |
| 7: Update via Equation (18): |
| 8: end for |
| 9: for each task in query set Q do |
| 10: Query Set Update: |
| 11: Compute query loss: ; Compute gradient: |
| 12: Update via Equation (18): |
| 13: end for |
| 14: Total Loss Computation: Initialize |
| 15: for each task in query set Q do |
| 16: Compute query loss: |
| 17: Accumulate: |
| 18: end for |
| 19: Compute regularization term via Equation (10): |
| 20: Compute total meta-loss via Equation (19): |
| 21: Return: Updated parameters , and meta-loss |
| Algorithm 4 LP-MSR: Light Prior-Based Multi-Scale Retinex with color restoration |
| Require: Input image , illumination prior |
| Ensure: Enhanced image |
| 1: Step 1: Concatenate Input with Light Prior |
| 2: Concatenate I and along the channel axis: |
| 3: Step 2: Light Map Generation |
| 4: Extract features: |
| 5: Local feature extraction: |
| 6: Light map: |
| 7: Step 3: Multi-Scale Retinex with Color Restoration |
| 8: for each do |
| 9: |
| 10: |
| 11: end for |
| 12: |
| 13: Color restoration: |
| 14: |
| 15: Final Retinex output: |
| 16: Step 4: Final Enhancement |
| 17: |
| return |
3.1. Dynamic Multi-Window Attention
W-MSA employs a single fixed-size window, which limits its adaptability to varying spatial patterns. Although SW-MSA [] introduces shifted windows to alleviate locality constraints, it still lacks sufficient capability in capturing multi-scale information. To address this, we propose a dynamic multi-scale window attention (DMWA) mechanism that integrates multi-scale window compositions with adaptively learned weights to improve cross-scale feature fusion, as illustrated in Figure 2.
Figure 2.
Architectural details of the DMWA module. The green regions represent the gating mechanisms, while the light purple areas indicate adaptive weights.
Let the input feature map be denoted as , where H and W represent the spatial dimensions, and C is the number of channels. We define a set of candidate window sizes as , for example, . A global feature representation is used to generate attention weights , which satisfy the constraint . For each window size , the input feature map is partitioned into regular non-overlapping grids: the input feature map is first padded to match the partition size for each window scale, as defined in Equation (1).
Here, denotes reflection padding, which resizes the input feature map to a padded size of , where and .
Here, represents the number of window grids, denotes the spatial size of each window, and C is the number of channels. The window weights are generated through global average pooling and learnable parameters as follows:
where GAP refers to global average pooling, . For each window size , multi-head self-attention is computed, as follows:
where , , and . The projection weights are , and , where h is the number of attention heads. denotes the relative position encoding matrix, which is generated via learnable parameters. We define the pixel offset within a window as . The relative position bias matrix is obtained via embedding lookup, with the index calculated as follows:
The attention output for each window size is denoted as , and the gating map is generated via the following:
where is the Sigmoid function. The final output is obtained by performing a weighted combination and separation of multi-scale features.
Here, denotes the Sigmoid function, and . The final output is the weighted combination of multi-scale features, and the segmentation loss is defined as follows:
Class imbalance is common in industrial scenarios, especially when detecting small or damaged regions in power lines. In hierarchical prediction tasks, when predicted probabilities approach 0 or 1, model sensitivity decreases, potentially causing overconfidence. We introduce the following attention smoothness regularization term:
denotes the attention map at layer l, computed from the attention score matrix .The overall loss is given by the following:
The receptive field of traditional convolution is defined as , while the receptive field of DMWA is defined as follows:
Let the mutual information between input X and label Y be ; then DMWA improves information retention through multi-scale integration, as follows:
If the set is conditionally independent, mutual information increases linearly. Under gradient descent with the learning rate , the convergence rate satisfies the following:
The regularization term thus helps suppress gradient explosion and improve convergence. The overall process is detailed in Algorithm 2.
3.2. Multi-Stage Meta-Transfer Learning
Due to the limited availability of training samples, conventional deep learning methods struggle to effectively learn features from small-scale damaged regions. Traditional transfer learning approaches often transfer all knowledge from the source domain to the target domain, which may result in negative transfer. Although meta-transfer learning can mitigate some negative transfer effects, it still suffers from performance degradation when the domain gap between source and target is too large.
To address these limitations, we propose a multi-stage meta-transfer learning strategy that selectively freezes different network layers at different training stages to better adapt to the challenges caused by variations in lighting and complex backgrounds. The learning process is divided into three stages: (a) pre-training, (b) meta-training, and (c) meta-testing, as illustrated in Figure 3.
Figure 3.
Workflow of the meta-transfer learning.
3.2.1. Pre-Training
In the pre-training stage, the pretrained weights of YOLOv11 are fine-tuned using a damaged conductor dataset under normal illumination from the source domain . Both the Backbone and the Neck layers are used as the feature extractor . The cross-entropy loss is employed to update , defined as follows:
where denotes the learning rate during pre-training. The feature extractor obtained from this step is then transferred into the meta-training phase to enhance the robustness and convergence speed of the proposed framework.
3.2.2. Meta-Training
During meta-training, a meta-task is constructed by randomly sampling from the source domain . Each meta-task consists of two subsets: a support set and a query set . The support set is used to update the base learner, while the query set is used to evaluate and further adapt the learner.
Specifically, for each meta-batch, the base learner is initialized with the pre-trained feature extractor parameters from YOLOv11 and a random initialization of the remaining parameters. Each meta-batch contains a batch of tasks, and each task has its own base learner. Using gradient descent, the base learner is updated to learn generic features of damaged conductors, as shown below:
Here, and are the updated parameters of the base learner on the support and query sets, respectively; is the learning rate, and , denote the task-specific cross-entropy loss.
Conventional meta-transfer learning focuses on optimizing the entire network using fast adaptation of initial parameters. However, transferring the entire network in the meta-training stage may result in overfitting, especially in early convolutional layers. To avoid this, we freeze the pre-trained and fine-tuned shallow layers of the network and only perform meta-transfer on the Backbone, DMWA, and Neck modules. This improves training efficiency and convergence speed. The meta-training loss is formulated as follows:
where is the cross-entropy loss on the query set, and , are the updated parameters of DMWA and Neck. The regularization term compares the attention maps of the first and last layers in DMWA— and —and smooths their difference to ensure stability during meta-learning.The training details of meta-training are described in Algorithm 3.
3.2.3. Meta-Test
During meta-testing, a set of tasks is sampled from the target domain under various illumination conditions. These tasks are similarly split into a support set and a query set . The optimized parameters and obtained from meta-training are used to initialize the model for meta-testing. The base learner is further fine-tuned on the support set to adapt to the target domain.
Since the previous two stages were conducted on the source domain (collected under normal illumination), the meta-testing phase requires the model to adapt to feature distribution shifts caused by different lighting conditions. To this end, the Backbone is frozen during this phase, and a small learning rate (e.g., ) is used to fine-tune all other network layers. The final optimized parameters and are then used for evaluation.
3.3. Low-Light Image Enhancement
In low-light environments, both global contrast and local noise can adversely affect object detection and semantic segmentation. However, their relative impact often depends on specific tasks and environmental conditions. Generally, global contrast plays a more critical role than local noise, particularly under poor illumination. Images with low global contrast typically lack essential semantic cues across the entire scene, making it difficult to distinguish objects from the background. This significantly degrades segmentation accuracy, especially when the contrast is insufficient for object delineation. Under such conditions, the model struggles to detect or accurately segment objects.
In contrast, local noise tends to affect only small regions, primarily degrading boundary quality. Its impact is often mitigated by smoothing operations or specific architectural designs, such as using dilated convolutions or edge-aware modules.
Inspired by Retinex theory and considering the limited computational resources in industrial applications—which constrain the use of overly deep enhancement networks—we introduce an efficient enhancement method that combines illumination prior compensation with color image enhancement. This approach focuses on improving global contrast using minimal computation. The proposed method, LP-MSR (Light Prior-based Multi-Scale Retinex with Color Restoration), is summarized in Algorithm 4.
In addition, LP-MSR employs a single-pass computation of the image’s mean luminance (HSV–V channel, Hue–Saturation–Value) as a proxy for ambient lux, using a precomputed calibration curve to map to approximate lux and thus assign one of six brightness categories (shown in Table 1). Specifically, let
where is the value channel at the pixel , H and W represent the height and width of the image, respectively. The V channel in the HSV color model indicates the brightness or lightness of a color, with higher values corresponding to brighter colors. The V channel, along with the hue (H) and saturation (S) channels, is used to describe the full color representation of each pixel in the image. We then apply a calibration function (obtained by collecting data from a predefined dataset under the six initial scenes, with the distribution of the V channel estimated from the readings under known illuminance conditions, as measured by a calibrated lux meter) to estimate lux, as shown in Figure 4. The six estimated lux ranges are as follows:
Table 1.
Lighting settings and environment simulation.
Figure 4.
Mapping of to lux for six brightness ranges.
For images in Range 1, we increase the color-restoration coefficient and Retinex gain G, while reducing each Gaussian scale to emphasize detail recovery, as described in lines 8–15 of Algorithm 4. For Range 2, no enhancement is applied (all parameters remain at their defaults). For Range 6, and G are reduced and is increased to prevent over-enhancement. For the intermediate ranges (3–5), all parameters interpolate linearly between their low- and high-lux settings. This one-time luminance scan, calibration lookup, and linear interpolation incur negligible overhead compared with the multi-scale Gaussian blurs and convolutions, leaving inference complexity effectively unchanged.
4. Experiments and Discussion
4.1. Evaluation Indicators
In object detection and segmentation tasks, commonly used evaluation metrics include recall, precision, average precision (AP), and mean average precision (mAP). Recall measures the proportion of actual positive samples that are correctly identified, while precision indicates the proportion of detected positive samples that are true positives. These metrics often have a trade-off: increasing recall may decrease precision due to false positives, while increasing precision may decrease recall, leading to missed detections. The formulas for these metrics are as follows:
where is the number of true positives, is the number of false negatives, and is the number of false positives.
where is the precision at each recall level r, C is the number of classes, and is the average precision for the i-th class.
Additionally, FPS (frames per second) measures the real-time processing capability of the model, indicating how many images the model can process per second. GFLOPs (Giga Floating Point Operations per Second) indicates the amount of computation required for a single forward pass through the model, and Parameters represents the total number of trainable parameters in the model.
These metrics provide insights into both the performance and computational efficiency of the model.
4.2. Experiment Setup
To train and validate the performance of the damaged wire detection and segmentation model in substations, this study constructed two datasets, with a particular focus on the impact of lighting variations on model performance.
- Complex Background Dataset: The complex background dataset comprises 3760 RGB images encompassing various wire forms and damage types, making it well-suited for complex substation environments. Data augmentation techniques—including rotation, scaling, brightness adjustment, and noise injection—were applied to enhance the model’s robustness under diverse environmental conditions.
- Lighting Variation Test Datasets: The lighting variation test datasets consist of 1350 images designed to evaluate model performance under a range of lighting conditions. The light intensity spans from 10 to 200,000 lux, covering scenarios such as nighttime low light, typical outdoor weak light, direct strong sunlight, and extreme brightness. Details of the lighting conditions are summarized in Table 1.
In the initial stage, the batch size was set to 16, and the model was trained for 50 epochs with a 20-epoch warm-up strategy. The initial learning rate was set to 0.01. The optimizer used was Adam, with parameters , , and a weight decay of . Based on this setup, we performed grid search to determine the optimal learning rate from the candidate set , combined with data augmentation probabilities from .
After evaluating the mean Average Precision (mAP) on the complex background and lighting variation test datasets, experimental results showed that a learning rate of , together with Mosaic = 1.0 and Mixup = 0.15, enabled fast convergence and reduced the risk of overfitting. Subsequently, we extended the training to 150 epochs and adopted multi-scale training, where the input image size was randomly resized within the range to improve the model’s adaptability to different target scales. The hardware configuration used for training is shown in Table 2.
Table 2.
Hardware and software configuration.
As shown in Figure 5, the complex background dataset was collected as illustrated in Figure 5a. Images were captured from various perspectives, including frontal, diagonal, rear, upward, and top–down views, providing a panoramic representation of the operational environment. The dataset includes wires placed against diverse backgrounds, such as trees, flat surfaces, industrial areas, and open ground, and under varying levels of occlusion. Each wire instance exhibits different damage states, including aging, deformation, scorching, corrosion, and fracture. Additionally, as shown in Figure 5b, the Lighting Variation Test Datasets were captured under six distinct lighting conditions to simulate a wide range of real-world scenarios: Low Light Conditions, Indoor Lighting, Overcast Weak Light, Overcast Strong Light, Cloudless Strong Light, and Extreme Strong Light.
Figure 5.
Classification of datasets. (a) Represents the complex background dataset, collected with variations across three aspects: different viewpoints (frontal, diagonal, rear, upward, top), different backgrounds (tree/natural, plain colors, patterned surface, industrial scene, outdoor ground), and different types of damage (wire aging, wire damage, scorched wire, wire corrosion, wire fracture). (b) Represents the test datasets under six different lighting conditions, with lighting variations ranging from 10 to 2 × 105 lux, as defined in Table 1. The binary image shown in (b) corresponds to the ground truth (GT) mask.
To ensure the accuracy of the lighting variation test set, we controlled the distance between the light source and the object and used a photometer to measure the range of illumination intensity. The control of illumination intensity was achieved using the following formulas. The expressions of illumination calculation based on luminous flux and inverse-square law are shown in the following Equations (26) and (27):
Here, E is the illuminance, measured in lux; is the luminous flux, measured in lumens (lm); A is the illuminated area, measured in square meters (m2); I is the luminous intensity, measured in candelas (cd); and d is the distance between the light source and the object, measured in meters (m).
To derive Equation (28), we combine the concepts of image brightness, light source distribution, and object reflectance characteristics, as follows:
Here, represents the image brightness, denotes the brightness distribution of the light source, and represents the reflectance properties of the object.
4.3. Result Analysis
4.3.1. DMWA Module Effectiveness Verification
To evaluate the effectiveness of the proposed attention mechanism, we replaced the original attention module in our framework with several commonly used alternatives. Experiments were conducted on our normal-light dataset (400–600 lux) and on the complex background dataset, where we compared W-MSA [], SW-MSA [], SE [], CBAM [], ECA [], and SRM []. The results are summarized in Table 3 and Table 4.
Table 3.
Comparison of detection, segmentation, efficiency, and speed across attention mechanisms on a normal-light dataset (400–600 lux). ↑ indicates that higher values are better, and ↓ indicates that lower values are better.
Table 4.
Comparison of detection, segmentation, efficiency, and speed across attention mechanisms on the complex background dataset. ↑ indicates that higher values are better, and ↓ indicates that lower values are better.
As shown in Table 3 and Table 4, on the normal-light dataset (400–600 lux), our proposed DMWA achieves the largest gains across four key metrics: detection accuracy (+14.99%), detection F1-Score (+13.54%), segmentation accuracy (+13.02%), and segmentation F1-Score (+11.06%). These improvements significantly outperform all competing methods. Notably, on the complex background dataset, SRM’s performance increases slightly compared with the normal-light dataset (detection accuracy: +3.11% to +4.14%, detection F1: +4.22% to +5.13%; segmentation accuracy: +4.78% to +5.07%). In contrast, other attention modules see drops in their metrics under complex backgrounds, yet DMWA still delivers superior improvements over all other attention mechanisms.
Moreover, DMWA also demonstrates strong practical performance in terms of inference speed, achieving 47.9 FPS and an inference latency of only 1.5 ms. Compared with sliding-window-based mechanisms, DMWA achieves a better trade-off between accuracy and efficiency. While both W-MSA and SW-MSA yield competitive accuracy, they fall short in inference speed, highlighting the superior balance offered by DMWA.
4.3.2. Multi-Stage Meta-Transfer Learning Effectiveness Verification
We adopt different transfer learning strategies to train the visual module, including Transfer Learning (TL) [] and Meta-Transfer Learning (MTL) [], with the following configurations:
- TL: The model is first pretrained on the source domain for 50 epochs, during which the Backbone layers are frozen to preserve feature extraction capabilities and prevent overfitting on limited data. The remaining layers are then fine-tuned for another 100 epochs.
- MTL: Each meta-batch contains 10 meta-tasks. Both the support and query sets follow a five-way five-shot setting and cover six illumination conditions. The learning rate for both the base learner and the meta learner is set to 0.001. The base learner is trained for 10 inner-loop steps, while the meta learner updates once per meta-iteration. The total training comprises 150 epochs.
To enhance learning efficiency under few-shot, multi-illumination, and multi-scene conditions, and to mitigate the issue of negative transfer in conventional methods, we propose a multi-stage meta-transfer learning approach (MMTL). MMTL achieves faster convergence with limited data and significantly improves the F1-Score, thereby enhancing generalization performance.
We compare the F1-Scores of TL, MTL, and MMTL after 150 training epochs under various illumination settings (as shown in Figure 6). Experimental results show that TL struggles to exceed an F1-Score of 0.7 in few-shot scenarios and is prone to negative transfer. Although MTL performs better, global fine-tuning can lead to the loss of some generalizable features. In contrast, MMTL integrates meta-learning mechanisms with a staged freezing strategy, effectively avoiding negative transfer and achieving high F1-Scores (up to 0.93) within a shorter training period. These results demonstrate its strong generalization ability and stability in low-data regimes.
Figure 6.
Three-dimensional visualization of F1-Score evolution under different lighting conditions and training methods.
4.3.3. Generalization and Robustness Verification of Vision Modules
To evaluate the performance and robustness of the proposed visual model Yolov11n: DMWA-MMTL under complex backgrounds and varying lighting conditions, we conducted tests on an illumination-variant dataset, with a focus on detection and segmentation accuracy. The overall performance results are illustrated in Figure 7.
Figure 7.
Comparison of detection and segmentation performance of different models under different illuminations. The six lighting conditions (1–6) correspond to those defined in Table 1.
In terms of detection and segmentation accuracy, our method consistently outperforms the YOLO series (e.g., YOLOv8n, YOLOv11s, YOLOv11m) and Fast R-CNN across all lighting scenarios. Under low-light conditions (10–200 lux), our model achieves a detection accuracy of 0.7778, significantly higher than that of Fast R-CNN (0.4032) and the baseline (0.4078). In normal lighting (400–600 lux), it reaches the highest accuracy of 0.8804, surpassing YOLOv11m (0.7628) by approximately 12%. Even under high illumination (1 × 105–2 × 105 lux), the model maintains a strong accuracy of 0.7467, clearly outperforming the baseline (0.4289) and other methods.
A similar trend is observed in segmentation tasks. Our model achieves 0.7678 segmentation accuracy in low-light conditions, far exceeding Fast R-CNN (0.4467), and reaches 0.8704 under standard lighting, outperforming all comparison models. Overall, the proposed method not only achieves higher average accuracy, but also demonstrates greater robustness and stability under extreme illumination conditions.
4.3.4. Validation of the Proposed Image Enhancement Method and Overall Framework
To validate the effectiveness of our proposed image enhancement approach in improving segmentation performance under low-light conditions, we conducted a series of ablation studies by replacing both the visual detection and segmentation modules as well as the image enhancement module. These experiments were performed on a low-light dataset (10–200 lux) to evaluate detection and segmentation capabilities.
To ensure real-time inference on resource-constrained edge devices, this work follows a real-time-first principle and therefore compares only lightweight architectures that have been proven feasible for mobile and embedded scenarios. Deep, high-capacity models such as the UNet family were excluded from our comparative experiments because their higher FLOPs and memory footprint prevent them from achieving over 20 FPS on platforms like Raspberry Pi 4B and Jetson Nano under the low-power and low-latency requirements of our target applications.
As shown in Table 5, using our proposed visual module alone without any enhancement module (Model B) already yields a significant improvement in detection and segmentation performance under low-light conditions compared with the baseline (Model A). However, this comes at a slight cost to real-time performance, with FPS dropping from 128.8 to 47.9. When the proposed image enhancement (IE) module is further incorporated (Model H), both detection and segmentation accuracy are further improved (e.g., detection precision increases from 0.763 to 0.862, and recall improves from 0.791 to 0.881). Meanwhile, the inference speed only slightly decreases to 46.4 FPS, which still meets real-time processing requirements.
Table 5.
Comparison of the performances of different modules under low-light conditions (10–200 lux). ↑ indicates that higher values are better, and ↓ indicates that lower values are better.
Detection and segmentation models with strong generalization capabilities can partially mitigate the accuracy loss caused by noise and reduced clarity. Our image enhancement method places a higher emphasis on improving global contrast, whereas large-scale image enhancement modules invest substantial computational resources into denoising and sharpening. As shown in Figure 8, our shallow LP-MSR outperforms MSRCR, which has similar computational overhead, and achieves performance on par with four deeper image enhancement methods. Moreover, when our enhanced images are used for segmentation, the edges at both ends of damaged conductors are detected more accurately. However, such large network-based enhancement modules are not well suited for industrial applications that demand real-time performance.
Figure 8.
Comparison of low-light image enhancement methods and their segmentation outputs.
Compared with other classical or recent low-light enhancement methods (Models J–M), although they can also significantly boost detection and segmentation performance, most of them suffer from lower speeds (typically under 30 FPS), which limits their real-time applicability. Furthermore, when replacing different detection and segmentation networks (Models N–S), while keeping our IE module unchanged, a good balance between performance and speed can still be achieved. For instance, Model S reaches a detection precision and recall close to 0.86 while maintaining a high speed of 84.7 FPS.
From Figure 9 and Figure 10, it is evident that our proposed DMWA-MMTL consistently produces accurate segmentation masks across a wide range of illumination conditions—from extremely low light (10–200 lux) to very high brightness (1 × 105–2 × 105 lux). Unlike the various YOLO-based models, which sometimes lose continuity or misinterpret edges under challenging lighting, our method retains the correct shape and boundary details, closely matching the ground truth. This robustness highlights the strong generalization capability and reliability of our approach, making it particularly suitable for industrial inspection tasks where lighting can vary dramatically.
Figure 9.
Inference results of seven different models on six different light conditions.
Figure 10.
Inference results of seven models in six lighting conditions shown as pixel-level error maps.
In summary, the results demonstrate that our proposed image enhancement module can significantly improve detection and segmentation performance in low-light environments while maintaining high real-time performance. These findings further confirm the effectiveness and practicality of the overall framework for low-light vision tasks.
4.3.5. Edge Deployment Considerations
In this section, we choose to conduct our evaluations on the existing experimental equipment available to us and categorize the evaluated platforms into three groups. The first group consists of desktop/laptop-level discrete GPUs—namely, RTX 3090 (24 GB GDDR6X), RTX 1080 (8 GB GDDR5X), and RTX 3060 Laptop (6 GB GDDR6)—which offer very high deep neural network inference throughput, thanks to large numbers of CUDA cores and high-bandwidth VRAM (Video Random Access Memory). The second group comprises ultraportable hybrid-architecture CPUs: Intel Core Ultra 9 (36 MB L3 cache, 6 Performance+Efficient cores) and Core Ultra 7 (24 MB L3 cache, 6 P-Cores+8 E-Cores). Each integrates a Xe iGPU but relies heavily on its large L3 cache and heterogeneous core design to accelerate inference. Finally, the third group includes conventional CPUs—Intel Core i7 (16 MB L3, 6 cores/12 threads) and Core i5 (12 MB L3, 6 cores/12 threads)—which depend solely on FP32 multithreaded execution without heterogeneous cores or dedicated ML accelerators.
As illustrated in Table 6, discrete GPU platforms (RTX 3090/1080/3060 Laptop) run our 17.7 GFLOPs model at 40.1–42.7 FPS versus YOLO11n (6.9 GFLOPs) at 101.9–128.8 FPS, maintaining over 30 FPS even with greater complexity. In the Core Ultra series, large 36 MB/24 MB L3 caches and heterogeneous P/E cores help reduce overhead: our model achieves 26.8 FPS (37.3 ms) on Ultra 9 and 21.3 FPS (46.9 ms) on Ultra 7, compared with YOLO11n’s 52.1 FPS and 39.9 FPS; nonetheless, throughput remains below GPU levels. Conventional CPUs face pronounced cache and memory-bandwidth bottlenecks: Core i7 (16 MB L3) yields only 10.9 FPS (91.7 ms) for our model versus 21.4 FPS for YOLO11n, and Core i5 (12 MB L3) achieves 8.9 FPS (112.4 ms) versus 19.6 FPS, both far below real-time requirements.
Table 6.
Model complexity and inference efficiency across hardware platforms. ↑ indicates that higher values are better, and ↓ indicates that lower values are better.
Under a strict real-time requirement (≥30 FPS), only discrete GPUs (e.g., RTX 3060 Laptop or higher) can run the FP32 model over 30 FPS without further tuning. Core Ultra 9/Ultra 7 in FP32 yields FPS/FPS and require FP16 or INT8 quantization to exceed FPS. Core i7/i5 CPUs cannot reach FPS in FP32 (only 8–FPS), and quantization alone is insufficient.
Under a near-real-time requirement (20 FPS), Core Ultra 9/Ultra 7 in FP32 already meets the threshold (26 FPS/21 FPS). Since NVIDIA’s Jetson Xavier NX edge module offers roughly one quarter of the FP32 throughput of an RTX 3060 Laptop but compensates with Tensor Cores, we estimate that TensorRT FP16/INT8 on Xavier NX yields 20–25 FPS. Similarly, AGX Xavier (32 TOPS FP16) and Orin NX (60 TOPS FP16) can exceed 30 FPS with INT8 quantization. In contrast, Core i7/i5 remains below 20 FPS even after quantization, making it unsuitable without substantial model compression.
Additionally, when comparing various YOLOv8 and YOLOv11 variants, it is clear that model size and complexity have a direct impact on inference ability across platforms. On the discrete GPU, YOLOv8n (3.01 M params, 8.1 GFLOPs) achieves the highest FPS (129.6 FPS on RTX 3090, 109.6 FPS on RTX 1080, 100.1 FPS on RTX 3060 Laptop), closely matching YOLO11n’s performance and exceeding YOLOv8s (11.13 M params, 28.4 GFLOPs), which runs at 107.8 FPS/59.8 FPS/54.8 FPS, respectively. YOLOv8m (27.22 M params, 110.9 GFLOPs) and YOLO11m (21.58 M params, 64.9 GFLOPs) incur a larger latency penalty, dropping to 85.7 FPS/41.7 FPS/37.8 FPS (YOLOv8m) and 88.1 FPS/42.1 FPS/39.9 FPS (YOLO11m) on RTX 3090/1080/3060. This demonstrates that, on powerful GPUs, even the “m” variants can maintain well above 30 FPS, though at reduced margins.
On the Core Ultra series, the smallest variant (YOLOv8n) still runs acceptably: 47.1 FPS on Ultra 9 and 37.9 FPS on Ultra 7, indicating that the 8.1 GFLOPs cost can be handled by the integrated Xe iGPU and large L3 cache for sub-30 ms latency. By contrast, YOLOv8s’s 28.4 GFLOPs pushes both Ultra 9 and Ultra 7 to their limits—only 14.8 FPS and 11.8 FPS, respectively—falling well below near-real-time thresholds. YOLOv8m and YOLO11m similarly cannot run at all on these platforms due to memory bandwidth and cache constraints (denoted by “–”), as their 64.9 GFLOPs and 110.9 GFLOPs exceed what the heterogeneous core design can process without unacceptable stalling. YOLOv11s (21.3 GFLOPs) manages 15.9 FPS on Ultra 9 and 12.7 FPS on Ultra 7, again underperforming for most near-real-time use cases.
On conventional CPUs (Core i7/i5), only the “n” variants are feasible: YOLOv8n achieves 20.1 FPS on Core i7 and 17.8 FPS on Core i5, while YOLO11n hits 21.4 FPS and 19.6 FPS, respectively. The “s” and “m” variants of YOLOv8 and YOLOv11 all show “–” (cannot run) due to excessive computational and memory requirements. This reinforces that traditional CPUs without specialized accelerators simply cannot support these larger networks, making them unsuitable for anything beyond occasional offline inference.
The “n” variants of YOLOv8/YOLO11 can run on all tested platforms, but only on discrete GPUs and Core Ultra series do they exceed 30 FPS without quantization. The “s” variants require at least a Core Ultra 9 or higher and still fall short of real-time thresholds. The “m” variants are effectively restricted to high-end GPUs for both real-time and near-real-time applications.
Overall, our 17.7 GFLOPs model can meet or approach industrial near-real-time requirements under the hardware and optimization conditions described above. While maintaining high performance, it can also be deployed on a variety of computing devices.
4.3.6. Ablation Experiments
To evaluate the effectiveness of each component under low-light conditions, we designed four ablation models (A–H) by selectively integrating or removing the IE module (LP-MSR), the DMWA module, and the MMTL module. As shown in Table 7, Model A, which incorporates all three modules, achieves the best overall performance in both detection and segmentation tasks, with detection mAP50 and mAP50-95 reaching 0.878 and 0.725 and segmentation mAP50 and mAP50-95 reaching 0.863 and 0.697, respectively—significantly outperforming other configurations.
Table 7.
Detection, segmentation, and inference performance of ablation experiments under low-light conditions. ↑ indicates that higher values are better, and ↓ indicates that lower values are better.
A comparison between Models B and C reveals that removing either the IE (B) or DMWA (C) module leads to a considerable drop in detection and segmentation accuracy. This indicates that the IE module plays a crucial role in enhancing low-light imagery, while the DMWA module is essential for effective feature extraction and attention weighting under low-light conditions.
Although Model D excludes the MMTL module and thus achieves faster inference (1.5 ms), it suffers from reduced accuracy. The performance gap between Model D and the Model A confirms the importance of MMTL in boosting overall multi-task learning performance.
As shown in Table 7, in the two-way and three-way ablations (E–H), performance drops far exceed the individual ablations (B–D), indicating no antagonistic effects among modules but rather tight synergy. Specifically, removing LP-MSR and DMWA (E) or LP-MSR and MMTL (F) yields mAP50 values significantly lower than what would be expected from summing the individual ablation effects; likewise, ablating DMWA and MMTL together (G) demonstrates the loss of their collaborative gain. The three-way ablation (H) further exacerbates performance degradation, reinforcing that LP-MSR, DMWA, and MMTL collectively provide complementary benefits to detection and segmentation performance.
In summary, the collaborative integration of the IE, DMWA, and MMTL modules significantly enhances both robustness and accuracy of the model in low-light environments, thereby expanding the applicability of substation monitoring systems under challenging lighting conditions.
4.3.7. Limitations and Failure Cases
To investigate the limitations and failure cases, we added four extreme scenarios to the dataset—thirty images each—to more comprehensively evaluate the model under varied conditions. As illustrated in Figure 11, the first scenario is overlapping wires. When multiple wires cross and intertwine, the model struggles to distinguish each wire’s true edges from the overlapping regions, causing clear “merged” segmentations or boundary misalignments. This issue becomes especially severe when the wires have similar thickness and color against a complex background, leading the damage-detection module to mistake overlapping regions for faults and thus dramatically increasing the false-positive rate. The second scenario is partial occlusions, when a wire is partially covered by surrounding objects (e.g., foam packaging, desktop equipment, or other cables), the model cannot fully recover the features in the occluded area, resulting in “broken” or discontinuous segmentations. As a result, genuine breaks or cracks are hidden and go undetected, or intact wires are erroneously flagged as damaged. The third scenario is sudden illumination changes; for instance, under direct flashlight or strong shadow, a wire’s surface may be severely overexposed or exhibit high-contrast shadows. Thanks to our extensive lighting-augmentation strategy during training, the framework remains relatively stable in most of these conditions, preserving high-precision edges; only minor misalignments appear at extreme highlights or overexposed borders, and overall robustness is markedly better than that of conventional methods. The fourth scenario is scratches and micro-cracks. When a wire’s surface has only slight abrasions or shallow cracks, these textures closely resemble true damage contours. The model frequently misclassifies superficial scratches as damaged areas—producing false positives—while very narrow, shallow micro-cracks have low contrast and poor noise resilience, so they are often missed or incompletely segmented, leading to false negatives.
Figure 11.
Failure cases across four extreme scenarios with segmentation results and normalized error maps. GT is binary ground-truth (GT) masks.
5. Conclusions
In this paper, we proposed a novel framework for damaged wire detection and segmentation in substations under varying lighting conditions. The framework integrates a low-light image enhancement module, an improved YOLOv11n-based detection and segmentation network with a dynamic multi-scale window attention mechanism, and a multi-stage meta-transfer learning strategy to address small-sample limitations and mitigate negative transfer. Through extensive experiments across six illumination environments ranging from 10 to 200,000 lux, it is demonstrated that our method significantly improves detection and segmentation accuracy, achieving up to 73% performance gains over baseline models while maintaining real-time inference speed.
In future work, we plan to explore the integration of multi-modal sensing to provide richer contextual information for detection and segmentation tasks. Additionally, we aim to incorporate continual learning strategies to enable long-term adaptation in dynamic and evolving environments. These enhancements will further improve the framework’s robustness, generalization capability, and practical applicability in real-world substation inspection scenarios.
Author Contributions
Conceptualization, H.W. and S.X.; methodology, H.W.; software, H.W.; validation, S.X., H.W. and Y.L.; formal analysis, Y.L. and H.W.; investigation, H.W.; resources, H.W. and S.X.; data curation, H.W. and S.X.; writing—original draft preparation, H.W.; writing—review and editing, H.W. and Y.L.; visualization, S.X.; supervision, Y.L.; project administration, H.W. and Y.L. All authors have read and agreed to the published version of the manuscript.
Funding
This study was funded by the National College Students’ Innovation and Entrepreneurship Training Program Project (No.: 202410488022X).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The full dataset used in this study, along with the code to reproduce all experiments, will be publicly available at https://github.com/wuhan66/DMWA-MMTL (accessed on 9 June 2025) immediately upon paper acceptance.
Conflicts of Interest
The authors declare no conflicts of interest.
Glossary of Acronyms
| Acronym | Full Term |
| YOLO | You Only Look Once |
| CNN | Convolutional Neural Network |
| DMWA | Dynamic Multi-scale Window Attention |
| SE | Squeeze-and-Excitation |
| CBAM | Convolutional Block Attention Module |
| ECA | Efficient Channel Attention |
| SRM | Style-based Recalibration Module |
| W-MSA | Window-based Multi-Head Self-Attention |
| SW-MSA | Shifted Window-based Multi-Head Self-Attention |
| TL | Transfer Learning |
| MTL | Meta-Transfer Learning |
| MMTL | Multi-stage Meta-Transfer Learning |
| LP-MSR | Light Prior-based Multi-Scale Retinex |
| MSR | Multi-Scale Retinex |
| MSRCR | Multi-Scale Retinex with Color Restoration |
| LLFLOW | Low-Light Flow |
| BEM | Bayesian Enhancement Method |
| EnlightenGAN | Enlighten Generative Adversarial Network |
| Py-diffusion | Pyramid Diffusion |
| RTX | NVIDIA RTX |
| FLOPs | Floating Point Operations |
| GPU | Graphics Processing Unit |
| IoU | Intersection over Union |
| mIoU | Mean Intersection over Union |
| P | Precision |
| R | Recall |
| mAP50 | Mean Average Precision at 50% IoU |
| Acc | Accuracy |
| F1 | F1-Score |
References
- Zhong, J.; Huyan, J.; Zhang, W.; Cheng, H.; Zhang, J.; Tong, Z.; Jiang, X.; Huang, B. A deeper generative adversarial network for grooved cement concrete pavement crack detection. Eng. Appl. Artif. Intell. 2023, 119, 105808. [Google Scholar] [CrossRef]
- de Oliveira, J.H.E.; Lages, W.F. Robotized inspection of power lines with infrared vision. In Proceedings of the 2010 1st International Conference on Applied Robotics for the Power Industry, Montréal, QC, Canada, 5–7 October 2010; pp. 1–6. [Google Scholar]
- Pagnano, A.; Höpf, M.; Teti, R. A roadmap for automated power line inspection. Maintenance and repair. Procedia CIRP 2013, 12, 234–239. [Google Scholar] [CrossRef]
- Kolbeinsson, B.; Mikolajczyk, K. UCorr: Wire Detection and Depth Estimation for Autonomous Drones. In Proceedings of the International Conference on Robotics, Computer Vision and Intelligent Systems, Rome, Italy, 25–27 February 2024; pp. 179–192. [Google Scholar]
- Hota, M.; Kumar, U. Power lines detection and segmentation in multi-spectral UAV images using convolutional neural network. In Proceedings of the 2020 IEEE India Geoscience and Remote Sensing Symposium (InGARSS), Ahmedabad, India, 1–4 December 2020; pp. 154–157. [Google Scholar]
- Abdelfattah, R.; Wang, X.; Wang, S. Plgan: Generative adversarial networks for power-line segmentation in aerial images. IEEE Trans. Image Process. 2023, 32, 6248–6259. [Google Scholar] [CrossRef]
- Yang, L.; Fan, J.; Xu, S.; Li, E.; Liu, Y. Vision-based power line segmentation with an attention fusion network. IEEE Sensors J. 2022, 22, 8196–8205. [Google Scholar] [CrossRef]
- Tammisetti, V.; Stettinger, G.; Cuellar, M.P.; Molina-Solana, M. Meta-YOLOv8: Meta-learning-enhanced YOLOv8 for precise traffic light color detection in ADAS. Electronics 2025, 14, 468. [Google Scholar] [CrossRef]
- Chen, S.; Li, Y.; Zhang, Y.; Yang, Y.; Zhang, X. Soft X-ray image recognition and classification of maize seed cracks based on image enhancement and optimized YOLOv8 model. Comput. Electron. Agric. 2024, 216, 108475. [Google Scholar] [CrossRef]
- Xu, Y.; Xu, W.; Cheung, D.; Tu, Z. Line segment detection using transformers without edges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4257–4266. [Google Scholar]
- Damodaran, S.; Shanmugam, L.; Jothi Swaroopan, N. Overhead power line detection from aerial images using segmentation approaches. Autom. Čas. Autom. Mjer. Elektron. Račun. Komun. 2024, 65, 261–288. [Google Scholar] [CrossRef]
- Nardinocchi, C.; Balsi, M.; Esposito, S. Fully automatic point cloud analysis for powerline corridor mapping. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8637–8648. [Google Scholar] [CrossRef]
- Yermo, M.; Laso, R.; Lorenzo, O.G.; Pena, T.F.; Cabaleiro, J.C.; Rivera, F.F.; Vilariño, D.L. Powerline detection and characterization in general-purpose airborne LiDAR surveys. Int. J. Remote Sens. 2024, 17, 10137–10157. [Google Scholar] [CrossRef]
- Qin, X.; Wu, G.; Ye, X.; Huang, L.; Lei, J. A novel method to reconstruct overhead high-voltage power lines using cable inspection robot LiDAR data. Remote Sens. 2017, 9, 753. [Google Scholar] [CrossRef]
- Huang, S.; Hu, Q.; Zhao, P.; Li, J.; Ai, M.; Wang, S. ALS point cloud semantic segmentation based on graph convolution and transformer with elevation attention. Int. J. Satell. Commun. Remote Sens. 2023, 17, 2877–2889. [Google Scholar] [CrossRef]
- Stambler, A.; Sherwin, G.; Rowe, P. Detection and reconstruction of wires using cameras for aircraft safety systems. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 697–703. [Google Scholar]
- Muñoz, D.F.T.; Prieto, F.; Correa, A.C. Power lines detection from a stereo vision system. In Proceedings of the 2020 5th International Conference on Control and Robotics Engineering (ICCRE), Osaka, Japan, 24–26 April 2020; pp. 242–245. [Google Scholar]
- Qiu, S.; Zhao, G.; Li, X.; Wang, X. Facial expression recognition using local sliding window attention. Sensors 2023, 23, 3424. [Google Scholar] [CrossRef] [PubMed]
- Hassani, A.; Walton, S.; Li, J.; Li, S.; Shi, H. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6185–6194. [Google Scholar]
- Liu, N.; Li, Z.; Liu, R.; Zhang, H.; Gao, J.; Wei, T.; Si, J.; Wu, H. ASHFormer: Axial and sliding window-based attention with high-resolution transformer for automatic stratigraphic correlation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5913910. [Google Scholar] [CrossRef]
- Yang, D.; Li, M.; Guo, J.-e.; Du, P. An attention-based multi-input LSTM with sliding window-based two-stage decomposition for wind speed forecasting. Appl. Energy 2024, 375, 124057. [Google Scholar] [CrossRef]
- Li, Y.; Chen, J.; Ma, J.; Wang, X.; Zhang, W. Gaze estimation based on convolutional structure and sliding window-based attention mechanism. Sensors 2023, 23, 6226. [Google Scholar] [CrossRef]
- Zhao, Z.; Gao, T.; Wang, H.; Schuller, B.W. SWRR: Feature map classifier based on sliding window attention and high-response feature reuse for multimodal emotion recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 2433–2437. [Google Scholar]
- Shao, S.; McAleer, S.; Yan, R.; Baldi, P. Highly accurate machine fault diagnosis using deep transfer learning. IEEE Trans. Ind. Inform. 2018, 15, 2446–2455. [Google Scholar] [CrossRef]
- Sun, G.; Liang, L.; Chen, T.; Xiao, F.; Lang, F. Network traffic classification based on transfer learning. Comput. Electr. Eng. 2018, 69, 920–927. [Google Scholar] [CrossRef]
- Zhang, Q.; Yang, Q.; Zhang, X.; Bao, Q.; Su, J.; Liu, X. Waste image classification based on transfer learning and convolutional neural network. Waste Manag. 2021, 135, 150–157. [Google Scholar] [CrossRef]
- Liu, W.; Quijano, K.; Crawford, M.M. YOLOv5-Tassel: Detecting tassels in RGB UAV imagery with improved YOLOv5 based on transfer learning. Int. J. Satell. Commun. Remote Sens. 2022, 15, 8085–8094. [Google Scholar] [CrossRef]
- Hong, X.; Yang, D.; Huang, L.; Zhang, B.; Jin, G. Vibration-adaption deep convolutional transfer learning method for stranded wire structural health monitoring using guided wave. IEEE Trans. Instrum. Meas. 2022, 72, 6500610. [Google Scholar] [CrossRef]
- Wang, M.; Li, J.; Xue, Y. A new defect diagnosis method for wire rope based on CNN-transformer and transfer learning. Appl. Sci. 2023, 13, 7069. [Google Scholar] [CrossRef]
- Xu, Y.; Fan, Y.; Bao, Y.; Li, H. Task-aware meta-learning paradigm for universal structural damage segmentation using limited images. Eng. Struct. 2023, 284, 115917. [Google Scholar] [CrossRef]
- Li, C.; Li, S.; Wang, H.; Gu, F.; Ball, A.D. Attention-based deep meta-transfer learning for few-shot fine-grained fault diagnosis. Knowl. Based Syst. 2023, 264, 110345. [Google Scholar] [CrossRef]
- Singh, G.; Mittal, A. Various image enhancement techniques—a critical review. Int. J. Inf. Res. Stud. 2014, 10, 267–274. [Google Scholar]
- Sinha, G.R. Design and Implementation of Image Enhancement Techniques in Frequency Domain. Ph.D. Thesis, Chhattisgarh Swami Vivekanand Technical University, Bhilai, India, 2009. [Google Scholar]
- Vishwakarma, A.K.; Mishra, A. Color image enhancement techniques: A critical review. Int. J. Comput. Sci. Eng. 2012, 3, 39–45. [Google Scholar]
- Wu, W.; Weng, J.; Zhang, P.; Wang, X.; Yang, W.; Jiang, J. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5901–5910. [Google Scholar]
- Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 17–24 June 2023; pp. 12504–12513. [Google Scholar]
- Bai, J.; Yin, Y.; He, Q.; Li, Y.; Zhang, X. Retinexmamba: Retinex-based mamba for low-light image enhancement. arXiv 2024, arXiv:2405.03349. [Google Scholar]
- Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
- Zhou, D.; Yang, Z.; Yang, Y. Pyramid diffusion models for low-light image enhancement. arXiv 2023, arXiv:2305.10028. [Google Scholar]
- Wang, Y.; Wan, R.; Yang, W.; Li, H.; Chau, L.-P.; Kot, A. Low-light image enhancement with normalizing flow. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; pp. 2604–2612. [Google Scholar]
- Huang, G.; Anantrasirichai, N.; Ye, F.; Qi, Z.; Lin, R.; Yang, Q.; Bull, D. Bayesian Neural Networks for One-to-Many Mapping in Image Enhancement. arXiv 2025, arXiv:2501.14265. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
- Lee, H.; Kim, H.-E.; Nam, H. SRM: A style-based recalibration module for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1854–1862. [Google Scholar]
- He, Y.-P.; Zang, C.-Z.; Zeng, P.; Wang, M.-X.; Dong, Q.-W.; Wan, G.-X.; Dong, X.-T. Few-shot working condition recognition of a sucker-rod pumping system based on a 4-dimensional time-frequency signature and meta-learning convolutional shrinkage neural network. Pet. Sci. 2023, 20, 1142–1154. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).