1. Introduction
With the advancement of technology and the modernization of power systems, traditional methods of power line inspection have gradually been replaced by more efficient and precise Unmanned Aerial Vehicle (UAV)-based inspections [
1,
2]. UAV-based inspections utilize high-definition cameras, infrared thermography cameras, and other sensors to monitor and collect data on power lines and related equipment in real-time [
3]. Compared to traditional manual inspections, UAV-based inspections offer several advantages. First, UAVs can reach high altitudes or hazardous areas that are difficult for humans to access, reducing safety risks in the inspection process [
4]. Second, UAV-based inspections are highly efficient and precise, enabling coverage of a large area of power infrastructure and completing tasks in a relatively short time. Most importantly, the various sensors mounted on UAVs provide real-time data on the status of electrical equipment, significantly enhancing the intelligence level of operation and maintenance in power systems. As a result, UAV-based inspections have become an indispensable technological tool in the power industry, widely applied in the inspection of power transmission lines, equipment maintenance, and fault diagnosis, among other areas.
In the process of UAV-based inspections, the image data collected needs to be analyzed and interpreted through object detection technology to enable automatic identification and fault location of power equipment [
5]. Image-based object detection plays a crucial role in UAV-based inspections, especially for power line inspection tasks. Specifically, we focus on how to utilize object detection technology to accurately detect OPGW (optical fiber composite overhead ground wire) and conventional ground wires in inspection images. OPGW, a key component of modern power transmission systems, serves the dual functions of lightning protection and optical fiber communication, which greatly enhance the stability and intelligence of power systems. Compared to conventional ground wire, OPGW exhibits significant differences in shape, thickness, and material properties. Particularly, on different types of towers (such as tension towers and tangent towers), OPGW and conventional ground wire show distinct wire features. As shown in
Figure 1, the wire features of OPGW and conventional ground wires differ on tension towers and tangent towers, which can be classified into four categories. On tension towers, the wire feature of OPGW is a heart-shaped loop, labeled as O1, while the wire feature of the conventional tension tower ground wire is a tension clamp, labeled as D1. On tangent towers, OPGW typically uses preformed suspension clamps as its wire feature, labeled as O2, while the conventional tangent tower ground wire is characterized by a traditional boat-shaped suspension clamp, labeled as D2. By employing object detection technology to accurately recognize and distinguish these wire features in UAV-based inspection images, it is possible to effectively differentiate between OPGW and conventional ground wires, providing reliable data support for power line inspection and maintenance.
Image-based object detection technology plays a crucial role in UAV-based inspections, but challenges persist in its practical use, particularly when it comes to detecting OPGW and conventional ground wires. One major issue is that the wire features of OPGW and conventional ground wires often appear as small objects in images. This can lead to blurring or a loss of detail when captured from a distance, making them difficult to detect. Additionally, power lines are commonly situated in complex environments, where factors like lighting variations, cluttered backgrounds, and overlapping cables further complicate detection. Under different weather or lighting conditions, the wire features of OPGW and conventional ground wires can become tiny and low contrast, rendering traditional image processing methods inefficient and unreliable. Moreover, in cases of limited image resolution, object overlap, occlusion, or partial damage, detection algorithms may fail to identify or misidentify the wires. To enhance detection accuracy and robustness in real-world applications, it is crucial to consider these factors and optimize image processing algorithms and models, ensuring they can effectively recognize OPGW and conventional ground wires in challenging environments.
In recent years, with the rapid development of deep learning technology, object detection algorithms have been widely applied in power equipment inspection, fault detection, and image recognition. Among the various object detection algorithms, the YOLO (You Only Look Once) series has become one of the preferred methods in power equipment inspection due to its high real-time performance, low computational resource requirements, and excellent accuracy. The advantages of the YOLO series lies in its ability to quickly and accurately perform object recognition and localization tasks with minimal computational resources, making it particularly advantageous in scenarios such as power equipment inspection, where real-time performance is crucial. However, the traditional YOLO architecture still faces some significant limitations when dealing with the wire features of OPGW and conventional ground wires, which are small objects. These limitations manifest in several ways: first, the YOLO algorithm suffers from insufficient feature representation when handling small objects, resulting in the ineffective extraction of features for small-sized wires. Second, YOLO’s sensitivity to small objects is limited, especially in complex backgrounds, where the object can easily be confused with the background, leading to a decrease in recognition accuracy. Lastly, YOLO also has certain shortcomings in bounding box regression accuracy, particularly in complex scenes, where the localization error of the bounding boxes tends to be larger.
To address the aforementioned issues, this paper proposes an improved YOLO11-based model for detecting wire features of OPGW and conventional ground wires. The objective is to enhance the model’s detection accuracy and robustness in complex backgrounds through a series of innovative designs. Specifically, the main contributions of this paper include the following aspects:
- (1)
Feature Enhancement Module (FEM): In the backbone network of YOLO11, the original C3K2 module in the sixth layer is replaced by the FEM module. This module enhances the multi-scale feature extraction capability, effectively improving the representation of small structures, particularly for the feature extraction and representation of small objects such as wire feature. As a result, the detection accuracy is significantly improved.
- (2)
Introduction of Shallow Detection Head: A P2 shallow detection head is added to the Head section of the model to enhance its ability to perceive small-scale objects and fine details at the edges. By focusing on features from shallower layers, the P2 detection head effectively improves the model’s ability to recognize small objects (such as wire features of OPGW and conventional ground wires), overcoming the limitations of traditional YOLO architecture in small-object detection.
- (3)
Improvement of Loss Function: In the traditional YOLO model, the detection head uses the Intersection over Union (IoU) loss function. However, in complex backgrounds, the IoU loss function has limited performance in bounding box regression. To address this, this paper replaces the IoU loss function with the Normalized Wasserstein Distance (NWD) loss function. The NWD loss function provides higher accuracy and stability when handling bounding box regression in complex scenes, effectively enhancing the model’s performance in challenging environments.
- (4)
Low computational resource consumption and high detection accuracy: The WFD-YOLO network inherits the advantages of YOLO, maintaining a relatively small model parameter size and low computational resource consumption. Compared with other object detection models, it performs better in detecting wire features, achieving a good balance between computational resource requirements and detection accuracy.
In the M-scale model experiments, the results indicate that under consistent experimental conditions, the proposed method significantly improved detection accuracy. Specifically, the mean Average Precision at a single IoU threshold of 0.50 () reached 78.3%, an improvement of 2.3%, while the mean Average Precision across IoU thresholds uniformly sampled from 0.50 to 0.95 with a step size of 0.05 () achieved 52.0%, an increase of 1.1%. In addition, the method proposed in this paper inherits the lightweight characteristics of the YOLO network, has relatively low model parameters, and requires less computational power compared to other object detection models. These experimental results indicate that the proposed improved method significantly outperforms the traditional YOLO11 model in terms of detection accuracy, generalization ability, and capability to adapt to complex scenarios. Moreover, it has the advantage of low computational resource requirements, making it better suited for OPGW and conventional ground wire inspection tasks in complex environments during UAV-based inspections.
Overall, in response to the stringent energy efficiency requirements for UAV-based inspection scenarios, the WFD-YOLO network achieves ultra-low energy consumption while ensuring high-precision detection of wire features. By optimizing the model architecture, the minimized model parameters of WFD-YOLO effectively reduce the consumption of onboard computing resources. This improvement in computational efficiency enables WFD-YOLO to achieve an ideal balance between detection accuracy and energy consumption, providing a practical solution for sustainable UAV-based inspections.
3. Methods
3.1. WFD-YOLO Model Overview
The WFD-YOLO network model we proposed uses YOLO11 as the baseline and is designed for fast and accurate detection of small wire features in OPGW and conventional ground wire images. By optimizing and improving the backbone network, loss function, and detection head of the YOLO11 model, we are able to significantly enhance the recognition accuracy and robustness for small objects, particularly under complex backgrounds and adverse lighting conditions. The overall architecture of the WFD-YOLO model, as shown in
Figure 2, ensures significantly improved performance in detecting small wire features while maintaining a relatively small number of model parameters and low computational resource requirements.
First, to address the limitations of the traditional YOLO11 backbone network in handling small objects, we replaced the C3K2 module in the 6th layer with the Feature Enhancement Module (FEM). The FEM adopts a multi-branch convolution structure and dilated convolutions, which effectively enhance small-object feature extraction capabilities without significantly increasing the computational load. This enables the model to extract features from richer local information, improving the representation of small objects. The introduction of the FEM allows the network to capture the features of small-sized objects such as wire features of OPGW and conventional ground wires in greater detail, avoiding the problem of inaccurate small-object detection due to insufficient feature representation in the traditional YOLO model.
Secondly, to further improve the accuracy and fine detail perception of small object detection, we introduced an additional shallow object detection head based on the P2 feature map, built on top of the three-scale detection heads in YOLO11. The P2 detection head focuses on processing relatively shallow feature maps and is better at capturing local features of small objects, addressing the issue of feature loss for small objects in deep feature maps due to the excessive depth of convolutional layers. This structural improvement allows the WFD-YOLO model to significantly enhance detection accuracy when dealing with small objects, particularly in the edges and details of the object, resulting in more precise detection outcomes.
Moreover, to further enhance the model’s performance in complex scenarios, we improved the loss function. The traditional YOLO model uses an Intersection over Union (IoU)-based loss function, which can be influenced by bounding box regression errors, especially when there is low distinguishability between the object and the background. Therefore, we replaced the traditional IoU loss function with the Normalized Wasserstein Distance (NWD) loss function. The NWD loss function is more accurate in measuring the distance between the object and the background, particularly in complex backgrounds, effectively reducing bounding box regression errors and improving the accuracy and robustness of small-object detection. The introduction of NWD makes the WFD-YOLO model more stable in complex environments, enabling it to adaptively adjust to various scenarios and avoid detection failures or false positives that are common in traditional methods under complex backgrounds.
3.2. Feature Enhancement Module
The backbone network, as the fundamental feature extraction module of YOLO11, adopts a lightweight design concept aimed at reducing computational complexity while maintaining powerful feature extraction capabilities. In the tasks of detecting small wire features in images of OPGW and conventional ground wires, the backbone network, in the early stage of feature extraction, faces challenges due to its limited receptive field and insufficient feature richness. This makes it difficult to accurately capture the critical information of small objects and effectively distinguish them from the background. Therefore, a targeted Feature Enhancement Module needs to be designed to overcome these problems.
In the traditional YOLO11 backbone network, key features are extracted by progressively downsampling the input image layer by layer, and then transmitted to the neck part via the C3K2 module at layer 4 and 6, as well as the C2PSA module at layer 10. Among these, the key features of small objects are mainly transmitted by the C3K2 module located at layer 6, the structure of which is shown in
Figure 3. When the C3k parameter is set to True, the intermediate operation of the C3K2 module involves the parallel concatenation of feature maps from multiple C3K modules. When the C3k parameter is False, the intermediate operation consists of the parallel concatenation of feature maps from multiple Bottleneck modules, with the structures of the C3K and Bottleneck modules shown in
Figure 4.
In the wire features detection tasks, the C3K2 module enhances feature richness by parallel concatenating convolutional feature maps of different sizes. However, its ability to capture local contextual information is limited, especially when dealing with the similarity between dense backgrounds and small objects, where the C3K2 module may struggle to effectively distinguish between the object and the background. Moreover, although the C3K2 module strengthens feature representation through the parallel concatenation of multiple feature maps, its relatively small convolution kernels and the lack of dilated convolutions result in a smaller receptive field when processing small objects. This implies that the C3K2 module cannot effectively expand the receptive field to capture global information of small objects, leading to significant limitations when dealing with small-object detection.
To address the aforementioned issues, we introduce the Feature Enhancement Module (FEM) [
23], which incorporates dilated convolution and multi-branch convolutional structures. By focusing on the two core dimensions of increasing feature richness and expanding the receptive field, the module accurately captures the critical information of small objects and effectively distinguishes them from the background. Its network structure is shown in
Figure 5, which is primarily composed of the following four branches:
where
,
, and
represent the standard convolution operations with kernel sizes of
,
, and
, respectively.
means dilated convolution operation with a dilation rate of 5.
F is the input feature map. The final output feature map is obtained by combining the results of the four branches:
where
is the feature map concatenation operation. ⊕ represents the elementwise addition operation of the feature map.
,
,
and
represent the output feature map of the four branches after standard and dilated convolution.
Y is the output feature map of FEM.
3.3. P2 Shallow Detection Head
In the OPGW and conventional ground wire feature detection tasks, wire features are typically very small and easily obscured by background information, presenting a significant challenge for small-object detection. Since the wire features of OPGW and conventional ground wires generally have small sizes and low contrast, they are prone to being covered or confused with the surrounding background, especially in complex backgrounds or under significant lighting variations. Inspired by the work in [
24], we introduced an additional P2 shallow detection head in the Head section of the YOLO11 model.
In the traditional YOLO11 model, the output layers typically include feature maps at P3, P4, and P5, which correspond to different scales of object detection capabilities. After three downsampling operations in the YOLO11 backbone network, feature maps of sizes 80 × 80 (P3), 40 × 40 (P4), and 20 × 20 (P5) are obtained. When processing an input image of size 640 × 640, YOLO11 gradually reduces the resolution of the feature maps through successive convolutional layers to extract features at various levels. However, each downsampling operation reduces the resolution of the feature maps, and the final P5 layer features a resolution of only 20 × 20. This means that as the network depth increases, the spatial resolution of the feature maps progressively decreases, and the model performs poorly when detecting very small objects, especially the heart-shaped ring feature in OPGW and the wire feature of conventional ground wire. This is because, as the number of convolutional layers increases, the features of small objects tend to become more blurred or even vanish in the higher-level feature maps.
In the feature maps from P3 to P5, the object detection capability gradually decreases, primarily due to the reduction in resolution of the feature maps, making it difficult for the model to capture fine features of small-scale objects. Assuming the input image has a size of
, the size
of the feature map at each downsampling layer, after applying the convolutional operation, is given by the following formula:
where
H is the height of the input image,
W is the width of the input image,
k represents the layer number, and
is the size of the feature map obtained from the
k-th layer.
For an input image with size
, the feature map sizes after three downsampling operations are
It can be seen that as the size of the feature map decreases, the model’s ability to detect small objects drops significantly. This is because the traditional YOLO11 model in the OPGW and conventional ground wire features has certain spatial limitations, especially when the object has a complex background with more variations.
To address this issue, we introduced a new P2 detection head into the original YOLO11 model, adding a shallow convolutional layer (P2) that enables the model to focus more on smaller regions of the image. The feature map at layer P2 is much smaller than the feature maps at layers P3, P4, and P5, and it retains more object information, enabling more precise detection of objects with a finer scale. Specifically, P2 allows for more precise comparison of the feature map size between P2 and P3 layers, enabling the model to detect small objects more accurately, even when there are complex backgrounds or background noise.
Assuming the size of the feature map at layer P2 is
, we can express it as
Under this setting, the resolution of the P2 feature map is higher than that of P3 (80 × 80) and P4 (40 × 40), enabling it to retain more detailed information. This is particularly effective for handling small wire features of OPGW and conventional ground wires, significantly improving detection accuracy. By introducing the shallow feature map of the P2 layer into the detection head, the network can better utilize local image information, enhancing its ability to recognize small objects.
3.4. NWD Loss Function
The traditional YOLO model series uses the Intersection over Union (IoU) as the loss function, and its variants (e.g., GIoU, DIoU, CIoU) modify this loss.The specific formula is as follows:
where
A is the predicted bounding box and
B represents the true bounding box.
However, this region-based Intersection over Union (IoU) metric has significant limitations in small-object scenarios. On the one hand, small objects occupy an extremely small proportion at the pixel level, and even slight positional shifts can cause a drastic decrease in IoU, potentially even to zero, resulting in vanishing gradients and unstable training. On the other hand, IoU only measures the overlap between the predicted and ground truth bounding boxes, lacking the ability to continuously model the deviations in the position and size of the object center, making it difficult to meet the fine localization requirements in tasks like the detection of wire features of small objects in OPGW and conventional ground wire images. To address this, we introduce the Normalized Gaussian Wasserstein Distance (NWD) [
25] as the core loss function for bounding box regression, within the YOLO11 framework. NWD, from the perspective of probability distribution matching, models bounding boxes as two-dimensional Gaussian distributions. By calculating the Wasserstein distance between the predicted and ground truth distributions, NWD builds a continuous, differentiable, and position-sensitive optimization objective, fundamentally improving the accuracy and robustness of small-object detection. The NWD loss function is given by the following expression.
Let the predicted bounding box and the ground truth bounding box be represented as
Convert them into two-dimensional Gaussian distributions
and
, defined as follows:
where the
represents the center position of the bounding box, and the covariance matrix
reflects the width and height distribution, capturing the shape information of the bounding box.
Once
and
obtained by converting to two-dimensional Gaussian distributions, the second-order Wasserstein distance
between the two Gaussian distributions can be calculated using the following formula:
where
represents the Euclidean norm, and
represents the Frobenius norm. These norms calculate the difference between the center position, width, and height in their respective dimensions.
However, the Wasserstein distance is inherently a distance metric, and its range is not within [0, 1], which is not conducive to loss normalization and network convergence. To tackle this, an exponential normalization strategy is introduced, mapping the Wasserstein distance to a similarity metric similar to IoU. The NWD is defined as follows:
where the constant
c is used to control the scale invariance and is typically set to the empirical value of the average diagonal length of the object boxes in the dataset.
Finally, the loss function based on the NWD is defined as
5. Discussion
Based on experimental results, WFD-YOLO successfully improves the accuracy of small-object detection while maintaining low computational resource requirements through a series of innovative improvements within the YOLO framework.
First, the model introduces minimal computational overhead to enhance small-object features in images without increasing the complexity of model deployment. As shown in the experimental results, our model achieves higher accuracy in similar real-time detectors without a significant increase in computational demand. This improvement is closely linked to the Feature Enhancement Module in our model, which replaces the original C3K2 module. Compared to C3K2, FEM has relatively low complexity but effectively increases feature richness and expands the receptive field, thereby enhancing the feature representation of small wire features.
Second, our model improves the detection of small wire features that have been occluded. These wire features often occupy a small pixel area in images and can easily be overwhelmed by large-scale features on high-level feature maps. The additional detection head on the shallow feature maps captures more local information, significantly improving the detection accuracy of small wire features.
Finally, compared to other YOLO models, WFD-YOLO has more refined small-object detection, which enhances its ability to detect wire features. To further enhance bounding box regression accuracy, we replaced the original IoU-based loss function with NWD loss. Unlike traditional IoU-based losses, NWD loss measures the position and scale differences between the predicted and ground truth boxes from a spatial distribution perspective. This loss function is more capable of distinguishing subtle positional deviations, which is especially beneficial in small-object detection scenarios, where slight misalignments can lead to missed detections.
Despite these advancements, WFD-YOLO still faces certain limitations in extreme scenarios. As shown in
Figure 9, when objects occupy an extremely small pixel area or are heavily occluded by complex environmental elements, the sparsity of available semantic information can lead to frequent missed detections. Furthermore, under conditions of severe image blur or low signal-to-noise ratios, the model may exhibit false positives, such as misidentifying background noise as wire features or failing to accurately distinguish between specific wire categories.
In summary, compared to other object detection models, WFD-YOLO inherits the advantages of the YOLO framework, maintaining a smaller number of parameters and lower computational resource requirements. Through our improvements, WFD-YOLO achieves higher detection accuracy in OPGW and conventional ground wire feature detection tasks, successfully balancing low computational power with high detection precision.