EDF-YOLOv5: An Improved Algorithm for Power Transmission Line Defect Detection Based on YOLOv5

: Detecting defects in power transmission lines through unmanned aerial inspection images is crucial for evaluating the operational status of outdoor transmission equipment. This paper presents a defect recognition method called EDF-YOLOv5, which is based on the YOLOv5s, to enhance detection accuracy. Firstly, the EN-SPPFCSPC module is designed to improve the algorithm’s ability to extract information, thereby enhancing the detection performance for small target defects. Secondly, the algorithm incorporates a high-level semantic feature information extraction network, DCNv3C3, which improves its ability to generalize to defects of different shapes. Lastly, a new bounding box loss function, Focal-CIoU, is introduced to enhance the contribution of high-quality samples during training. The experimental results demonstrate that the enhanced algorithm achieves a 2.3% increase in mean average precision (mAP@.5) for power transmission line defect detection, a 0.9% improvement in F1-score, and operates at a detection speed of 117 frames per second. These findings highlight the superior performance of EDF-YOLOv5 in detecting power transmission line defects.


Introduction
The electric power industry is rapidly developing due to the gradual rise in electricity demand, which brings substantial challenges in maintaining the existing transmission lines [1].Furthermore, transmission lines span extensive areas and endure prolonged exposure to the elements, rendering them susceptible to various transmission defects, including insulator damage, insulator pollution flashovers, and bird nest attachments [2].These issues can be exacerbated by intricate external environmental conditions, posing significant threats to the regular operation of transmission lines [3][4][5].
To prevent faults, the initial approach involved manual patrolling to inspect power transmission line defects.Nevertheless, these conventional manual patrolling methods are not only time-consuming and labor-intensive but also introduce specific risks to the patrolling personnel [6,7].
With the advancement of society and technology, unmanned aerial vehicles (UAVs), commonly referred to as drones, have gained widespread application in the patrol inspection of transmission lines [8].This innovative inspection method utilizes target detection algorithms to identify defects in images captured by drones, representing a more efficient and technologically advanced approach to detecting and maintaining transmission lines [9,10].
However, it also presents new challenges.For instance, the varied shooting angles of drones and the complex backgrounds of captured targets can all influence the detection accuracy of target detection algorithms [11].Therefore, the urgent issue at hand is how to enhance the detection accuracy of target detection algorithms for images of power transmission line defects captured by drones [12].
Currently, mainstream target detection algorithms are categorized into two main classes based on whether they involve a separate candidate box generation stage: two-stage target detection algorithms and one-stage target detection algorithms [13].Prominent two-stage target detection algorithms include Fast R-CNN (Fast Region-Convolutional Neural Network) [14], Faster R-CNN (Faster Region-Convolutional Neural Network) [15], and the like.Two-stage object detection algorithms have demonstrated excellent detection performance in practical applications, but their complexity results in slower detection speeds, limiting their real-time applicability [16].
Mainstream one-stage target detection algorithms include the YOLO (You Only Look Once) series and the SSD (Single Shot MultiBox Detector) series algorithms [17].One-stage target detection algorithms eliminate the separate stage for candidate box generation, leading to more streamlined models.In this series of algorithms, feature extraction, candidate box regression, and classification all take place within the same convolutional network, resulting in faster detection speeds compared to their two-stage counterparts [18,19].
In recent years, the field of transmission line defect detection through drone patrol inspections has garnered significant interest.Given the small size and fast detection speed of YOLO series algorithms, they have found wide application in this domain.However, YOLO series algorithms have been associated with lower detection accuracy in transmission line defect detection.To address this issue, several scholars have proposed improvements.Hu et al. [20] introduced a BiFPN module for feature fusion, effectively enhancing the defect detection capabilities of the YOLOv5s algorithm.Additionally, to overcome the challenge of detecting small defects on insulators, they integrated an SPD module.Han et al. [21] introduced SA-Net and Multi-head output to improve YOLOv4's ability to recognize small targets in insulator defects.Bao et al. [22] improved YOLOv5 by adding a coordinate attention module, allowing it to pay more attention to the features of vibration dampers and insulators.Qiu et al. [23] optimized YOLOv4 using the MobileNet lightweight convolutional neural network to enhance the algorithm's detection speed for defects on transmission lines.Despite the numerous improvements made in YOLO series algorithms in recent years, there have been relatively few enhancements specifically targeting the YOLOv5s algorithm in the field of transmission line defect detection.Therefore, to date, there is still significant room for improvement in the accuracy of YOLOv5s algorithm for detecting transmission line defects, particularly for small target defects and defects with substantial shape variations.
To address the issues mentioned above, we have proposed an improved version of YOLOv5 for detecting defects in power transmission lines.The main contributions of this paper are as follows: 1.
We have designed a module with enhanced feature extraction capabilities, referred to as EN-SPPFCSPC.This module outperforms Spatial Pyramid Pooling-Fast and Fully Connected Spatial Pyramid Convolution (SPPFCSPC), providing higher detection accuracy while maintaining a lower parameter count.EN-SPPFCSPC effectively leverages feature information, reducing the loss of detailed information, and significantly enhancing detection accuracy.

2.
We have introduced a high-level semantic information extraction module, DCNv3C3, with stronger adaptive capabilities.This module is employed to replace the C3 module in the YOLOv5s algorithm's neck.This improvement effectively boosts the algorithm's generalization ability, enabling it to perform better in various scenarios.

3.
To expedite model convergence and enhance detection accuracy, we have proposed the Focal-CIoU loss function.This novel loss function aims to augment the gradient contribution of high-quality samples during the training process.

Network Architecture
The YOLOv5 algorithm was developed by Glenn Jocher and his team in 2020.This algorithm effectively balances detection speed and accuracy, making it widely applicable in the field of object detection [24].The YOLOv5 algorithm comprises three main components: the backbone, the neck, and the detection head [25].Input images first undergo feature extraction in the backbone, generating three useful feature layers that are then passed to the neck network.The neck module performs feature fusion on the incoming feature layers.The top-down part of the neck network achieves feature fusion across different scales through up-sampling and fusion with coarser-grained feature maps.Subsequently, the bottom-up part mainly employs convolutional layers to fuse features across different scales.Finally, the feature maps from both the top-down and bottom-up parts are combined to obtain the ultimate feature map [26].The detection head serves as the classifier and regressor of the YOLOv5 model, assessing the features from the neck network to detect target objects [27].In this paper, the YOLOv5s algorithm is enhanced to produce EDF-YOLOv5, which offers increased accuracy in detecting defects on transmission lines, as shown in Figure 1.Firstly, in the YOLOv5s backbone network, we introduce the EN-SPPFCSPC module to replace the Spatial Pyramid Pooling-Fast module (SPPF).The EN-SPPFCSPC comprehensively utilizes feature information, preventing the significant loss of detailed information, and thereby enhancing the algorithm's capability to detect small target defects in power transmission lines.Secondly, we replace the C3 modules in the algorithm's neck network section with DCNv3C3.This modification effectively enhances the algorithm's ability to generalize across various shapes of power transmission line defects.Lastly, we introduce the Focal-CIoU loss function, replacing CIoU.This change enhances the gradient contributions of high-quality samples during the training process, effectively improving the algorithm's convergence speed and detection accuracy.
algorithm effectively balances detection speed and accuracy, making it widely applic in the field of object detection [24].The YOLOv5 algorithm comprises three main com nents: the backbone, the neck, and the detection head [25].Input images first und feature extraction in the backbone, generating three useful feature layers that are passed to the neck network.The neck module performs feature fusion on the incom feature layers.The top-down part of the neck network achieves feature fusion across ferent scales through up-sampling and fusion with coarser-grained feature maps.Su quently, the bottom-up part mainly employs convolutional layers to fuse features ac different scales.Finally, the feature maps from both the top-down and bottom-up p are combined to obtain the ultimate feature map [26].The detection head serves a classifier and regressor of the YOLOv5 model, assessing the features from the neck work to detect target objects [27].In this paper, the YOLOv5s algorithm is enhance produce EDF-YOLOv5, which offers increased accuracy in detecting defects on trans sion lines, as shown in Figure 1.Firstly, in the YOLOv5s backbone network, we intro the EN-SPPFCSPC module to replace the Spatial Pyramid Pooling-Fast module (SP The EN-SPPFCSPC comprehensively utilizes feature information, preventing the sig cant loss of detailed information, and thereby enhancing the algorithm's capability t tect small target defects in power transmission lines.Secondly, we replace the C3 mod in the algorithm's neck network section with DCNv3C3.This modification effectivel hances the algorithm's ability to generalize across various shapes of power transmis line defects.Lastly, we introduce the Focal-CIoU loss function, replacing CIoU.change enhances the gradient contributions of high-quality samples during the trai process, effectively improving the algorithm's convergence speed and detection accu

EN-SPPFCSPC Module
In this paper, we are inspired by the SPPFCSPC structure [28] and introduce SPPFCSPC, which has improved feature extraction capabilities, effectively preventin loss of detailed information in defective targets.The advantages of the EN-SPPFC module primarily stem from two key improvements.Firstly, we employ three 5 × 5-s SoftPool [29] sub-regions as a replacement for the max-pooling in SPPFCSPC.

EN-SPPFCSPC Module
In this paper, we are inspired by the SPPFCSPC structure [28] and introduce EN-SPPFCSPC, which has improved feature extraction capabilities, effectively preventing the loss of detailed information in defective targets.The advantages of the EN-SPPFCSPC module primarily stem from two key improvements.Firstly, we employ three 5 × 5-sized SoftPool [29] sub-regions as a replacement for the max-pooling in SPPFCSPC.This substitution effectively mitigates the substantial loss of fine-grained details.SoftPool, applied to the input feature layer, activates feature points using a softmax-weighted approach within the pooling region.It then computes the weighted sum of all activated points within the pooling area, ultimately yielding the activation output of the pooling neighborhood.The principle of SoftPool is depicted in Figure 2, where, assuming a pooling kernel size of 2 × 2, we calculated the weight values, w i , for each pixel point, p i (i = 1, 2, 3, 4), within the feature region, R. If the pixel value of each point p i is a i , the activation weight is determined using Equation (1).
tion weight is determined using Equation (1).
Subsequently, the weights are assigned to the respective pixel points and subject to weighted summation, yielding the SoftPool output result,  , as shown in Equation ( SoftPool comprehensively considers every feature point within the pooling neig borhood, reducing information loss while preserving the overall receptive field.More ver, each input feature receives a gradient, contributing to enhanced training effectiv ness.Furthermore, to effectively reduce the number of parameters, EN-SPPFCSPC intr duces a multi-group mechanism concept, employing group convolution in place of a co ventional convolution.For the input feature maps, it is divided into four groups along t channel dimension; convolution operations are then performed on each of these group The EN-SPPFCSPC structure is illustrated in Figure 3.The input feature map undergo feature extraction through two branches: one branch contains only a GCBR module, whi is composed of grouped convolution, batch normalization, and the Relu activation fun tion.The GCBR module in this branch utilizes a 1 × 1 convolutional kernel for group convolution with a group number of 4, effectively reducing the number of parameters a facilitating inter-channel information interaction.The other branch consists of three ser SoftPool operations with convolutional kernels of size 5 × 5 , along with multiple GCB modules.In this branch, the feature map sequentially undergoes three SoftPool ope tions, followed by concatenation of the outputs from the three SoftPools and the outp from the third GCBR along the channel dimension.This approach effectively controls t parameter count and maximizes the utilization of fine-grained feature information in t feature map.The EN-SPPFCSPC module, with its ability to fully integrate feature info mation, significantly enhances the algorithm's capability to detect small defective targe Subsequently, the weights are assigned to the respective pixel points and subjected to weighted summation, yielding the SoftPool output result, a, as shown in Equation (2).

SoftPool activation map
SoftPool comprehensively considers every feature point within the pooling neighborhood, reducing information loss while preserving the overall receptive field.Moreover, each input feature receives a gradient, contributing to enhanced training effectiveness.
Furthermore, to effectively reduce the number of parameters, EN-SPPFCSPC introduces a multi-group mechanism concept, employing group convolution in place of a conventional convolution.For the input feature maps, it is divided into four groups along the channel dimension; convolution operations are then performed on each of these groups.The EN-SPPFCSPC structure is illustrated in Figure 3.The input feature map undergoes feature extraction through two branches: one branch contains only a GCBR module, which is composed of grouped convolution, batch normalization, and the Relu activation function.The GCBR module in this branch utilizes a 1 × 1 convolutional kernel for grouped convolution with a group number of 4, effectively reducing the number of parameters and facilitating inter-channel information interaction.The other branch consists of three serial SoftPool operations with convolutional kernels of size 5 × 5 , along with multiple GCBR modules.In this branch, the feature map sequentially undergoes three SoftPool operations, followed by concatenation of the outputs from the three SoftPools and the output from the third GCBR along the channel dimension.This approach effectively controls the parameter count and maximizes the utilization of fine-grained feature information in the feature map.The EN-SPPFCSPC module, with its ability to fully integrate feature information, significantly enhances the algorithm's capability to detect small defective targets.

DCNv3C3 Module
YOLOv5s employs fixed geometric structures for its convolution kernels, inherently limiting its ability to model geometric transformations.Therefore, in this paper, we combine the C3 module with Deformable Convolution v3 (DCNv3) [30] to propose the DCNv3C3 module, which exhibits an enhanced adaptability to cope with changes in the shape of the target object.DCNv3 can adjust the sampling positions of convolution by position-offset maps, allowing the sampling points to better focus on the target objects.Simultaneously, it adjusts the weights of sampling points through the modulation maps.Therefore, compared to regular convolution, DCNv3 exhibits superior generalization capabilities, enabling it to adaptively locate and activate target units.A visual comparison of the sampling effects between regular convolution and DCNv3 is shown in Figure 4.

DCNv3C3 Module
YOLOv5s employs fixed geometric structures for its convolution kernels, inherently limiting its ability to model geometric transformations.Therefore, in this paper, we combine the C3 module with Deformable Convolution v3 (DCNv3) [30] to propose the DCNv3C3 module, which exhibits an enhanced adaptability to cope with changes in the shape of the target object.DCNv3 can adjust the sampling positions of convolution by position-offset maps, allowing the sampling points to better focus on the target objects.Simultaneously, it adjusts the weights of sampling points through the modulation maps.Therefore, compared to regular convolution, DCNv3 exhibits superior generalization capabilities, enabling it to adaptively locate and activate target units.A visual comparison of the sampling effects between regular convolution and DCNv3 is shown in Figure 4.
Normal convolution employs fixed convolution kernels, D, to sample the input feature map X ∈ R N×C×W×H .The final output is obtained by weighting the convolution kernel's weights with the pixel values at the sampling points.For example, with a 3 × 3 convolution kernel, considering the center position p 0 as the reference, each sampling point's relative position coordinates can be represented as p d ∈ {(−1, −1), (−1, 0), . . . ,(0, 1), (1, 1)}.The mathematical expression for normal convolution is provided in Equation (3): In the equation, y(p 0 ) represents the sampled output of the convolution kernel, w(p d ) represents the projection weight of the d − th sampling point of the convolution kernel, p 0 is the center position of the convolution kernel, and X(p 0 + p d ) represents the pixel value of the input feature map X at the corresponding sampling point p d location.
DCNv3 is depicted in Figure 5. Assuming that k means the kernel size, the implementation of DCNv3 can be divided into two main steps.The first step involves conducting depth-wise and point-wise inferences on the input feature map X ∈ R N×C×W×H , resulting in the generation of position-offset maps ∆P ∈ R N×W×H×2L and modulation maps M ∈ R N×C×W×L , where L = Gk 2 and G represents the number of groups.In the second step, ∆P and M play flexible roles in the feature extraction procedure by focusing on the sampling object and modulating the weights, respectively.More detailed explanations with formulas are described in the following.Normal convolution employs fixed convolution kernels, D, to sample the input feature map  ∈ ℝ × × × .The final output is obtained by weighting the convolution kernel's weights with the pixel values at the sampling points.For example, with a 3 × 3 convolution kernel, considering the center position  as the reference, each sampling point's relative position coordinates can be represented as  ∈ (−1, −1), (−1, 0), … , (0, 1), (1, 1) .The mathematical expression for normal convolution is provided in Equation (3): In the equation, ( ) represents the sampled output of the convolution kernel, ( ) represents the projection weight of the  − ℎ sampling point of the convolution kernel,  is the center position of the convolution kernel, and ( +  ) represents the pixel value of the input feature map  at the corresponding sampling point  location.
DCNv3 is depicted in Figure 5. Assuming that  means the kernel size, the implementation of DCNv3 can be divided into two main steps.The first step involves conducting depth-wise and point-wise inferences on the input feature map  ∈ ℝ × × × , resulting in the generation of position-offset maps ∆ ∈ ℝ × × × and modulation maps  ∈ ℝ × × × , where  =  and  represents the number of groups.In the second step, ∆ and M play flexible roles in the feature extraction procedure by focusing on the sampling object and modulating the weights, respectively.More detailed explanations with formulas are described in the following.
In the first step, the depth-wise direction inference, as indicated by Equation ( 4), and the channel-wise operation, are applied to the input , which then results in   : where DConv(•) is the depth-wise convolution and (•) converts channel dimensions to the last dimension.BN and GELU mean the normalization and activation function, respectively.And the point-wise direction inference, as illustrated by Equations ( 5) and ( 6), a fully connected approach is employed to enable the sharing of projection weights among sampling points, ultimately computing bias offsets and modulation factors.

∆𝑃 = 𝐿𝑖𝑛𝑒𝑎𝑟 (𝑋 )
(5) where  (•) denotes the shared linear layer applied to each sampling point.Note that M passes through a softmax function and is therefore stable in the  ×  dimension.
In the second step, grouping convolution operations are applied to the input, , during the convolution operation for each group; the adjustment of sampling points of the conventional convolution kernel is accomplished using position-offset and modulation factors.This adjustment aims to better focus the sampling points on the target features.The output for any pixel of the input feature map is expressed as per Equation (7).DCNv3 not only exhibits enhanced geometric transformation capabilities but also, through the application of grouped convolution, segregates the spatial aggregation process into G groups.This effectively controls the parameter count and introduces diverse spatial aggregation patterns to the sampling process, thereby providing stronger feature information for downstream tasks.The structure of DCNv3C3 is illustrated in Figure 6, where DCNv3 is introduced into the C3 module.In the original C3 module, the BottleNeck part consists of two CSB structures and one residual connection.In the improved version, one of the CBS structures is optimized into a DBS structure which employs the more versatile DCNv3 in place of the normal convolution for feature sampling.The DCNv3C3 module requires setting the convolution group number for the DCNv3 operator based on the actual situation.In this paper, the parameters were set as follows according to the different numbers of input feature map channels: (ℎ, ) ∈ (1024, 32), (512, 18), (256, 8) In the first step, the depth-wise direction inference, as indicated by Equation (4), and the channel-wise operation, are applied to the input X, which then results in X dw : where DConv(•) is the depth-wise convolution and Trans(•) converts channel dimensions to the last dimension.BN and GELU mean the normalization and activation function, respectively.
And the point-wise direction inference, as illustrated by Equations ( 5) and ( 6), a fully connected approach is employed to enable the sharing of projection weights among sampling points, ultimately computing bias offsets and modulation factors.∆P = Linear share (X dw ) (5) where Linear share (•) denotes the shared linear layer applied to each sampling point.Note that M passes through a softmax function and is therefore stable in the k × k dimension.
In the second step, grouping convolution operations are applied to the input, X, during the convolution operation for each group; the adjustment of sampling points of the conventional convolution kernel is accomplished using position-offset and modulation factors.This adjustment aims to better focus the sampling points on the target features.The output for any pixel of the input feature map is expressed as per Equation (7).DCNv3 not only exhibits enhanced geometric transformation capabilities but also, through the application of grouped convolution, segregates the spatial aggregation process into G groups.This effectively controls the parameter count and introduces diverse spatial aggregation patterns to the sampling process, thereby providing stronger feature information for downstream tasks.
In the equation, G represents the number of groups, g represents the g − th group, w gd represents the projection weight for the d − th sampling position of the convolution kernel used for sampling in the g − th group, and m gd ∈ M represents the modulation factor of the d − th sampling point in the g − th group.∆p gd ∈ ∆P represents the position-offset of the d − th sampling point in the g-th group.
The structure of DCNv3C3 is illustrated in Figure 6, where DCNv3 is introduced into the C3 module.In the original C3 module, the BottleNeck part consists of two CSB structures and one residual connection.In the improved version, one of the CBS structures is optimized into a DBS structure which employs the more versatile DCNv3 in place of the normal convolution for feature sampling.The DCNv3C3 module requires setting the convolution group number for the DCNv3 operator based on the actual situation.In this paper, the parameters were set as follows according to the different numbers of input feature map channels: (channels, groups) ∈ {(1024, 32), (512, 18), (256, 8)}.

Focal-CIoU Loss Function
In the YOLOv5 algorithm, the default choice for bounding box regression loss is the CIoU loss function, as shown in Equations ( 8)-( 10).This loss function takes into account three factors: the overlap area between the predicted and true bounding boxes, the centroid distance, and the aspect ratio.However, it overlooks the issue of imbalance between

Focal-CIoU Loss Function
In the YOLOv5 algorithm, the default choice for bounding box regression loss is the CIoU loss function, as shown in Equations ( 8)- (10).This loss function takes into account three factors: the overlap area between the predicted and true bounding boxes, the centroid distance, and the aspect ratio.However, it overlooks the issue of imbalance between low-quality and high-quality samples.Low-quality samples refer to predicted boxes with minimal overlap with the target box.These anchor boxes result in larger regression errors and have a negative impact on training.
In Equations ( 8) and ( 10), d represents the Euclidean distance between the predicted box and the true box's center point, c represents the diagonal length of the minimum closed bounding region of the two bounding boxes, w and h are the width and height of the predicted box, respectively, and w gt and h gt are the width and height of the true box, respectively.
To enhance the contribution of high-quality samples during training, this study drew inspiration from the Focal-EIoU loss function [31] and combined the ideas of the CIOU loss function and the Focal loss function, proposing the Focal-CIoU loss function.The Focal-CIoU loss function assigns weights to training samples based on IOU and the γ parameter, giving higher weights to high-quality samples.This reduces the contribution of low-quality samples in bounding box regression, allowing the algorithm to focus more on high-quality samples during training, which helps improve regression accuracy.The Focal-CIoU loss function is defined as follows in Equation ( 11): where γ is a parameter used to control the degree of suppression of outliers.In this study, the parameter setting method from Focal-EIoU was adopted, with γ taking a value of 0.5.

Experimental Conditions
This experiment was conducted on a Linux system, with an Intel Core i7-8700 CPU, NVIDIA Titan Xp GPU with 12 GB of memory, using CUDA 11.6 as the software environment, and PyTorch 1.12.1 as the deep learning framework.The Python version used was 3.8.8.
In this experiment, a total of 3985 aerial photographs of transmission line defects were used, of which 2845 were collected by the research team.The UAV model used for this collected dataset was the MAVIC PRO 2 with 12 megapixels, a range of 10 km, and a maximum flight altitude of 5000 m.The remaining 1140 aerial photos are from the web.The labeling process was carried out using "labelimg", and the labeling format adhered to the PASCAL VOC standard.The annotation information for the dataset was uniformly saved in XML format.To enhance model training, efforts were made to ensure that the annotated bounding boxes closely matched the annotated objects.
There were four target categories in total: Nest (for tower bird's nest attachments), Defect (for insulator damage defects), Normal Insulator (for normal insulators), and Pollution Flashover (for insulator pollution flashover defects).The dataset was divided into training, testing, and validation sets in an 8:1:1 ratio.The distribution of sample labels in the dataset can be seen in Figure 7, and the transmission line defects are described in Table 1.
Defect (for insulator damage defects), Normal Insulator (for normal insulators), and lution Flashover (for insulator pollution flashover defects).The dataset was divided training, testing, and validation sets in an 8:1:1 ratio.The distribution of sample labe the dataset can be seen in Figure 7, and the transmission line defects are described in T 1.

Experimental Evaluation Indicators
The evaluation metrics involved in this experiment primarily include mean average precision (mAP), recall rate (R), precision (P), parameter count, frames per second (FPS), and F1-score.The calculation formulas are as follows ( 12)- (15).
TP represents the true positive samples in the detection results, FN represents the false negative samples in the detection results, and FP represents the false positive samples in the detection results.mAP is the mean average precision for all categories and the F1-score is a comprehensive metric that simultaneously considers precision and recall to evaluate the algorithm's performance on positive and negative samples.

Comparison of Different Loss Functions in Experiments
To validate the effectiveness of the Focal-CIoU loss function, this study conducted comparative experiments using the CIoU loss function, GIoU loss function, Focal-EIoU [31] loss function, and DIoU loss function.The experimental results are shown in Figure 9 and Table 2.      Analyzing the results in conjunction with Figure 9 and Table 2, the following conclusions can be drawn.The Focal-CIoU loss function performs remarkably well during training, consistently showing lower initial loss values compared to the other four loss functions, and it converges quickly, reaching the lowest loss values below 0.02.Additionally, when YOLOv5s utilizes the Focal-CIoU loss function as the bounding box regression loss, it results in a 1.5% improvement in mAP@0.5.In summary, the Focal-CIoU loss function exhibits the best performance in this comparative experiment.This loss function enhances the gradient contribution of high-quality samples during training, making the model more robust and freer from overfitting issues.

Comparison of Different Spatial Pyramid Pooling Modules in Experiments
To validate the effectiveness of the EN-SPPFCSPC module, this section conducts comparative experiments using the SPPF module, the SPPFCSPC module, and the EN-SPPFCSPC module, respectively.The experimental results are shown in Table 3. From Table 3, it can be observed that the EN-SPPFCSPC model exhibited a favorable performance in this comparative experiment.With a parameter count similar to the SPPF module, the final mAP@0.5 value improved by 0.9%.Furthermore, compared to SPPFCSPC, the EN-SPPFCSPC module significantly reduced the parameter count while improving accuracy, achieving a lightweight effect.It is evident that the EN-SPPFCSPC module effectively utilized comprehensive information from feature maps, addressing the issue of the substantial loss of fine details in the YOLOv5s algorithm's spatial pyramid pooling module.The standard for real-time object detection typically requires algorithms to process a certain number of frames per second (FPS).This criterion may vary depending on the application scenario, but, generally, the requirement for real-time object detection is around 30 FPS.According to this standard, the improved YOLOv5s algorithm with the SPPFCSPC enhancement performs exceptionally well on this dataset and meets the speed requirements for real-time detection.

Experimental Comparison of DCNv3C3 Module at Different Usage Positions
To verify the impact of DCNv3C3 on the detection accuracy of the YOLOv5s algorithm and to find the optimal placement of this module in the algorithm, this study conducted comparative experiments using three scenarios: the unmodified YOLOv5s algorithm, the use of the DCNv3C3 module in the backbone of the YOLOv5s algorithm, and the use of the DCNv3C3 module in the neck of the YOLOv5s algorithm.The experimental results are shown in Table 4. From Table 4, it can be observed that optimizing C3 in the Backbone of DCNv3C3 results in a slight improvement in average detection accuracy while reducing the parameter count.Specifically, mAP@0.5 improved by 0.6%, but the detection speed decreased by 37.8 frames per second, and there was a slight decrease in F1 score.On the other hand, optimizing C3 in the Neck to DCNv3C3 resulted in the most significant optimization effect for the algorithm.mAP@0.5 improved by 1.2%, F1 score increased significantly, and the model's detection speed only slightly decreased.In summary, the DCNv3C3 module effectively enhances the YOLOv5s algorithm's capability to detect transmission line defects.

Ablation Experiment
This paper proposes three improvements to the YOLOv5s algorithm.In order to explore the impact of these different improvements on the algorithm's detection accuracy, ablation experiments were conducted.The experimental results are shown in Table 5, where √ indicates the usage of this module, where M0 represents the unimproved YOLOv5s algorithm, M1 represents the use of EN-SPPFCSPC as the algorithm's spatial pyramid pooling module, M2 represents the optimization of the C3 module in the algorithm's neck to the DCNv3C3 module, M3 represents the use of the Focal-CIoU loss function as the algorithm's bounding box regression loss, M4 represents the simultaneous use of the EN-SPPFCSPC and DCNv3C3 modules for algorithm improvement, M5 represents the simultaneous use of the DCNv3C3 module and Focal-CIoU loss function for algorithm improvement, and M6 represents the simultaneous use of the EN-SPPFCSPC module and Focal-CIoU loss function for algorithm improvement."Ours" represents the EDF-YOLOv5 obtained by simultaneously using all three of the above-mentioned improvements to enhance the algorithm.We randomly selected test data and performed detection using the YOLOv5s algorithm and an improved algorithm called EDF-YOLOv5.The detection results are shown in Figure 10.Comparisons between the improved algorithm and the unimproved algorithm M0 lead to the following conclusions: (1) When using EN-SPPFCSPC alone to improve the algorithm, there is a slight improvement in the average detection accuracy of defect targets.The mAP@.5 increased by 0.9%.Although the detection speed decreased slightly, it still met the real-time detection standard.(2) When using DCNv3C3 alone to improve the algorithm, the algorithm can more accurately detect defects in power transmission lines.The mAP@.5 increased by 1.2%, and the F1 score also improved by 1.2%.This improvement effectively optimized the algorithm.(3) When using the Focal-CIoU loss function alone to improve the algorithm, there is a significant improvement in the average detection accuracy of defects in power transmission lines.The mAP@.5 increased by 1.5%, and there is also some improvement in the F1 score, while the detection speed remains unchanged.It is evident that the algorithm's performance significantly improved.(4) When simultaneously using EN-SPPFCSPC and DCNv3C3 to improve the algorithm, the mAP@.5 increased by 1.2%, and there is some improvement in the F1 score.
Although the detection speed is lower than the original algorithm, it still meets the real-time detection standard.(5) When simultaneously using Focal-CIoU and DCNv3C3 to improve the algorithm, the mAP@.5 increased by 1.0%, and the F1 score improved slightly, while the model's detection speed decreased.(6) When simultaneously using the EN-SPPFCSPC and Focal-CIoU loss function to improve the algorithm, the mAP@.5 increased by 1.1%, and there is some improvement in the F1 score.Although the speed decreased by 31 frames per second, real-time detection was still guaranteed.( 7) When simultaneously applying the three proposed improvement methods in this paper to the algorithm, the algorithm achieves optimal detection accuracy for defects in power transmission lines in this experiment.The mAP@.5 increased by 2.3%, and the F1 score improved by 0.8%.The detection speed decreased by 49 frames per second, but it did not affect the algorithm's real-time performance.Furthermore, in conjunction with the detection results displayed in Figure 10, we can draw the following conclusions.It is evident that the EDF-YOLOv5 algorithm proposed in this paper performs better on the power transmission line defect dataset.Compared to the unimproved algorithm, it achieves a significant improvement in detection accuracy and can more accurately detect defect targets, demonstrating the effectiveness of the improvements proposed in this paper.
Electronics 2023, 12, x FOR PEER REVIEW 13 of 16 of the EN-SPPFCSPC and DCNv3C3 modules for algorithm improvement, M5 represents the simultaneous use of the DCNv3C3 module and Focal-CIoU loss function for algorithm improvement, and M6 represents the simultaneous use of the EN-SPPFCSPC module and Focal-CIoU loss function for algorithm improvement."Ours" represents the EDF-YOLOv5 obtained by simultaneously using all three of the above-mentioned improvements to enhance the algorithm.We randomly selected test data and performed detection using the YOLOv5s algorithm and an improved algorithm called EDF-YOLOv5.The detection results are shown in Figure 10.Comparisons between the improved algorithm and the unimproved algorithm M0 lead to the following conclusions: (1) When using EN-SPPFCSPC alone to improve the algorithm, there is a slight improvement in the average detection accuracy of defect targets.The mAP@.5 increased by 0.9%.Although the detection speed decreased slightly, it still met the real-time detection standard.(2) When using DCNv3C3 alone to improve the algorithm, the algorithm can more accurately detect defects in power transmission lines.The mAP@.5 increased by 1.2%, and the F1 score also improved by 1.2%.This improvement effectively optimized the algorithm.(3) When using the Focal-CIoU loss function alone to improve the algorithm, there is a significant improvement in the average detection accuracy of defects in power transmission lines.The mAP@.5 increased by 1.5%, and there is also some improvement in the F1 score, while the detection speed remains unchanged.It is evident that the algorithm's performance significantly improved.(4) When simultaneously using EN-SPPFCSPC and DCNv3C3 to improve the algorithm, the mAP@.5 increased by 1.2%, and there is some improvement in the F1 score.Although the

Comparison of Evaluation Metrics for Mainstream Object Detection Algorithms
To further validate the detection performance of the EDF-YOLOv5 algorithm on this dataset, we conducted comparative experiments with the current mainstream object detection algorithms, including Faster R-CNN, Mask R-CNN, SSD, YOLOX, and YOLOv8.The experimental results are presented in Table 6.From Table 6, the following conclusions can be drawn: (1) In terms of mAP@.5, EDF-YOLOv5 demonstrates an outstanding performance.It outperforms classical two-stage object detection algorithms, Faster R-CNN and Mask R-CNN, with accuracy improvements of 13.5% and 9.8%, respectively.Compared to other single-stage object detection algorithms, such as YOLOv5s, SSD, YOLOX, and YOLOv8, it achieves improvements of 2.3%, 12.0%, 7.5%, and 0.4%, respectively.This indicates that the algorithm exhibits superior detection effectiveness for power transmission line defects.(2) Analyzing the detection speed of the algorithms, it is observed that the two-stage object detection algorithm, Faster R-CNN, has the lowest detection speed.YOLOv8 achieves the fastest detection speed, reaching up to 181 frames per second, while EDF-YOLOv5's detection speed of 117 frames per second ranks third among the compared algorithms.It is evident that EDF-YOLOv5 also possesses a certain advantage in terms of detection speed, meeting the real-time detection requirements.(3) The F1 score is another standard for assessing algorithm accuracy.Comparing the F1 scores of different object detection algorithms, it is noted that EDF-YOLOv5 achieves an F1 score of 90.1%.This is an improvement of 0.8%, 27.4%, 16.4%, and 0.7% compared to other object detection algorithms, including YOLOv5s, Faster R-CNN, SSD, and YOLOv8, respectively.It is evident that the EDF-YOLOv5 algorithm exhibits an outstanding performance in power transmission line defect detection, both in terms of detection speed and accuracy.

Conclusions
This paper introduces an improved YOLOv5s algorithm, EDF-YOLOv5, for the detection of power transmission line defects in aerial drone images.Through extensive experiments, it has been demonstrated that the proposed EN-SPPFCSPC module effectively utilizes comprehensive information from input features while controlling the parameter count, reducing the loss of fine-grained details, and improving the algorithm's detection accuracy.The introduced DCNv3C3 module enhances the model's generalization ability and synchronously improves detection accuracy.Additionally, the Focal-CIoU loss function enables the algorithm to achieve faster convergence and lower initial loss values.In summary, the enhanced algorithm exhibits superior performance in power transmission line defect detection tasks, achieving a mAP@.5 of 93.1% and an F1 score of 90.1%, with a detection speed of 117 frames per second, meeting the real-time detection requirements for power transmission line defect detection.Moving forward, we will continue to explore on the basis of this improved algorithm, make further efforts to enhance accuracy, and pursue algorithm lightweighting.We hope that the proposed improved algorithm can find broader applications in the field.

Figure 4 .
Figure 4. (a) Illustration of the sampling effect of DCNv3.(b) Illustration of the sampling effect of ordinary convolution.

Figure 4 .
Figure 4. (a) Illustration of the sampling effect of DCNv3.(b) Illustration of the sampling effect of ordinary convolution.

Figure 7 .
Figure 7. (a) Distribution of label quantities in the dataset.(b) Distribution of label sizes.

Figure 7 .
Figure 7. (a) Distribution of label quantities in the dataset.(b) Distribution of label sizes.

Figure 8 .
Figure 8. Example images of the dataset.
tronics 2023, 12, x FOR PEER REVIEW[31] loss function, and DIoU loss function.The experimental results are shown in 9 and Table2 .

Figure 9 .
Figure 9. Loss function experimental results comparison diagram.

Figure 9 .
Figure 9. Loss function experimental results comparison diagram.

Table 1 .
Introduction of transmission line defects.
Defect NameDefect Description Nest Birds nesting on power towers are prone to short-circuit fai Pollution Flashover A significant reduction in the insulation level of insulators, lead

Table 2 .
Detection accuracy of different loss functions at mAP50.

Table 2 .
Detection accuracy of different loss functions at mAP50.

Table 3 .
Comparison of experimental results using different spatial pyramid pooling modules.

Table 4 .
Experimental comparison of DCNv3C3 module usage at different positions.

Table 5 .
Comparison of ablation experiment results.

Table 5 .
Comparison of ablation experiment results.

Table 6 .
Performance comparison of mainstream detection algorithms.