An Enhanced Single-Stage Neural Network for Object Detection in Transmission Line Inspection

: To address the issue of human object detection in transmission line inspection, an enhanced single-stage neural network is proposed, which is based on the improvement of the YOLOv7-tiny model. Firstly, a lighter GSConv module is utilized to optimize the original ELAN module, reducing the parameters in the network. In order to make the network less sensitive to the targets with an unconventional pose, a module based on CSPNeXt and GSConv is designed and integrated with the ELAN module to extract deep features from the targets. Moreover, a WIoU (Wise Intersection over Union) loss function is utilized to enhance the ability of the YOLOv7-tiny model to detect objects with an unconventional pose and the interference of the background. Finally, the experimental results on human targets in transmission line inspection demonstrate that the proposed network improves detection confidence and reduces missed detection. Compared to the YOLOv7-tiny model, the proposed method promotes the performance of accuracy while reducing the amount of parameters.


Introduction
With the rapid development of computer vision, visual monitoring has penetrated into every aspect of our lives.Human target detection [1,2], as an important branch of target detection, has attracted more and more attention in intelligent video surveillance [3].Accordingly, neural network technologies as one of the most active fields of artificial intelligence provide a theoretical foundation for detecting targets, thus promoting monitoring efficiency, and reducing the burden of manual work [4].
Recently, deep learning neural networks have played a significant role in object detection [5,6], not only improving detection accuracy but also promoting detection speed.Generally, there exists three mainstream approaches, including one-stage [7-9] and two-stage methods [10][11][12][13], and the Transformer-based method [14,15].The single-stage methods, exemplified by YOLO (You Only Look Once) algorithms [7], directly predict the bounding boxes and class probabilities in a single stage.This kind of method is renowned for its real-time performance and efficiency [16,17].The two-stage methods appeared relatively early, dominated by CNN algorithms, such as Faster R-CNN and R-CNN series [13,14].Usually, this method involves the generation of candidate boxes extracted from the image, and then the detection procedure is facilitated by making secondary adjustments based on these candidate regions.These kinds of methods can achieve high accuracy by explicitly generating region proposals and then classifying and refining them.The Transformerbased method is a state-of-the-art method which views object detection as a set prediction problem.However, it demands an extra-long training time to converge.
For years, much work has been devoted to object detection by using the aforementioned models.For instance, Yi et al. [18] proposed an enhanced algorithm called LAR-YOLOv8, which improves small target detection accuracy by designing an attention-guided Electronics 2024, 13, 2080 2 of 13 bidirectional feature pyramid network, while introducing the RIoU loss function.Wang et al. [19] proposed an innovative anchor-free driving scene detection network based on YOLOv8, improving both speed and accuracy to meet the demands of high-speed road driving scenarios.Shi et al. [9] proposed the Maritime Tiny Person detector (MTP-YOLO) based on YOLOv9, which incorporates the C2fELAN design and MCE-CBAM module, along with a new W-EIoU loss function.For two-stage methods, He et al. [10] introduced an extension to the Faster R-CNN framework by adding a branch for predicting segmentation masks on each region of interest.Cai et al. [11] proposed enhancing the object detection framework through a multi-stage architecture where detection is progressively refined, improving the precision of bounding boxes at stages with incrementally higher intersection over union thresholds.Hu et al. [20] built on the Faster R-CNN by incorporating relation modules that model pairwise relations between objects in a scene, aiding in the recognition of contextually related objects.For Transformer-based methods, Meng et al. [12] introduced a method called conditional DETR (Detection Transformer) by incorporating a conditional spatial query.Zhu et al. [21] proposed an enhancement to the original DETR that addresses its slow convergence and limited feature resolution by using attention modules that focus on a small set of key sampling points.This adaptation significantly reduces training time and improves performance, particularly on small objects, as demonstrated on the COCO benchmark.
In contrast to the two other kinds of methods, YOLO series methods eliminate the need for separate region proposal generation, making them faster but possibly sacrificing some accuracy.In particular, the YOLO-tiny object detection model is designed for faster inference speed and lower resource requirements, making it suitable for running on mobile or device endpoints.However, it may cause a loss of precision, so that the unconventional pose and the interference of the background make the YOLO-tiny model have low performance on human detection in transmission line inspection.
To detect the human target in transmission line inspection, this paper proposes an enhanced single-stage neural network, which is based on the YOLOv7-tiny model [22].The contributions of this work are as follows: (1) the introduction of a lighter GSConv module [23] to optimize the original ELAN module; (2) the combination of a CSPNeXt-GS module to make the network less sensitive to the target with an unconventional pose; and (3) the integration of the CSPNeXt-GS module to extract deep features from the targets.In addition, a WIoU loss function [24] is utilized to further enhance the ability of the YOLOv7tiny model for object detection.By incorporating these improvements, the proposed method significantly promotes the model's performance in terms of accuracy and faster convergence speed, making it a valuable contribution to the field of human target detection in transmission line inspection.
The organization of the paper is as follows: in Section 2, the original YOLO-tiny model is presented, and its modifications are described in Section 3. In Section 4, the experimental results of the proposed YOLO-tiny on human target detection in transmission line inspection are presented, and the performance of the proposed method is then compared with that of some existing methods to demonstrate its effectiveness in object detection.Finally, conclusions are drawn in Section 5.

YOLOv7-Tiny Neural Network
YOLOv7-tiny is a variant of the YOLOv7 object detection algorithm [25], which is designed for real-time and efficient object detection tasks.Generally speaking, it is optimized for edge devices with limited computational resources, thereby reducing the model size and computational complexity.
Figure 1 illustrates the overall structure of the YOLOv7-tiny model.It mainly consists of several key components, including a backbone network and detection layers.The backbone network is responsible for extracting features from the input image, which typically consists of several convolutional layers and pooling layers.The detection layers in YOLOv7-tiny model are utilized to generate predictions, taking the responsibility for predicting bounding boxes, class probabilities, and objectless scores for different grid cells in the input image.
, 13, x FOR PEER REVIEW 3 of 13 backbone network is responsible for extracting features from the input image, which typically consists of several convolutional layers and pooling layers.The detection layers in YOLOv7-tiny model are utilized to generate predictions, taking the responsibility for predicting bounding boxes, class probabilities, and objectless scores for different grid cells in the input image.

An Enhanced Single-Stage Neural Network Based on the YOLOv7-Tiny Model
The network architecture of an enhanced YOLOv7-tiny model is illustrated in Figure 2. Several improvements have been made compared to the original YOLOv7-tiny model, as seen in the boxes with color in Figure 2. Firstly, the improved ELAN-CSPGS module is used in the backbone network to enhance the extraction of deep features from the image.Secondly, the improved ELAN-GS module is employed in the network head to preserve the hidden connections between feature map channels as much as possible.The standard convolution in SPPCSPC-tiny is then replaced with GSConv [23] to further reduce the number of parameters inherent in the network, and named SPPCSPC-GS for simplification.Finally, the network utilizes WIoU loss [24] to train the model, regardless of the original loss function.

An Enhanced Single-Stage Neural Network Based on the YOLOv7-Tiny Model
The network architecture of an enhanced YOLOv7-tiny model is illustrated in Figure 2. Several improvements have been made compared to the original YOLOv7-tiny model, as seen in the boxes with color in Figure 2. Firstly, the improved ELAN-CSPGS module is used in the backbone network to enhance the extraction of deep features from the image.Secondly, the improved ELAN-GS module is employed in the network head to preserve the hidden connections between feature map channels as much as possible.The standard convolution in SPPCSPC-tiny is then replaced with GSConv [23] to further reduce the number of parameters inherent in the network, and named SPPCSPC-GS for simplification.Finally, the network utilizes WIoU loss [24] to train the model, regardless of the original loss function.
backbone network is responsible for extracting features from the input image, which typically consists of several convolutional layers and pooling layers.The detection layers in YOLOv7-tiny model are utilized to generate predictions, taking the responsibility for predicting bounding boxes, class probabilities, and objectless scores for different grid cells in the input image.

An Enhanced Single-Stage Neural Network Based on the YOLOv7-Tiny Model
The network architecture of an enhanced YOLOv7-tiny model is illustrated in Figure 2. Several improvements have been made compared to the original YOLOv7-tiny model, as seen in the boxes with color in Figure 2. Firstly, the improved ELAN-CSPGS module is used in the backbone network to enhance the extraction of deep features from the image.Secondly, the improved ELAN-GS module is employed in the network head to preserve the hidden connections between feature map channels as much as possible.The standard convolution in SPPCSPC-tiny is then replaced with GSConv [23] to further reduce the number of parameters inherent in the network, and named SPPCSPC-GS for simplification.Finally, the network utilizes WIoU loss [24] to train the model, regardless of the original loss function.In the following subsections, a detailed description of each module will be provided.

ELAN-GS Module
Figure 3 illustrates the original ELAN-tiny structure and its improvement structure.In contrast to the original ELAN-tiny model, a GSConv module is introduced into the ELANtiny model, regardless of convolutional blocks (CBL) which consists of convolutional layer, batch normalization (BN) and LeakyReLU activation function.The principle of GSConv is shown in Figure 4, where "Conv" represents standard convolution operations followed by batch normalization and a non-linear activation function, and "DWConv" represents depth-wise separable convolution which is a commonly used method to reduce the number of model parameters.
Electronics 2024, 13, x FOR PEER REVIEW 4 of 13 In the following subsections, a detailed description of each module will be provided.

ELAN-GS Module
Figure 3 illustrates the original ELAN-tiny structure and its improvement structure.In contrast to the original ELAN-tiny model, a GSConv module is introduced into the ELAN-tiny model, regardless of convolutional blocks (CBL) which consists of convolutional layer, batch normalization (BN) and LeakyReLU activation function.The principle of GSConv is shown in Figure 4, where "Conv" represents standard convolution operations followed by batch normalization and a non-linear activation function, and "DWConv" represents depth-wise separable convolution which is a commonly used method to reduce the number of model parameters.Let the input feature map information be I and the output feature map be O.After, the ELAN-GS module operation can be expressed as where Tn i represents the standard convolution, pooling, and activation operation for the i-th convolution kernel of size n; gn i is the GSConv operation for the i-th convolution kernel of size n; Concat denotes splicing feature vectors along the channel dimension; Similar to the original ELAN-tiny model, the input feature map undergoes two GSConv operations to extract preliminary features.Then, one of the GSConv outputs is subjected to two consecutive standard convolutions.Finally, the outputs of the four convolution blocks are concatenated, followed by a convolution fusion to obtain the final output.This process ensures that the final output O contains more and deeper features.In the following subsections, a detailed description of each module will be provided.

ELAN-GS Module
Figure 3 illustrates the original ELAN-tiny structure and its improvement structure.In contrast to the original ELAN-tiny model, a GSConv module is introduced into the ELAN-tiny model, regardless of convolutional blocks (CBL) which consists of convolutional layer, batch normalization (BN) and LeakyReLU activation function.The principle of GSConv is shown in Figure 4, where "Conv" represents standard convolution operations followed by batch normalization and a non-linear activation function, and "DWConv" represents depth-wise separable convolution which is a commonly used method to reduce the number of model parameters.Let the input feature map information be I and the output feature map be O.After, the ELAN-GS module operation can be expressed as where Tn i represents the standard convolution, pooling, and activation operation for the i-th convolution kernel of size n; gn i is the GSConv operation for the i-th convolution kernel of size n; Concat denotes splicing feature vectors along the channel dimension; Similar to the original ELAN-tiny model, the input feature map undergoes two GSConv operations to extract preliminary features.Then, one of the GSConv outputs is subjected to two consecutive standard convolutions.Finally, the outputs of the four convolution blocks are concatenated, followed by a convolution fusion to obtain the final output.This process ensures that the final output O contains more and deeper features.Let the input feature map information be I and the output feature map be O.After, the ELAN-GS module operation can be expressed as where T n i represents the standard convolution, pooling, and activation operation for the i-th convolution kernel of size n; g n i is the GSConv operation for the i-th convolution kernel of size n; Concat denotes splicing feature vectors along the channel dimension; ).Similar to the original ELAN-tiny model, the input feature map undergoes two GSConv operations to extract preliminary features.Then, one of the GSConv outputs is subjected to two consecutive standard convolutions.Finally, the outputs of the four convolution blocks are concatenated, followed by a convolution fusion to obtain the final output.This process ensures that the final output O contains more and deeper features.

ELAN-CSPGS Module
In target detection tasks where the object has an unconventional pose in a complex background, low-dimensional features such as shape and contour are crucial factors affecting detection performance.In order to fully extract and retain the low-dimensional features of the image without losing semantic information, this work introduces the CSPNet backbone network [18].The CSPNet can enhance the learning ability of neural networks and promote the feature extraction capability of convolutional neural networks with less computational complexity.
Figure 5 illustrates the CSP structure, which divides the features into two parts.One part goes through dense convolutional blocks, and the other part undergoes regular convolution and directly connects to the output of the dense convolutional blocks.Transition layers are usually implemented with 1 x1 convolutions or pooling operations.

ELAN-CSPGS Module
In target detection tasks where the object has an unconventional pose in a complex background, low-dimensional features such as shape and contour are crucial factors affecting detection performance.In order to fully extract and retain the low-dimensional features of the image without losing semantic information, this work introduces the CSPNet backbone network [18].The CSPNet can enhance the learning ability of neural networks and promote the feature extraction capability of convolutional neural networks with less computational complexity.
Figure 5 illustrates the CSP structure, which divides the features into two parts.One part goes through dense convolutional blocks, and the other part undergoes regular convolution and directly connects to the output of the dense convolutional blocks.Transition layers are usually implemented with 1 x1 convolutions or pooling operations.Based on the CSP structure, the CSPNeXt-GS module is then designed to enhance the feature extraction capability of the convolutional neural network, as shown in Figure 6.By combining GSConv with the CSP structure, this module can strengthen the convolutional feature extraction while reducing computational complexity.Similarly, let Σ represent the summation by elements, and the output can be written as In order to preserve deep features such as the unconventional pose, the branch D is replaced with the CSPNeXt-GS module in the structure of ELAN-GS, as shown in Figure Based on the CSP structure, the CSPNeXt-GS module is then designed to enhance the feature extraction capability of the convolutional neural network, as shown in Figure 6.By combining GSConv with the CSP structure, this module can strengthen the convolutional feature extraction while reducing computational complexity.Similarly, let ∑ represent the summation by elements, and the output can be written as

Input
Electronics 2024, 13, x FOR PEER REVIEW 5 of 13

ELAN-CSPGS Module
In target detection tasks where the object has an unconventional pose in a complex background, low-dimensional features such as shape and contour are crucial factors affecting detection performance.In order to fully extract and retain the low-dimensional features of the image without losing semantic information, this work introduces the CSPNet backbone network [18].The CSPNet can enhance the learning ability of neural networks and promote the feature extraction capability of convolutional neural networks with less computational complexity.
Figure 5 illustrates the CSP structure, which divides the features into two parts.One part goes through dense convolutional blocks, and the other part undergoes regular convolution and directly connects to the output of the dense convolutional blocks.Transition layers are usually implemented with 1 x1 convolutions or pooling operations.Based on the CSP structure, the CSPNeXt-GS module is then designed to enhance the feature extraction capability of the convolutional neural network, as shown in Figure 6.By combining GSConv with the CSP structure, this module can strengthen the convolutional feature extraction while reducing computational complexity.Similarly, let Σ represent the summation by elements, and the output can be written as In order to preserve deep features such as the unconventional pose, the branch D is replaced with the CSPNeXt-GS module in the structure of ELAN-GS, as shown in Figure In order to preserve deep features such as the unconventional pose, the branch D is replaced with the CSPNeXt-GS module in the structure of ELAN-GS, as shown in Figure 7.This improvement enhances the subsequent convolutional layers and can extract deeper features while retaining the features of the A and B branches.

WIoU Loss
Most existing training works assume that the samples have high quality and focus on enhancing the fitting ability of the regression loss.However, blindly strengthening the bounding box regression of low-quality samples can hinder the generalization of the detector.To address this issue, the regression loss function WIoU is introduced in this work.
Let (xgt, ygt) and (x, y) denote the coordinates of the center point of the object box and prediction box, respectively, and Wg, Hg be the width and height of the minimum enclosing box of the prediction box and the real box, as shown in Figure 8, and the loss can be expressed as where Su denotes the intersecting part area and is calculated as where wgt and hgt denote the width and height of the object box truth value, respectively; w and h denote the width and height of predicted object box, respectively; and Wi and Hi are the width and height of the intersection part of the real object box and predicted object box, respectively.

WIoU Loss
Most existing training works assume that the samples have high quality and focus on enhancing the fitting ability of the regression loss.However, blindly strengthening the bounding box regression of low-quality samples can hinder the generalization of the detector.To address this issue, the regression loss function WIoU is introduced in this work.
Let (x gt , y gt ) and (x, y) denote the coordinates of the center point of the object box and prediction box, respectively, and W g , H g be the width and height of the minimum enclosing box of the prediction box and the real box, as shown in Figure 8, and the loss can be expressed as where S u denotes the intersecting part area and is calculated as where w gt and h gt denote the width and height of the object box truth value, respectively; w and h denote the width and height of predicted object box, respectively; and W i and H i are the width and height of the intersection part of the real object box and predicted object box, respectively.

WIoU Loss
Most existing training works assume that the samples have high quality and focus on enhancing the fitting ability of the regression loss.However, blindly strengthening the bounding box regression of low-quality samples can hinder the generalization of the detector.To address this issue, the regression loss function WIoU is introduced in this work.
Let (xgt, ygt) and (x, y) denote the coordinates of the center point of the object box and prediction box, respectively, and Wg, Hg be the width and height of the minimum enclosing box of the prediction box and the real box, as shown in Figure 8, and the loss can be expressed as where Su denotes the intersecting part area and is calculated as   Among the loss functions of WIoU, WIoUv3 [26] has better performance, and is illustrated as where r represents the focusing factor; * denotes separation from the computational map.
When the prediction box overlaps with the object box, L IoU will weaken the penalty for the geometric factor and focus attention on the centroid distance.Further, in the aggregation factor r, an outlier β is introduced to assess the degree of anomaly of the anchor frame, and is denoted as where L IoU denotes the sliding average with momentum m, and m is usually 1 − tn √ 0.05, t represents the training period, and n is the batch size.Therefore, the focusing factor r can be expressed as where α and δ are hyper parameters, and the anchor frame will obtain the highest gradient gain when its outlier degree is some constant.The dynamic nature of L IoU makes the quality classification criterion β for anchor frames dynamic as well, thus enabling WIoUv3 to dynamically formulate a gradient gain allocation strategy that best fits the current situation.

Experimental Results and Discussion
To validate the effectiveness of the proposed method, experiments and tests on human object detection were performed on the Ubuntu 18.04 64-bit operating system using the PyTorch based on Python 3.10.The training dataset is from https://aistudio.baidu.com/datasetdetail/229698 (accessed on 13 July 2023) and the COCO dataset [27], which mainly focus on the human target.The training and testing datasets in this work were randomly divided with a ratio of 8:2.Notably, all the models used below employ the learning rate 0.01, and take the training processing using this dataset, and were run on this platform consisting of an Intel ® Xeon ® CPU E5-2620 v4 2.1 GHz processor, 64 GB RAM, and an NVIDIA GeForce RTX 4090 24 GB GPU.
In our method, the anchor frames of the human object dataset are clustered by the use of a K-means++ clustering algorithm, and the prior bounding box size information obtained is shown in Table 1.The hyperparameters used for network training are set as follows: initial learning rate of 0.01, momentum of 0.937, and decay rate of 0.0005.The batch size is 16, and the Adam optimizer is used during training.The image size for training is set to 640 × 640 pixels, and the training period is 200 epochs.The quantitative evaluation metric for the experimental results is the average precision (AP) [28] at an intersection over union (IoU) threshold of 0.5.The calculation of AP is as follows where m represents the number of positive samples; C denotes the number of classes; P represents the maximum precision for a single sample, and AP represents the average precision for a class of sample.

Experimental Results
Figure 9 gives the values of the two loss functions with respect to the iteration, named CIoU [29] and WIoU [24].As the iteration increases, both WIoU and CIoU eventually reach a converged state.Nevertheless, WIoU has a smaller loss value and converges faster compared to CIoU, thus demonstrating the effectiveness of the WIoU loss function in our method.
Electronics 2024, 13, x FOR PEER REVIEW 8 of 13 The quantitative evaluation metric for the experimental results is the average precision (AP) [28] at an intersection over union (IoU) threshold of 0.5.The calculation of AP is as follows where m represents the number of positive samples; C denotes the number of classes; P represents the maximum precision for a single sample, and AP represents the average precision for a class of sample.

Experimental Results
Figure 9 gives the values of the two loss functions with respect to the iteration, named CIoU [29] and WIoU [24].As the iteration increases, both WIoU and CIoU eventually reach a converged state.Nevertheless, WIoU has a smaller loss value and converges faster compared to CIoU, thus demonstrating the effectiveness of the WIoU loss function in our method.Figure 10 shows the P-R curves which intuitively display the recall and precision performance of the detector.The improved network has a larger area that indicates the better performance of the detector.Therefore, the improved network model can achieve better results.Figure 10 shows the P-R curves which intuitively display the recall and precision performance of the detector.The improved network has a larger area that indicates the better performance of the detector.Therefore, the improved network model can achieve better results.
Electronics 2024, 13, x FOR PEER REVIEW 8 of 13 The quantitative evaluation metric for the experimental results is the average precision (AP) [28] at an intersection over union (IoU) threshold of 0.5.The calculation of AP is as follows where m represents the number of positive samples; C denotes the number of classes; P represents the maximum precision for a single sample, and AP represents the average precision for a class of sample.

Experimental Results
Figure 9 gives the values of the two loss functions with respect to the iteration, named CIoU [29] and WIoU [24].As the iteration increases, both WIoU and CIoU eventually reach a converged state.Nevertheless, WIoU has a smaller loss value and converges faster compared to CIoU, thus demonstrating the effectiveness of the WIoU loss function in our method.Figure 10 shows the P-R curves which intuitively display the recall and precision performance of the detector.The improved network has a larger area that indicates the better performance of the detector.Therefore, the improved network model can achieve better results.To provide a visual demonstration of the detection performance of the proposed method on human objects, a comprehensive comparison is made with some existing classical deep learning models in terms of detection accuracy, spatial complexity, and temporal complexity.Figure 11 illustrates five real-world images that were selected from transmission line inspection.
Electronics 2024, 13, x FOR PEER REVIEW 9 of 13 To provide a visual demonstration of the detection performance of the proposed method on human objects, a comprehensive comparison is made with some existing classical deep learning models in terms of detection accuracy, spatial complexity, and temporal complexity.Figure 11 illustrates five real-world images that were selected from transmission line inspection.Figures 12-16 illustrate the results for human object detection with the comparison methods, including Faster R-CNN [13], YOLOv5 [30], YOLOv7-tiny, and YOLOv7 [25].For the human objects, the improved network can detect all of them, while other networks show varying degrees of missed detection.For the small target in image3, the proposed method and YOLOv7 can detect it and outperform the other models.Moreover, the improved network can accurately box the human targets, with more precise detection box positions compared to the original Faster R-CNN, as seen in image1 and im-age4.Table 2 lists the accuracy of the human object detection.The proposed method has a lower false detection rate and leak detection rate.This demonstrates that the proposed improvements on Faster R-CNN achieve a better performance on the detection accuracy of human targets.It effectively reduces the rates of missed detection and false detection, improves the detection rate of small targets, and generates more accurate and effective detection boxes.Figures 12-16 illustrate the results for human object detection with the comparison methods, including Faster R-CNN [13], YOLOv5 [30], YOLOv7-tiny, and YOLOv7 [25].For the human objects, the improved network can detect all of them, while other networks show varying degrees of missed detection.For the small target in image3, the proposed method and YOLOv7 can detect it and outperform the other models.Moreover, the improved network can accurately box the human targets, with more precise detection box positions compared to the original Faster R-CNN, as seen in image1 and image4.Table 2 lists the accuracy of the human object detection.The proposed method has a lower false detection rate and leak detection rate.This demonstrates that the proposed improvements on Faster R-CNN achieve a better performance on the detection accuracy of human targets.It effectively reduces the rates of missed detection and false detection, improves the detection rate of small targets, and generates more accurate and effective detection boxes.To provide a visual demonstration of the detection performance of the proposed method on human objects, a comprehensive comparison is made with some existing classical deep learning models in terms of detection accuracy, spatial complexity, and temporal complexity.Figure 11 illustrates five real-world images that were selected from transmission line inspection.Figures 12-16 illustrate the results for human object detection with the comparison methods, including Faster R-CNN [13], YOLOv5 [30], YOLOv7-tiny, and YOLOv7 [25].For the human objects, the improved network can detect all of them, while other networks show varying degrees of missed detection.For the small target in image3, the proposed method and YOLOv7 can detect it and outperform the other models.Moreover, the improved network can accurately box the human targets, with more precise detection box positions compared to the original Faster R-CNN, as seen in image1 and im-age4.Table 2 lists the accuracy of the human object detection.The proposed method has a lower false detection rate and leak detection rate.This demonstrates that the proposed improvements on Faster R-CNN achieve a better performance on the detection accuracy of human targets.It effectively reduces the rates of missed detection and false detection, improves the detection rate of small targets, and generates more accurate and effective detection boxes.To provide a visual demonstration of the detection performance of the proposed method on human objects, a comprehensive comparison is made with some existing classical deep learning models in terms of detection accuracy, spatial complexity, and temporal complexity.Figure 11 illustrates five real-world images that were selected from transmission line inspection.Figures 12-16 illustrate the results for human object detection with the comparison methods, including Faster R-CNN [13], YOLOv5 [30], YOLOv7-tiny, and YOLOv7 [25].For the human objects, the improved network can detect all of them, while other networks show varying degrees of missed detection.For the small target in image3, the proposed method and YOLOv7 can detect it and outperform the other models.Moreover, the improved network can accurately box the human targets, with more precise detection box positions compared to the original Faster R-CNN, as seen in image1 and im-age4.Table 2 lists the accuracy of the human object detection.The proposed method has a lower false detection rate and leak detection rate.This demonstrates that the proposed improvements on Faster R-CNN achieve a better performance on the detection accuracy of human targets.It effectively reduces the rates of missed detection and false detection, improves the detection rate of small targets, and generates more accurate and effective detection boxes.To provide a visual demonstration of the detection performance of the proposed method on human objects, a comprehensive comparison is made with some existing classical deep learning models in terms of detection accuracy, spatial complexity, and temporal complexity.Figure 11 illustrates five real-world images that were selected from transmission line inspection.Figures 12-16 illustrate the results for human object detection with the comparison methods, including Faster R-CNN [13], YOLOv5 [30], YOLOv7-tiny, and YOLOv7 [25].For the human objects, the improved network can detect all of them, while other networks show varying degrees of missed detection.For the small target in image3, the proposed method and YOLOv7 can detect it and outperform the other models.Moreover, the improved network can accurately box the human targets, with more precise detection box positions compared to the original Faster R-CNN, as seen in image1 and im-age4.Table 2 lists the accuracy of the human object detection.The proposed method has a lower false detection rate and leak detection rate.This demonstrates that the proposed improvements on Faster R-CNN achieve a better performance on the detection accuracy of human targets.It effectively reduces the rates of missed detection and false detection, improves the detection rate of small targets, and generates more accurate and effective detection boxes.Table 3 presents a comparison of the time complexity among the mentioned methods.The YOLOv7 method is the least time-consuming, while the Faster R-CNN methods, being two-stage algorithms, have a higher time consumption.The proposed method is partially enhanced by the improvements, resulting in a high time complexity compared to the YOLOv7-tiny algorithms.However, it still has a lower time complexity compared to the YOLOv5 and YOLOv7.

Absence of Ablation
To further demonstrate the performance of each improvement in the proposed method, the absence of ablation experiments are carried out.Table 5 gives the results of five cases, corresponding to the original YOLOv7-tiny, and its improvement with the ELAS-GS, ELAN_CSPGS, WIoU and K-means++, respectively.From Table 5, it can be observed that the above modules have gradually improved the detection accuracy of the model, while simultaneously reducing the model's parameters and computational complexity, while maintaining a high frame rate.Table 3 presents a comparison of the time complexity among the mentioned methods.The YOLOv7 method is the least time-consuming, while the Faster R-CNN methods, being two-stage algorithms, have a higher time consumption.The proposed method is partially enhanced by the improvements, resulting in a high time complexity compared to the YOLOv7-tiny algorithms.However, it still has a lower time complexity compared to the YOLOv5 and YOLOv7.

Absence of Ablation
To further demonstrate the performance of each improvement in the proposed method, the absence of ablation experiments are carried out.Table 5 gives the results of five cases, corresponding to the original YOLOv7-tiny, and its improvement with the ELAS-GS, ELAN_CSPGS, WIoU and K-means++, respectively.From Table 5, it can be observed that the above modules have gradually improved the detection accuracy of the model, while simultaneously reducing the model's parameters and computational complexity, while maintaining a high frame rate.

Further Discussion
Recently, the latest version YOLO series methods have appeared [31,32].Figures 17 and 18 illustrate the results of human object detection.It can be seen that YOLOv8 and YOLOv9 can obtain the desired results.The performance of the training dataset is shown in Table 6.The confidence is relatively low due to the fact that the YOLOv8n.pt and YOLOv9-c.ptare used for the pre-training model, respectively.Nevertheless, the proposed model has the advantage of less parameters and an acceptable accuracy in transmission line inspection, making it suitable for running on mobile or device endpoints.

Further Discussion
Recently, the latest version YOLO series methods have appeared [31,32].Figures 17 and 18 illustrate the results of human object detection.It can be seen that YOLOv8 and YOLOv9 can obtain the desired results.The performance of the training dataset is shown in Table 6.The confidence is relatively low due to the fact that the YOLOv8n.pt and YOLOv9-c.ptare used for the pre-training model, respectively.Nevertheless, the proposed model has the advantage of less parameters and an acceptable accuracy in transmission line inspection, making it suitable for running on mobile or device endpoints.

Conclusions
This paper proposes an enhanced single-stage neural network based on the YOLOtiny model for human object detection in transmission line inspection.The model incorporates a lighter GSConv module to enhance the original ELAN-tiny module and combines it with the CSP structure to make the network less sensitive to low-contrast targets.The WIoU loss function is utilized to replace the original loss function, further enhancing the detection performance.The experimental results demonstrate that the proposed network model achieves a higher detection accuracy compared to some existing and commonly used models and has fewer parameters and a lesser computational complexity.In the near future, the proposed method will be applied to endpoint devices for transmission line inspection.

Further Discussion
Recently, the latest version YOLO series methods have appeared [31,32].Figures 17 and 18 illustrate the results of human object detection.It can be seen that YOLOv8 and YOLOv9 can obtain the desired results.The performance of the training dataset is shown in Table 6.The confidence is relatively low due to the fact that the YOLOv8n.pt and YOLOv9-c.ptare used for the pre-training model, respectively.Nevertheless, the proposed model has the advantage of less parameters and an acceptable accuracy in transmission line inspection, making it suitable for running on mobile or device endpoints.

Conclusions
This paper proposes an enhanced single-stage neural network based on the YOLOtiny model for human object detection in transmission line inspection.The model incorporates a lighter GSConv module to enhance the original ELAN-tiny module and combines it with the CSP structure to make the network less sensitive to low-contrast targets.The WIoU loss function is utilized to replace the original loss function, further enhancing the detection performance.The experimental results demonstrate that the proposed network model achieves a higher detection accuracy compared to some existing and commonly used models and has fewer parameters and a lesser computational complexity.In the near future, the proposed method will be applied to endpoint devices for transmission line inspection.

Conclusions
This paper proposes an enhanced single-stage neural network based on the YOLO-tiny model for human object detection in transmission line inspection.The model incorporates a lighter GSConv module to enhance the original ELAN-tiny module and combines it with the CSP structure to make the network less sensitive to low-contrast targets.The WIoU loss function is utilized to replace the original loss function, further enhancing the detection performance.The experimental results demonstrate that the proposed network model achieves a higher detection accuracy compared to some existing and commonly used models and has fewer parameters and a lesser computational complexity.In the near future, the proposed method will be applied to endpoint devices for transmission line inspection.conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figure 2 .
Figure 2. Structure of an enhanced neural network.

Figure 2 .
Figure 2. Structure of an enhanced neural network.

Figure 2 .
Figure 2. Structure of an enhanced neural network.

Figure 3 .
Figure 3. Structure of the ELAN-GS module.

Figure 5 .
Figure 5. Structure of the CSP.

Figure 5 .
Figure 5. Structure of the CSP.

Figure 5 .
Figure 5. Structure of the CSP.

Figure 8 .
Figure 8. Calculation of the WIoU loss function.
wgt and hgt denote the width and height of the object box truth value, respectively; w and h denote the width and height of predicted object box, respectively; and Wi and Hi are the width and height of the intersection part of the real object box and predicted object box, respectively.

Figure 8 .
Figure 8. Calculation of the WIoU loss function.Figure 8. Calculation of the WIoU loss function.

Figure 8 .
Figure 8. Calculation of the WIoU loss function.Figure 8. Calculation of the WIoU loss function.

Figure 10 .
Figure 10.Comparison of network model P-R curves before and after improvement.

Figure 10 .
Figure 10.Comparison of network model P-R curves before and after improvement.Figure 10.Comparison of network model P-R curves before and after improvement.

Figure 10 .
Figure 10.Comparison of network model P-R curves before and after improvement.Figure 10.Comparison of network model P-R curves before and after improvement.

Figure 11 .
Figure 11.Real-world images in transmission line inspection.

Figure 12 .
Figure 12.Detection result of Faster R-CNN.

Figure 11 .
Figure 11.Real-world images in transmission line inspection.

Figure 11 .
Figure 11.Real-world images in transmission line inspection.

Figure 12 .
Figure 12.Detection result of Faster R-CNN.

Figure 12 .
Figure 12.Detection result of Faster R-CNN.

Figure 11 .
Figure 11.Real-world images in transmission line inspection.

Figure 12 .
Figure 12.Detection result of Faster R-CNN.

Figure 11 .
Figure 11.Real-world images in transmission line inspection.

Figure 12 .
Figure 12.Detection result of Faster R-CNN.

Figure 16 .
Figure 16.Detection result of our method.

Figure 16 .
Figure 16.Detection result of our method.

Author Contributions:
Methodology, J.T.; Software, J.N.; Validation, Z.C. and Z.H.; Formal analysis, X.X.; Writing-original draft, C.C.All authors have read and agreed to the published version of the manuscript.

Author Contributions:
Methodology, J.T.; Software, J.N.; Validation, Z.C. and Z.H.; Formal analysis, X.X.; Writing-original draft, C.C.All authors have read and agreed to the published version of the manuscript.

Table 2 .
Accuracy of human object detection.

Table 3 .
Evaluation of time complex.

Table 4
lists the maximum batch size (B) and memory usage (Memory, M) when training with different algorithms.The proposed algorithm achieves the highest accuracy and strikes a good balance between accuracy, speed, and model size, as seen in Table4.

Table 4 .
Complexity comparison results of the different methods.

Table 2 .
Accuracy of object detection.

Table 3 .
Evaluation of time complex.

Table 4
lists the maximum batch size (B) and memory usage (Memory, M) when training with different algorithms.The proposed algorithm achieves the highest accuracy and strikes a good balance between accuracy, speed, and model size, as seen in Table4.

Table 4 .
Complexity comparison results of the different methods.

Table 6 .
Performance of the latest version YOLO series.

Table 6 .
Performance of the latest version YOLO series.

Table 6 .
Performance of the latest version YOLO series.