Deep Learning Based Electric Pylon Detection in Remote Sensing Images

: The working condition of power network can signiﬁcantly inﬂuence urban development. Among all the power facilities, electric pylon has an important effect on the normal operation of electricity supply. Therefore, the work status of electric pylons requires continuous and real-time monitoring. Considering the low efﬁciency of manual detection, we propose to utilize deep learning methods for electric pylon detection in high-resolution remote sensing images in this paper. To verify the effectiveness of electric pylon detection methods based on deep learning, we tested and compared the comprehensive performance of 10 state-of-the-art deep-learning-based detectors with different characteristics. Extensive experiments were carried out on a self-made dataset containing 1500 images. Moreover, 50 relatively complicated images were selected from the dataset to test and evaluate the adaptability to actual complex situations and resolution variations. Experimental results show the feasibility of applying deep learning methods to electric pylon detection. The comparative analysis can provide reference for the selection of speciﬁc deep learning model in actual electric pylon detection task. writing—review visualization, and


Introduction
Electricity is one of the most crucial energy supports for economic development and technology progress. Furthermore, the stability of electricity supply is an essential requirement for regional development. In the entire power system, electric network is an important link to transfer the electric energy from the power plants with concentrated distribution to individual power users with scattered distribution [1]. In other words, this component of the power system most closely connects with the urban power supply. To monitor the performance of the electric network, electric pylons, which play the role of undertaking and guiding wires, need to be monitored frequently to ensure normal operation.
However, with the popularization of electricity and the increasing complexity of electric network, user residence expresses the trend of enlargement and diversification. Considering that current distribution of electric pylons contains the characteristics of large quantity, wide span, diverse appearance and complex surrounding terrain, traditional field inspections relying on manpower require large resource consumption but receive low time efficiency. Field inspections based on unmanned aerial vehicles (UAVs) may show better performance [2,3]. However, this approach proves to be difficult to realize real-time monitoring requirements in the face of large area, and is susceptible to the influence of the surrounding tall buildings. In contrast, satellite remote sensing monitoring has a large monitoring area and proves to be efficient and less influenced by surroundings, which has been applied to global environmental observation [4]. Therefore, this paper focuses on electric pylon detection in high-resolution remote sensing images captured by satellites.
Furthermore, artificial interpretation in high-resolution remote sensing images turns out to be a significant amount of hard work. However, in addition to the influence of characteristics of the electric pylon itself, there exists inevitable shortcoming in the artificial interpretation compared with the machine recognition, i.e., the visual fatigue. Due to the influence of visual fatigue, continuous artificial interpretation work can significantly reduce the efficiency and accuracy of manual monitoring. Thus, this paper introduces deep learning methods to automatically interpret remote sensing images containing electric pylons, which can significantly improve the comprehensive efficiency of electric pylon detection.
In recent years, target detection methods based on deep learning have become a hot spot in related areas. Such kind of methods has been demonstrated to be effective in the detection of aircraft [5,6], ship [7,8], and condensing tower [9,10], on the basis of a large number of successful experiments. Furthermore, deep-learning-based detectors can adapt to several remote sensing data sources, and has been applied to the target detection tasks of optical [11], infrared [12], LiDAR [13], SAR [14], aerial images [15], etc. With the continuous improvement of deep learning theory and the iterative update of detection algorithm, target detectors based on deep learning have shown superiority over traditional object detection methods.
In this paper, to analyze the feasibility of deep learning methods in the detection of electric pylons in high-resolution remote sensing images, we select 10 state-of-the-art deep-learning-based target detectors to compare the comprehensive performance of these models in the electric pylon detection. To improve training efficiency, the detectors designed and pre-trained on natural images are fine-tuned on the basis of remote sensing images containing electric pylons. Extensive experiments were performed on a self-made remote sensing image dataset.
The rest of this paper is organized as follows. Section 2 shows the related works of our study. Section 3 describes the main content of our work, including the production of the dataset, the selection of deep learning models, and the adjustment of specific parameters in the experiments. Section 4 introduces the experimental process and the test results of each detector. Section 5 presents the comprehensive analysis of the results. Finally, Section 6 makes a conclusion.

Object Detection Based on Deep Learning
In recent years, numerous deep-learning-based methods have been proposed to solve object detection problems.These detectors follow similar lines of thought, extract features using Convolutional Neural Network (CNN), and classify the objects and regress the bounding boxes using diverse methods. Deep-learning-based object detection methods are popularized by both two-stage and one-stage detectors.
Girshick et al. proposed R-CNN (Regions with Convolutional Neural Network) [16] as the first two-stage detector. The R-CNN method mainly acts as a classifier, training CNNs end-to-end to classify the proposal regions into object categories or background. SPP-Net (Spatial Pyramid Pooling Network) [17] and Fast R-CNN [18] promote the development of two-stage detectors further. Ren et al. introduced Faster R-CNN [19]. This meaningful detector proposed region proposal network to advance the efficiency of detectors and allow the detector to be trained end-to-end. Since then, scholars have introduced many methods to enhance Faster R-CNN from different points, e.g., Cascade R-CNN [20], Mask R-CNN [21], Grid R-CNN [22], HTC (Hybrid Task Cascade) [23], etc. These detectors generally have relatively superior accuracy, while cost longer running time and need larger memory.
On the other hand, one-stage detectors are popularized by SSD (Single Shot multibox Detector) [24] and YOLO (You Only Look Once) [25][26][27][28]. These detectors densely sample from different locations of the image uniformly, extracted features, and then classify and regress bounding-box directly.
They usually have less running time and memory cost, but have lower accuracy until the introduction of focal loss in Retinanet [29]. Focal loss is proved to greatly improve the comprehensive performance of one-stage detectors.
A brief illustration of deep-learning-based detectors is shown in Figure 1. (Left) two-stage detectors; and (Right) one-stage detectors. These detectors firstly extract features using Convolutional Neural Network (CNN), and then densely sampled from different location to get anchors. Difference between two-stage and one-stage is mainly reflected on whether utilizes Region Proposal Network (RPN) [19]. Two-stage detectors classify the anchors recommended by RPN and regress the bounding-box to draw near the ground truth box, while one-stage detectors do classification and regression on all anchors.
For the above methods, the selection of candidate bounding boxes is all based on the prior anchors, and these methods are regarded as the anchor-based methods, which regress pre-set anchors and four variables, [x,y,w,h]. In recent years, researchers have paid more attention to anchor-free methods, which use several regression methods, e.g., regressing pixel position, and proposed a number of successful anchor-free detectors such as Corner-Net [30], Center-Net [31], FCOS (Fully Convolutional One-Stage) [32], etc. In certain scenes, anchor-free detectors have the identical accuracy and the superior speed compared to the anchor-based one-stage detectors. Anchor-free methods are growing into a focus of future research.

Electric Pylon Detection
How to effectively and accurately realize the automatic detection of electric pylons remains an urgent problem to be solved in remote sensing. Matikainen et al. [33] proposed to introduce remote sensing methods to detect power lines. They studied the application of several remote sensing data sources in power lines monitoring, including SAR images, optical satellite images, optical aerial images, thermal images, etc. Various remote sensing images provide a series of new ideas for electric pylon detection, some of which have been greatly improved in recent years, e.g., UAV monitoring [2,3] and detection based on Lidar data [34]. Besides, the detection and tracking of electric pylons in videos is also an important research field of electric pylon monitoring. Tilawat et al. [35] proposed an automatic detection method to locate electric pylons in aerial videos.
Utilizing learning methods is the development trend in the area of target detection and identification. Sampedro et al. [36] proposed a traditional supervised learning method for detecting and classifying transmission towers. In [36], two MLP (multi-layer perceptron) neural networks were trained using HOG (directional gradient histogram) features. The former MLP network was used to achieve the mission of detection, and the latter network was used to classify different types of towers. Good evaluation results achieved by this model preliminary prove the feasibility of applying learning method to monitor power network working condition.
With the development of deep learning, it has been applied to detect electric pylon in SAR images. For example, Fei and Tan [37] proposed to use deep learning to identify electric tower in high-resolution SAR images. The authors aimed at balancing the precision and efficiency of identification, and, in particular, constructed a two-stage detector by cascading YOLOv2 [26] and VGG [38]. Comparing with YOLOv2, this detector achieved better detection performance with the recall reaching 73.8% in the testing process.
However, there are few works focusing on electric pylon in optical satellite remote sensing images. Therefore, to accelerate the research in this application area, we performed a series of extensive experiments based on high resolution optical remote sensing images and analyzed the advantages and disadvantages of nine state-of-the-art deep-learning-based detectors.

EPD Dataset
To study electric pylon detection based on deep learning, we specially collected a high-resolution remote sensing image dataset for electric pylon detection (EPD). Images in our EPD dataset were collected from Google Earth and image productions of Pleiades satellite. Specially, all images in the dataset are processed multi-spectral remote sensing image products, which are widely used in practical detection tasks. Pleiades images are orthoimages, and images from Google Earth are multi-spectral products captured by different sensors. Such multi-source data can better test the generalization ability of deep learning detectors. Figure 2 shows samples from these two sources in our EPD dataset. Image samples in our EPD dataset. The first and second images were captured by Pleiades satellite, while the third image and fourth images were collected from Google Earth. All image samples in our dataset were obtained from these two sources and image formats are all processed multi-spectral image products.
As shown in Figure 2, the electric pylon targets in high-resolution optical remote sensing images have a variety of features, which bring considerable challenges to actual detection tasks. On the one hand, due to the wide use of pylons, the sizes and specifications of pylons vary greatly. Even at the same spatial resolution, the area occupied by different pylons in the same image may be quite different. On the other hand, due to the wide coverage of the power network, the background environment of the pylons varies greatly. Light and topography also affect the characteristics of electric pylons. The former will impact the appearance color of electric pylons, while the latter will affect the tilt degree of electric pylons with the certain observation angle of satellites.
To test the adaptability of several detectors to the above interference factors, we comprehensively selected the electric pylons targets in different states when making our dataset. EPD dataset contains 1500 images in total: 720 images were captured by Pleiades satellite along Huimao Line in Guangdong Province, China, a main line of power network in south China, while the remaining images were collected from Google Earth to further improve the representativeness of the dataset by expanding the source of samples. The spatial resolution of images in EPD dataset is 1 m/pixel. Moreover, to test and evaluate the adaptability of the detectors in face of actual situations, we specially selected 50 relatively complex images from EPD dataset comprising a complex test subset called EPD-C. Twenty images in subset EPD-C were from the production of Pleiades satellite while the remaining 30 images were from Google Earth. One criterion for selecting images to EPD-C is the interference of the background, such as the similarity between the background and the target or interfering objects in the image containing certain similar characteristics with the target to be detected. Another criterion includes the particularity of the target to be detected, such as the large scale variation or unique characteristics. The details of the complex test set EPD-C are summarized in the Table 1.  Figure 3 shows two samples in the complex test subset EPD-C. We can see that the left image from Pleiades satellite has a colorful field background, similar to the color of electric pylons. The right image in Figure 3 is from Google Earth, and its background contains crisscrossed roads and framed buildings. It would be harder to detect electric pylons in these two images. There are totally 159 electric pylons in EPD-C. Thus, the construction of EPD-C can help to better evaluate electric pylon detection performance in complex situations.
Moreover, these two samples in Figure 3 also indicate that, in addition to the characteristics of electric pylons themselves, the surrounding background also brings challenges to the detection task, which is mainly reflected in the background color and interference targets. Due to the light-colored frame structure of electric pylon target, light background and frame structure buildings can significantly interfere with detection results.
Particularly, we regard the remaining 1450 images in EPD dataset excluding EPD-C as a standard subset named EPD-S, which involves more than 3000 electric pylons. EPD-S subset was used to train detectors and perform random experiments.

Deep Learning Detectors for Comparison
We selected 10 popular state-of-the-art deep-learning-based detectors, namely Faster R-CNN [19], Cascade R-CNN [20], Grid R-CNN [22], Libra R-CNN [39], Retinanet [29], YOLOv3 [27], YOLOv4 [28], Retinanet FreeAnchor [40], FCOS [32], and Retinanet FSAF (Feature Selective Anchor-Free) [41], containing four two-stage models and six one-stage models. From the perspective of whether to use anchor, eight detectors are anchor-based, one detector is anchor-free, and one is specially an anchor-based detector with an anchor-free branch. We selected these detectors based on the following reasons. Firstly, these 10 detectors are popular deep learning models proposed in the last five years. Their performance can almost stand for the ability of state-of-the-art deep-learning-based detectors to solve electric pylon detection. Secondly, these models have already been applied in many other remote sensing tasks and have obtained meaningful achievements. Lastly, these models cover the main research directions of deep-learning-based object detection network, such as two-stage/one-stage and anchor-based/anchor-free, making our study more comprehensive and the experimental results more credible. Some details of the detectors we studied in this paper are reported in Table 2. For more detailed introduction to how each detector works, please refer to the respective citations. Table 2. Detectors based on deep learning studied in this paper. Eight detectors use ResNet101 [42] + FPN [43] as the backbone, while the other two detectors of YOLO series use Darknet-53 [27] and CSPDarknet-53 [28] as the backbone, respectively. ResNet101 refers to a deep residual network with 101 layers. FPN refers to feature pyramid networks. Darknet-53 refers to a deep residual network with 53 layers and CSPDarknet adds a CSPNet [44] structure on the basis of Darknet-53.

Backbone Network
In deep learning networks for object detection task, the structure utilized to extract features is called the backbone network. As a relatively new network structure with excellent performance in various tasks, deep residual network (ResNet) [42] is very suitable to be the backbone network of object detection models. In this paper, most networks use ResNet101 (ResNet with 101 layers) as the backbone network, except YOLOv3 and YOLOv4, which, respectively, use Darknet-53 and CSPDarknet-53 as the backbone network. Besides, considering that feature pyramid networks (FPN) [43] has a good performance in solving multi-scale problems, we combine FPN with ResNet to further improve network performance.
As shown in Figure 4, the output of the last four stages (C 2 -C 5 ) out of total five stages of ResNet, are input again to the FPN network. The output of the backbone network is the fused feature map of ResNet and FPN. We record the outputs as P 2 -P 5 . The P 5 layer is obtained by 1 × 1 convolution of C 5 . P 4 layer is obtained by fusion of P 5 layer and C 4 layer after upper sampling. P 3 and P 2 are computed by a similar operation as P 4 . The numbers of channels inputting each layer are 256, 512, 1024, and 2048, respectively, and the number of output channels is 256. Each layer has half the size of the previous layer. Meanwhile, FPN increases the P 6 layer obtained by the secondary down sampling of C 5 layer, which adds a feature layer on a larger scale. Besides, the structures of Darknet-53 and CSPDarknet-53 are detailed in Section 3.2.7 and Section 3.2.8, respectively.

Faster R-CNN
Faster R-CNN [19] creatively merges Region Proposal Network (RPN) to Fast R-CNN [18]. An RPN is a fully convolutional network, trained end-to-end to generate high-quality candidate boxes for detection by Fast R-CNN. RPN accelerates the region proposal and reduces the running time of two-stage detectors significantly. The structure of Faster R-CNN is shown in Figure 5.  [19]. The input of Faster R-CNN is the output of FPN [43]. Tetragons with different colors represent different anchors. RPN (Region Proposal Network) is a fully convolutional network, classifying the anchors to foreground/background and regressing the bounding-box sketchily. ROI is the abbreviation of region of interesting and ROI Align layer [21] is used to reconcile the size of feature maps.
Due to the utilization of FPN, the feature maps input to RPN have multi-scale characteristic. The scale of anchors in each feature map is 8 2 , equivalent to generate five scales of anchors (32 2 , 64 2 , 128 2 , 256 2 , and 512 2 ) in the original image. The aspect ratios of anchors are 1:1, 2:1, and 1:2. We generated 15 kinds of anchor totally.
Faster R-CNN usually uses ROI pooling layer to reconcile the size of feature maps input into two fully connected (FC) layers. However, in this paper, we utilized a modified method of ROI pooling, i.e., ROI Align [21], as shown in Figure 5. ROI Align layer outputs a feature map with a shape of 7 × 7 × 256, and the FC layer outputs 1024 channels.
In RPN, we utilized Cross Entropy Loss (CE Loss) [19] with sigmoid function when calculating classification loss and Smooth L1 Loss [19] with parameter β = 1/9 when calculating bounding-box regression loss. In R-CNN, we utilized CE Loss with SoftMax method when calculating classification loss and Smooth L1 Loss with parameter β = 1 when calculating bounding-box regression loss. We added these losses together as the final loss.
Faster R-CNN adopts the parameterizations of four coordinates [x, y, w, h] to regress from an anchor box to a nearby ground-truth box. In R-CNN, we utilized mean values 0 and variances [0.1, 0.1, 0.2, 0.2] to normalize the four coordinates.
When training RPN, we assigned positive and negative labels to anchors following respective citation [19]. We sampled anchors as well. In particular, if the maximum IoU (Intersection over Union) between a ground-truth box and any anchor was lower than 0.3, we ignored entire anchors. To reduce redundant anchor, we adopted non-maximum suppression (NMS) [45]. We set the IoU threshold for NMS as 0.7, and calculated 2000 proposal regions with the highest scores per image.
When training R-CNN, we simply assigned positive and negative labels with 0.5 as a threshold, and we utilized ground truth as a positive sample. We randomly sampled 512 anchors with the positive and negative ratio 1:3.
When testing, we utilized NMS as well. In RPN, we also set the IoU threshold for NMS as 0.7, and calculated 1000 proposal regions per image. In R-CNN, we set the IoU threshold for NMS as 0.7 and calculated 100 proposal regions per image.

Cascade R-CNN
Faster R-CNN [19] utilizes 0.5 as the IoU threshold when defining positive and negative samples. A low threshold, e.g., 0.5, usually produces noisy detection in positive samples, but a high threshold, e.g., 0.7, usually leads positive samples to exponentially vanish and causes detection performance to degrade. Cascade R-CNN [20] was proposed to address the problem. As shown in Figure 6, Cascade R-CNN cascades a sequence of detectors with increasing IoU thresholds. The detectors are trained stage by stage. The output of a detector is a superior distribution for training the next higher quality detector.  [20]. The input of Cascade R-CNN is the output of FPN [43]. Tetragons with different colors represent different anchors, while the tetragons on different feature maps with the same color represent the same anchors. RPN (Region Proposal Network) is a fully convolutional network, classifying the anchors to foreground/background and regressing the bounding-box sketchily. ROI is the abbreviation of region of interesting and ROI Align layer [21] is used to reconcile the size of feature maps. IoU is the abbreviation of Intersection over Union, whose details are introduced in Section 4.2. Detectors 1 and 2 use different IoU thresholds and the positive ones are imported to next detector, as shown by the lines of different colors.
We cascaded three detectors with the IoU thresholds 0.35, 0.45, and 0.55, respectively. We selected these IoU thresholds to calculate more anchors and improve the ability of finding small objects. The detectors randomly sampled 1024, 512, and 512 anchors, respectively. The positive and negative ratio for each detector was 1:3.

Grid R-CNN
Grid R-CNN [22] substitutes a grid guided localization mechanism for precise object detection instead of traditional regression based methods. The detector designs a spatial information fusion module to utilize the inner spatial correlation and calibrate the location of grid points. The fusion feature maps are obtained by adding correlative grid feature maps processed by convolution layers together. Grid R-CNN extends region mapping to cover all the target grid points of positive proposal as well.
The authors also presented a better and faster version of Grid R-CNN, Grid R-CNN Plus. The major update is the proposed grid point specific representation. Grid R-CNN Plus solves the problem that the ground truth label is obliged to a small region on the supervision map by shifting a biased distribution to a normalized one. More details of Grid R-CNN Plus can be found in respective citations [46].
The structure of Grid R-CNN used in this paper is shown in Figure 7. We established Grid Head mainly following [46]. The loss function of Grid Head is CE Loss with sigmoid function, and the weight of CE Loss is 15. In addition, we utilized Group Normalization (GN) [47] during calculating the convolution of grid features. Other parameters of Grid R-CNN are identical to Faster R-CNN.  [22]. The input of Grid R-CNN is the output of FPN [43]. Tetragons with different colors represent different anchors. RPN (Region Proposal Network) is a fully convolutional network, classifying the anchors to foreground/background and regressing the bounding-box sketchily. ROI is the abbreviation of region of interesting and ROI Align layer [21] is used to reconcile the size of feature maps. Grid Guided Localization [22] adopts a fully convolutional network to obtain grid feature maps and fuses them.

Libra R-CNN
Pang et al. proposed Libra R-CNN [39]. They divided the object detector training into three stages: (1) sampling regions; (2) extracting features; and (3) recognizing the categories and refining the locations under the guidance of a loss function. Three levels of imbalance exist during the training process, i.e., sample level, feature level, and objective level. Pang et al. proposed three novel components, namely IoU-balanced sampling, balanced feature pyramid, and balanced L1 loss, for reducing these imbalances separately. The structure of Libra R-CNN is shown in Figure 8.
IoU-balanced sampling splits the sampling interval into even bins according to IoU and samples from them uniformly. We split three bins and utilized IoU-balanced sampling in R-CNN. In RPN, we restricted the maximum ratio of negative sample to five. Balanced feature pyramid resizes the feature maps extracted from the former network structure identically, calculates average value, resizes to original size, and superposes with original feature maps. We resized the feature maps to the size of C4, and utilized non-local module [48] to extract features. Balanced L1 loss implements a more balanced training of the detector. We utilized balanced L1 loss with parameter α = 0.5, γ = 1.5, and β = 1.0 when calculating bounding-box regression loss. We set NMS following Pang et al. [39]. Other parameters of Libra R-CNN are identical to Faster R-CNN.  [39]. Libra R-CNN updates the output of FPN [43], P 2 -P 6 , with Balanced Feature Pyramid and the updated feature maps are imported to next structure. Tetragons with different colors represent different anchors. RPN (Region Proposal Network) is a fully convolutional network, classify the anchors to foreground/background and regress the bounding-box sketchily. ROI is the abbreviation of region of interesting and ROI Align layer [21] is used to reconcile the size of feature maps. Libra R-CNN utilizes IoU-balanced Sampling to sample the anchors and Balanced L1 Loss to calculate classification loss. Details of Balanced Feature Pyramid, IoU-balanced Sampling, and Balanced L1 Loss can be found in [39].

Retinanet
Retinanet [29] is a well-performing one-stage object detector. One-stage detectors usually extract features and make classification and regression directly. They output multi-channel maps which represent the classification and regression information. Retinanet addresses one of the critical problems that lead to low detection precision, i.e., the extreme foreground-background class imbalance. It proposes a novel loss function, focal loss. Focal loss focuses on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. The primary structure of Retinanet is shown as Figure 9.  [29]. The input of Retinanet is the output of FPN [43]. Retinanet is divided into classification sub-network and regression sub-network. In this paper, Classification sub-network outputs feature maps with 18 channels, equaling the number of classes multiplied by the number of anchors. Focal Loss is used in classification sub-network. Regression sub-network outputs feature maps with 36 channels, equaling the number of offsets multiplied by the number of anchors. Regression sub-network utilizes smooth L1 loss [19].
The FPN we utilized for Retinanet was slightly different from the one in Section 3.2.1. We constructed a pyramid with levels P 3 through P 7 . P 6 is computed by a 3 × 3 stride-2 convolution on C 5 , and P 7 is obtained via applying ReLU function [49] followed by a 3 × 3 stride-2 convolution on P 6 .
We established the size of anchors according to Lin et al. [29]. For focal loss, we utilized it when calculating classification loss, with parameters γ = 2.0 and α = 0.25. For regression loss, we utilized Smooth L1 Loss with parameter β = 1/9.
During training, we assigned anchors positive with an IoU threshold of 0.5 and negative with 0.4. We also assigned the anchors positive when they had the maximum IoU with a ground-truth box. However, we ignored entire anchors if the maximum IoU was lower than 0.1. We did not normalize [x, y, w, h] in Retinanet. In addition, other parameters are identical to Faster R-CNN.

YOLOv3
You Only Look Once (YOLO) [25] is a typical one-stage object detector proposed by Joseph Redmon in 2016. Afterwards, the author had made two improved versions, including YOLOv2 [26] and YOLOv3 [27]. In the series, YOLOv3 is a relatively new detector and can reach a well performance. Specifically, YOLOv3's network for feature extraction is Darknet-53, referring the residual network idea of ResNet. The basic composition unit of Darknet-53 is "1 × 1 convolution module + 3 × 3 convolution module + residual module". Darknet-53 retains the strategy of leaky ReLU layer and batch normalization layer in the former series. Besides, the input images are processed by a total of five times of down sampling. Darknet-53 also refers the idea of the multi-scale feature layers in FPN, and layers of the last three scales are selected as output. For example, the ratio of output feature layers size is 1:2:4 when the size of input image is 1024 2 . Furthermore, to improve the expression of shallow feature layers and make full use of the information of each feature layers, the relatively shallower layer will be superimposed with the deeper feature layer which has been processed by up-sampling. Figure 10 shows the structure of YOLOv3 network. Figure 10. Structure of YOLOv3 [27]. Darknet-53 is a residual network mainly constructed of 1 × 1 and 3 × 3 convolution module. C 1 -C 3 refer to the feature layers of the last three scales obtained after five down sampling of the input image in Darknet-53. P 1 -P 3 are obtained from C 1 -C 3 feature layers after superimposition of adjacent layers through upper sampling.
refers to the superimposition operation.
The size of the nine anchors given in [25] was calculated by k-means clustering. We also used k-means clustering on the EPD-S subset, and the sizes of the anchors were modified to 26  In addition, focal loss was selected as the loss function with γ = 0.8 and α = 1.0.

YOLOv4
Recently, Alexey et al. proposed the latest YOLOv4 [28] on the basis of YOLO series. Compared with YOLOv3, YOLOv4 has better overall performance and can receive well detection result on a single GPU. To reach a fast and accurate detector, the authors carefully sifted and tested the typical algorithm modules commonly used in the deep learning models, and further designed and improved some modules. In particular, the improvement mainly focused on the selection of backbone and combination of several tricks. On the basis of choosing CSPDarknet-53 as the backbone of the detector, the authors added SPP block [17] to extend the receptive field of the model and utilized modified PANet instead of FPN. As for Tricks, the authors selected the most suitable methods for YOLOv4 from the commonly used deep-learning-based detection modules, including choosing Mish as the activation function, DropBlock as the regularization method, etc. Moreover, YOLOv4 utilizes a new method of data augmentation call Mosaic, which expands the data by stitching together four images. To adapt YOLOv4 to single GPU training, the authors also improved several existing methods, including SAM, PANet, Cross mini-Batch Normalization (CMBN), etc. In general, the main structure of YOLOv4 can be summarized as "CSPDarknet-53+SPP+PANet+YOLOv3 Head+Tricks", which is shown in Figure 11.
In addition, we used anchors of the same size in YOLOv4 as YOLOv3, which were obtained through k-means clustering. The sizes of the nine anchors were, respectively, 26 × 22, 30 × 41, 35 × 60, 45 × 29, 47 × 48, 61 × 86, 63 × 35, 88 × 55, and 168 × 140. Except for setting the parameter random to 0, we utilized all the default parameters provided by the author in our experiment. Figure 11. Structure of YOLOv4 [28]. The same as Darknet-53, C 1 -C 3 refer to the feature layers of the last three scales obtained after five down-sampling of the input image. CSPDarknet-53 also uses the structure of CSPNet [44] and the Mish activation function. SPP refers to spatial pyramid pooling, and C 3 layer are disposed with 5 × 5, 9 × 9, and 13 × 13 pooling operations, respectively. P 1 -P 3 are obtained from C 1 -C 3 feature layers after aggregation of adjacent layers through two times of down-sampling and up-sampling. refers to the aggregation operation.

Retinanet FreeAnchor
FreeAnchor [40] utilizes probability theory to guide object-anchor matching, and updates hand-crafted anchor assignment to "free" anchor matching by expressing detector training as a maximum likelihood estimation (MLE) [50] procedure. This method establishes bags of candidate anchors for different objects and learns an object-anchor matching approach. We used FreeAnchor module working jointly with Retinanet (named Retinanet FreeAnchor). The structure of Retinanet FreeAnchor is shown in Figure 12.
We utilized the same loss function as Zhang et al. [40]. The parameter γ of focal loss was 2.0 and α was 0.   [40]. Retinanet FreeAnchor has the same input, classification, and regression sub-network as Retinanet [29]. The green-filled tetragons mean anchors containing objects, while the white-filled tetragons means anchors containing nothing. The tetragons put in the same bag of candidate contain the same object. The Anchor Matching Mechanism implements the learning-to-match approach.

FCOS
FCOS [32] is a typically anchor-free detector. Anchor-free detectors do not generate anchors, but gain the classification and regression information directly by convolutions. The structure of FCOS is shown in Figure 13.  [32]. The structure of FCOS is similar with Retinanet [29]. The regression sub-network outputs four channels which present four offsets, while the classification sub-network outputs two channels presenting two categories. IoU loss [51] and focal loss [29] are used as regression and classification loss, respectively. Classification sub-network outputs another channel to restrain the inferior quality outer points, using center-ness loss.
During training, if a point was inside a ground-truth box of an object, we assigned it positive. Otherwise, we assigned it negative. To deal with the problem caused by ground-truth box overlapping, we restricted the range of regression distance in each level of the feature as [32].
For loss function, we utilized the same focal loss as Retinanet and the IoU Loss [51]. In addition, FCOS proposed Center-ness Loss to restrain the inferior quality outer points. We utilized CE Loss with sigmoid function to implement Center-ness Loss. The weights of these three loss functions were equal.
We also used GN with group number 32 in the sub-networks. For other parameters in FCOS, we utilized those recommended by Tian et al. [32]. The rest of parameters were identical to Retinanet.

Retinanet FSAF
FSAF [41] is an anchor-free module. In traditional multi-scale feature extracting networks, e.g., FPN, it is hard to select the best feature level for every object when training. FSAF can be plugged into one-stage detectors with feature pyramid structure. We used FSAF module working jointly with Retinanet (called Retinanet FSAF) as the structure shown in Figure 14.  [41]. Retinanet FSAF combines an anchor-free module, FSAF module, with Retinanet [29]. The right part shows the structure of FSAF module, which has the same anchor-based branch as Retinanet. The blue part shows the structure of anchor-free branch. Regression sub-network outputs four channels which present four offsets, while the classification sub-network outputs two channels which present two categories. IoU loss [51] and focal loss [29] are used as regression and classification loss, respectively. The left part shows the global structure of Retinanet FSAF. Blue lines mean utilizing the results of anchor-free branch to select the best feature levels for training objects, while the red ones represent object detection on the selected feature map. Details of feature map selection can be found in [41].
We established FSAF according to Zhu et al. [41], utilized IoU loss [51] as the regression loss and focal loss as the classification loss, and utilized Online Feature Selection to select the best feature level for training objects. Other parameters of Retinanet FSAF were identical to Retinanet. In addition, all detectors used pre-trained weights to fine-tune, saving a lot of training time. The pre-trained models were trained on ImageNet. Pre-trained weights of ResNet101 are provided by PyTorch website https://download.pytorch.org/models/resnet101-5d3b4d8f.pth, the weights of DarkNet-53 is from the website of the authors of YOLOv3 https://pjreddie.com/media/files/ yolov3.weights, and the weights of YOLOv4 is from the website of the authors https://github.com/ AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.conv.137. For transfer learning, we froze shallow parameters of the per-trained models except for YOLOv3 and YOLOv4. In ResNet101, we did not update the parameters of layer C 1 and C 2 .

Details of Detector Training
For all detectors based on MMDetection, we randomly flipped the input images with a probability of 0.5 for data augmentation. We also normalized the input images For YOLOv3, four data augmentation parameters provided by Ultralytics LLC were utilized to generate more training samples by rotating the angle and adjusting the saturation, exposure and hue. For YOLOv4, four data augmentation parameters provided by the author were also utilized to generate more training samples by rotating the angle and adjusting the saturation, exposure and hue. Moreover, YOLOv4 utilizes a new method of data augmentation called Mosaic, which is introduced in Section 3.2.8.
When training, all detectors based on MMDetection utilized stochastic gradient descent (SGD) with momentum [52] as the optimizer. The momentum was set as 0.9 and the weight decay was 0.0001. YOLOv3 also utilized SGD as the optimizer, with momentum 0.9 and weight decay 0.0005. YOLOv4 utilized Adam [53] as the optimizer, with momentum 0.9 and weight decay 0.0005. We selected the initial learning rate appropriately and set the train epoch big enough to ensure that all models could be thoroughly trained. For learning rate, we utilized learning rate descending strategies. When the epoch reached a certain value of Epoch1 and Epoch2, the learning rate decreased to 1/10 of the original. We used learning rate warm-up method [42] on majority of the detectors. This method could warm up the detectors by using small learning rate at the beginning of training. We utilized linear growth methods, the warm-up iteration was 500, and the warm-up ratio was 1/3. The initial learning rates (lr) and epochs are shown in Table 3. It needs to be emphasized that the parameters used by all detectors are the optimal parameters obtained after adjustment in our opinion, which ensures the fairness and accuracy of performance comparison among different detectors for the task of electric pylon detection in remote sensing images. Table 3. Parameter settings in training. Initial lr, initial learning rate; Epoch1 and Epoch2: All detectors utilized the STEP lr decline strategy, and the lr decreased to 0.1 times at Epoch1 and Epoch2; Batch size: 1 m and 1-2 m refer to the resolution of the samples in training process.

Detectors
Initial lr

Experimental Settings
To accurately evaluate the performance of electric pylon detectors based on deep learning, we carried out extensive experiments on our self-made EPD dataset described in Section 3.1. We repeated stochastic experiments to reduce the error generated by dataset partition. Particularly, nine detectors were trained and tested on the EPD-S subset that contains 1450 images. In each round of training, the EPD-S subset was randomly divided into 8:1:1, where 8/10 was used to train the detector, 1/10 was used to validate the trained detector, and 1/10 was used to test the performance. The detector performing the best average accuracy (AP) during validation was selected as the final detector, and then its AP gained in testing set was used to evaluate the generalization capability. To improve the accuracy of experimental results, we carried out 10 times the above experimental process of "randomly dividing dataset → training detector → validating detector → testing detector" for each kind of deep learning detectors. The average value of APs gained from 10 times of stochastic tests was reported as the performance evaluation basis of the final detector.
Besides, considering complex real situation, the final detectors trained on the EPD-S subset were also tested on the EPD-C subset. The average value of APs gained from 10 tests was taken as an additional evaluation criterion. Noticing that the EPD-C subset was only used in testing process not in training, testing results on EPD-C subset could be more objective to evaluate the generalization capability of each detector.
For software environment, we utilized Python 3 (3.6.9 and 3.7.5 to run different programs), PYTORCH 1.0, CUDA 10.0, and CUDNN 7.6.4 to run all programs. For hardware environment, we trained and tested the detectors using one NVIDIA GeForce RTX 2080Ti GPU with 10G memory.
It should be noticed that the data and codes of this paper will be released to the public at our website: https://github.com/qsjxyz/Electric-Pylon-Detection-in-RSI.

Index for Evaluation
We use twod popular indicators, i.e., recall and average precision (AP), to evaluate the detection performance of detectors. These indicators are computed based on IoU as where B pred represents the predict bounding box and B truth represents ground truth. The bounding box is considered to be correct when IoU surpasses a threshold. We set the value of threshold as 0.5, the same as Yao et al. [9]. When IoU exceeds 0.5, the testing result is a true positive (TP), otherwise it is a false positive (FP). When the detector predicts no target in a location containing a target, it is a false negative (FN). Then, we can calculate two metrics, precision (P) and recall (R), as The output of detectors contains detections and their confidence scores. After sorting detections by confidence scores from high to low and computing the precision and recall of all detections, we can obtain a precision-recall curve (PRC). AP is the area under PRC.
Besides, to evaluate the practicability of the model more comprehensively, the testing speeds and model sizes of all detectors were also calculated on the same hardware platform and are reported in this paper. It should be noticed that all detectors in the experiments were evaluated based on the same criteria.

Performance of Detectors
Firstly, we evaluated the performance of detectors at the original resolution 1 m/pixel, and conducted 10 rounds of repeated stochastic experiments for each detector following the specific experimental operation in Section 4.1. The column Batch size (1m detector) in Table 3 shows batch sizes set in the training process. Considering the limitation of the memory of the graphics card and the training speed, we selected the maximum batch size for each detector that is available under the condition of hardware environment. Table 4 shows the results of 10 detectors trained with data of 1-m/pixel resolution. Moreover, to analyze the adaptability of detectors to complex environments, we tested each trained detector on the EPD-C subset. Table 5 presents the results. Table 4. Results on the EPD-S subset. Train resolution, 1 m/pixel; test resolution, 1 m/pixel. AP refers to average precision. Recall, AP, and speed are expressed in the format of mean ± standard deviation on the basis of 10 rounds of test results.

EPD-S Subset EPD-C Subset
Resolution Proportion Resolution Proportion  Table 7. Results on mixed resolution EPD-S subset. Train resolution, 1-2 m/pixel; test resolution, 1-2 m/pixel. AP refers to average precision. Recall, AP, and speed are expressed in the format of mean ± standard deviation on the basis of 10 rounds of test results.
For the detectors trained under the mixed resolution 1-2 m/pixel, it can be seen that YOLOv3 obtains the best recall and Retinanet FSAF gains the best AP. Moreover, when we test detectors on the EPD-C subset, YOLOv4 obtains both the best recall and best AP. In general, the AP of FCOS and the recall of Faster R-CNN on the mixed resolution dataset are not well.
Overall, the performance of detectors is satisfactory on the EPD-S subset, but the experimental results on the EPD-C subset are not ideal. It indicates that the detectors could be better trained to fully adapt to the complex environments. Contrasting the models trained under 1-m resolution and 1-2-m resolution, we can find that recalls and APs of various models decline slightly with acceptable drop degree. This suggests that the parameters utilized in the experiment can adapt to the change of resolution to a certain extent.
In terms of memory usage, YOLOv3 performs best with the least model size. This may be because YOLOv3 utilizes Darknet-53 as the backbone network, which contains fewer parameters than ResNet101 used in other detectors. YOLOv4 also has a small model size similar to YOLOv3, utilizing CSPDarknet-53 as the backbone network. FCOS has the smallest model size among the detectors taking ResNet101 as the backbone network. Cascade R-CNN and Grid R-CNN occupy a large amount of memory.
In terms of running speed, YOLOv3, YOLOv4, and FCOS perform well, while Grid R-CNN and Cascade R-CNN have relatively slow speed. This also indicates that higher speed is often accompanied by lower memory occupation. When the actual detection task requires a high speed and low memory detector, we recommend choosing YOLOv3, YOLOv4, or FCOS.

Robustness against Spatial Resolution
To test the performance of detectors under different spatial resolutions and evaluate the adaptability of detectors to resolution variation, we tested the detectors trained in Section 4.3 on 2-and 4-m resolution testing sets by down-sampling the EPD-C subset. We report the results in Tables 9-12. Table 9. Results on the EPD-C subset. Train resolution, 1 m/pixel; test resolution, 2 m/pixel. AP refers to average precision. Recall, AP, and speed are expressed in the format of mean ± standard deviation on the basis of 10 rounds of test results.
Overall, the detectors have resolution adaptability to some extent. The detectors trained on the 1-m resolution data perform better results on the 2-m resolution test set, while the detectors trained on the mixed resolution data get better results on the 4-m resolution test set. As we select AP as the index to evaluate the adaptability of resolution variation, FCOS trained on the 1-m resolution data and Libra R-CNN trained on the mixed resolution data show better resolution robustness. Figures 15 and 16 visualize the detection results of each detectors on the EPD-C subset. Four result images are shown for each detector in its row. The first two images are from the same scene with resolutions of 1 and 4 m, respectively, and the background is mountainous, among which there is an island similar to the characteristics of electric pylons in the upper right corner. The background of the other scene with 1-and 4-m resolution images on the right is a power plant which contains relatively dense targets of electric pylons, and the frame structure of the plant itself also has an impact on the detection results.  Overall, detection results under 1-m resolution is superior to those under 4-m resolution, which shows that all detectors possess resolution adaptability to some extent. Besides, detection results of the scene on the left are more affected by resolution decrease. The reason may be that the background of the left scene is more similar to electric pylons. Furthermore, as shown in Figure 15d, the results of Libra R-CNN on the left scene is not ideal, and the detector even identifies the island as an electric pylon target at 1-m resolution. It is because Libra R-CNN uses the Feature Balance Pyramid to enhance the representation of areas similar to the characteristics of the electric pylons targets, thereby misidentifying the island as the electric pylon. As shown in Figure 16d, the results of FCOS on the right scene is not acceptable. It misses several electric pylon targets. This means that anchor-free detectors may not perform well when detecting densely distributed targets.

Analysis of Performance
The analysis of advantages and disadvantages of each detectors based on deep learning can help to select the most suitable detectors to fulfill the electric pylon detection task. As shown in Section 4, Faster R-CNN [19] gets ordinary and acceptable performance in terms of recall, AP, running speed, and model size. On the one hand, Faster R-CNN uses FPN to adapt to the scale change of targets. On the other hand, Faster R-CNN uses RPN to perform better in binary classification. As Faster R-CNN does not have much improvement compared with the conventional two-stage detectors, it performs normally in the actual detection task. Specially, we regard Faster R-CNN as a benchmark for comparison. Cascade R-CNN [20] has the biggest model size and a slower running speed. When testing Cascade R-CNN on a test set with lower resolution, it performs better than Faster R-CNN in general, according to the IoU threshold mentioned in Section 3.2.3.
Grid R-CNN [22] obtains a superior AP and recall, but its model size is big and its running speed is the slowest. Due to its grid point positioning mechanism, Grid R-CNN achieves a good detection result on square targets. Considering that electric pylon targets are similar to square targets, Grid R-CNN performs well in electric pylon detection. However, the feature map size of Grid R-CNN after ROI Align is 14 × 14, which is larger than the commonly used 7 × 7 size. Thus, Grid R-CNN has a slow running speed and occupies a large amount of memory.
Libra R-CNN [39] has a similar model size with Faster R-CNN, but it performs better than Faster R-CNN in other aspects. Libra R-CNN usually gets a relatively faster running speed with a mediocre recall and AP compared to other detectors. Since the method of Libra R-CNN dividing the training process into three stages is not very complex, its detection speed and model size are not much different from Faster R-CNN. Besides, Libra R-CNN uses the Feature Balance Pyramid to enhance the representation of areas similar to the characteristics of targets to be detected.
Retinanet [29] has a similar AP and running speed as Faster R-CNN, but its recall is better and its model size is smaller. Retinanet is a one-stage detector which does not use RPN, so it has a simple structure and small model size. In addition, Retinanet uses focal loss to improve the detection accuracy.
YOLOv3 [27] achieves the smallest model size and a superior resolution applicability contrasting with other detectors, except that the detector trained on 1-m resolution data obtains a low AP on 4-m resolution test set. YOLOv4 [28] also has a small size, performs fast in speed, and has good adaptability to a certain degree of resolution degradation. Both Darknet-53 and CSPDarknet-53 contain fewer layers that Resnet101, thus both YOLOv3 and YOLOv4 have a small model size and perform well in speed. Compared with YOLOv3, YOLOv4 offers better overall detection performance and integrates several typical modules commonly used in deep-learning-based detectors.
Retinanet FreeAnchor [40] and Retinanet FSAF [41] perform well on the EPD-S subset, achieving satisfactory recalls and APs, but they cannot adapt resolution variation favorably. These two detectors both make improvements on the basis of Retinanet, and experimental results show that they both have better performance than Retinanet. The former represents object-anchor matching as a maximum likelihood estimation (MLE) process and selects the most representative anchor from each object's anchor set, while the latter focuses on how to select the optimal feature layer for object-anchor matching. Results of these two detectors indicate that object-anchor matching is an important part of the electric pylon detector.
FCOS [32] gets a superior performance in terms of running speed, and in most cases its recall is satisfactory, but its AP is not good. FCOS is an anchor-free detector. It has a fast detection speed and performs well in the detection task of small targets and low resolution.
In general, we could not find a detector that has good performance on all aspects. Thus, we need to choose the most suitable one based on the requirements and restrictions in practical situation. FCOS and YOLOv4 could be used to get rapid results, while YOLOv3 and YOLOv4 could be used when the space occupancy is limited. Retinanet FreeAnchor, Retinanet FSAF, and YOLOv4 could be used in conventional environment. YOLOv4 and Retinanet could be used in complex environment when high precision is needed, while Libra R-CNN and YOLOv4 could be used in complex environment to get high recall. If the spatial resolution does not change much, Grid R-CNN, Libra R-CNN, YOLOv4, and Retinanet could be used to gain superior AP; otherwise, we recommend using FCOS as the electric pylon detector.

Analysis of Resolution Robustness
To discuss resolution robustness, we calculated the average recall and AP of the detectors in each case. The results are shown in Table 13. As shown in Table 13, if the resolution variation is small, e.g., from 1 to 2 m, the detectors could still gain acceptable accuracy. However, when the resolution varies over a certain limit, e.g., larger than 1 m, the performance of detectors declines rapidly. The detectors trained on mixed resolution data perform relatively better on 4-m test set. It should be noticed that we did not fine-tune the parameters when training variant-resolution, and we can improve the accuracy of detectors by adjusting the parameters such as scales of anchors. That is mainly because detectors trained on variant-resolution data learned further multi-scale features in multi-scale images.
The detectors trained by fixed 1-m resolution or variant-resolution perform barely satisfactory on 4-m test set. On the one hand, there are quite a few objects with the size of less than 20 pixels, and these small objects can be hardly observed even by eyes. On the other hand, the improvement of network structure can hardly completely solve multi-scale problems. Therefore, we need to design new networks to adapt flexible resolution better.

Application Prospects
With the development of satellite-based and airborne Earth observation, high-resolution remote sensing data can be obtained more and more easily. It is now possible to obtain high-resolution remote sensing data of the regions of interest at low cost and high frequency. Thus, given high-resolution remote sensing data of electric pylons, our deep-learning-based detectors can automatically detect electric pylons. Considering the good generalization ability of deep-learning-based detectors, our work has significant potential for electric pylon detection, benefiting the management of electric power system.

Conclusions
In this paper, we introduce deep learning methods to achieve electric pylon detection in high-resolution remote sensing images. To analyze the comprehensive performance of different detectors under different conditions, we selected 10 state-of-the-art deep-learning-based detectors, and comprehensively compared their performance on a specially made dataset containing 1500 images. The experimental results show the characteristics of each detector in detail and provide the selection criteria when deep-learning-based detectors are applied to actual scene. For conventional detection tasks, YOLOv4 and Retinanet FSAF can achieve relatively good detection accuracy, while the running speed of FCOS and YOLOv4 is fast and can adapt to tasks with real-time processing requirements. In addition, YOLOv3 and YOLOv4 are small in model size and can adapt to the working environment with small memory. For complex detection tasks, Grid R-CNN, YOLOv4, and Retinanet have better comprehensive performance. From the perspective of practical application, Grid R-CNN, Libra R-CNN, YOLOv4, and Retinanet have excellent detection accuracy in tasks with low resolution, and FCOS shows better comprehensive performance in tasks with mixed resolution. It should also be noticed that a detector that performs the best in all conditions has not appeared thus far. Therefore, we need further research on deep-learning-based detectors for electric pylon detection in high-resolution remote sensing images in the future.