Deep Learning-Based Bird’s Nest Detection on Transmission Lines Using UAV Imagery

: In order to ensure the safety of transmission lines, the use of unmanned aerial vehicle (UAV) images for automatic object detection has important application prospects, such as the detection of birds’ nests. The traditional bird’s nest detection methods mainly include the study of morphological characteristics of the bird’s nest. These methods have poor applicability and low accuracy. In this work, we propose a deep learning-based birds’ nests automatic detection framework—region of interest (ROI) mining faster region-based convolutional neural networks (RCNN). First, the prior dimensions of anchors are obtained by using k-means clustering to improve the accuracy of coordinate boxes generation. Second, in order to balance the number of foreground and background samples in the training process, the focal loss function is introduced in the region proposal network (RPN) classiﬁcation stage. Finally, the ROI mining module is added to solve the class imbalance problem in the classiﬁcation stage, combined with the characteristics of di ﬃ cult-to-classify bird’s nest samples in the UAV images. After parameter optimization and experimental veriﬁcation, the deep learning-based bird’s nest automatic detection framework proposed in this work achieves high detection accuracy. In addition, the mean average precision (mAP) and formula 1 (F1) score of the proposed method are higher than the original faster RCNN and cascade RCNN. Our comparative analysis veriﬁes the e ﬀ ectiveness of the proposed method.


Introduction
With the increasing number of high-voltage transmission lines, damage caused by birds to the power systems is increasing. Some birds build their nests on the transmission towers. During rain, the birds' nests act as conductors and trigger power line tripping. Similarly, in a dry environment, the branches of the bird's nest are prone to fire, which not only affects the normal power supply, but also poses a huge security risk. Thus, in order to ensure the safe and reliable operation of power grid systems, and reduce the adverse effects of bird activities on transmission lines and other equipment, the research problem of detecting and locating bird nests on transmission lines and poles is of great scientific importance and practical significance.
Bird's nest detection on a high-voltage transmission line is a problem of image classification [1] and target-detection technology [2]. In literature, various methods for bird's nest detection on high-voltage transmission lines are presented. Most of these methods only consider the texture or color information of the bird's nest. This makes it impossible to locate the bird's nest accurately when the image's contrast is poor or the texture information is not rich.
In [3], authors propose a method that uses texture, color, shape and nest's area to find if the image contains a bird's nest. However, the bird's nest does not possess distinct characteristics. For instance, objects such as branches and grass also have strong texture characteristics, and have The main contributions of this work are as follows: (1) We propose a new hard sample mining algorithm for object detection task. Through the comprehensive analysis of the datasets, the size characteristics of the detection objects are determined, and the limited conditions are given. The ROI mining method can focus better on the difficult-to-classify small-scale objects and optimize the model in a targeted manner. (2) According to the characteristics of the annotation boxes of the datasets, we obtain adaptive prior anchors by using k-means clustering. This method reduces the convergence time of the model and improves the accuracy of the coordinate boxes generated. (3) We combine the focal loss function in the model to solve the problem of class imbalance during the training process. This can improve the detection accuracy of the model.
The results presented in this work show that the proposed method achieves better results as compared to other methods proposed in the literature.

Faster Region-Based Convolutional Neural Networks (RCNN) in an Automatic Detection Framework
Region-based convolutional neural networks (RCNN) use a selective search to generate region proposals [15]. These region proposals are then used by CNN to extract features. Finally, these features are used by a support vector machine (SVM) to classify features in RCNN. The detection accuracy of the RCNN method on the pattern analysis, statistical modeling and computational learning visual object class (PASCAL VOC) dataset is much higher than the traditional methods [16], however, its training time and space overhead are huge. Each training image results in a large number of regions of interest (ROI). Moreover, we need to extract features for each ROI, and write the output to disk. Similarly, during testing, it is also necessary to extract ROI from the test samples and complete the detection after extracting the features from each ROI.
Faster RCNN is an improved algorithm based on RCNN and fast RCNN [17]. The main improvements of faster RCNN are: the SVM does not need to be trained; it combines multiple loss functions for a single level training process and for reducing time overhead; layers update during training and there is no need to write features to save disk space. The region proposal network (RPN) is proposed to generate candidate regions in a computationally efficient manner. It is noticeable that through alternate training, RPN and fast RCNN network share parameters which greatly improves the detection speed.

Feature Extraction Network
Faster RCNN is based on residual network (ResNet) [18] and consists of a series of residual blocks for extracting image features. The structure of the residual block is presented in Figure 1. Each residual block constitutes two paths: F(x) and x. The F(x) path fits the residual, and the x path presents an identity map. The addition requires that the dimensions of F(x) and x involved in the operation are same. ResNet effectively solves the problem of network degradation when the network depth increases. This means that as, the network's depth increases, the accuracy of the training set gradually decreases and the network performance deteriorates. ResNet mitigates this problem and improves the accuracy of object detection.

Region Proposal Network
RPN is a fully convolutional network capable of end-to-end training. The idea is to use a convolutional neural network to generate a set of rectangular object proposals, each with an objectness score. This is accomplished by the sliding window method on the feature map of the last shared convolutional layer to generate region proposals.
The small sliding window that is mapped back to the corresponding low-dimensional feature takes 3 × 3 window of the convolutional feature map as input. The features are inserted into next two 1 × 1 convolutional layers, namely, the regression layer, and the classification layer. Please note that the classification layer is used only for softmax classification and the regression layer is used for accurately locating the candidate regions. This is elaborated in, Figure 2.

Anchor Mechanism
Anchor is the core of the RPN networks [19]. Anchors are boxes of preset size used to determine if any target objects are present in the corresponding receptive field at the center of each sliding window. Since the target size and the ratio of length to width are different, multiple scale windows are required. The anchor in faster RCNN sets the reference window size to 16, according to, (8,16, 32) three multiples and three aspect ratios (1:1, 1:2, 2:1), i.e., a total of 9 scale anchors are obtained as presented in Figure 3.

Loss Function
The anchors with the largest intersection over union (IoU) with the ground truth boxes are considered as positive samples. Similarly, the anchors which have less than 0.3 IoU with ground truth boxes are considered as negative samples during RPN training. The loss function is defined as: where, i and p i represents the ith anchor in the mini-batch and the probability that the ith anchor is the foreground, respectively. When the ith anchor is the foreground, p * i is 1 and 0 otherwise. Please also note that t i and t * i represents the coordinates of the predicted bounding box and ground truth coordinates, respectively.

Fast RCNN
During this phase, fast RCNN uses the obtained proposal feature maps, calculates the specific category of each proposal using the fully connected layer and softmax layer [20], and outputs the classification probability vector. In addition, the regression part uses bounding box regression to obtain the position offset for each proposal which is used to extract more accurate coordinates of the object's detection box.

Annotation Boxes Clustering
Faster RCNN uses nine default size anchor boxes for object detection. In this work, the statistical analysis of the size of annotation boxes reveals that the ground truth bounding box of the bird's nest dataset used in this paper are relatively small. In order to enhance the efficiency of bird's nest detection with different aspect ratios, we perform k-means clustering on bird's nest annotation boxes to find the sizes of the anchor boxes that are more suitable for the dataset. This is then used to design the anchors. Consequently, the convergence time of the model decreases, and the accuracy of bounding box coordinates is improved. When new images are added as input, the clustering method will automatically adjust the anchors according to the updated clustering result of the annotation boxes. This can adapt better to the characteristics of the dataset.
The k-means clustering method [21] comprises 3 steps. First initializing the number of categories and cluster centers. Second, calculating the distance between each bounding box and all cluster centers and assigning the bounding box to the nearest cluster. The mean value of every cluster is updated after each assignment. Finally, repeating the two aforementioned steps until the mean position of each cluster becomes stagnant.
In this work, we use 9 clusters for k-means clustering. The width and height of the ground truth bounding boxes are used as input features. The size distribution of bounding boxes in the bird's nest dataset after k-means clustering is presented in Figure 4. After clustering, the length and width of nine anchors are obtained. These are used as the basis for designing the anchors of the model in this paper.

Focal Loss Function
In order to solve class imbalance [22] problem, this paper introduces focal loss function [23] in the RPN foreground and background classification stage, replacing L cls in the RPN loss function.
The focal loss function is modified based on the standard cross entropy loss function that reduces the weight of samples that are easy to classify. Thus, the model is more focused on the samples that are difficult to classify during training.
In this work, focal loss is defined as: where, α t represents the weight (α t ∈ [0, 1]) that is used to balance the uneven proportion of the number of positive and negative samples. γ is the focusing parameter (γ ≥ 0), and 1 − p t γ is called the modulation coefficient. The influence of the modulation coefficient on the loss value increases with the increase in γ. This solves the problem of imbalance between simple and complex training examples by reducing the weight of easily classifiable samples. Thus, the model is more focused on difficult and misclassified samples. p t is defined as: where, p represents the probability that the predicted sample belongs to class 1 (p ∈ [0, 1]), and y represents the category of the label (y ∈ {±1}). When the sample is misclassified, the value of p t is very small, and 1 − p t tends to 1, so the loss of the sample is minutely affected. However, when the sample classification is correct, the value of p t is large, and 1 − p t approaches to 0. Thus, the resulting sample loss value is very small. This loss function not only adjusts the weight of positive and negative samples, but also controls the weight of difficult and easy to classify samples. This solves the class imbalance problem in the RPN network during the training process.

Bird's Nest Detection Network Framework
Based on faster RCNN, this paper improves the network structure, and designs a network model based on ROI hard negative mining, called ROI mining faster RCNN. The proposed method effectively solves the problem of bird's nest detection for small objects with an imbalanced class problem.

Region of Interest (ROI) Mining Method
In training phase of the two-stage detection network, the number of negative samples may reach tens or even hundreds of times that of the positive samples. Most negative sample features correspond to the background in the receptive field of the input image. This is easy to classify, i.e., the classification loss value is small. Similarly, the positive set also contains many samples whose features are easy to classify. Therefore, during the training process, the decrease in the value of the classification loss function may be due to the correct classification of a large number of easy-to-classify samples. This means that the easy-to-classify samples dominate the decline in the value of the loss function, resulting in the final detector not effectively identifying the small bird's nest in complex scenes where the shapes are not obvious.
In response to the aforementioned problems, we propose the ROI mining method. Figure 5 presents the flow chart of ROI mining. The proposed method improves the proportion of small objects and difficult-to-classify objects in the classification loss and regression loss by screening and integrating all the ROI regions obtained through the RPN network. This method increases the proportion of small objects and difficult-to-classify objects in the total samples from 1:120 to 3:120. Thereby, a classifier is obtained that is more suitable for the efficient detection of difficult-to-classify nest samples with insignificant features. The detailed process of ROI Mining method is as follows: Step 1. Before the training process starts, calculate the area of all the labeled boxes in the bird's nest dataset and sort the annotation boxes descending order. Afterwards, compute the median S using the areas.
Step 2. During the training process, obtain all the candidate ROIs that are to be used as an input to Fast RCNN classification loss and regression loss calculation. Calculate the corresponding total loss value for each ROI, and sort them according to the loss value from high to low.
Step 3. Select N numbers of ROIs to calculate loss. Place N/2 ROIs with the total loss values in the top of the set, and the remaining N/2 ROIs, whose loss value is high and area is smaller than S are selected.
Step 4. Use the replaced N ROIs as input for fast RCNN classification loss and regression loss calculation.
For the aforementioned process, we select N = 128 and apply this method to the bird's nest detection in this paper. This process makes small objects and difficult-to-classify samples dominate the loss function. Therefore, the class imbalance in the training process of the detector is mitigated, thereby improving the detection accuracy.

ROI Mining Faster RCNN Structure
In this work, we propose a method based on the 2-stage faster RCNN framework. We accomplish this by improving faster RCNN and proposing ROI mining faster RCNN. The ROI mining faster RCNN is based on ResNet-101 and is used for extracting features from images. This is done in order to ensure the accuracy of object detection in real-time applications and considering the detection problem in image samples obtained using an unmanned aerial vehicle (UAV). In order to improve the accuracy of bounding box coordinates generated by the algorithm, we use k-means clustering for extracting anchor boxes. In addition, we propose the focal loss function in RPN stage to balance the number of foreground and background samples. Moreover, we present the ROI mining module for solving the class imbalance problem during the training process. The overall flow chart is shown in Figure 6. We use the annotated images as the input of the model, ResNet automatically extracts the features of the images and generates feature maps. The RPN network generates multiple candidate ROIs based on the information of the feature maps. Then a classifier is used to distinguish these ROIs into foreground and background, and a regression is used to make preliminary adjustments to the position of these ROIs. The processed ROIs are combined with the image information to obtain proposals, and input them into ROI pooling to obtain the output results of the same dimension. After that, we use ROI mining to select these results, input the selected special results into the final regression and classification networks. The method automatically generates the detection boxes, and finally obtains the detection maps.
The main content of this work is to solve the problem of class imbalance in the model training process. It is noticeable that the main cause of this problem is the error in the classification stage. In training, it is difficult for the classifier to recognize small-sized objects and it is easy to classify them as background, which leads to the imbalance between foreground and background. In the RPN stage, pixel-wise object detection, i.e., using the regression layer to locate, may cause errors. Most of these errors are caused by incorrect classification. Therefore, after solving the problems in the classification stage, we no longer consider the localization problems.

Simulation Results and Analysis
In this section, we present the results obtained using the proposed ROI mining faster RCNN detection framework and analyze the verification results.

Data Preparation
In this work, we collected 800 aerial images of electric towers. These images are acquired from Zhaotong power supply bureau of Yunnan province. We use these images for annotating bird's nest data for training and testing purposes. We use 5-fold cross-validation to evaluate the stability and detection performance of the models. In the 5-fold cross-validation experiment, the original dataset is randomly divided into five non-coincident sub-datasets, and then the models are trained and validated five times. Each time, four sub-datasets are selected as the training sets and one sub-dataset as the validation set. In the five training and validation processes, the sub-dataset used to validate the model is different each time. Finally, the average value of the five results is selected as the index representing the performance of the model. In addition, the training dataset images are subject to horizontal flip, vertical flip, and random rotation [24]. The images are randomly stretched within a certain range. Gaussian blur is applied, and salt-and-pepper noise are added to simulate the real-world environment. The uniform resource location (URL) of the dataset is shown in Supplementary Materials.
After the application of aforementioned preprocessing, we obtain 3000 images for training bird's nest. Figure 7 presents a schematic diagram of annotated pictures of a training set after data enhancement. Similarly, Figure 7a is an original image, and the rest of the images are partially enlarged from the original image using data processing.

Evaluation Index
In this work, we evaluate the faster RCNN, faster RCNN with focal loss, cascade RCNN and ROI mining faster RCNN models in terms of detection accuracy.
We use four evaluation parameters to evaluate the proposed work namely, precision, recall, F1 score and mean average precision (mAP). In the object detection task, we use the ratio of the area of the intersection between the final detection boxes and the sample annotation boxes to the area of their union to represent whether the final detection results are correct. The detected objects are divided into two categories, i.e., positive and negative. There are four cases for each category, i.e., true positive (TP), which correspond to correctly categorized samples in positive samples, i.e., the detector correctly detects the birds' nests as birds' nests; false positive (FP), samples that are incorrectly categorized into positive samples, i.e., the detector incorrectly detects the backgrounds as birds' nests; false negative (FN), negative examples that are incorrectly categorized into negative examples, i.e., the detector incorrectly detects the birds' nests as backgrounds; true negative (TN), negative examples correctly categorized, i.e., the detector correctly detects the backgrounds as backgrounds.
Based on this information, we define precision as: Recall is defined as: F1 score is defined as: Since the detection object in this paper is only the bird's nest category, the AP value is the mAP value. It is expressed as: where, p and r present the precision and recall, respectively.

Comparative Experiment
In this work, four models are used to train the bird nest recognition network on the dataset to verify the validity and efficiency of the proposed ROI mining faster RCNN model.

Simulation Setup
The feature extraction network used in this work is ResNet-101. In addition, we define the weight α t = 1, the focusing parameter γ = 2, the initial learning rate λ = 0.001, the training epochs is set to 20, and the batch size equals 1. Table 1 presents the basic configuration of the local computer. This configuration is independent of the detection accuracy in the experiment.

Performance Evaluation
We present the comparison of the ROI mining faster RCNN model with faster RCNN, faster RCNN with focal loss, and cascade RCNN in terms of bird's nest detection using UAV images.
TP, TN, FP and FN are determined by the pixel-wise intersection-over-union. This work sets the intersection-over-union (IoU) threshold to 0.5. When the IoU of the output detection box and the annotation box is greater than 0.5, we believe that the model correctly detects the object (P). Otherwise, it is considered that the model performs a wrong detection (F). The significance of TP, FP, FN and TN is to reflect the accuracy of detector classification and location, which directly affects recall and precision. Similarly, the accuracy of location is intuitively reflected by AP. In the detection, the real situations of TP, TN, FP and FN are shown in Figure 8. In practical applications, in order to ensure the safe operation of transmission lines, it is necessary to accurately detect all birds' nests on the tower. Recall is used to represent the ability of the detector to detect birds' nests and suspected bird's nest objects. Excessive false detections result in the waste of computing resources, precision represents the accuracy of the detector to detect the birds' nests. F1 score is synthesized by precision and recall, which is used to balance the detection ability of the bird's nest and computing resources. AP is used to indicate the detection accuracy for bird's nest category. It measures whether the category and position of the bounding box predicted by the model are accurate. According to the analyses above, this work uses these four indicators to measure the detection ability of the detectors for the birds' nests.
This work uses 5-fold cross-validation to test the ROI mining faster RCNN. We evaluate the performance and stability of the model according to the mean and standard deviation of the validation mAP in the five validation sets. The results of the five tests are shown in Table 2. As shown in Table 2, the mean mAP of ROI mining faster RCNN is 82.51%. After 5-fold cross-validation, the standard deviation of mAP is 0.0041. It can be seen from the validation standard deviation and validation loss that ROI mining faster RCNN has performed better generalization ability and robustness.
In order to contrast and validate the functionality of our method in industrial applications, we use the images of high-voltage electric poles in the urban area as an input to detect foreign objects that are also small objects, e.g., honeycombs and water bottles. By detecting objects in different backgrounds, this can measure the effectiveness and stability of this method in a wide range of industrial applications. The results of the comparative experiments are shown in Table 3. It can be seen from the results that our method achieves high accuracy under different backgrounds and has good generalization. In practical applications, some researchers uses the built-in functions of the software to automatically generate dataset labels. This greatly reduces the time for data preparation, however it affects detection accuracy of the model. We use the ground truth labeler function in MATLAB to label the training sets automatically, and compare the training result with the training result of the model using the manually labeled datasets. In this way, the recognition accuracy of the model under the automatically labeled datasets is verified. The comparison results are shown in Table 4. We use the enhanced dataset explained in Section 5.1 as the training set, and train the aforementioned networks. Table 5 presents the performance evaluation of the networks after testing is performed using the same test set. As presented in the results, the proposed method is able to achieve high detection accuracy for bird's nest detection. In addition, the proposed model ROI mining faster RCNN is able to outperform faster RCNN and cascade RCNN in terms of mAP, F1 score and recall. However, precision is slightly lower than cascade RCNN. It is noticeable that the proposed method of ROI mining faster RCNN improves recall rate of the object detection on the basis of classifying complex samples. Therefore, the detection accuracy enhances, while maintaining the precision. In terms of mAP, which reflects the overall performance of the detection network, the proposed method is able to cope with the inherent issues of the original model on the basis of focal loss that is added to faster RCNN. Similarly, the process of balancing the number of foregrounds and backgrounds during the training process improves mAP. The proposed method is able to achieve the mAP score of 82.51%, that proves the effectiveness of the ROI mining method.
It is evident from the results presented in this work that accuracy and mAP of ROI mining faster RCNN outperforms the faster RCNN and cascade RCNN methods. Moreover, precision of the proposed method is also comparable. Thus, the detection method proposed in this work greatly improves the process of bird's nest recognition on transmission lines and transmission poles. In addition, the proposed method successfully copes with the class imbalance problem during training and achieves the purpose of automatically and accurately detecting birds' nests in aerial transmission line images.

Conclusions
In this work, we consider aerial transmission line images as a research object, combined with the knowledge of image processing and neural networks. The work is based on the faster RCNN detection network, which is a deep learning-based detection method. In this paper, we propose ROI mining faster RCNN. First of all, we use the k-means clustering method to generate anchors coordinate boxes for bird's nest data. In addition, we deploy data enhancement strategies on the training set for improving the generalization and robustness of the proposed model. We propose focal loss function in the RPN stage to balance the proportion of positive and negative samples in the loss value, thereby alleviating the imbalance between the foreground and the background. Finally, we introduce the ROI mining method. The proposed model focuses on the classification of small object difficult-to-classify samples. It is noticeable that the loss value is dominated by the difficult-to-classify samples to solve the problem of class imbalance in the training process. We perform detailed comparative experiments and show that the proposed method improves the recall rate of object detection. Please note that while precision is slightly lower than cascade RCNN, at the same time, the mAP on the verification set reaches 82.51%, which is 4.22% higher than the original faster RCNN. The proposed method is able to accurately detect most of the bird's nest objects on aerial transmission lines. In addition, our results reveal that the proposed method is well adapted to the characteristics of aerial transmission line image datasets.