Evaluation of Power Insulator Detection E ﬃ ciency with the Use of Limited Training Dataset

: This article presents an analysis of the e ﬀ ectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly deﬁned the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a speciﬁc number of selected images from diagnostic video. The analyzed input material was recorded during an inspection ﬂight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve di ﬀ erent levels of e ﬀ ectiveness depending on the amount of training data. The most beneﬁcial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a signiﬁcantly lower inﬂuence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.


Introduction
Recently, the development of deep convolutional neural networks and their broader application in the field of digital image processing was observed [1], areas well as the development of decision-making processes automation in the power sector. The conducted research focused on individual elements of electric power infrastructure [2,3], including electric power insulators [4][5][6]. Using UAVs (unmanned aerial vehicles) as a platform for obtaining data on power transmission elements is becoming widespread. There are many challenges to overcome when automating the analysis of digital images captured by UAV vehicles. The biggest o problems are difficulties related to the number of images and resulting data labeling problems. This is usually a manual task that requires precision and accuracy to provide the learning algorithms with the most valuable input information. During the process, it is possible to filter out unnecessary, redundant, or distorted data. This article presents research related to the elimination of the indicated problem. The analysis of the learning outcomes of selected implementations of the convolutional neural networks was performed with the use of a limited set of visual training data, recorded during the flight of a UAV vessel over electric power infrastructure objects.
The remainder of the paper is organized into the following sections: Section 2 provides a review of the video data acquisition process for the purposes of diagnostic testing of power insulators. The problem statement and the research methodologies, together with the proposed approach of insulator detection using different convolutional neural networks (CNNs), are outlined in Section 3. The results of the research and discussion are also reported. Section 4 provides a conclusion for the study. Section 5 presents a discussion of unsolved research problems.

Recording of Visual Data
Electric power lines are very specific objects similar to other industrial infrastructure facilities, usually covering a wide area. This requires performing periodic inspections for proper maintenance. Usually, periodical flights are made by airplanes or unmanned aerial vehicles [7], during which visual material is recorded depicting the state of individual elements of overhead high voltage lines. The continuous development of imaging methods allows image-recording at increasing resolutions with an increasingly larger number of frames per second. Obtaining desired information from the acquired visual material requires application of appropriate processing and the development of new analysis methods for the detection of specific irregularities affecting the proper functioning of the inspected object. The visual data (video, series of frames) recorded during a controlled flight, due to the vastness of electric power infrastructure and the consecutive development of imaging techniques, are currently characterized by very large volume, whereas analysis conducted by humans is expensive, inefficient in terms of time, burdened with many errors, and dependent on the perception and experience of the person making the inference based on visual material.
At the same time, human analysis is characterized by high flexibility in decision-making in unusual situations that require complex multi-criteria analysis and the need to consider many aspects from different fields. The development of artificial intelligence, in the field of robotic process automation (RPA), among others, allows for effective algorithm application in the area of visual material analysis. In terms of inspection of the state of technical infrastructure, the challenges for these types of algorithms are as follows: • The need to use advanced methods of image processing (deep learning); • The need to build appropriate datasets which enable the construction of models describing the essential elements of technical infrastructure; • A high degree of complication in making decisions concerning the assessment of the state of the examined object.

Dataset Preparation for CNN
The quality of prediction generated by convolutional neural networks depends to a large extent on the use of an appropriate machine learning dataset [8]. The dataset built to solve a specific problem should be characterized by the following features [9,10]: • A large number of samples; • Balance in terms of individual classes; • High quality of images; • Diversity in terms of depicting the analyzed objects (position, rotation, scale, lighting, obscuration); • Accuracy of annotations (objects marking) in images.
The quality of digital images is significant. As demonstrated in Reference [11], an increase in the current conditions for large datasets meets many challenges as follows: • The diversity of data sources brings abundant data types and complex data structures, which increase the difficulty of data integration; • Data volume is tremendous, and it is difficult to evaluate data quality within a reasonable amount of time; • Data change very fast, and the "timeliness" of data is very short, which necessitates higher requirements for processing technology; • No unified and approved data quality standards exist, and the research on the quality of large datasets only just began.
The main focus during the process of creating a training dataset for convolutional neural networks was on using as few training frames as possible. The aim was to improve the quality of the training dataset and significantly shorten the pre-processing and training time of individual neural network architectures.

Description of the Proposed Approach
The block diagram in Figure 1 shows the individual stages of the conducted research. The first stage involved an inspection flight along an electric power line with the use of an unmanned aerial vehicle and a camera recording a digital image at a resolution of 3840 × 2160. Then, the recorded video material was divided into single digital images with a specific time interval. The selected images containing the examined object or objects were pre-processed and then tagged.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 13 • The diversity of data sources brings abundant data types and complex data structures, which increase the difficulty of data integration; • Data volume is tremendous, and it is difficult to evaluate data quality within a reasonable amount of time; • Data change very fast, and the "timeliness" of data is very short, which necessitates higher requirements for processing technology; • No unified and approved data quality standards exist, and the research on the quality of large datasets only just began. The main focus during the process of creating a training dataset for convolutional neural networks was on using as few training frames as possible. The aim was to improve the quality of the training dataset and significantly shorten the pre-processing and training time of individual neural network architectures.

Description of the proposed approach
The block diagram in Figure 1 shows the individual stages of the conducted research. The first stage involved an inspection flight along an electric power line with the use of an unmanned aerial vehicle and a camera recording a digital image at a resolution of 3840 × 2160. Then, the recorded video material was divided into single digital images with a specific time interval. The selected images containing the examined object or objects were pre-processed and then tagged. In the next stage, the selected images were divided into training datasets consisting of 15, 30, and 60 frames, and a test dataset consisting of 283 frames containing 2560 regions of interest (ROIs) representing a power insulator. In the in-depth study for testing of selected CNNs (the faster regionconvolutional neural network (faster R-CNN) and the region-based fully convolutional network (R- In the next stage, the selected images were divided into training datasets consisting of 15, 30, and 60 frames, and a test dataset consisting of 283 frames containing 2560 regions of interest (ROIs) representing a power insulator. In the in-depth study for testing of selected CNNs (the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN)), training datasets were also prepared, comprising five, 10,20,25,35,40,45,50, and 55 frames. A detailed description of the dataset preparation process with its sample elements is described in more detail in the next section. For each prepared dataset (15,30, 60 frames), the process of training for the selected convolutional neural network was conducted. The inference process for an identical test dataset was performed, and, in the last stage, the obtained results were compared.

Training and Test Dataset
The training and test datasets included images taken during the flight of the UAV along the high-voltage line. For data acquisition, a GoPro Hero 4 camera was used. All the images of the training and test datasets were manually tagged, thus obtaining a prepared training dataset and a ground truth for the test set. For each image, a separate XML file was generated. The file contained coordinates (x_min, y_min, x_max, y_max) of the regions of interest (ROIs), defining the positions of objects in the scene. Figure 2 presents examples of tagged images from the flight along the high-voltage line. The whole dataset contained 343 photos. In total, 3138 ROIs containing the insulator were marked. The test dataset consisted of 283 images (2560 ROIs), and the entire training dataset contained 60 images (578 ROIs). Additionally, the training dataset was divided into smaller sets of images, i.e., five, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, and 60 frames.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 13 A detailed description of the dataset preparation process with its sample elements is described in more detail in the next section. For each prepared dataset (15, 30, 60 frames), the process of training for the selected convolutional neural network was conducted. The inference process for an identical test dataset was performed, and, in the last stage, the obtained results were compared.

Training and test dataset
The training and test datasets included images taken during the flight of the UAV along the high-voltage line. For data acquisition, a GoPro Hero 4 camera was used. All the images of the training and test datasets were manually tagged, thus obtaining a prepared training dataset and a ground truth for the test set. For each image, a separate XML file was generated. The file contained coordinates (x_min, y_min, x_max, y_max) of the regions of interest (ROIs), defining the positions of objects in the scene. Figure 2 presents examples of tagged images from the flight along the highvoltage line. The whole dataset contained 343 photos. In total, 3138 ROIs containing the insulator were marked. The test dataset consisted of 283 images (2560 ROIs), and the entire training dataset contained 60 images (578 ROIs). Additionally, the training dataset was divided into smaller sets of images, i.e., five, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, and 60 frames.

Applied convolutional neural networks
For insulator detection, five different convolutional neural networks were selected. These CNNs are described in Table.1. • It consists of only one CNN • Much faster than its predecessors (Fast R-CNN and R-CNN) [13,14] You Only Look Once v3 YOLO v3 • The single convolutional network predicts the bounding boxes and the class probabilities for these boxes [15].
• Can operate with the speed of 45 frames per second [16] • The limitation of the YOLO algorithm is that it struggles with small objects within the image (due to the spatial constraints of the algorithm) [17]

Applied Convolutional Neural Networks
For insulator detection, five different convolutional neural networks were selected. These CNNs are described in Table 1. • The single convolutional network predicts the bounding boxes and the class probabilities for these boxes [15].
• Can operate with the speed of 45 frames per second [16] • The limitation of the YOLO algorithm is that it struggles with small objects within the image (due to the spatial constraints of the algorithm) [17] Single-Shot MultiBox Detector Inception

SSD Inception
• The network is an object detector that also classifies those detected objects • The network uses the MultiBox technique developed for bounding box regression • The network employs Inception V3 as the backbone [18][19][20][21] • The tasks of object localization and classification are done in a single forward pass of the network [22] • SSD produces worse performance on smaller objects, as they may not appear across all feature maps [18] Single Shot MultiBox Detector Lite MobileNet v2

SSD MobileNet
• The network is an object detector that also classifies those detected objects • The network uses the MultiBox technique developed for bounding box regression • The network employs MobileNet V2 as the backbone and has depth-wise separable convolutions for the SSD layers, also known as SSDLite [18,20] • The main features of the SSDLite MobileNet detector are the same as SSD Inception. Both implementations differ mainly in performance [23] Region-based fully convolutional network

R-FCN
• Unlike R-CNN series networks, fully connected layers after region of interest (ROI) pooling are removed • All major complexity is moved before ROI pooling to generate the score maps • The ResNet-101 architecture is used as a backbone to compute the feature maps [24] • Uses a positive sensitive score map to shorten the inference time and maintain competitive accuracy [25].

The Analysis Results for All the Applied Convolutional Neural Networks
The operation results of the selected convolutional neural networks were evaluated based on three parameters: precision, recall, and average precision. The precision parameter shows how often the network generates an incorrect prediction. The precision parameter is described by the relationship of true positive predictions resulting from the network (correctly detected insulators) and false positives (number of wrong predictions) generated by the network. In turn, the recall parameter shows how many objects were detected from all the tagged objects. It is crucial to assess the accuracy of the prediction of the region of interest generated by the network. The IoU (intersection over union) parameter was used to assess this accuracy. The IoU parameter describes the relationship between overlap area which is the sum of the area of the region delimited by the network and tagged manually in the picture and union area which is the product of the area of the region delimited by the network and the region hand-tagged in the picture. In the analysis of network operation, the key parameter was the average precision parameter, as well as the average precision curve, on the basis of which this parameter was determined. This curve was drawn based on previously determined precision and recall parameters for each single object detection. The shape and position of the average precision (AP) curve allowed assessing the accuracy of the network operation. For example, if the AP curve is more horizontal and located higher on the y-axis, the network is more accurate, while, as it reaches further on the x-axis, the number of objects it manages to detect correctly is increased. Ideally, it should be a straight line at the level of y = 1 reaching up to the value of x = 1. On this basis, the average precision parameter was calculated. Table 2 presents the results of analyses for selected neural networks trained using 15, 30, and 60 frames of film depicting the flight along the high-voltage line. Table 2 contains values of average precision, precision, and recall parameters calculated for three different established IoU values (i.e., 0.25, 0.5, and 0.75). As can be seen in Table 2, a change in the value of the IoU parameter had a significant impact on the analysis results. For further analysis, it was assumed that the IoU level equal to 0.5 was sufficient to conclude that the insulator was correctly identified in the image. For an IoU of 0.5, it is clear that the networks Faster R-CNN and R-FCN had by far the highest efficiency. The recall parameter was particularly important here, which had a value of as much as 0.7925 for the R-FCN/ResNet network trained on 60 frames for the IoU of 0.5. This means that, in this case, the network was able to identify almost 80% of the insulators correctly. Given that the networks Faster R-CNN and R-FCN achieved much better results than the other algorithms, they were selected for further, more detailed analyses.

The In-Depth Analysis Results for the Faster R-CNN and R-FCN Networks
The first step in the further analysis was to verify how the increase in the number of video frames used for training affected the quality of network operation. For this purpose, both networks were re-trained using five, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, and 60 frames. Figure 3 presents the average precision curves of the Faster R-CNN model for the IoU of 0.5, which demonstrate the results of the network operation on the test dataset for different numbers of frames that were involved in the training process.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 13 network operation on the test dataset for different numbers of frames that were involved in the training process. As shown in Figure 3, the number of training frames involved in the learning process had a significant impact on the detection quality of Faster R-CNN. Noticeably, the best result was obtained for the network trained on a dataset consisting of 60 training frames. It should also be noted that the networks trained with 25 and 35 frames were surprisingly good. Figure 4 presents the results obtained for the R-FCN model in similar conditions.  As shown in Figure 3, the number of training frames involved in the learning process had a significant impact on the detection quality of Faster R-CNN. Noticeably, the best result was obtained for the network trained on a dataset consisting of 60 training frames. It should also be noted that the networks trained with 25 and 35 frames were surprisingly good. Figure 4 presents the results obtained for the R-FCN model in similar conditions. Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 13 network operation on the test dataset for different numbers of frames that were involved in the training process. As shown in Figure 3, the number of training frames involved in the learning process had a significant impact on the detection quality of Faster R-CNN. Noticeably, the best result was obtained for the network trained on a dataset consisting of 60 training frames. It should also be noted that the networks trained with 25 and 35 frames were surprisingly good. Figure 4 presents the results obtained for the R-FCN model in similar conditions.  As shown in Figure 4, the AP curves for the R-FCN model were flatter and reached further along the x-axis than those for the Faster R-CNN model. It can also be seen that the increase in the number of frames was not as crucial here as it was in the case of the Faster R-CNN model. In particular, this was visible for networks trained for more than 15 frames. The slope of AP curves did not change as significantly here either. In the final stage of the analysis, the relationship between the increase in the number of training frames and the AP parameter value was examined. The obtained results are shown in Figure 5, which also shows graphs and formulas of functions matched to the measurement data, together with calculated determination coefficients. of frames was not as crucial here as it was in the case of the Faster R-CNN model. In particular, this was visible for networks trained for more than 15 frames. The slope of AP curves did not change as significantly here either. In the final stage of the analysis, the relationship between the increase in the number of training frames and the AP parameter value was examined. The obtained results are shown in Figure 5, which also shows graphs and formulas of functions matched to the measurement data, together with calculated determination coefficients. As showed in Figure 5, the determination coefficients R 2 for both functions were at a similar level of approximately 0.94. The slope of trend lines for the R-FCN model was slightly smaller, which means that fewer training frames were needed to obtain a comparable effect to the Faster R-CNN model. To finalize the presented activities, an in-depth study was performed which was divided into two stages. In the first stage, the selected networks (Faster R-CNN and R-FCN) were trained in 20,000 steps for 15, 30, and 60 frames of the recorded material. Subsequently, an inference was made, and a set of images was obtained representing both the objects sought and the other objects that were wrongly classified. Next, manual selection of correctly classified objects was made, assuming the parameter IOU = 0.5, with which the original training sets were supplemented. Expanded training datasets prepared in this manner were used to tune the previously obtained models again in the second stage of training. The obtained results are presented in Figures 6 and 7. As showed in Figure 5, the determination coefficients R 2 for both functions were at a similar level of approximately 0.94. The slope of trend lines for the R-FCN model was slightly smaller, which means that fewer training frames were needed to obtain a comparable effect to the Faster R-CNN model. To finalize the presented activities, an in-depth study was performed which was divided into two stages. In the first stage, the selected networks (Faster R-CNN and R-FCN) were trained in 20,000 steps for 15, 30, and 60 frames of the recorded material. Subsequently, an inference was made, and a set of images was obtained representing both the objects sought and the other objects that were wrongly classified. Next, manual selection of correctly classified objects was made, assuming the parameter IOU = 0.5, with which the original training sets were supplemented. Expanded training datasets prepared in this manner were used to tune the previously obtained models again in the second stage of training. The obtained results are presented in Figures 6 and 7.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 13 Figure 6. The average precision curve for the R-FCN model trained on the training dataset and additionally tuned in three considered cases with 15, 30, and 60 frames. Figure 6 presents a comparison of average precision curves obtained for the R-FCN network additionally tuned in three considered cases with 15, 30, and 60 frames. As can be seen, the detection accuracy significantly improved; the precision parameter curves were closer to the line y = 1, but the recall parameter slightly deteriorated, as fewer insulators were found. Figure 7 presents an analogical  Figure 6 presents a comparison of average precision curves obtained for the R-FCN network additionally tuned in three considered cases with 15, 30, and 60 frames. As can be seen, the detection accuracy significantly improved; the precision parameter curves were closer to the line y = 1, but the recall parameter slightly deteriorated, as fewer insulators were found. Figure 7 presents an analogical situation to Figure 6, with the results obtained after testing the Faster R-CNN model. In this case, a significant improvement in detection accuracy can be observed, especially for 15 frames of the video, but there was also a decrease in the number of insulators found. In this case, the shape of the curves (30 and 60 frames) also changed; their end parts sloped downward. Accurately calculated parameters of average precision, precision, and recall are presented in Table 3. Figure 6. The average precision curve for the R-FCN model trained on the training dataset and additionally tuned in three considered cases with 15, 30, and 60 frames. Figure 6 presents a comparison of average precision curves obtained for the R-FCN network additionally tuned in three considered cases with 15, 30, and 60 frames. As can be seen, the detection accuracy significantly improved; the precision parameter curves were closer to the line y = 1, but the recall parameter slightly deteriorated, as fewer insulators were found. Figure 7 presents an analogical situation to Figure 6, with the results obtained after testing the Faster R-CNN model. In this case, a significant improvement in detection accuracy can be observed, especially for 15 frames of the video, but there was also a decrease in the number of insulators found. In this case, the shape of the curves (30 and 60 frames) also changed; their end parts sloped downward. Accurately calculated parameters of average precision, precision, and recall are presented in Table 3. Average accuracy was not a clear indicator which, for the analyzed cases 15, 30 or 60, gave better results. It is worth noting that, as a result of the in-depth study, an improvement in precision was achieved in each of the cases. For the Faster R-CNN model, the improvement was respectively 23.19%, 15.84%, and 14.27%, and, for the R-FCN model, it was 21.44%, 20.31%, and 10.50%. On the other hand, the recall parameter for the Faster R-CNN model deteriorated respectively by 13.07%, 8.14%, and 9.29%, and, for the R-FCN model, it deteriorated respectively by 10.28%, 14.52%, and 12.57%.

Conclusions
Creating object detectors of a particular class in digital images requires the development of an appropriate learning dataset. The application of a limited learning dataset is possible for image sequences in which there is a moderate, highly predictable variability of imaging conditions (lighting conditions, background variability, scene composition, etc.) in which the visual material is recorded. In the studied cases, a diagnostic flight along high-voltage lines in non-urbanized areas was implemented, and known deep neural network architectures such as You Only Look Once (YOLO), Single-Shot MultiBox Detector (SSD), Faster R-CNN, and R-FCN were used to develop the object detector.
This article presented the comparison results for the detection of an object of the "power insulator" class while maintaining a limited training dataset consisting of five, 10,15,20,25,30,35,40,45,50,55, and 60 frames of visual material. The research aimed to assess the influence of the size of training dataset on the achieved efficiency for various deep neural network architectures. For the evaluation of efficiency, three IoU threshold values (at the level of 0.25, 0.5, and 0.75) were used to assess the quality of the obtained detections. The best results were reached for two networks, Faster R-CNN and R-FCN. Faster R-CNN received the highest AP at the level of 0.8 for 60 training frames. During the study, it was demonstrated that the number of training data directly translated into the results obtained by the trained detector. The R-FCN model got a poorer AP result; however, it can be observed that the relationship between the number of input samples and the obtained results had a much smaller impact than in the case of the other networks, which, in the authors' opinion, is a desirable feature in the case of a limited input dataset.
The main contribution of the work is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with well-defined conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. In contrast to a previous study carried out [24], in our case, the Faster R-CNN network obtained a better AP result for the same limited training sample, which suggests that deep neural networks will achieve different levels of effectiveness depending on the amount of training data. As part of the in-depth study, additional network tuning was conducted, considering the results gained after the first stage of training. For the two analyzed networks (Faster R-CNN and R-FCN), in each of the three cases (15,30, and 60 frames), the precision parameter improved while the recall parameter deteriorated.

Discussion
In this article, the authors presented a study of the effectiveness of power insulator detection in digital images with the application of a limited training dataset. However, the detection efficiency is also affected by other problems such as the following:

•
Flat and limited (unidirectional) views of objects. For aerial photographs taken perpendicular to the ground, objects of interest are relatively small and have fewer elements, mainly in flat and rectangular form. They usually include shots from above, omitting many important features of those objects in other planes.

•
Large sizes of digital images. Currently, aerial imaging provides visual material at very high resolutions, which allows capturing increasingly more details, but at the same time introduces problems related to the use of sufficient computing power necessary to process them. These problems are eliminated by applying various methods of pre-processing for aerial photography. However, their proper selection requires a lot of research to determine the effectiveness of various solutions dedicated to specific technical problems.

•
Overlapping objects. Objects may be occluded by other objects of the same type, which causes, e.g., inaccuracies when labeling data.
• Replication of the same objects in different digital images. The same object can occur in two separate images, which can lead to double detection and errors when recognizing objects in images.
Directions for further research include the verification of the results obtained in limited datasets captured during other inspection flights and effectiveness analysis for other methods of detecting objects in digital images. It is also appropriate to examine the influence of augmentation of images used in the process of training a convolutional network on detection efficiency.