InsulatorGAN: A Transmission Line Insulator Detection Model Using Multi-Granularity Conditional Generative Adversarial Nets for UAV Inspection

: Insulator detection is one of the most signiﬁcant issues in high-voltage transmission line inspection using unmanned aerial vehicles (UAVs) and has attracted attention from researchers all over the world. The state-of-the-art models in object detection perform well in insulator detection, but the precision is limited by the scale of the dataset and parameters. Recently, the Generative Adversarial Network (GAN) was found to offer excellent image generation. Therefore, we propose a novel model called InsulatorGAN based on using conditional GANs to detect insulators in transmission lines. However, due to the ﬁxed categories in datasets such as ImageNet and Pascal VOC, the generated insulator images are of a low resolution and are not sufﬁciently realistic. To solve these problems, we established an insulator dataset called InsuGenSet for model training. InsulatorGAN can generate high-resolution, realistic-looking insulator-detection images that can be used for data expansion. Moreover, InsulatorGAN can be easily adapted to other power equipment inspection tasks and scenarios using one generator and multiple discriminators. To give the generated images richer details, we also introduced a penalty mechanism based on a Monte Carlo search in InsulatorGAN. In addition, we proposed a multi-scale discriminator structure based on a multi-task learning mechanism to improve the quality of the generated images. Finally, experiments on the InsuGenSet and CPLID datasets demonstrated that our model outperforms existing state-of-the-art models by advancing both the resolution and quality of the generated images as well as the position of the detection box in the images.


Introduction
Insulators provide electrical insulation and mechanical support for overhead transmission lines. Insulator detection refers to the process of locating the position of the insulator in inspection images. This process serves as the foundation for other tasks such as insulator defect detection and power line extraction. Recently, insulator detection has attracted significant interest from researchers in many fields such as smart grid and computer vision, and some progress has been made [1,2].
At present, the inspection of insulators mainly depends on the visual observation of the staff, which can easily cause omissions, unmanned aerial vehicles (UAV), with the advantages of low cost, small size, and multiple cruise modes [3], combined with other systems such as wireless sensor networks (WSNs) [4], which have become important auxiliary equipment in many fields such as crop identification [5] and yield estimation [6]. On the other hand, UAVs bring safety hazards and privacy issues, such as threats to the flight activities of birds [7]. In addition, the communication between the drone and Ground Control Station (GCS) is vulnerable to attacks, which can cause data leakage [8]. However, Figure 1. Test examples of InsulatorGAN on InsuGenSet and CPLID. First, the LR Image is generated, then the details are enriched to obtain the HR Image.
The differences between InsulatorGAN and previous models are as follows. First, the generator of InsulatorGAN uses multiple stages, which ensures that InsulatorGAN can not only generate high-resolution and realistic images but can also be applied to other fields, such as image translation and style conversion. Second, to strengthen the semantic constraints in the image-generation process, we used the Monte Carlo search (MCS) [19] to sample low-resolution target images multiple times and calculate the corresponding penalty value according to the sampling results. The penalty mechanism can force the generator to produce images with richer semantics to avoid mode collapse [20]. Third, to improve the ability of the discriminator, based on the multi-task learning strategy of parameter sharing [21], we proposed a discriminator framework based on a multi-scale structure. Although all the discriminators use the same network structure, the input of different resolutions allows the discriminators to cooperate with each other, extract feature maps at different abstraction levels, and accelerate the training of the model. In addition, to solve the problem where the public dataset CPLID [1] features less data and a simple background, we used the images obtained via UAV to build the insulator imagegeneration dataset, InsuGenSet. Moreover, the insulator detection results output by the InsulatorGAN model can be used for data expansion. We compared InsulatorGAN with several mainstream image-generation models and achieved the best results on the CPLID and InsuGenSet, which demonstrates that our model can generate high-resolution and high-quality images. The main contributions of this paper are as follows: • This paper proposes an insulator-detection image-generation model, InsulatorGAN, based on an improved conditional Generative Adversarial Nets. This model includes a generator and multiple discriminators. Moreover, we used a two-stage method from coarse to fine to generate high-resolution insulator inspection images that can be flexibly adapted to other scenes; • To improve the constraints on the insulator image-generation process, a penalty mechanism based on the Monte Carlo search was introduced into the generator. This mechanism enables the generator to obtain sufficient semantic guidance and add more semantic details to the generated image; • Based on the parameter sharing mechanism, we propose a multi-scale discriminator structure that enables the entire discriminator network to use feature information at different levels of abstraction to determine whether the input image is true or false; • To solve the small scale of the public insulator dataset CPLID, we established a dataset called InsuGenSet for insulator-detection image generation based on real images. We conducted many comparative experiments between the InsulatorGAN and The differences between InsulatorGAN and previous models are as follows. First, the generator of InsulatorGAN uses multiple stages, which ensures that InsulatorGAN can not only generate high-resolution and realistic images but can also be applied to other fields, such as image translation and style conversion. Second, to strengthen the semantic constraints in the image-generation process, we used the Monte Carlo search (MCS) [19] to sample low-resolution target images multiple times and calculate the corresponding penalty value according to the sampling results. The penalty mechanism can force the generator to produce images with richer semantics to avoid mode collapse [20]. Third, to improve the ability of the discriminator, based on the multi-task learning strategy of parameter sharing [21], we proposed a discriminator framework based on a multi-scale structure. Although all the discriminators use the same network structure, the input of different resolutions allows the discriminators to cooperate with each other, extract feature maps at different abstraction levels, and accelerate the training of the model. In addition, to solve the problem where the public dataset CPLID [1] features less data and a simple background, we used the images obtained via UAV to build the insulator imagegeneration dataset, InsuGenSet. Moreover, the insulator detection results output by the InsulatorGAN model can be used for data expansion. We compared InsulatorGAN with several mainstream image-generation models and achieved the best results on the CPLID and InsuGenSet, which demonstrates that our model can generate high-resolution and high-quality images. The main contributions of this paper are as follows: • This paper proposes an insulator-detection image-generation model, InsulatorGAN, based on an improved conditional Generative Adversarial Nets. This model includes a generator and multiple discriminators. Moreover, we used a two-stage method from coarse to fine to generate high-resolution insulator inspection images that can be flexibly adapted to other scenes; • To improve the constraints on the insulator image-generation process, a penalty mechanism based on the Monte Carlo search was introduced into the generator. This mechanism enables the generator to obtain sufficient semantic guidance and add more semantic details to the generated image; • Based on the parameter sharing mechanism, we propose a multi-scale discriminator structure that enables the entire discriminator network to use feature information at different levels of abstraction to determine whether the input image is true or false; • To solve the small scale of the public insulator dataset CPLID, we established a dataset called InsuGenSet for insulator-detection image generation based on real images. We conducted many comparative experiments between the InsulatorGAN and state-ofthe-art models on InsuGenSet, and the results demonstrated the effectiveness and flexibility of InsulatorGAN. The rest of this article is arranged as follows. Section 2 introduces the related work on insulator detection and image generation. Section 3 introduces the knowledge on GAN. The architecture of InsulatorGAN is illustrated in Section 4. In Section 5, we present several sets of experiments that determined the effectiveness of our framework. Finally, the conclusions and future work are outlined in Section 6.

Insulator Detection
Insulators provide electrical insulation and mechanical support for overhead transmission lines. Insulator detection refers to the process of locating the position of the insulator in an inspection image. This process provides the foundation for other tasks such as insulator defect detection and power line extraction. Recently, insulator detection has attracted significant interest from researchers in many fields such as smart grids and computer vision, and some progress has been made [1,2].
Insulator detection is a major issue in power line inspection and provides a reference for insulator defect detection, foreign object detection, wire detection, and robot path planning. With the rise of deep convolutional neural networks in the image field, insulator detection has also been given new life. The researchers in [1] proposed a cascaded architecture for insulator defect detection.
Sadykova et al. [12] used the YOLOv3 neural network model to train a real-time insulator detection classifier under varying image resolutions and different lighting conditions to assess the presence of ice, water, and snow on the insulator. Zhao et al. [13] improved the anchor generation method and the non-maximum suppression algorithm in the Faster R-CNN model for use with different sizes and aspect ratios and the mutual occlusion of insulators in aerial images. To enhance the reuse and spread of insulator features, Liu et al. [22] added a multi-level feature mapping module based on YOLOv3 and Dense Blocks and proposed an insulator detection network called YOLOv3-dense. Focusing on the interference of complex backgrounds and small targets, the researchers in [23] proposed two insulator detection models, Exact R-CNN and CME-CNN. CME-CNN added an encoder-decoder based on Exact R-CNN to obtain pure insulators.
However, the excellent performance of the above models have generally been obtained at the cost of a complex network structure. In this case, either the real-time performance of the network is difficult to guarantee, or the detection precision is not sufficient. The model in this paper is an improved conditional generative adversarial network with a simple framework and clear training methods. With the support of a small amount of data, high detection precision can be obtained in a short time, and the generated images can be used for data expansion, which solves the problems of the aforementioned work.

Image Generation
Image generation has become a hot research topic due to the rise of deep learning. Variational autoencoder (VAE) [24] is a generative model based on the probability graph model. In [25], the researchers proposed an Attribute2Image model that can generate images from visual features by synthesizing the foreground and background. In addition, researchers in [26] introduced an attention mechanism into VAE and proposed a DRAW model to improve the image quality.
Recently, researchers have achieved favorable performance in image generation using GAN [27]. The training goal of the generator and the discriminator is for the two to defeat each other. A large number of GAN-based frameworks were subsequently proposed, such as conditional GANs [28], Bi-GANs [29], and InfoGANs [30]. GAN can also generate new images based on labels [31], text [32,33], or images [15,34,35].
However, the images generated by the above models generally have problems such as blurring and distortion. The model does not learn how to generate images but simply repeats the content of the images in the training set. InsulatorGAN is also an image-based CGANs, but our model introduces more sufficient semantic guidance with a novel penalty mechanism, further improves the semantic architecture of the image, and overcomes the problem of image distortion.

Basic Knowledge of GAN
The Generative Adversarial Network [27] includes two adversarial learning subnetworks, a discriminator, and a generator, which are trained using the maximum-minimum game theory. The generator G obtains an image via a d-dimensional noise vector and produces a generated image as close as possible to the real image. On the other hand, the discriminator D is used to determine whether the input is a fake image from the generator or a real image from a real dataset. The loss function of the entire generative adversarial network is as follows: where x represents the real image sampled from the real data distribution p r , and z represents the d-dimensional noise vector sampled from the Gaussian distribution p g . CGANs [14] control the results of model generation by introducing auxiliary variables. In the CGANs, the generator generates images based on auxiliary conditions, and the discriminator makes judgments based on auxiliary conditions and images (false images or real images). The loss function is as follows: where s represents the auxiliary variable, and x = G(s, z) represents the image generated by the generator. In addition to fighting against loss, previous works [34,36] have also sought to minimize the L1 or L2 distance between the real and the fake images to help the generator synthesize images with greater similarities to the real images. Previous research has proven that, compared with the L2 distance, the L1 distance can help the model reduce blur and distortion in the image. Therefore, the L1 distance was also introduced into InsulatorGAN. The formula for minimizing the L1 distance is as follows: The loss function for this type of CGANs is the sum of Equations (2) and (3).

Task Definition
First, we must define the task of the insulator-detection image generation. Suppose we have a clear image of a line insulator taken by a drone, and we mark that image as the original image I o . After the insulators in the image are marked, the image is recorded as the ground truth I t . The insulator-detection image-generation task refers to the generated image I g containing the box, which is very similar to I t in terms of visual effects.

Overall Framework
The framework of InsulatorGAN proposed in this paper is shown in Figure 2. Our goal was to generate an insulator-detection image I g . For this purpose, we used a generator G(I o ; θ g ) and a discriminator D((I o , I g ); θ d ), where θ g and θ d , respectively, represent the parameters of the generator and discriminator; I o represents the image sampled from the real data distribution; and I g represents the image generated by the generator G. goal was to generate an insulator-detection image . For this purpose, we used a generator and a discriminator , where and , respectively, represent the parameters of the generator and discriminator; represents the image sampled from the real data distribution; and represents the image generated by the generator . The training of the entire model can be divided into two adversarial learning processes: generator learning and discriminator learning. The target of generator is to generate an image from the insulator detection results and make the generated images cheat the discriminator . In other words, the target of the generator is to minimize the distance between the fake image and the target image. Correspondingly, the goal of the discriminator is to accurately determine whether the input is a fake image generated by the generator or a ground truth and to calculate a penalty value for the fake image.

Multi-Granularity Generator
As shown in Figure 3, the process of image generation is divided into three parts: First, input the original image and the target image . Here, the generator uses coarse-grained modules to obtain low-resolution images . Then, the Monte Carlo search method is introduced to sample the low-resolution image N times. Finally, the attention mechanism is introduced to extract the features of the N intermediate result images, and the output of the attention mechanism and the original image are input into the fine-grained output module. Finally, we obtain the high-resolution target image . We hoped to improve the quality of the image through the two-stage generation method, use the Monte Carlo search to mine the hidden spatial information of the samples output by the coarse-grained module, combine the penalty mechanism and the attention mechanism to constrain the position of the generated box, and to improve image resolution and detail performance. The training of the entire model can be divided into two adversarial learning processes: generator learning and discriminator learning. The target of generator G is to generate an image from the insulator detection results and make the generated images cheat the discriminator D. In other words, the target of the generator is to minimize the distance between the fake image and the target image. Correspondingly, the goal of the discriminator is to accurately determine whether the input is a fake image generated by the generator or a ground truth and to calculate a penalty value for the fake image.

Multi-Granularity Generator
As shown in Figure 3, the process of image generation is divided into three parts: First, input the original image I o and the target image I t . Here, the generator uses coarse-grained modules to obtain low-resolution images I g . Then, the Monte Carlo search method is introduced to sample the low-resolution image N times. Finally, the attention mechanism is introduced to extract the features of the N intermediate result images, and the output of the attention mechanism and the original image I o are input into the fine-grained output module. Finally, we obtain the high-resolution target image I g . We hoped to improve the quality of the image through the two-stage generation method, use the Monte Carlo search to mine the hidden spatial information of the samples output by the coarse-grained module, combine the penalty mechanism and the attention mechanism to constrain the position of the generated box, and to improve image resolution and detail performance.
Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 21 Figure 3. The framework of the generator containing three parts: the coarse-grained module, the Monte Carlo search, the and fine-grained module. First, the coarse-grained module is used to obtain the LR image; then, the Monte Carlo search is used for detail mining; lastly, the attention mechanism is used to input the detailed information into the Fine-grained module to obtain the HR Image.

Penalty Mechanism
To strengthen the semantic constraints during image generation and to further improve the semantic details in the image, we propose a penalty mechanism based on the The framework of the generator containing three parts: the coarse-grained module, the Monte Carlo search, the and fine-grained module. First, the coarse-grained module is used to obtain the LR image; then, the Monte Carlo search is used for detail mining; lastly, the attention mechanism is used to input the detailed information into the Fine-grained module to obtain the HR Image. To strengthen the semantic constraints during image generation and to further improve the semantic details in the image, we propose a penalty mechanism based on the Monte Carlo search (MCS) strategy, which makes the generated image more accurate in its semantics and details.
We performed Monte Carlo searches on the low-resolution target image I g output by the coarse-grained generation module. The specific process can be outlined as follows: where I i g (i = 1, · · ·, N) represents the image obtained based on I g sampling, and MC G β represents the state of the MCS simulation. G β represents the generation module virtualized by MCS technology and shares parameters with the fine-grained generation module. This module generates N intermediate result images continuously for N times. Here, z ∼ N(0, 1) is the noise vector introduced to ensure the diversity of the sampling results. We use the sum of the output of the encoder and the noise vector as the input of the decoder. The noise vector ensures that each sampling model focuses on the feature vector differently and can extract richer semantic information from the low-resolution target image.
Next, N intermediate result images and images with a known angle of view are sent to the discriminator. Then, we can calculate the penalty value based on the output of the discriminator. The calculation process is as follows: where D(I i g ; θ d ) represents the probability that the image I i g output by the discriminator is a real image.

Attention Mechanism
After obtaining N intermediate result images through sampling, we refer to the intermediate result images to provide sufficient semantic guidance for the next stage of generation. For this process, we propose a multi-channel attention mechanism. This mechanism is different from the methods used in previous work, where the synthesized image was generated only from the RGB three-channel space. We use sampling to obtain N intermediate results as feature sets to build a larger semantic generation space, and the model then extracts more fine-grained information by referring to the information from different channels. Then, the calculation result is input into the fine-grained generation module to obtain the high-resolution target image.
Next, we must perform a convolution operation on N intermediate result images to obtain the corresponding attention weight matrix. The results are calculated as follows: where represents the parameter of the convolution operation and Softmax(·) represents the softmax function for normalization. We then multiply the resulting attention weight matrix with the corresponding intermediate result image to obtain the final output: where I g in the formula represents the final output result of the attention mechanism, and the symbols ⊕ and ⊗ represent the addition and multiplication elements of the matrix, respectively.

Objective Function
The generator continues optimization by minimizing the following objective function: where D(I o , G(I o , I t )) represents the score of the generated image obtained by the generator G using the original image and the ground truth (I o , I t ). In other words, this represents the probability that the discriminator D considers the generated image to be the ground truth, θ g represents the parameters of the generator, and V D G represents the penalty value for generator G obtained using Formula (6).

Multi-Level Discriminator
The architecture of the discriminators is shown in Figure 4. The discriminator is composed of a deep convolutional network with a simple architecture. Widening and deepening the network to improve the discrimination ability of the discriminator will inevitably lead to a sharp increase in the model's parameters, which will increase the model's training time.  The framework of discriminators. The discriminator consists of three parts with the same structure: D1, D2, and D3. First, the shared convolutional layer is used to process the input image to obtain the feature map; after that, the map is down-sampled 2 times and 4 times and input into D1, D2, and D3 to obtain the discriminator scores.

Multitasking Mechanism
A discriminator network based on a multi-scale architecture is proposed in this paper. This network uses three discriminators, , , , with the same structures to process images at three different scales. Based on the parameter sharing mechanism [21], a multi-task learning strategy is used in this paper for the training of the discriminator. We first extracted the basic characteristics of the real sample and the generated sample through the convolutional network and obtained the corresponding feature map. Then, we used 2 and 4 as sampling factors to down-sample the feature maps of the real and generated images to obtain feature maps at three different scales. Moreover, we used the three discriminators , , to process the feature maps.
Although each discriminator has the same structure, a discriminator with a smaller input can enhance the semantic information of the generated image, while a larger input is superior at adding details to the generated image. Thus, the design of a multi-level discriminator is very conducive to model training. When a high-resolution model is desired, we only need to add discriminators to the original model instead of training the model from zero.  Figure 4. The framework of discriminators. The discriminator consists of three parts with the same structure: D 1 , D 2 , and D 3 . First, the shared convolutional layer is used to process the input image to obtain the feature map; after that, the map is down-sampled 2 times and 4 times and input into D 1 , D 2 , and D 3 to obtain the discriminator scores.

Multitasking Mechanism
A discriminator network based on a multi-scale architecture is proposed in this paper. This network uses three discriminators, D 1 , D 2 , D 3 , with the same structures to process images at three different scales. Based on the parameter sharing mechanism [21], a multitask learning strategy is used in this paper for the training of the discriminator. We first extracted the basic characteristics of the real sample and the generated sample through the convolutional network and obtained the corresponding feature map. Then, we used 2 and 4 as sampling factors to down-sample the feature maps of the real and generated images to obtain feature maps at three different scales. Moreover, we used the three discriminators D 1 , D 2 , D 3 , to process the feature maps.
Although each discriminator has the same structure, a discriminator with a smaller input can enhance the semantic information of the generated image, while a larger input is superior at adding details to the generated image. Thus, the design of a multi-level discriminator is very conducive to model training. When a high-resolution model is desired, we only need to add discriminators to the original model instead of training the model from zero.

Objective Function
The formula for the above multi-task learning process is as follows: where P r and P g represent the sample distribution of the real images and generated images, respectively; I o represents the original image; I g represents the generated image obtained using the original image and ground truth; θ d represents the parameters of the discriminator; and D k represents one of the multiple discriminators. We then conducted adversarial learning between the generator and the discriminators. The optimization algorithm is shown in Algorithm 1.
; g-steps, the training step of the generator; d-steps, the training step of the discriminators. Output: G, generator after training. 1: Initialize generator G and discriminator {D i } i=k i=1 with random weights; 2: repeat 3: for g-steps do 4: G generate fake images; 5: Calculate the penalty value V G D via Formula (6); 6: Minimize Formula (9) to update the parameters of the generator G; 7: end for 8: for d-steps do 9: Use G to generate fake images I g = I g 1 , · · ·, I g N ; 10: Use real images I o = {I o 1 , · · ·, I o N } and fake images I g = I g 1 , · · ·, I g N to up-date the discriminator parameters by minimizing Formula (10); 11: end for 12: until InsulatorGAN completes convergence 13: return

Network Structure
For the problem of insulator-detection image generation, there is a large amount of underlying feature sharing between the input and output. To allow these features to be transmitted stably in the network, we used U-net [37] as the basic structure for the generator and discriminators.
The structure of the generator is shown in Table 1. CONV represents a convolutional layer; N-m indicates that each convolutional layer contains m convolution kernels; K-mxm indicates that the scale of the convolution kernel is mxm; S-m indicates that the moving step size of the convolution kernel is m; P-m represents the number of boundary expansions required for the convolution kernel to iterate on the image; and IN, ReLU indicates that the loss function of the current convolution layer is the InstanceNorm-ReLU layer [38].
The structure of the discriminator is shown in Table 2. Before the image is sent to the discriminator, it is necessary to use a convolution kernel with a size of 3 × 3, an offset step of 1, and an edge padding of 1 to extract the primary features in the image. After the feature map with the same size as the original sample image is obtained, that map is down-sampled by 2 times and 4 times. Then, the feature map obtained after down-sampling is sent to the discriminator of the corresponding size. Notably, the first layer uses Convolution-InstanceNorm-LeakyReLU [39] as the activation function without normalization. The slope of Leaky ReLU is 0.2. After the last convolutional layer of the discriminators, a fully connected layer is used to generate a one-dimensional output. The discriminators with different input resolutions are the same in terms of network structure.

Network Layer Information Input Output
Down-Sample

Dataset
First, we needed an insulator dataset with sufficient images to analyze and evaluate the model. The current public dataset for insulator detection is only CPLID, but most of the insulator entities are obtained through image augmentation methods such as rotation and cropping. The real samples included are limited and cannot fully reflect the characteristics of insulator samples from power lines. Therefore, we used the UAV to record a 500 KV overhead power line inspection video in a location in China as the data source and produced a dataset to be used in generating insulator-detection images. Moreover, we used AutoCAD as a labeling tool to draw the smallest external polygonal frame around all the insulator entities. To improve the separation between the insulator and the background, we used magenta to draw the frame. This color is very rare in nature, which improves the model's focus on insulation during training. We named this dataset InsuGenSet, and the training set and test set each contained 2500 and 500 image pairs composed of real images and ground-truth values.

Experiment Configuration
InsulatorGAN training involved a total of 200 epochs. The first 100 epochs were maintained at a 0.0001 learning rate, while the next 100 epochs gradually shrunk to 0. Before starting the training, the parameters of the model were initialized using a normal distribution with a mean value of 0 and a standard deviation of 0.02. An NVIDA RTX 2080 GPU with a memory capacity of 8 GB was used for training and testing. The operating system was Ubuntu 18.04, and the PC had a memory capacity of 32 GB. All algorithms were built using Pytorch1.4. The number of Monte Carlo searches was 5, and the loss proportion of the control discriminator and feature extraction matching was 10.

The Baselines
In this study, we compared InsulatorGAN with the current mainstream models of image generation, image translation, and object detection.
Pix2Pix [15]: This method uses anti-loss learning to map from x to y, where x and y represent images in different domains; this method has achieved excellent results in image-translation tasks.
CRN [16]: Unlike the method used by GAN for image generation, the CRN model uses a CNN to output a corresponding generated image based on an input semantic layout image. This method creatively uses the technique of calculating the matching loss between images and calculates the diverse loss in the generated and semantic-segmentation images.
X-Fork [17]: Similar to pix2pix, this generator produces multi-view images by learning the mapping of G: {I a →I b , S b }. I a and I b represent the images under perspective a and perspective b, respectively, and S b represents the semantic segmentation map of perspective b.
X-Seq [17]: X-Seq uses two CGANs (G1 and G2) together as a model, where G1 synthesizes the image of the target perspective, and G2 synthesizes the semantic segmentation map of the target perspective based on the output image of G1.
SelectionGAN [18]: This model introduces a multi-channel attention mechanism based on X-seq to selectively learn the intermediate results of the model, thereby generating cascading images from coarse to fine.
Cascade R-CNN [40]: Cascade R-CNN is one of the baselines in object detection. The detection speed of this method is slow, but the algorithm offers high precision and robustness.
YOLOv4 [41]: Based on the YOLO structure, this is the most recent optimization strategy and was adopted to achieve a balance between FPS and precision.
CenterNet [42]: Unlike Faster R-CNN and YOLOv4, this model converts object detection into key-point detection based on a heatmap.

Quantitative Evaluation
In this section, the inception score and Fréchet inception distance are used to evaluate the quality of the generated image. Then, we use AP50-AP90 to evaluate the accuracy of the box position in the generated image. Finally, some pixel-level indicators are used to evaluate the similarity between the fake image and ground truth.

Inception Score (IS) and Fréchet Inception Distance (FID)
The Inception Score (IS) is a universal model index for generative models that can measure the clarity and diversity of images generated by the model. A higher IS indicates a higher quality. The calculation formula is as follows: where G represents the generator, x represents the generated image, y represents the predicted label of the generated image, and D KL represents the KL divergence (Kullback-Leibler Divergence), also called relative entropy. Because the CPLID and InsuGenSet datasets contain insulators that are not included in the ImageNet dataset [43], we could not directly use the pre-trained Inception model. Moreover, the Inception model features a large number of parameters, so we instead used AlexNet to score the InsulatorGAN.
The Fréchet Inception Distance (FID) is an index used to calculate the distance between a real image and a fake image. A lower FID value for an image indicates a higher image quality. The calculation formula is as follows: where T r represents the sum of the elements on the diagonal of the matrix µ I t , µ I g represents the mean of the feature map of the ground truth I t and the generated image I g , and Σ I t and Σ I g represent the covariance matrix of the feature map of I t and I g . In order to increase the calculation speed and prevent the occurrence of underfitting due to the small scale of dataset, similar to IS, we used AlexNet instead of InceptionV3 to obtain the basic features of the image. We used the output of the last pooling layer before the fully connected layer, that is, a 1 * 1 * 4096 vector as the basic feature of the input image. On this basis, we could obtain the covariance matrix of 4096 * 4096, which was Σ I t and Σ I g . Table 3 outlines the experimental results for IS and FID. Here, the score of Insulator-GAN was higher than the scores of the other baseline architectures, which indicates that the images generated by InsulatorGAN are superior, in terms of quality and diversity, to those generated by other baseline models.

Precision of Box
Here, we evaluate the generation position of the detection frame based on the Average Precision (AP) of the COCO dataset [44]. The result of this standard must be based on the Intersection of Union (IoU) of the fake image and the Ground truth, which is used as the basis for the threshold setting of the correct standard. The formula is shown below: where De f aultBox represents the position of the predicted frame, and GroundTruthBox represents the location of the ground truth. The ratio of the intersection and the union of the two is used to indicate the precision of the predicted frame. As shown in Table 4, although the precision of InsulatorGAN was not significantly improved compared with that of Cascade R-CNN, the precision was still higher than that of Yolov4 and CenterNet. This result indicates that InsulatorGAN can locate the position of the insulator in the generated image effectively and achieve the goal of insulator detection. In addition, when the IoU threshold is 0.9, the detection precision was significantly higher than that of other baselines because the detection frame generated by InsulatorGAN was the smallest external polygon, which is a natural advantage of our model. Based on the work in [45], we used the structural similarity (SSIM), peak signal-tonoise ratio (PSNR), and sharpness difference (SD) to measure the pixel-level similarity between the generated image and the Ground Truth.
Structural Similarity (SSIM) evaluates the similarity between images based on attributes such as the brightness and contrast of the image, with a value range of [−1, 1]. When the value of SSIM is larger, the similarity of the input image pair is higher. The formula is shown below: where µ I g , µ I t represent the mean value of the generated image I g and ground truth I t ; σ I g , σ I t represent the standard deviation of I g and I t , respectively; and c 1 , c 2 represent a constant introduced to ensure that the denominator is not 0. The peak signal-to-noise ratio (PSNR) evaluates the quality of the generated image relative to the ground truth by measuring the peak signal that reaches the noise ratio. Here, a high value indicates a better image quality. The formula is as follows: PSNR(I g , I t ) = 10 log 10 max 2 I g mse (15) mse(I g , The sharpness difference can be obtained during image generation by calculating the loss of sharpness. In this paper, we used the concepts outlined in [45] to calculate the gradient change between the fake image and the ground truth: SharpDiff.(I g , I t ) = 10 log 10 max 2 where ∇ i I = I i,j − I i−1,j , ∇ j I = I i,j − I i,j−1 and the Sharpness Difference in Formula (17) can be regarded as the reciprocal of the gradient. As shown in Table 6, the score of InsulatorGAN was higher than the scores of all baselines. The results indicate that InsulatorGAN can learn to generate high-quality insulator-detection images for complex environments such as mountains, forests, and towns. The proposed penalty mechanism can further improve the semantic details of the image, make the generated image more realistic, and significantly reduce forgery traces in the image. In addition, the similarity values for the fake image and the ground truth were high, indicating that the generated results can be used as samples to expand the insulator dataset.
To compare the speed of all models, we conducted a comparative speed test. As shown in Table 5, the test speed of the InsulatorGAN proposed in this paper was lower than that of all the baselines. This result was due to the two-stage generation method adopted by the model from coarse to fine, which inevitably increased the quantity of calculations. The Monte Carlo search also required significant time. Nevertheless, the speed difference between InsulatorGAN and SelectionGAN was relatively small, and 64FPS was enough to meet the needs of practical applications.

Qualitative Evaluation
The qualitative experimental results are shown in Figures 5 and 6. The resolution of the test images was 512 * 512. It can be seen that the image generated by InsulatorGAN in this paper was clearer, and the details of the object or scene were richer. When using the InsuGenSet and CPLID datasets, previous models are prone to generate fuzzy and distorted images. InsulatorGAN instead learned how to generate images of insulators with a strong diversity, producing more semantic details at the edges and connections of insulators. The wires, trees, and towers in the non-insulator area were observed to be realistic and similar to those of the Ground Truth.

Sensitivity Analysis
In this section, we conducted five experiments on sensitivity analysis, including the two-stage generation method, the number of introductions of Monte Carlo search, the multi-level discriminator, the number of training iterations, and the minimum training data experimental.

Sensitivity Analysis
In this section, we conducted five experiments on sensitivity analysis, including the two-stage generation method, the number of introductions of Monte Carlo search, the multi-level discriminator, the number of training iterations, and the minimum training data experimental.

Two-Stage Generation
In order to verify the impact of two-stage generation on model performance, we conducted experiments on the number of stages of InsulatorGAN introduced on the InsuGenSet dataset. The experimental results are shown in Table 6. It can be seen from the table that when using two-stage generation, InsulatorGAN had the most balanced performance on various indicators.  Table 7. From the experimental results, it can be seen that when the number of times of using Monte Carlo search was 5, the performance of InsulatorGAN on various indicators was the most balanced. To verify the impact of the multi-level discriminator on the performance of the model, we conducted experiments on the number of times the discriminator of InsulatorGAN was introduced on the InsuGenSet dataset. The experimental results are shown in Table 8. It can be seen from the experimental results that when the three-level discriminator was used, the performance of InsulatorGAN on various indicators was the most balanced.  Table 9. From the experimental results, it can be seen that when the epoch was 200, InsulatorGAN had the most average performance on various indicators.  Table 10. It can be seen from the results that the performance of InsulatorGAN did not drop significantly with the reduction of the training set at the beginning. Until the training set size is reduced to 70%, InsulatorGAN's scores on various indicators are similar to SelectionGAN, which shows that InsulatorGAN has strong robustness and can still learn key feature information even on a small-scale dataset. To a certain extent, it overcomes the shortcomings of the previous model's poor generalization ability.

Ablation Analysis
To analyze the functions of the different components in InsulatorGAN, we performed an ablative analysis on InsuGenSet. As shown in Table 11, model B had better indicators than model A, which demonstrates that the two-stage generation method from coarse to fine was able to better improve the clarity of the image. Model C promoted the learning of the generator by introducing a multi-level discriminator, which enhanced the quality of the generated image while improving stability. The score of Model D shows that the penalty mechanism significantly improved the performance of InsulatorGAN, allowing the model to obtain sufficient semantic constraints during the generation process and increasing the clarity and realism of the generated image.

Computational Complexity
The network parameters and training time were counted to evaluate the space complexity and time complexity of the networks. The results are shown in Table 12. It can be seen that, compared with Cascade R-CNN, InsulatorGAN had a smaller number of network parameters, but it obtained a better test performance. Compared with generative models such as SelectionGAN, although the parameter amount and training time of InsulatorGAN increased slightly, the performance improvement it brings was worthwhile.

Conclusions
This paper proposed a power-line-insulator-detection image-generation model called InsulatorGAN, which can generate insulator-detection images based on aerial images taken by drones. This model first uses a coarse-grained module to generate low-resolution target images and then uses a Monte Carlo search to mine the hidden information of the intermediate results. Finally, this method uses a fine-grained model to synthesize high-resolution target images based on the search results. Quantitative and qualitative experiments on the public CPLID dataset and the self-built InsuGenSet dataset provided the following results. Compared with current mainstream models, the model in this paper can generate clearer and more diversified insulator-detection images. The ablative analysis experiment demonstrated that the penalty mechanism proposed in this paper based on a Monte Carlo search significantly improved the image quality. In addition, the model in this paper can be flexibly transferred to multiple scenarios. In the future, we will further explore the application of this model to other power-line components, image-style transfer, image translation, and other fields.  Data Availability Statement: The data in this paper are undisclosed due to the confidentiality requirements of the data supplier.