TL-Net: A Novel Network for Transmission Line Scenes Classiﬁcation

: With the development of unmanned aerial vehicle (UAV) control technology, one of the recent trends in this research domain is to utilize UAVs to perform non-contact transmission line inspection. The RGB camera mounted on UAVs collects large numbers of images during the transmission line inspection, but most of them contain no critical components of transmission lines. Hence, it is a momentous task to adopt image classiﬁcation algorithms to distinguish key images from all aerial images. In this work, we propose a novel classiﬁcation method to remove redundant data and retain informative images. A novel transmission line scene dataset, namely TLS_dataset, is built to evaluate the classiﬁcation performance of networks. Then, we propose a novel convolutional neural network (CNN), namely TL-Net, to classify transmission line scenes. In comparison to other typical deep learning networks, TL-Nets gain better classiﬁcation accuracy and less memory consumption. The experimental results show that TL-Net101 gains 99.68% test accuracy on the TLS_dataset.


Introduction
The increasing dependence of modern-day societies on the electricity supply challenges the reliability and sustainability of the uninterrupted flow of electricity [1]. However, due to the rapid development of the domestic industry, the critical components of transmission lines are exposed in the natural environment for an extended period and suffer from environmental pollution [2]. Hence, transmission line faults are inevitable and may lead to local power supply tension, forest fires, and even extensive blackouts. Accordingly, electricity companies must perform transmission line inspections regularly to maintain the reliability and sustainability of transmission networks.
Accidents of the critical components, such as downed poles and insulator self-shattering, are the main reasons for transmission line faults [3]. Hence, it is necessary to implement transmission line inspections to evaluate the health condition of critical components. To ensure an uninterrupted flow of electricity, foot patrols with various sensors have been a reliable method of transmission line inspection over the past few years [4]. However, transmission lines are usually built on various topographies, such as plateaus and mountains [5]. Consequently, foot patrols usually suffer from poor owever, the camera mounted on the UAV collects large numbers of redundant images, which can educe the efficiency of critical component health evaluation. Figure 1 illustrates a positive sample nd negative sample captured by the camera during a transmission line inspection. The positive ample contains critical components of transmission lines, while the negative sample does not. Due to the development of machine learning, it has become a trend to adopt machine learning lgorithms to solve image classification tasks. However, machine learning algorithms cannot extract igh-level semantics of images effectively, which leads to limited performance in image lassification tasks [7]. Hence, the image classification algorithms based on machine learning cannot atisfy classification tasks for images containing intricate content. In recent years, with the evelopment of graphics processing unit (GPU) technology, GPUs can achieve a high running speed f parallel processing, which leads to a significant improvement in the deep learning research omain [8]. Deep learning can extract higher level features, which represent more abstract semantics f images [9]. Hence, deep learning algorithms, such as CNNs (convolutional neural networks), ave recently garnered significant attention in image classification tasks [10].
To improve the classification accuracy of TLS_dataset, we propose a novel CNN, namely L-Net. Inspired by the Inception module and the SE module [11], we propose two novel modules, amely the optimized Inception module (OIM) and the optimized (OSEM). TL-Nets are built by nserting these two new modules into ResNets. The OIM is proposed to reduce parameters and etain receptive fields of various sizes. To further facilitate information flow and fuse various image eatures, we propose the OSEM based on the SE module (SEM). The results of ablation and omparison experiments verify the effectiveness of these two modules. The method proposed in this aper can retain informative images for transmission line inspection effectively. Hence, our work is f considerable significance in maintaining the reliability and sustainability of transmission etworks. The general schematic flowchart of the transmission line scene classification is emonstrated in Figure 2. Due to the development of machine learning, it has become a trend to adopt machine learning algorithms to solve image classification tasks. However, machine learning algorithms cannot extract high-level semantics of images effectively, which leads to limited performance in image classification tasks [7]. Hence, the image classification algorithms based on machine learning cannot satisfy classification tasks for images containing intricate content. In recent years, with the development of graphics processing unit (GPU) technology, GPUs can achieve a high running speed of parallel processing, which leads to a significant improvement in the deep learning research domain [8]. Deep learning can extract higher level features, which represent more abstract semantics of images [9]. Hence, deep learning algorithms, such as CNNs (convolutional neural networks), have recently garnered significant attention in image classification tasks [10].
To improve the classification accuracy of TLS_dataset, we propose a novel CNN, namely TL-Net. Inspired by the Inception module and the SE module [11], we propose two novel modules, namely the optimized Inception module (OIM) and the optimized (OSEM). TL-Nets are built by inserting these two new modules into ResNets. The OIM is proposed to reduce parameters and retain receptive fields of various sizes. To further facilitate information flow and fuse various image features, we propose the OSEM based on the SE module (SEM). The results of ablation and comparison experiments verify the effectiveness of these two modules. The method proposed in this paper can retain informative images for transmission line inspection effectively. Hence, our work is of considerable significance in maintaining the reliability and sustainability of transmission networks. The general schematic flowchart of the transmission line scene classification is demonstrated in Figure 2. comparison experiments verify the effectiveness of these two modules. The method proposed in this paper can retain informative images for transmission line inspection effectively. Hence, our work is of considerable significance in maintaining the reliability and sustainability of transmission networks. The general schematic flowchart of the transmission line scene classification is demonstrated in Figure 2.

Machine Learning
Common machine learning algorithms include Support Vector Machine (SVM) [12], Bayesian Additive Regression Tree (BART) [13], and Quantile Regression Forests (QRF) [14]. In recent years, machine learning algorithms have been widely applied in various research domains related to electricity, such as power outage prediction [15], wind speed prediction [16], and storm outage prediction [17]. In [15], Yang et al. proposed a method to quantify the uncertainty in power outage prediction modeling based on machine learning. Cerrai et al. [16] proposed three new modules based on Outage Prediction Model (OPM) and evaluated them on 76 extratropical and 44 convective storms. Bhuiyan et al. [17] evaluated the performance of BART and QRF on wind speed prediction, and their study suggested that QRF outperformed BART.
Due to the working mode of machine learning algorithms, it is necessary to extract image features before training. Common image feature extraction algorithms include Local Binary Pattern (LBP), SIFT (Scale-invariant Feature Transform), and HOG (Histogram of Oriented Gradient) [7]. However, these hand-crafted features cannot simply be adapted to new conditions [1,18]. Hence, the image classification methods, which are based on machine learning, cannot satisfy the accuracy requirements of classification tasks with intricate content, such as transmission line scenes.

Deep Learning
In 1998, LeCun [19] proposed LeNet to handle the handwritten classification task. However, LeNet suffers from much memory consumption and low efficiency. Hence, this significant innovation drew little attention. In 2012, the success of AlexNet, proposed by Krizhevsky et al., led to the resurgence of CNNs [20]. AlexNet is a significant breakthrough in the field of computer vision due to substantially better accuracy than traditional machine learning methods. Szegedy et al. [21] proposed GoogLeNet, which contains a novel module named the Inception module. The Inception module can merge the feature splits which are produced by various convolution kernels. In 2014, the Visual Geometry Group proposed VGG [22], and the results of their work verify that deeper CNNs usually achieve better accuracy. In contrast to GoogLeNet, VGG stacks 3×3 convolution layers repeatedly to construct the entire network. The strategy of VGG is simple but effective, which is inherited by ResNet. In 2015, He et al. [23] proposed ResNet, and their pioneering work alleviates the notorious problem of vanishing/exploding gradients in deep CNNs, which can obtain high-level semantics of images.
Deep convolutional neural networks have been adopted in different fields, such as insulator detection [24,25] and power line inspection [26,27]. Previous researchers have done lots of work on transmission line scene classification. Yang et al. [28] proposed a method for classifying key images of transmission lines based on Markov Random Fields (MRF). In [29], Kim

Data Collection
The images in TLS_dataset were captured by cameras during transmission line inspections in China. This original dataset consists of 6000 aerial images, with a resolution of 3×224×224. There are 3000 positive samples and 3000 negative samples in this dataset.

Data Augmentation
To improve the robustness of classifiers, we improve the number of images with the following methods. After rotating horizontally, rotating vertically, adding noise, and modifying image brightness, the number of images increased fivefold. We set the ratio of training data, validation data, and test data to be 3:1:1, as shown in Figure 3. Specifically, we modify image brightness in the HSV color space and then convert the image format from HSV color space to RGB color space.
Energies 2020, 13, x FOR PEER REVIEW 4 of 14 brightness, the number of images increased fivefold. We set the ratio of training data, validation data, and test data to be 3:1:1, as shown in Figure 3. Specifically, we modify image brightness in the HSV color space and then convert the image format from HSV color space to RGB color space.

Optimized Inception Module
The Inception module is a notable architecture innovation in the development of CNNs, which contains various kernel sizes to extract features of different sizes [21]. Consider a feature map X that is passed through the Inception module. This module implements s transformation functions, denoted by [F1(·), F2(·),···, Fs(·)], which can be regarded as a composite function of various operations, such as convolution, activation function, pooling, and Batch Normalization (BN). The output of the Inception module is denoted by Y, which can be defined as in Equation (1): where C[·] refers to the concatenation of each feature outputted by different branches. The Inception module can extract features of various receptive fields and merge these features across their channel dimension. However, the Inception module suffers from excessive parameters and a lack of interpretability. Hence, it is unclear how to modify the large numbers of hyper-parameters to adapt to different tasks. To address this challenge, we propose the optimized Inception module (OIM). The shortcut connection of ResNets can effectively alleviate the problem of vanishing and exploding gradients in deep networks [23]. Based on the results of previous works, deep networks can extract a large receptive field by small convolution layers [22,23]. Hence, the Inception module, which contains large convolution layers, is not suitable to be inserted into deep networks. In contrast to the Inception module, the OIM comprises 1×1 and 3×3 convolution layers. The 3×3 convolution layer can gain a large receptive field in deep networks, and the 1×1 convolution layer can retain the

Optimized Inception Module
The Inception module is a notable architecture innovation in the development of CNNs, which contains various kernel sizes to extract features of different sizes [21]. Consider a feature map X that is passed through the Inception module. This module implements s transformation functions, denoted by [F 1 (·), F 2 (·),···, F s (·)], which can be regarded as a composite function of various operations, such as convolution, activation function, pooling, and Batch Normalization (BN). The output of the Inception module is denoted by Y, which can be defined as in Equation (1): where C[·] refers to the concatenation of each feature outputted by different branches. The Inception module can extract features of various receptive fields and merge these features across their channel dimension. However, the Inception module suffers from excessive parameters and a lack of interpretability. Hence, it is unclear how to modify the large numbers of hyper-parameters to adapt to different tasks. To address this challenge, we propose the optimized Inception module (OIM). The shortcut connection of ResNets can effectively alleviate the problem of vanishing and exploding gradients in deep networks [23]. Based on the results of previous works, deep networks can extract a large receptive field by small convolution layers [22,23]. Hence, the Inception module, which contains large convolution layers, is not suitable to be inserted into deep networks. In contrast to the Inception module, the OIM comprises 1×1 and 3×3 convolution layers. The 3×3 convolution layer can gain a Energies 2020, 13, 3910 5 of 15 large receptive field in deep networks, and the 1×1 convolution layer can retain the receptive fields in various sizes. To further reduce the parameters, we insert the grouped convolution into the OIM as depicted in Figure 4. Consider a feature map X which is passed through the OIM. The input X is split into four parts of the same spatial size. These four feature blocks are denoted by x1~x4. Then, we utilize 1×1 and 3×3 convolution layers to extract multi-scale features of different receptive fields. Then these features are merged across their channel dimension. The concatenation of these features is indicated as yi, where i∈{1,2,3,4}. We refer to the transformation functions of 1×1 and 3×3 convolution layers as Fi(·) and Gi(·). Finally, the output Y is written as in Equation (2): where yi refers to the output of the convolution operations. We insert the OIM into ResNets to replace the original 3×3 convolution layers. Consider the input size and the output size of the convolution layer both to be c×h×w. The parameter of the 3×3 convolution layer P3 × 3 and the parameter of the OIM POIM can be calculated in Equation (3):

Optimized SE Module
The channel-wise attention strategy can focus on relevant information and filter redundant information to decrease the complexity of image analyses. Momenta presented the SEM in 2017, which is a typical channel-wise attention module [11]. Figure 5 shows the architecture of the SEM, which consists of the squeeze operation and the excitation operation. The feature map X first enters in the squeeze operation, which aggregates X across its spatial dimensions (h×w) and produces the channel descriptor x. The excitation operation follows to capture channel-wise dependencies.  [11]. The input is indicated as X, and the output is denoted by Y. Consider a feature map X which is passed through the OIM. The input X is split into four parts of the same spatial size. These four feature blocks are denoted by x 1~x4 . Then, we utilize 1×1 and 3×3 convolution layers to extract multi-scale features of different receptive fields. Then these features are merged across their channel dimension. The concatenation of these features is indicated as y i , where i∈{1,2,3,4}. We refer to the transformation functions of 1×1 and 3×3 convolution layers as F i (·) and G i (·). Finally, the output Y is written as in Equation (2): where y i refers to the output of the convolution operations. We insert the OIM into ResNets to replace the original 3×3 convolution layers. Consider the input size and the output size of the convolution layer both to be c×h×w. The parameter of the 3×3 convolution layer P 3×3 and the parameter of the OIM P OIM can be calculated in Equation (3):

Optimized SE Module
The channel-wise attention strategy can focus on relevant information and filter redundant information to decrease the complexity of image analyses. Momenta presented the SEM in 2017, which is a typical channel-wise attention module [11]. Figure 5 shows the architecture of the SEM, which consists of the squeeze operation and the excitation operation. The feature map X first enters in the squeeze operation, which aggregates X across its spatial dimensions (h×w) and produces the channel descriptor x. The excitation operation follows to capture channel-wise dependencies.
Energies 2020, 13, 3910 6 of 15 information to decrease the complexity of image analyses. Momenta presented the SEM in 2017, which is a typical channel-wise attention module [11]. Figure 5 shows the architecture of the SEM, which consists of the squeeze operation and the excitation operation. The feature map X first enters in the squeeze operation, which aggregates X across its spatial dimensions (h×w) and produces the channel descriptor x. The excitation operation follows to capture channel-wise dependencies.  [11]. The input is indicated as X, and the output is denoted by Y. Figure 5. The architecture of the SEM. The purple frame indicates the squeeze operation, and the green frame indicates the excitation operation. FC denotes the fully connected layer. The number of neurons in the first and second fully connected layer is set to c/16 and c, as suggested in [11]. The input is indicated as X, and the output is denoted by Y.
Consider the input X, with a size of c×h×w, that is passed through the SE module. The output Y is given as in Equation (4): where F sq (·) and F ex (·) refer to the transformation functions of the squeeze and excitation operations, respectively, and F scale (X,e) indicates the channel-wise multiplication between X and e. To facilitate information flow further and fuse various channel descriptors, we propose a novel channel-wise attention module based on SEM, which is called OSEM. We split the input X into four parts with the same spatial size, denoted by X i , where i∈{1,2,3,4}. Each feature block goes through two fully connected layers, which are denoted by f i (·) and g i (·). If a squeeze operation and an excitation operation are implemented on X i simply, the output Y will lack the dependencies of each feature block. To solve this, we propose a new connectivity pattern: we introduce a direct connection from a channel descriptor x i to the subsequent channel descriptor x i+1 . Figure 6 illustrates the architecture of the OSEM. The output Y is written as in Equation (5): In the OSEM, each fully connected layer f i (·) and g i (·) could potentially receive channel descriptors from all the preceding feature blocks {x j , j≤i}. Note that y 4 receives channel descriptors of all the feature blocks. This connectivity pattern can enforce the information flow of features and fuse features of different blocks. The information fusion of different feature blocks can make the output of the OSEM contain more image information than that of the SEM. Consider the input size and the output size of the SEM and the OSEM both to be c×h×w. The parameter of the SEM P SEM and the parameter of the OSEM P OSEM can be calculated in Equation (6):.

Network Implementation Details
ResNets are composed of basic blocks, as illustrated in Figure 7a. We reformulate the network architecture by inserting the OIM and the OSEM into ResNets. The basic block in TL-Nets is as illustrated in Figure 7b. The implementation of TL-Nets follows [23], as given in Table 1, and we utilize the publicly available framework of Keras to build all the networks in this paper. Consider a 3×224×224 input image that is passed through TL-Nets. Before entering the conv2, a 7×7 convolution operation and a max-pooling are performed on the input image. Downsampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2, as suggested in [23]. At the end of TL-Nets, the sigmoid is attached to the average pooling. In this paper, all the networks are trained using SGD (Stochastic Gradient Descent) with a mini-batch size of 10 on a Titan Xp GPU. We set the momentum, the weight decay, and the initial learning rate to 0.9, 10 −4 , and 0.01, as suggested in [23]. The loss function is set to be binary cross-entropy. onv3_1, conv4_1, and conv5_1 with a stride of 2, as suggested in [23]. At the end of TL-Nets, the igmoid is attached to the average pooling. In this paper, all the networks are trained using SGD Stochastic Gradient Descent) with a mini-batch size of 10 on a Titan Xp GPU. We set the omentum, the weight decay, and the initial learning rate to 0.9, 10 −4 , and 0.01, as suggested in [23]. he loss function is set to be binary cross-entropy.

Ablation Studies
To better illustrate the effectiveness of each optimized module, we conduct the ablation study as follows. We evaluate the classification performances of ResNets with each optimized module. To ensure a fair comparison test, we eliminate the differences in hyper-parameter settings. The basic blocks of OI-ResNets and OSE-ResNets are depicted in Figure 8a To better illustrate the effectiveness of each optimized module, we conduct the ablation study s follows. We evaluate the classification performances of ResNets with each optimized module. To nsure a fair comparison test, we eliminate the differences in hyper-parameter settings. The basic locks of OI-ResNets and OSE-ResNets are depicted in Figure 8a The receiver operating characteristic (ROC) curves are adopted to illustrate the classification erformance of binary classifiers with variable thresholds. The area under the ROC curve (AUC) is qual to the probability that the classifier ranks a randomly chosen positive instance higher than a The receiver operating characteristic (ROC) curves are adopted to illustrate the classification performance of binary classifiers with variable thresholds. The area under the ROC curve (AUC) is equal to the probability that the classifier ranks a randomly chosen positive instance higher than a randomly chosen negative one. We evaluate the classification performance of each network by the accuracy of test data and the AUC value, as suggested in [7,31]. The larger the AUC value is, the better the classification performance of the network is. False negative (FN), false positive (FP), true negative (TN), and true positive (TP) are essential parameters in the ROC curve. The true positive rate (TPR) and the false positive rate (FPR) are defined as in Equation (7): The accuracy can be calculated in Equation (8): Table 2 shows the accuracies of ResNets with each optimized module. As we can observe, the OIM and the OSEM both improve test accuracy compared to the baseline models. Specifically, OIM-ResNet50 and OSEM-ResNet50 improve test accuracy by 0.84% and 1.09%, respectively, compared to ResNet50. In comparison with ResNet101, OIM-ResNet101 and OSEM-ResNet101 achieve improvements of 1.07% and 1.19% in test accuracy. Moreover, OIM and OSEM can improve the AUC value of ResNets, according to the data in Table 2. OIM and the OSEM both improve test accuracy compared to the baseline models. Specifically, OIM-ResNet50 and OSEM-ResNet50 improve test accuracy by 0.84% and 1.09%, respectively, compared to ResNet50. In comparison with ResNet101, OIM-ResNet101 and OSEM-ResNet101 achieve improvements of 1.07% and 1.19% in test accuracy. Moreover, OIM and OSEM can improve the AUC value of ResNets, according to the data in Table 2.

Comparison with Typical Deep Learning Methods
We trained both TL-Nets and some typical CNNs on our test data to validate the effectiveness of TL-Nets. Figures 11 and 12 show the training accuracy curves and training loss curves of ResNets and TL-Nets. Figure 13 shows the training accuracy curves and training loss curves of TL-Net101, InceptionV3 [32], and Inception-ResNetV2 [33]. Table 3 lists the experimental results, which suggest that TL-Net101 achieves the highest test accuracy and AUC value compared to other networks. To

Comparison with Typical Deep Learning Methods
We trained both TL-Nets and some typical CNNs on our test data to validate the effectiveness of TL-Nets. Figures 11 and 12 show the training accuracy curves and training loss curves of ResNets and TL-Nets. Figure 13 shows the training accuracy curves and training loss curves of TL-Net101, InceptionV3 [32], and Inception-ResNetV2 [33]. Table 3 lists the experimental results, which suggest that TL-Net101 achieves the highest test accuracy and AUC value compared to other networks. To be more specific, TL-Net50 and TL-Net101 improve test accuracy by 1.50% and 1.45% compared to baseline models. In comparison to the baseline models, TL-Net50 and TL-Net101 improve the AUC value by 0.0059 and 0.0031. In comparison to InceptionV3 and Inception-ResNetV2, TL-Net101 gains an improvement of 0.81% and 0.10% in test accuracy, respectively. Hence, TL-Nets gain better classification performance than ResNets, which suggests the effectiveness of OIM and OSEM. Figure 14 shows the ROC curves of ResNet50, ResNet101, TL-Net50, TL-Net101, Inception-ResNetV2, and InceptionV3.

Comparison with Typical Deep Learning Methods
We trained both TL-Nets and some typical CNNs on our test data to validate the effectiveness of TL-Nets. Figures 11 and 12 show the training accuracy curves and training loss curves of ResNets and TL-Nets. Figure 13 shows the training accuracy curves and training loss curves of TL-Net101, InceptionV3 [32], and Inception-ResNetV2 [33]. Table 3 lists the experimental results, which suggest that TL-Net101 achieves the highest test accuracy and AUC value compared to other networks. To be more specific, TL-Net50 and TL-Net101 improve test accuracy by 1.50% and 1.45% compared to baseline models. In comparison to the baseline models, TL-Net50 and TL-Net101 improve the AUC value by 0.0059 and 0.0031. In comparison to InceptionV3 and Inception-ResNetV2, TL-Net101 gains an improvement of 0.81% and 0.10% in test accuracy, respectively. Hence, TL-Nets gain better classification performance than ResNets, which suggests the effectiveness of OIM and OSEM. Figure  14 shows the ROC curves of ResNet50, ResNet101, TL-Net50, TL-Net101, Inception-ResNetV2, and InceptionV3.        Finally, we evaluated the memory consumption and the average running time per image of networks, as shown in Table 4. As we can observe, the average running time of ResNet50 is faster than that of TL-Net50 by 0.006 s. However, the test accuracy of TL-Net50 is higher than that of ResNet50 by 1.50%. We can observe a similar phenomenon in ResNet101 and TL-Net101. Moreover, TL-Net50 gains less memory consumption than ResNets, InceptionV3, and Inception-ResNetV2. Hence, TL-Nets can be seen as a compromise between test accuracy and real-time performance.

Conclusions and Future Work
In this paper, we propose a novel network for classifying transmission line scenes, namely TL-Net. TL-Nets are built by inserting two optimized modules into ResNets: the OIM and the OSEM. Specifically, the OIM is designed to reduce network parameters and gain receptive fields of various sizes. The OSEM is proposed to improve information flow and fuse different features. In comparison to other typical deep learning networks, TL-Nets achieve better classification results. Overall, the methods proposed in this paper can improve the accuracy of transmission line scene classification, and they are of considerable significance to the reliability and sustainability of transmission networks.