Weakly Supervised Learning for Transmission Line Detection Using Unpaired Image-to-Image Translation

: To achieve full autonomy of unmanned aerial vehicles (UAVs), obstacle detection and avoidance are indispensable parts of visual recognition systems. In particular, detecting transmission lines is an important topic due to the potential risk of accidents while operating at low altitude. Even though many studies have been conducted to detect transmission lines, there still remains many challenges due to their thin shapes in diverse backgrounds. Moreover, most previous methods require a signiﬁcant level of human involvement to generate pixel-level ground truth data. In this paper, we propose a transmission line detection algorithm based on weakly supervised learning and unpaired image-to-image translation. The proposed algorithm only requires image-level labels, and a novel attention module, which is called parallel dilated attention (PDA), improves the detection accuracy by recalibrating channel importance based on the information from various receptive ﬁelds. Finally, we construct a refinement network based on unpaired image-to-image translation in order that the prediction map is guided to detect line-shaped objects. The proposed algorithm outperforms the state-of-the-art method by 2.74% in terms of F1-score, and experimental results demonstrate that the proposed method is effective for detecting transmission lines in both quantitative and qualitative aspects.


Introduction
Recently, unmanned aerial vehicles (UAVs) have been widely utilized in many industrial fields. In the applications of construction site monitoring and infrastructure inspection, UAVs reduce the cost and inspection time while ensuring the safety of inspectors [1]. Furthermore, UAVs contribute to increasing the efficiency of precision agriculture by spreading seeds and monitoring the conditions of crops more effectively than human workers [2]. Beyond this, UAVs are also applied to military surveillance, aerial photography, search and rescue, and product delivery [3]. As technology advances, UAVs with high-level autonomy have been utilized in applications such as forest fire monitoring and security system.
For the reliable operation of autonomous UAVs, obstacle detection and avoidance are important functions. Most autonomous UAVs and drones equip cameras for visual recognition, and path planning and control are successfully conducted based on the accurate recognition of surrounding environments. Previous studies proposed several obstacle detection methods. In particular, Huang et al. [4] proposed an obstacle avoidance algorithm using a monocular camera and millimeter-wave radar together. Similarly, a deep learningbased recognition algorithm was employed to detect multiple obstacles in [5]. On the other hand, many researchers have conducted research on obstacle avoidance and path planning based on deep learning approaches [6][7][8][9][10][11]. Ou et al. [9] suggested a framework based on deep reinforcement learning to plan feasible global paths with an obstacle map. Yuan et al. [10] presented a path planning method based on a convolutional neural network (CNN) model that can detect and localize obstacles such as buildings.
Data collection has become more straightforward than ever, and with the support of extensive public datasets, deep learning techniques have shown promising results in various industrial fields. However, supervised learning methods require ground truth data, which cause expensive costs for manual labor and time-consuming tasks, especially for large datasets. To address these limitations, weakly supervised learning has received attention to train deep neural networks with weak supervision. In recent years, weakly supervised learning has been applied to various tasks such as object detection [12][13][14], segmentation [15][16][17][18][19], and localization [20][21][22]. Wang et al. [16] suggested a segmentation algorithm based on a combination of U-Net and class activation map and trained using only image-level labels. They demonstrated that training a CNN model with weak supervision can also segment cropland accurately.
Among several types of obstacles, the transmission line is a critical obstacle that must be avoided. A collision with transmission lines can damage the stable supply of electricity, and furthermore, crashed UAVs can cause secondary accidents. Deep learning methods have been successfully applied to the detection and localization of transmission lines [21][22][23][24][25]. In the case of transmission line datasets, class imbalance occurs due to the background area that occupies most of the aerial image. In [23], they proposed a generalized focal loss function to handle class imbalance in the transmission line detection task. Lee et al. [21] introduced a weakly supervised learning method for detecting transmission lines, and they employed the VisualBackProp algorithm proposed by Bojarski et al. [26] to localize transmission lines. After this study, Choi et al. [22] proposed an extended study for transmission line detection. They utilized only patch-level labels to reduce the cost for collecting pixel-level ground truth data. In [22], they used the assumption that transmission lines are partially straight, and their proposed method connects broken lines by utilizing the orientations of the line segments. However, the class labels of the patch images require rough location information of transmission lines in images.
In this paper, we propose a transmission line detection method based on weakly supervised learning and unpaired image-to-image translation. The main contribution of this paper is three-fold as follows.

•
We develop a weakly supervised algorithm for detecting transmission lines in UAV images. Unlike the previous methods, which require pixel-level labels, our proposed method requires minimal labeling work for preparing training data, and therefore, it is easily applicable to real-world problems. • We integrate a novel attention module into the classification network to obtain a robust localization mask. To incorporate the information from various receptive fields, we introduce a parallel dilated attention (PDA) module. • For the training of the refinement network, we generate pseudo-line data and employ the cycle consistency loss, which was proposed in [27]. The refinement network enhances the line-shaped property of transmission lines, and therefore, the localization result is significantly improved in both quantitative and qualitative aspects.
The remainder of this paper is structured as follows: the related work is summarized in Section 2. The proposed method is presented in Section 3. Results and conclusions are presented in Sections 4 and 5, respectively.

Attention Mechanism
The attention mechanism has attracted many deep learning researchers, and they have proposed the following mechanisms. The bottleneck attention module (BAM) [28] is an attention module that computes attention maps from the two separated spatial and channel attention branches. Unlike the parallel structure of BAM, the spatial and channel attention of convolutional block attention modules (CBAM) [29] are sequentially configured. Yang et al. [30] proposed SimAM, which is a parameter-free attention module to calculate 3D attention weights in the channel and spatial dimensions. Hu et al. [31] tried to reduce the amount of computation while performing recalibration with a squeeze and excitation (SE) module, although this operation destroys the relationship between channels and weights due to channel reduction. The SE module utilizes the global average pooling (GAP) and two fully connected layers, which can be integrated into CNN architectures such as VGGNet, ResNet, and GoogLeNet. Wang et al. [32] proposed an efficient channel attention (ECA) to compute local cross-channel interaction by applying 1D convolution. The ECA module has shown significant improvements in state-of-the-art object detection, image classification, and object segmentation along with lightweight parameters. We compare the ECA module and other channel attention methods [30,31] with the proposed attention module and demonstrate the effectiveness of our proposed attention module in Section 4.5.
It has been used in a variety of applications concurrently with the progress of many studies on the attention mechanism. From the perspective of time series data, attention mechanisms can be implemented with sequence-to-sequence models with encoder and decoder architectures, to make models that pay attention to specific sequence data [33][34][35]. Furthermore, the attention module was applied to the CNN model which uses time-series data to estimate the blood pressure [36] and classifies the sleep stage in [37]. Many studies attempted to solve diverse problems on remote sensing image data such as classifications [38,39], ship detection [40,41], and semantic segmentation [42]. Ma et al. [39] implemented the channel and spatial attention module and integrated it into CNN architecture for the classification of the remote sensing scene images. Detecting small-scale ships is a challenging task in optical remote sensing images. Hu et al. [40] proposed detection models with the attention mechanism to suppress background while focusing on small ships to improve detection accuracy. Moreover, refs. [43,44] integrate attention modules in their deep neural network to solve segmentation tasks such as esophagus and lungs in medical images. Motivated by channel attention, we improved the localization mask by adopting an attention mechanism to focus on the important channels.

Image-to-Image Translation
Generative adversarial networks (GAN) [45] have made great strides in deep learning, and subsequent algorithms such as deep convolutional GAN (DCGANs) [46], conditional GAN (CGAN) [47], and InfoGAN [48] have been proposed. Isola et al. [47] proposed a supervised learning method that performs image-to-image translation using a paired dataset. Paired image-to-image translation is restrictive in its applications to real-world problems because it requires data correspondences. On the other hand, unpaired imageto-image translation techniques can address the limitations of paired image-to-image translation methods. There are several style transfer networks including CycleGAN [27], DiscoGAN [49], and DualGAN [50], and these methods translate the style of input images based on the unpaired datasets. Zhu et al. [27] proposed CycleGAN, which translates images from the source domain to the target domain with the cycle consistency loss, and this method shows remarkable results in the style transfer task. Researchers have extended unpaired image-to-image translation for several applications [51][52][53][54][55][56][57][58] to address issues such as data imbalance, lack of diversity, and limitation in collecting real paired dataset.
Zi et al. [51] constructed a modified CycleGAN to effectively generate clear images from cloudy images by utilizing unpaired image datasets. Furthermore, CycleGAN has been applied for data augmentation by translating synthetic images into realistic images in [52,53]. In particular, Mao et al. [53] improved the performance of classifying the actual wildfire smoke by utilizing images that were artificially generated images by utilizing [27]. To safely drive autonomous vehicles, road surface detection is essential as knowledge of road surface conditions (e.g., dry, wet, snowy) affects autonomous driving control [54]. Dry conditions can be collected more frequently, resulting in unbalanced data problems. To address this lack, they generated images of wet and snowy roads through an unpaired image-to-image translation method. Although biometric systems (e.g., fingerprint-based, and face-based) are widely used for security purposes, these recognition systems can be vulnerable to presentation attacks. A presentation attack aims to interfere with the normal functioning of a biometric recognition by presenting artifacts or biometric characteristics.
To prepare for such a presentation attack detection (PAD), Nguyen et al. [55] generalized the model to fake presentation attack face images obtained via CycleGAN. Inspired by these studies, we employed the image-to-image translation to refine the localization mask using pseudo-line data, and the effectiveness of a refinement network is shown in an ablation study later on.

Proposed Method
This section presents the proposed transmission line detection method. The weakly supervised learning framework proposed by Bojarski et al. [26] is employed to generate a localization mask for the transmission lines. Different from the previous work, we introduce a novel attention mechanism called PDA to improve the quality of the localization mask, and it is called attention localization mask (ALM). Furthermore, we develop a refinement network by utilizing an unpaired image-to-image translation technique between ALM and pseudo-line data. An overview of the proposed framework is shown in Figure 1. The upper and lower left parts present the backbone network for classifying images with and without transmission lines and the process for generating ALM from the hierarchical feature maps, respectively. The lower right part is the generator of the refinement network that produces the refined image, and the upper right part is a discriminator for adversarial training of the refinement network.

Classification Network and VisualBackProp Algorithm
We constructed a classification network to implement the VisualBackProp algorithm [26], which can obtain localization masks using feature maps. The classification network was constructed based on the VGG16 architecture, which consists of five convolution blocks to classify images with and without transmission lines. A convolution block contains convolution layers, a rectified linear unit (ReLU), and a max pooling operation. Although similar models were utilized in [21,22] for localizing transmission lines in patch images, our proposed model is different from the previous methods in that transmission lines can be localized in the original images. We employed image-level labels of the same size as the original size of the 512 × 512 in the power line dataset.
After binary classification, the localization mask is generated by the VisualBackProp algorithm. In order to obtain the mask for localizing transmission lines, we employed the VisualBackProp algorithm for the last feature maps F i ∈ R H i ×W i ×C i of the convolution blocks. The i-th feature map F i consists of C i feature maps f i 1 , · · · , f i C i . The first process of the VisualBackProp algorithm is to compute a single feature mapf i by accumulating the feature maps f i k ∈ R H i ×W i in the depth direction as (1).
where i and c i denote the number of convolution blocks and channels, respectively. The accumulated feature map is upsampled via bilinear interpolation to generate h i , and it is multiplied with the previous feature mapf i−1 to computeh i−1 as (2).
where ⊗ is the elementwise multiplication operation. Although the VisualBackProp algorithm provides reasonable localization masks in many cases, it fails when transmission lines have weak visual properties. To address this problem, we propose an attention mechanism for weakly supervised learning to enhance the responses of transmission lines in the localization map. In Figure 1, is the output of i-th convolution block in VGG16, which can be obtained after the ReLU and max pooling operations for F i , and it is utilized as the input for the following convolution block.

Parallel Dilated Attention Module
Inspired by SE-Net [31] and ECA-Net [32], we introduce a novel channel attention module called PDA. SE-Net [31] conducts dimensionality reduction in fully connected layers to reduce computational load, and Wang et al. [32] utilizes 1D convolution instead of fully connected layers to reduce model complexity without dimensionality reduction. Although ECA-Net employs a kernel size, which can be adaptively determined through a mapping function, this module still has the limitation that its receptive field is fixed. Figure 2 presents the structure of the proposed PDA module. Compared to the previous methods, the PDA module consists of three lightweight 1D convolutions with different dilation ratios, and therefore, the proposed attention mechanism can merge the information from various receptive fields.
In PDA, the GAP is applied to the feature map to acquire a vector with the size of 1 × 1 × C, where C denotes the number of channels. A feature vector of the identical length to the channel size of the previous feature map is diverged and provided as parallelized 1D convolutions. In Figure 2, D indicates the dilation ratio of the 1D convolution. The parallel structure of 1D convolutions is applied to obtain information from various receptive fields depending on dilation ratios. The various dilation ratios are advantageous for obtaining abundant features as they can collect information from narrow to broad receptive fields.
In this experiment, we set the dilation ratios to 1, 2, and 4, respectively, and padding was set equal to the dilation ratio to acquire an output vector of the same length as the input. The output vectors of the 1D convolution, which contain different locally correlated channel information, are concatenated together to aggregate the information of the parallel operation with a vector size of 1 × 1 × 3C. A fully connected layer compresses meaningful information and learns interdependencies between channels. A sigmoid function is utilized to obtain an attention vector a that contains values between 0 and 1, and it is expressed as follows: a = a 1 , · · · , a c 5 , 0 ≤ a k ≤ 1.
(3) The attention vector a is computed from the last convolution block, and the length of the attention vector is equal to the number of channels. Each component of a indicates the significance of the corresponding channel, and therefore, PDA guides the network to focus on important features for classifying and localizing transmission lines. The localization mask based on PDA is called ALM. Figure 3 shows details of the PDA module, which is placed between the last convolution feature map and ReLU operation. The last convolution feature map F 5 consists of f 5 1 , · · · , f 5 c 5 , and the k-th channel g k of the weighted feature map G ∈ R + where ⊗ is the elementwise multiplication. The attention-weighted feature map is utilized for transmission line localization. Different from the summation operation in the VisualBackProp algorithm, we conducted the weighted summation with the components of the attention vector computed from the PDA module. The attention-weighted feature map is beneficial to focus on important features for localizing transmission lines. In the same way as (1), the weighted feature map G is accumulated in the depth direction to compute the accumulated feature mapḡ as Whereas feature maps are added in the channel direction in the original VBP algorithm, our PDA module computes a channel attention vector to more effectively aggregate the information in the feature maps. To incorporateḡ with other feature maps,ḡ is upsampled to the identical size as the previous feature mapf 4 , and this procedure can be expressed as follows: where Upsampling indicates the bilinear interpolation, and the result is denoted by h 5 . Through the equation of (2), the elementwise multiplication is conducted between h 5 and f 4 , and the process is repeated until obtaining h 1 , which is called ALM.

Refinement Network via Image-to-Image Translation
The ALM in the previous step still has room for improvement due to blurred line responses and weak connections of transmission lines in the localization mask. Therefore, we constructed a refinement network to transfer visual characteristics of transmission lines. The refinement network employs a generator of an image-to-image transformation architecture. To transfer line-shaped properties, we generated a dataset based on a rulebased algorithm, and it is called the pseudo-line dataset. The pseudo-line dataset includes 250 binary pseudo-line images with the image size of 512 × 512, which is identical to the size of ALM. Figure 4 shows examples of pseudo-line data.  Figure 5 presents the procedure for generating the pseudo-line dataset. We prepared an image filled with zeros and generated pseudo-lines with randomly selected pixel coordinates a and b, which are integers ranging from 0 to 512. Because multiple lines which are too close to each other are not desirable, we generated pseudo-lines that connect opposite sides of images. Since the power line dataset contains multiple transmission lines, we also generated the pseudo-line dataset with variable numbers of transmission lines. In Figure 5, i denotes an arbitrary line interval, and multiple lines are generated with the distance of i. By utilizing the properties of the pseudo-line dataset, we refine the ALM to improve the localization accuracy. To construct a refinement network, we adopt the structure of CycleGAN [27]. We defined the ALM as a source domain S, and the pseudo-line dataset as a target domain T, respectively. The purpose of the refinement network is to obtain a mapping function from the source domain to the target domain. To train the refinement network, we constructed two generators and two discriminators for the adversarial training of the unpaired image-to-image translation framework. The first generator that maps from the source to the target domain is denoted by G : S → T, and it is presented in the lower right part of Figure 1. In some images, ALMs show weak responses and missing parts for localizing transmission lines. To address these limitations, the generator G allows the ALMs to contain the properties of pseudo-line images, and it is utilized as the refinement network. The refinement network restores weak responses and missing parts of ALMs based on the line-shaped properties of pseudo-line images. The first discriminator D T is trained to distinguish between G(s) and pseudo-line dataset in the target domain T, where G(s) is generated images from the source domain. As the adversarial training proceeds, the generator G becomes creating realistic line images, and these result images are called refined ALMs. The discriminator D T is presented at the upper right part in Figure 1, and the adversarial loss for training G and D T is defined as where p data (t) and p data (s) are the distributions of the source and target domains. Similarly, the second generator F : T → S maps target domain data into the source domain, and the second discriminator D S distinguishes ALMs and reconstructed ALMs in the source domain. Another adversarial loss for the training of F and D S is defined as Since using only adversarial losses can cause mode collapse, the cycle consistency loss is employed to train the generators in a constrained space. Based on the cycle consistency loss, which is defined as (9), the generated images can be translated back to the original images as shown in Figure 6.
The total loss function for training the refinement network is formulated as the combination of the adversarial losses and the cycle consistent loss, and it is defined as where λ is a hyper-parameter for controlling the effect the cycle consistency loss, and we utilized λ = 10 in experiments.

Experimental Results
We conducted experiments with the hardware environment including Intel Core i9-10900K CPU, 64 GB DDR4 RAM, and NVIDIA RTX 3090. The proposed algorithm was implemented based on PyTorch and OpenCV. To train the classification network, we utilized the Adam optimizer with the initial learning rate of 0.0001 and weight decay of 0.05. The refinement network was trained in the identical training setting, excluding the initial learning of 0.0002.

Dataset Description
We employed the public power line dataset consisting of 400 infrared (IR) and 400 visual light (VL) images. Figure 7 presents example images with and without transmission lines in the first and third columns, respectively. The ground truth corresponds to each image in the second and last columns. The dataset was collected for seasonal days in 21 different regions across Turkey in cooperation with the Turkish Electricity Transmission Company (TEIAS). The dataset is collected under diverse conditions, and therefore, it is challenging to recognize transmission lines in the image due to different backgrounds and illumination. In this study, 400 VL images with the size of 512 × 512 were used, of which 200 VL images contain transmission lines, and the others do not. For the training of the base classification network, we split the dataset into 300, 50, and 50 images for the training, validation, and test sets, respectively.

Evaluation Measure
The proposed method was evaluated based on the criterion proposed by Choi et al. [22]. Recall, precision, and F1-score were computed from the true positive (TP), false positive (FP), and false negative (FN). In [22], TP is defined as the number of cases where more than 50% of line pixels are correctly detected. To define false responses that occur near a transmission line, the FP is defined based on a tolerance range. By setting the tolerance range as 10 pixels on both sides of a transmission line, FP includes incorrect responses occurring in background regions and thick predictions on transmission lines. If a predicted line is composed of less than 10 pixels, then it is regarded as noise. Table 1 presents the quantitative comparison of the proposed method with several previous algorithms. To represent differences in approaches, we categorized the results based on the learning types and annotation levels, and more specifically, annotation levels are divided into pixel, patch, and image levels. The learning type of weakly supervised manner can be divided into patch and image-level annotations, which is a difference between the previous and our methods. In the previous methods, patch-level annotations were utilized by dividing the original images into 128 × 128 sub-images and assigning class labels for these patch images. Table 1. Quantitative comparison with other methods. The best result is highlighted in bold, and the second-best result is underlined. S and WS indicate supervised learning and weakly supervised learning, respectively.

Methods
Learning Although annotating patch images require less burden compared to pixel-level labels, patch-level annotations have limitations with respect to it requiring the location information of transmission lines in the original images. On the other hand, we utilized the entire size of the image for the training of the classification network. In transmission line images, lines are composed of small numbers of pixels and most of the remaining area consists of the background, and therefore, the larger the image size, the more difficult it is to localize and detect transmission lines. Nevertheless, we achieved quantitatively significant improvements by utilizing the entire images.
Compared with Choi et al. [22], our proposed method improves the performance by 5.47% and 2.75% in terms of precision and F1-score, respectively, while the recall is slightly lower by 0.27%. We utilized the entire image size from the beginning to the end of the algorithm, while [21,22] divide the image into patches once in the middle and merge them back together. In addition, in the process of dividing into patch images, approximate location information of transmission lines is required to train the classification network. It is worth noting that our proposed method does not require any location information. Furthermore, our method is meaningful in that the localization masks obtained from the feature maps can be improved by the refinement network. In another weakly supervised learning method, Lee et al. [21] achieved substantial performance in precision and F1-score, but there was a trade-off between precision and recall. By contrast, our proposed method has satisfactory results in both recall and precision for detecting transmission lines. Table 1 also presents the accuracy of the segmentation algorithms proposed in [59,60]. Although the supervised learning methods show satisfactory performance in both recall and precision, these algorithms require time-consuming work for generating pixel-level annotations. To reduce the cost for preparing ground truth data, our proposed method adopts a weakly supervised learning framework. It is noteworthy that even though we only utilized image-level annotations, the proposed algorithm shows higher accuracy in terms of precision and F1-score.

Ablation Study
We conducted an ablation study to demonstrate the effectiveness of each step of the proposed algorithm, and the results are presented in Table 2. We employed the algorithm proposed by Bojarski et al. [26], which is effective to represent clues in the feature map of a convolutional network, and it is considered as the baseline model and denoted as localization mask in Table 2. The attention vector obtained from the PDA contains the scores for each channel of the feature map, and the scores range from 0 to 1. This calculated score is multiplied by the feature map for channel-wise to give importance to each channel, and when performing binary classification of the input image, it is localized by focusing on the learned features. By adding the PDA to the baseline, we reached an F1-score of 92.35%, which improved performance by 1.22% compared to the existing localization mask. The localization mask performance after the refinement process showed that the refinement network, which is part of the proposed method, is meaningful. In Table 2, the performance improved by 3.59% and 2.96%, respectively, in terms of recall and precision. These results show that the adoption of CycleGAN is adequate as a way to improve the localization mask through the process of transferring the characteristics of an ideal line fully connected from one point to another. The best performance is the result of applying both steps, and the proposed model outperforms the baseline with recall and precision of 97.90% and 96.15%, respectively. In other words, the performance improved by 7.53% in recall and 4.25% in precision, and finally F1-score improved by 5.88% compared to the localization mask. We have shown that every step we suggest is beneficial in successfully detecting transmission lines.

Comparison with Other Attention Modules
In this section, the proposed attention module is compared with previous attention mechanisms including SimAM [30], SE [31], and ECA [32]. Table 3 summarizes the results of comparative experiments. An identical base network of VGG16 was utilized to obtain localization masks based on the previous attention modules. The localization accuracy with the use of PDA is compared with the accuracies based on the previous methods. SE showed less than 90% accuracies in all evaluation metrics, and PDA was higher than SE by 3.15% in terms of precision. In addition, another channel attention module ECA showed better performance than SE, but it was still lower than the accuracy of ours. ECA utilizes only one selective kernel for 1D convolution, and it is limited to properly representing features of transmission lines due to small receptive field. As shown in Table 3, SimAM, which generates 3D weights, was unsuitable for our task, and this experiment showed that not all attention modules are effective in localizing transmission lines. The PDA module showed plausible performance in all evaluation metrics compared to the previous methods. The performance improvement is attributed to the advantage that PDA captures abundant feature representations and broadens the receptive field by utilizing three different 1D convolutions.  Figure 8 presents result images of the proposed algorithm for detecting transmission lines. Figure 8a,b show input images and the corresponding ground truth data. The ALM is the localization mask with the application of the attention vector obtained from the PDA, for focusing more on the important channels. As shown in Figure 8c, PDA is effective in localizing the lines in the input images, and the predicted structure and number of lines are similar to the ground truth even before applying the refinement network. However, most transmission lines in the ALMs are blurry, and several predictions contain disconnected or omitted parts. To address these limitations, the refinement network is applied to the ALMs, and the results are presented at Figure 8d. The refinement network generates refined ALMs, which complement smudged or indistinct lines to make the lines sharp and bold, based on the characteristics of the target domain data. The refinement network also connects broken lines to generate intersecting lines of images, providing qualitatively plausible results even if compared with ground truth data. In Figure 8e, original images are overlaid with the refined ALM. Figure 9 represents failure cases of the proposed algorithm. In the refined ALM of the second example, the red arrow indicates a merged prediction of two transmission lines close to each other. However, even though two close transmission lines are recognized as a single line, it does not critically affect UAVs to operate a collision avoidance function. The yellow arrows in the refined ALMs indicate missing parts of transmission lines. When the responses of transmission lines in ALM are too weak, the refinement network could not restore the corresponding parts of transmission lines. The green arrow in the last example in Figure 9d indicates a false positive case, and such a failure is usually isolated and occurs in a local area. Therefore, we expect that these types of failures can be recovered by applying a post-processing based on the properties of transmission lines.

Conclusions
In this paper, we propose a transmission line detection algorithm based on weakly supervised learning and image-to-image translation. By only utilizing image-level labels, the proposed algorithm can be trained with the minimal human involvement. The proposed method consists of two steps: (1) localization of transmission lines based on PDA and (2) refinement via image-to-image translation. The PDA module computes a score vector based on the information from various receptive fields. The attention vector provides the channel importance of object features, and it is utilized for generating ALM. Furthermore, we constructed a refinement network that transfers line-shaped properties of transmission lines to improve weak responses and disconnected components in the ALM. We demonstrated that the PDA module outperforms the previous attention methods for localizing transmission lines. Moreover, the refinement network significantly improved the accuracy of transmission line detection in both quantitative and qualitative aspects.