PCAN—Part-Based Context Attention Network for Thermal Power Plant Detection in Remote Sensing Imagery

: The detection of Thermal Power Plants (TPPs) is a meaningful task for remote sensing image interpretation. It is a challenging task, because as facility objects TPPs are composed of various distinctive and irregular components. In this paper, we propose a novel end-to-end detection framework for TPPs based on deep convolutional neural networks. Speciﬁcally, based on the RetinaNet one-stage detector, a context attention multi-scale feature extraction network is proposed to fuse global spatial attention to strengthen the ability in representing irregular objects. In addition, we design a part-based attention module to adapt to TPPs containing distinctive components. Experiments show that the proposed method outperforms the state-of-the-art methods and can achieve 68.15% mean average precision.


Introduction
Fixed industrial facilities are buildings with pieces of equipment for a particular purpose. Specifically, power plants supply electricity to the electrical grid, sewage treatment plants remove contaminants from municipal wastewater and garbage dumps are piled with domestic garbage. These facilities greatly influence regional economic situation and ecological environment. Therefore, monitoring the location of fixed industrial facilities is of great significance for regional economic and environmental situation.
Thermal power plants of optical remote sensing images are investigated in this paper. Current research of spectral image object detection [1][2][3][4] mostly focuses on the land cover type such as urban land, agriculture land, forest land and water. Such objects are different from thermal power plants, because thermal power plants are functional fixed facilities, which have diverse spatial patterns with blurred boundaries and contain several non-rigid components with separate locations.
Compared with other facilities, it is more challenging to detect thermal power plants in remote sensing images due to the following characteristics. Thermal power plants generally have typical components including sedimentation tanks, cooling towers, chimneys, coal yards and pools. As shown in Figure 1, unlike sewage treatment plants, the components of Thermal Power Plants (TPPs) are non-rigid irregular objects, such as coal yards and pools, which are difficult to describe with a specific shape and scale. In addition, different from garbage dumps, TPPs have diverse spatial patterns with blurred boundaries, containing several components with separate locations, as illustrated in Figure 2. Consequently, it is more difficult but more valuable to study the detection of TPPs compared with other facility objects. In view of above characteristics, many recent works have already focused on the detection of irregular objects, as well as objects with diverse spatial patterns.
Detection of irregular objects: Zhou et al. [5] construct a fully-convolutional neural network adapted for text detection to predict words of arbitrary orientations and quadrilateral shapes in full images. Wang et al. [6] propose a Progressive Scale Expansion Network (PSENet) to detect text instances with arbitrary shapes, which generates the different scale of kernels for each text instance and gradually expands the minimal scale kernel to the text instance with the complete shape. They propose another arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN) [7] by means of cascadable U-shaped module and feature fusion. However, such arbitrary-shaped texts are irregular but organized objects with clear boundaries rather than TPPs. As a typical method for irregular objects, Deformable Convolutional Networks (DCN) [8] introduce deformable convolution and deformable RoI pooling to enhance the transformation modeling capacity of CNNs. The deformable convolution adds 2D offsets to receptive fields in the standard convolution, which can deform the receptive fields. In DCN, the shape and scale of anchor boxes is predefined, so it is difficult for the generic detector to describe TPPs with a specific shape and scale without adaption.
Detection of objects with diverse spatial patterns: Li et al. [9] divide a pedestrian image into several horizontal stripes for patch matching. Zhao et al. [10] propose Spindle Net for person re-identification, which separately captures semantic features from different body regions for the alignment of macro-and micro-body features. However, such part-based methods for person detection are based on the specific pattern of humans. Han et al. [11] propose a Part-based Convolutional Neural Network (P-CNN) for fine-grained visual categorization. P-CNN contains a part localization network, which learns a bank of convolutional filters as discriminative part detectors to locate distinctive object parts, and a part classification network, which classify each individual object part as imagelevel categories and then fuses part features and global feature for the final classification. Although P-CNN has taken discriminative parts into consideration, it is not applicable for TPPs because P-CNN is designed for rigid objects such as aircrafts and cars.
According to related research above, existing methods are mostly designed for rigid objects and organized objects with a specific pattern. These methods have not considered objects like TPPs, which are composed of non-rigid irregular components with separate spatial locations. In order to tackle the above problems, this paper presents an end-to-end detection framework called Part-based Context Attention Networks (PCAN). As illustrated in Figure 3, PCAN is based on a one-stage detector RetinaNet [12] on ResNets [13], using a context attention multi-scale feature extraction network (CMN) with deformable convolution [8] and a part-based attention module for both classification and regression. CMN not only obtains geometric constraint information by deformable convolution, but also enhances the context attention multi-scale feature maps. Part-based attention module is designed for the adaption of the thermal power plants with sparsely distinctive components, which introduce a loss function to focus on certain discriminative regions with high responses. Compared to other generic object detection methods such as RetinaNet, Faster RCNN and Cascade RCNN, our framework is more suitable for the detection of thermal power plants, and has achieved state-of-the-art performance.   [12], a one-stage detection network, extracts deep features by ResNet [13] and Feature Pyramid Networks (FPN) [14], and then obtains locations and class labels of the anchors by box subnet and class subnet using focal loss. (b) part-based context attention networks (PCAN) uses a Context attention Multi-scale feature extraction Network (CMN) to generate multi-scale feature maps containing contextual information for irregular objects and a part-based attention module for the adaption of facility objects composed of distinctive components.
The main contributions of this paper are summarized as follows: (1) We construct a one-stage end-to-end detection framework called Part-based Context Attention Networks (PCAN). The model adaptively generates multi-scale feature maps containing context and part-based attention, which is more accurate and effective for thermal power plants detection in high-resolution remote sensing imagery. (2) We propose a Context attention Multi-scale feature extraction Network (CMN) with deformable convolution, which strengthen the feature representations through the combination of context attention and multi-scale feature extraction.
(3) As facility objects generally consist of several components, a part-based attention module is designed for the adaption of such facility objects, which effectively help discover distinctive object components.
Experiments based on remote sensing images obtained from Google Earth show that our PCAN has state-of-art performance for the detection of thermal power plants. The datasets are publicly available in our github repository (https://github.com/wenxinYin/ AIR-TPPDD, accessed on 14 March 2021), which can reduce the workload of thermal power plants investigation. The rest of this paper is organized as follows. Section 2 introduces the details of the proposed method. Then Section 3 presents the experiments conducted on a remote sensing dataset to validate the effectiveness of the proposed framework. Section 4 discusses the results of the proposed method. Finally, Section 5 concludes this paper.

Network Architecture
The proposed PCAN model is an end-to-end framework based on RetinaNet [12]. The overall architecture of PCAN in Figure 4 can be divided into three parts: a deep feature extraction sub-network to extract context-based feature maps for irregular objects, a subnetwork for global prediction, and a module proposing a part-based loss function. The deep feature extraction sub-network contains a ResNet [13] backbone and a Context attention Multi-scale Network (CMN). The global prediction sub-network contains a classification subnet and a regression subnet for the bounding box prediction of global object. The partbased attention module sub-network is proposed to focus on discriminative regions with high responses in feature maps. In this sub-network, feature channels are clustered by K-means into certain groups, where a part-based loss function is introduced to highlight the prominent components in the object.
As shown in Figure 3, in the simple one-stage RetinaNet, only backbone networks and global prediction networks are included. However, due to the non-rigid irregular components of TPPs, context attention multi-scale network has been added into the architecture of this paper to enhance the feature representation capability. In addition, part-based attention module is proposed for detecting thermal power plants with several separate components.
Deep Feature Extraction Sub-network: We use a ResNet-50 [13] architecture pretrained on ImageNet and our CMN in backbone sub-network. The outputs of the last convolution layer in the last three residual blocks, defined as {C3, C4, C5} are activated for feature extraction, whose sizes are {1/8, 1/16, 1/32} corresponding to input image. In order to effectively detect multi-scale thermal power plants with irregular components, we design CMN to deal with the set of feature maps and produce global spatial attention features named {P3, P4, P5}. Deep feature extraction sub-network generates contextual attentioned deep feature maps of input images, which are designed for multi-scale TPPs with irregular components.
Global Prediction Sub-network: This sub-network includes a classification subnet and a regression subnet. These two parallel subnets share a common structure with separate parameters. Specifically, for A anchors and K object classes, the classification subnet predicts the probability of objects in spatial locations, which is a small FCN including three 3 × 3 conv layers attached to each pyramid level of CMN. Each 3 × 3 conv layer shares the same parameters, activated by ReLU. Then, the subnet is followed by a 3 × 3 conv layer with KA filters attached by a sigmoid activations. The difference between two subnets is that regression subnet finally obtains 4 linear outputs for each of the A anchors per spatial location rather than K. Global Prediction Sub-network uses focal loss [12] for classification subnet and smooth L 1 loss [15] for bounding box regression.
Part-based Attention Module: For the adaption of the TPPs with sparsely distributed components, part-based attention module sub-network is proposed to focus on discriminative regions with high responses in feature maps. In this sub-network, K-means method is adopted to cluster feature channels into certain groups, where each group aggregate spatially-correlated patterns corresponding to each component of TPPs. Part-based loss functions for both classification and regression are proposed to strengthen the influence of prominent components in the object. Overall framework of our PCAN, which consists of deep feature extraction sub-network, global prediction subnetwork and part-based attention module. In deep feature extraction sub-network, our CMN after the classic Convolutional Neural Network (CNN) produces multi-scale feature maps, which can not only contain contextual information but also model irregular components. Global prediction sub-network includes two subnets, one for predicting the labels for anchors and one for regressing from anchors to ground-truth bounding boxes. Part-based attention module adopts K-means method to cluster feature channels into certain groups, where each group aggregate spatially-correlated patterns, corresponding to one component of Thermal Power Plants (TPPs).

Context Attention Multi-Scale Feature Extraction Network (CMN)
As previously described in Section 1, thermal power plants contains non-rigid irregular components which are difficult to describe with a specific shape and scale. In order to detect TPPs with irregular components, we design a Context attention Multi-scale feature extraction Network (CMN) with deformable convolution based on FPN [14]. FPN can merge low-level feature maps with higher resolution and high-level semantic information, which is suitable for multi-scale feature representation. To match the component objects in irregular shapes, we use deformable convolutions [8] to obtain geometric constraint information. In addition, global context attention in GCNet [16] is introduced in CMN to aggregate global contextual information for modelling capacity enhancement.
In Figure 4, we use outputs of the last convolution layer in residual blocks of ResNet [13] as {C 3 , C 4 , C 5 }. Then CMN, which is elaborated in the following, produces unidimensional feature maps with geometric constraint and contextual attention. The resultant set of feature maps, called {M 3 , M 4 , M 5 } corresponding to {C 3 , C 4 , C 5 }, is then laterally connected by up-samling and element-wise addition, generating feature maps for prediction as {P 3 , P 4 , P 5 , P 6 , P 7 }. P 6 and P 7 are computed from C 5 as RetinaNet [12].
As shown in Figure 5, CMN is constructed by two modules including deformable convolution and context module (context attention and a transformer). For the input feature map C i=5,4,3 in the shape of {batch, C i , H i , W i }, the 1 × 1 convolutional layer is firstly used to reduce the dimension as C i of {batch, C 2 , H i , W i }. In DCN part, in order to transform the receptive field of convolutional kernels, offsets for each point on feature map C i are learned by a 3 × 3 conv, denoted as a tensor of {batch, 18, H i , W i }. As the obtained offsets are usually fractional, offsets are then aggregated to original locations by bilinear interpolation, so as to generate the updated locations. Additionally, a 3 × 3 deformable conv with stride = 3 is applied to the updated locations, followed by ReLU activation and batch normalization, of which result is denoted as D i .
In order to acquire global spatial contextual attention efficiently as GCNet [16], the following context part and transform part construct a non-local block to enhance D i . In context part, the 1 × 1 conv with a softmax generates a global spatial attention mask which indicates the importance of each pixel in the image. The obtained attention mask is then multiplied to D i , producing contextual features. In transform part, lightweight bottleneck layers integrate channel-wise dependencies and bottleneck ratio r is set to reduce the computational cost. Batch normalization (BN) can not only reduce the difficulty of optimization and also improve the generalization. The final step is to fuse the transformed contextual features, followed by sigmoid activation, with deformable feature maps D i .

Part-Based Attention Module
Thermal power plants contain distinctive components with separate locations as illustrated in Figure 2. This paper proposes a part-based loss function during training to strengthen the influence of distinctive components in TPPs. We introduce part-based loss function starting from the loss function in RetinaNet [12].
For an anchor box i, loss function is defined as the sum of classification loss L cls and regression loss L reg .
p i ∈ [0, 1] is the estimated probability for the object class. p * i ∈ {0, 1} is the ground-truth label, that is, p * i = 1 for objects and p * i = 0 otherwise. t i is a vector representing four parameterized coordinates of predicted anchor box and t * i is for the ground-truth box.
N cls and N reg are the numbers of anchor and anchor locations respectively in one batch. λ is used to balance L cls and L reg , which is set to 1 here. The classification loss L cls is the softmax loss of two classes, that is, object and background. L cls in RetinaNet is the focal loss [12] for binary classification which is designed for class imbalance based on cross entropy (CE) loss [17] during training.
where α t ∈ [0, 1] is a weighting factor factor and γ ∈ [0, 5] is a tunable focusing parameter for smoothly adjustments of influence of easy examples. p t is defined as: where p ∈ [0, 1] is the predicted probability for the class with label y = 1 and y ∈ {±1} indicates the ground-truth class, object or background. The regression loss L reg in RetinaNet is the standard smooth L 1 [15] loss used for box regression. For an anchor box i, (1) calculates the addition of mean values of both classification loss and regression loss for anchors at all scales of feature maps. However, the influence of distinctive components inside the objects is not taken into consideration. Different from other objects like garbage dumps, TPPs have diverse spatial patterns containing several components with separate locations. As a result of that, we design a part-based loss to strengthen the influence of prominent components in the object in the training stage, where the combined loss function is defined as the sum of L global (Equation (1)) and L part (Equation (6)), balanced by an adjustable parameter α.

Loss function in Equation
In the part-based attention module, a set of multi-scale feature maps are clustered into certain groups by K-means clustering [18], where each group aggregate spatially-correlated patterns corresponding to each component of TPPs. For feature maps {P 3 , P 4 , P 5 }, K (9, 6, 3) points are respectively extracted by K-means for each channel. Figure 6 visualizes the feature maps effected by part-based attention module, which reflects that the network can pay more attention to these distinctive components by adding part-based loss.
Similar to the loss function in RetinaNet in Equation (1), the part-based loss function is defined as follows.
where λ part is used to balance L part_cls and L part_reg which is set to 1 here. In Equation (1), {p i } indicates the probability of object presence at each spatial position for each of the A anchors and N object classes, which can be seen as a set of vectors in the shape of {batch, N A, W H}. {t i } is a set of four relative offsets between the anchor and the ground-truth box for each of the A anchors per spatial location in the shape of {batch, 4A, W H}. Thus, part-based loss function in Equation (6)  For clustering, centers should be mostly positive samples as distinctive components of an object, α-balanced CE loss is used as L part_cls , which can also be seen as γ = 0 in focal loss (Equation (2)). Regression loss L reg in part-based loss function (Equation (6)) is identical to smooth L 1 loss (Equation (4)).

Dataset
Large-scale datasets in remote sensing images such as UCMD [19], EuroSAT [20] and DOTA [21], have contributed to the development of the general object detection of remote sensing images. However, existing publicly available datasets in remote sensing only cover limited categories of objects [22][23][24][25]. There is no annotated dataset of fixed industrial facilities including thermal power plants, garbage dumps and sewage treatment plants to the best of our knowledge.
In order to push forward the deep learning based development of the detection of TPPs, we construct a thermal power plant dataset of visible spectrum Google Earth images for object detection, which will be publicly available. We collect 257 potential locations of worldwide power plants from public websites, and then download images of all these locations, examine and check them earnestly. Sites with low credibility are omitted and 230 thermal power plants remain. To increase the diversity of data, we collect historical images of the 230 valid sites from Google Earth, and obtain 487 images ultimately. Each image is 3584 × 3584 pixels, covering the land of 2 km × 2 km with a resolution of 0.60 m. All the objects are annotated with horizontal bounding boxes and finally obtain a COCO-style dataset.
In addition, to facilitate the representation of TPPs, we provide annotations including the whole PLANT and four components, that is Coal Yard, Chimney, Pool and other processing buildings (Processing). Coal Yard, Chimney and Pool are typical components in a thermal power PLANT. The study in this paper uses the PLANT annotations on a sub-dataset of 300 coal-fired TPP images. The 300 coal-fired TPP images are split into training and testing data with a ratio of 7 to 3. The data in Aerospace Information Research Institute-Thermal Power Plants Dataset for Detection (AIR-TPPDD) are respectively augmented by random cropping and flipping to obtain a dataset of 2000 samples of 900 × 900 pixels to adapt to deep learning based methods.

Evaluation Metrics
To evaluate the practical application of our proposed detection methods for TPPs, we adopt the standard mean average precision (mAP), frame per second (FPS), floatingpoint operations per second (FLOPs) and the number of trainable parameters (Params) in our experiments.
In a detection task, the predicted bounding boxes can be divided into true positive (TP), true negative (TN), false positive (FP) and false negative (FN). Precision and recall of detection results are calculated as: The F 1 score is the harmonic mean of the precision and recall, which can evaluate the performance comprehensively. Given two bounding boxes B 1 and B 2 , intersection over union (IoU) is defined as When IoU varies, precision and recall will change dynamically, constructing the precision-recall (PR) curve. The average precision is viewed as the area under PR curves obtained by setting different IoUs. More specifically, AP@0.5 and AP@0.75 are the areas under the PR curve setting IoU = 0.50 and 0.75 respectively. mAP@[0.5:0.95] is the average AP when IoU ranges from 0.5 to 0.95 in steps of 0.05, which is used as the main evaluation criterion for our task.
In addition, average frame per second (FPS) is the number of processing images per second during test stage, which represents the time cost for application. Floating-point operations per second (FLOPs) and the number of trainable parameters (Params) are commonly used to indicate the complexity of deep models. Experiments are all implemented under the same hardware conditions.

Parameter Settings
All experiments are implemented with the PyTorch framework on a NVIDIA TITAN RTX with CUDA11.1. The pre-trained model ResNet-50, which was trained on the Ima-geNet dataset [26], is used to initialize the network. For the balance between the large-size scene requirements for objects and training efficiency for deep network, all images are processed to 900 × 900 pixels by random cropping and flipping in experiments.
We then utilize stochastic gradient descent [27] to train the network with a momentum of 0.9 and weight decay of 5 × 10 −4 . The learning rate is initialized as 0.001 and then dropped by a factor of 0.1 every 10,000 steps. In classification loss (Equations (2) and (7)), we set α t = 0.25 and γ = 2 according to RetinaNet [12]. The ratio of negative and positive samples in training stage is set to 3 in order to suppress negative samples. The balancing factor α in loss function (Equation (5)) is set to 0.25 without specific notice.

Effect of CMN
In this section, the proposed CMN is trained for exploring the influence to the generated feature maps. Experiments use the same detection framework and unchanged parameters based on RetinaNet [12]. As shown in Figure 5, MFN is designed for irregular multi-scale feature representation by introducing global spatial attention and deformable convolution.
To prove the effectiveness of CMN we proposed, ablation experiments are designed as Table 1. As Figure 5, CMN can be viewed as the sum of deform module (DCN part) and context module (context attention and a transformer). CMN can be added to feature maps {C 3 , C 4 , C 5 } of backbone ResNet50. In Table 1, adding deform module or context module brings a certain improvement to the predicted detection results. It can also be seen that detection results of CMN are mostly obviously improved with the increase in network complexity except RetinaNet+CMN(C 35 ). It could be because that C 5 and C 3 are processed by CMN, so M 5 and M 3 can adapt to non-rigid irregular objects rather than M 4 in Figure 5. As a result of that, M 4 in RetinaNet+CMN(C 35 ) is not consistent with M 5 and M 3 , which does not benefit the optimization of networks. In general, the most obvious improvement can reach 4.25% and the mAPs of seven listed ways of CMN addition are enlarged, which can prove the reliability of our proposed CMN.

Effect of Part-Based Attention Module
As discussed in Section 2.3, part-based attention module is beneficial to the detection of thermal power plants. An adjustable parameter α is used to balance L part (Equation (6)) with L global (Equation (1)) in loss function (5). Same as CMN, we evaluate the effects of part-based loss function based on RetinaNet in ablation experiments. Table 2 shows that by replacing the loss function, the detection result is improved to 65.58% with respect to RetinaNet, delivering a gain of 2.49%. Figure 6 illustrates several input images and corresponding heat maps of feature representation which proves that part-based attention module can help pay more attention to the distinctive components in the training process. Furthermore, part-based attention module with balancing factor α = 0.25 is demonstrated in Table 2 to bring an extra improvement of 0.81% to the networks with CMN. Results on our best method RetinaNet+CMN+Part-based-loss (PCAN) are visualized in Figures 7 and 8. To balance L part with L global , we part-based multiply a modulation factor α to L part in Equation (5). In this section we change α from 0 to 1 to investigate its influence. All models share the same experiment settings based on RetinaNet. When α = 0, Equation (5) degenerated into Equation (1) as RetinaNet. If we set α to 1, part-based loss L part weighs the same as L global in overall loss function. Table 2 shows the performance of models with different α, from which we can find that the performance of models with different α approximately obeys normal distribution, achieving best α in [0.25, 0.5]. When α is close to 1, mAP obviously decreases. This result may be because L part roughly focus on the distinctive components which is not reasonable if it has a huge impact on the total loss. For simplicity, modulation factor α is set to 0.25 without specific notice in the part-based attention module to benefit the localisation of TPP targets.

Comparison with State-of-the-Arts
In this paper, classic one-stage RetinaNet detector [12] is used as the baseline method due to its simple structure and wide application in object detection. Two-stage detector Faster-RCNN detector [28] and multi-stage detector CascadeRCNN [29] are also included in contrast experiments.
RetinaNet [12] extracts deep features by ResNet [13] and FPN [14], and then uses a box subnet and a class subnet to obtains locations and class labels of anchors. Focal loss is designed to deal with class imbalance in one-stage detectors, which enlarges the weight of hard examples in cross-entropy loss.
Faster-RCNN [28] is a two-stage framework by integrating the Fast-RCNN [19] with RPN, which also extracts deep feature maps by a CNN backbone. FPN is added in our experiments to the backbone for multi-scale feature extraction. RPN is then trained to generate region proposals and ROI pooling computes proposal feature maps. Lastly, a classifier is used to predict the labels for each proposal and refine proposals.
For better match between the intersection over union (IoU) thresholds for which the detector is optimal and those of the input hypotheses, Cascade-RCNN [29] includes a sequence of detectors trained with increasing IoU thresholds. Compared with Faster-RCNN, Cascade-RCNN consists of at least two more ROI poolings and classifiers which are trained stage by stage.
All experiments are implemented under the same hardware conditions. As shown in Table 3, our PCAN increases mAP by 5.06% compared to RetinaNet. Furthermore, according to hypothetical test principle, statistical tests of the detection results of baseline RetinaNet show that P(mAP = 0.6309 ± 0.19%) = 0.95 and final results of PCAN show P(mAP = 0.6815 ± 0.63%) = 0.95, which indicates the enhancement of representation ability of deep feature maps for TPPs. FPS, FLOPs and the number of trainable parameters (Params) are listed in Table 3. It is thus convincing that our method gained better performance than RetinaNet without too much time and memory cost.
Experiments show that mAPs obtained by Faster-RCNN and Cascade-RCNN are close to mAP obtained by RetinaNet, with minor improvements in accuracy between multistage and one-stage methods. However, RetinaNet runs much faster than Faster-RCNN and Cascade-RCNN with less number of trainable parameters. This could be because that complicated models are not easy to optimize, especially for the non-rigid irregular TPP object.
Furthermore, experiments of remote sensing ship detection are performed on the AIR-SARShip dataset [30], as shown in Table 4. Results demonstrate a minor improvement of our PCAN for ship detection, which indicates that our proposed method is more suitable for the detection of thermal power plants rather than other objects.  Table 4. Experiments on ship dataset [30].

Discussion
By comparing and analyzing the above experiments, the effectiveness of the proposed method is verified. The proposed PCAN offers superior performance in the TPPs detection task by CMN and part-based attention module based on RetinaNet.
However, through observation of the test results in Figure 9, we can see that not all the detection results are ideal. Figure 9 shows some examples of false alarms and missing alarms, which are mainly caused by hard examples. Hard examples found in our experiments include the following two situations: (1) Disturbances due to similar surfaces. Some background scenes, such as buildings, parking lots and pools, locate near a TPP in Figure 9a-c. In Figure 9a, some residential buildings appear similar to the processing buildings in TPP. In Figure 9b, a parking lot with regularly arranged cars is mistakenly detected. In Figure 9c, it is sometimes difficult to distinguish whether a nearby pool is a part of the TPP object. (2) Missed alarms caused by occlusion and edge location. Objects blocked by clouds and located near the edge make it difficult to recognize in Figure 9d.
In the future, we will explore how to enhance the recognize ability of detectors in order to effectively reduce false alarms and missed alarms. We are particularly planning to split the AIR-TPPDD dataset into easy and hard examples, and then focus on the hard examples during training with strict limitation of ratio of hard and easy samples. In addition, detection and classification of power plants including coal-fired power plants, oil-fired power plants, gas-fired power plants, waste heat power plants may be implemented by constructing more detailed power plants images, which will be carried out in the future.

Conclusions
The detection of thermal power plants is a meaningful but challenging task. The difficulty results from the lack of annotated dataset and highly complex appearances of TPPs. In this paper, an effective TPP detection method, which includes context attention multiscale feature extraction network (CMN) and part-based attention module, is proposed to solve the problem. CMN enhances the local convolutional features and part-based attention module strengthen the influence of components in TPPs. Experiments demonstrate the effectiveness of our proposed part-based context attention networks (PCAN).