PCNet: Cloud Detection in FY-3D True-Color Imagery Using Multi-Scale Pyramid Contextual Information

: Cloud, one of the poor atmospheric conditions, signiﬁcantly reduces the usability of optical remote-sensing data and hampers follow-up applications. Thus, the identiﬁcation of cloud remains a priority for various remote-sensing activities, such as product retrieval, land-use/cover classiﬁcation, object detection, and especially for change detection. However, the complexity of clouds themselves make it difﬁcult to detect thin clouds and small isolated clouds. To accurately detect clouds in satellite imagery, we propose a novel neural network named the Pyramid Contextual Network (PCNet). Considering the limited applicability of a regular convolution kernel, we employed a Dilated Residual Block (DRB) to extend the receptive ﬁeld of the network, which contains a dilated convolution and residual connection. To improve the detection ability for thin clouds, the proposed new model, pyramid contextual block (PCB), was used to generate global information at different scales. FengYun-3D MERSI-II remote-sensing images covering China with 14,165 × 24,659 pixels, acquired on 17 July 2019, are processed to conduct cloud-detection experiments. Experimental results show that the overall precision rates of the trained network reach 97.1% and the overall recall rates reach 93.2%, which performs better both in quantity and quality than U-Net, UNet++, UNet3+, PSPNet and DeepLabV3+.


Introduction
With the rapid development of remote-sensing technology, more and more remotesensing images are employed for farmland monitoring, land use, target detection and so on in production and supporting living [1]. Image quality is equally as important as the processing algorithms. Due to the influence of complex atmospheric environments, most images cannot be directly used, and among these influencing factors is the presence of clouds. Nearly 70% of the world is often covered with clouds [2] leading to a compromised determination of the surface reflection information and thus significant impact on the analysis and application. Hence, improved cloud-detection procedures are essential to service the requirements of a range of Earth applications.
In recent years, many cloud-detection methods have been proposed. These methods can be divided into two classes: threshold-based and classification-based approaches. The threshold-based methods detect clouds with extremely high accuracy and good robustness by classifying the reflectance and brightness temperature of different spectra. Iris [3] proposed an Automated Cloud Cover Assessment System to estimate the cloud cover of Landsat satellite imagery. Oreopoulos [4] improved the clear sky synthesis algorithm of MODIS to evaluate the performance of the cloud mask of Landsat-7 imagery. of transparency makes its detection easily affected by complex ground information while the thin clouds have a great impact on cloud removal and other image applications.
Additionally, we found that most of the existing methods are based on multispectral images, which require NIR, SWIR or other bands. Given restrictions on data release, only RGB data were available. Based on the issues of these methods and the distributions of clouds in images, we propose a new Pyramid Contextual Network (PCNet) using the global information at different scales comprehensively. We construct two new modules in the proposed PCNet: the Dilated Residual Block (DRB) to expand the perception field of feature extraction and the pyramid contextual block (PCB) to explore the relationship between each pixel in the image. The PCB could detect isolated small clusters of thick clouds and thin clouds. To decrease the redundancy of feature maps, we use Channel Attention Block (CAB) [42] to refine the feature maps. In our network, the multi-scale global features of thick and thin clouds are automatically extracted that contain global contextual information at multiple scales. Our method can enhance the connection between each pixel and all remaining pixels, which makes better delineation for thin clouds.
The remainder of this paper is organized as follows. In Section 2, the data sources and the proposed methodology for cloud detection is described. Section 3 demonstrates the design of cloud-detection experiments and corresponding results to validate the superior performance of the proposed PCNet. In Section 4, the conclusions are presented.

Materials and Methods
The FY-3D is a satellite independently developed and launched by China. It has been widely used in meteorological forecasting, hydrological monitoring, and other tasks [43]. The current cloud-detection methods are mostly based on the algorithms developed for the Landsat and MODIS satellites. There is none for the FY-3D. There is a long-standing gap of cloud-detection methods for usage of FY-3D satellite images. Therefore, the cloud-detection task for the FY-3D satellite remote-sensing image is imminent. All data used in this paper can be downloaded from http://satellite.nsmc.org.cn/ (accessed on 10 July 2020).
To better show the practicability of our method, the region we selected is the entire Chinese region, as shown in Figure 1, and the resolution of the image is 250 m. The dataset contains different types of features such as deserts, grasslands, oceans, etc. We only use the visible light band for training and testing. In our work, we downloaded several pieces of FY-3D satellites images, and these data cover different landcovers. Our dataset is marked and verified individually by an independent and skilled group from the China Meteorological Administration. A pixel is marked as cloud if more than half the members agree. According to statistics, most of the ambiguous pixels are cloud boundaries and thin cloud regions, and these pixels account for no more than 3% of all pixels.

Preprocessing of Experimental Data
The spatial resolution of the FY-3D remote-sensing imagery used in the experiment is 250 m. This work uses the consensus of several experts to label remote-sensing images to ensure the correct classification of clouds. We choose the entire Chinese region for research because this area contains various landcovers, including oceans, deserts, forests, grasslands, and others. At the same time, the diverse climate in the region leads to clouds with different shapes, all of which increase the difficulty of detection. After data preprocessing (such as cropping), the data contains 24,659 × 14,165 pixels. The remote-sensing image is cropped by a 512 × 512 sliding window in 512-pixel steps. In addition, we used random left and right flips and up and down flips for some data during training, and added "salt and pepper noise" to increase the size of the dataset and avoid overfitting. There will be a large number of whole images that are all clouds after cropping. Such a large number of full cloud and fog samples will make the model difficult to train and reduce the accuracy of detection. Therefore, we include images with different cloud coverage in the training set and test set, so that the network can be well trained. After our careful selection, the dataset has 6959 images, of which 20% (1392 images) are used as the test set to verify the proposed model, and 80% (5567 images) are used as the training set. There are 669, 623, 879 cloud pixels and 789, 731, 769 clear pixels in the training set, which account for 45.88% and 54.12% of the whole training pixels. On the other hand, the test set contains 146, 582, 117 cloud pixels and 218, 322, 331 clear pixels, which occupy 40.17% and 59.83% of the entire testing pixels.
The distribution of image patches with different cloud coverage ratios are roughly similar in the two datasets. The details of the image patch distribution are summarized in Table 1.

Our Method
To meet the needs of subsequent experiments, we divided the downloaded satellite images into a training set and a test set. The training set contains 5567 images, and the test set contains 1392 images. Coverage of clouds and the types of surface objects are fully considered when dividing the dataset. Most of the images contain cloud and free areas. Cloud regions include small, medium, and large clouds; the underlying surface environment includes vegetation, agricultural, water and snow.
Deep learning has been widely used in various tasks of remote-sensing image processing. Convolutional neural networks are used in tasks such as object detection, semantic segmentation, saliency detection, because of its excellent fitting ability. The Feature Pyra-mid Network [44] is adopted in many tasks because of its ability to synthesize features from different scales. Large-scale features can ensure that details are mined, while small-scale features can make the global information easily extracted. Non-Local Block [45] is introduced to build the connection between the global information and the local information, which allows for a better characterization and exploitation of clouds by combining rich global and local information. These two ideas are both important to the task of cloud detection, so we combine these two blocks to exploit long-range correlation information to detect cloud. The network structure is shown in Figure 2. The CAB is used for choosing the best channel to make the cloud mask. The network takes a cloud image as input and outputs the cloud mask.

Dilated Residual Block
In cloud detection, because of the existence of large clouds, we should pay more attention to global information when extracting features. Dilated convolution [46] is adopted because of its excellent ability to extract features. Dilated convolution is embedded holes in the regular convolution which increases the receptive field. It has one more hyperparameter called dilation rate, which refers to the number of kernel intervals, e.g., the dilated rate of regular convolution is 1.
The architecture of DRB is shown in Figure 3. First, the input feature map is fed into dilated convolution and then normalizes the feature values through the Batch Normalization layer. Finally, it is activated by LeakyReLU layer. Our DRB contains five blocks; each block is composed of Dilated Conv-BN-LeakyReLU. All the size of convolution kernel is (3,3), and the dilatation ratios are 1, 2, 4, 2, 1; the padding sizes are also 1, 2, 4, 2, 1 to ensure the size of feature maps remains unchanged, the details are shown in Table 2. To preserve the information of the first group, we also add the residual connection to ensure that there is no loss of information from the beginning layers. The number of blocks and the dilated rates will be detailed in Section 3.3.2. To preserve the information of the input feature, we add the residual connection to ensure that there is no loss of information from the beginning layers. K, S, D, P mean kernel size, stride, dilated rate and padding size, respectively.

Pyramid Contextual Block
After the image is processed by the initial convolution and DRB, we obtain feature map F(H × W × C), H, W, C representing height, weight and channel, respectively. Before we introduce pyramid contextual block, we first review non-local block [45], specified as follows: F means the feature map processed by non-local block.
means similarity between i pixel and j pixel of the original feature maps. G(F) ∈ R(HW × N) is feature map embedded to N-dimension. D is a diagonal matrix for normalization purposes. T (.) is a transforming function to recover the channel of feature map to C as equal as the original feature F. In this way, the feature map can be globally enhanced by the whole position of the feature map and the correlation between all pixels. Additionally, in [45], it can be constructed by taking the linear embedded Gaussian kernel to compute the feature map M, and the linear function to calculate G: F emb is convolutional operation with parameters of W. When generating M, we use W θ and W φ as the convolution kernels. F emb (F, W θ ) and F emb (F, W φ ) have the same size. To compress the features in channel dimension and reduce the amount of calculation, all convolutions use kernel size of 1 × 1 [45].
As we can see in Figure 4, and then reshape them to F 1 (m × HW), F 2 (HW × m), F 3 (HW × n). We obtain the attention map A by SoftMax the result of multiplying F 1 and F 2 , attention map indicates the similarity of each pixel. F is calculated by multiplying the attention map and F 3 , then recover the channel of F by feeding it to a 1 × 1 convolution layer. Finally, we obtain the enhanced feature mapF by adding F and F.  [45]. " " denotes matrix multiplication, and " " denotes the element-wise sum. The SoftMax operation is performed on each row. A is the attention map which can capture long-range dependencies. The gray boxes denote 1 × 1 × 1 convolutions, (k, s, f) mean kernel size k, stride s and number of filters f. Please note that the non-local block is a brilliant attention mechanism, but there is a major trade-off. The attention map is generated by two feature maps, whose sizes are H × W × m. The computational costs and memory consumptions of non-local block arise quadratically as the spatial size of input feature map increases. To solve this problem and to obtain long-range correlation of the whole image, we use multi-scale features and different sizes of convolution kernels to reduce the parameters. From the structure of our block in Figure 5, we can see that our block contains three branches, E is the result of the input feature map to 1 × 1 convolution, E 1 θ and E 1 g are the feature maps obtained by 4 × 4 convolution layers. The size of E 1 θ and E 1 g becomes HW/16, the sizes of E 2 θ and E 2 g , E 3 θ and E 3 g are HW/(8 × 8), HW/(16 × 16) respectively, which greatly reduces the cost of computation, such that the global information can be grasped by the larger kernel sizes and the larger strides. Figure 5. The architecture of our Pyramid Contextual Block. " " denotes matrix multiplication, and " " denotes elementwise sum. The SoftMax operation is performed on each row. Compared with non-local block, we use multi-scale features and different sizes of convolution kernels to reduce the parameters and obtain long-range correlation of the whole image. E 1 ,Ê 2 andÊ 3 are the enhanced features generated by multi-scale attention maps.F is the output feature of PCB. The gray boxes denote convolutional layers, (k, s, f) mean kernel size k, stride s and number of filters f. Then, the feature map of each of our branches can be expressed by the following equation: Finally, we concatenate the features from all the branches and feed the result to 1 × 1 convolution to change channels of the result to be consistent with the input feature map, and add the input feature to obtain theF.
Under the premise of reducing the amount of calculation, PCB fully uses the information from multi-scale features to capture clouds with different sizes. Moreover, it can exploit the long-range connection between each pixel. In cloud detection, we should expand the receptive field to the entire image because the cloud can exist anywhere and have any size.

Channel Attention Block
After DRB and PCB have processed the feature map, it has used the contextual information in the image to perceive the area where clouds and fog exist. As shown in Figure 2, the feature map passes through the last PCB and concatenates the features of the previous layer together. There will inevitably be some redundant feature layers. We used SE Block [42] to select the appropriate channels that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.
The input features are first passed through squeeze and resample operations (Global Pooling, FC, ReLU, FC shown in Figure 6), which aggregates the feature maps across spatial dimensions H × W to produce a channel descriptor. This descriptor embeds the global distribution of channel-wise feature responses, enabling information from the global receptive field of the network to be fully used. This is followed by an excitation operation (Sigmoid shown in Figure 6), in which sample-specific activations, learned for each channel by a self-gating mechanism based on channel dependence, govern the excitation of each channel. The feature maps are then reweighted to generate the output which can be fed directly into subsequent layers. Figure 6. The architecture of Channel Attention Block. The feature map obtains a channel descriptor of 1 × 1 × C after squeeze and resampling operations. Then, it is activated by a self-gating mechanism based on channel dependence. The feature maps are reweighted to generate the output which can be fed directly into subsequent layers.
In Figure 6, sigmoid is set as the second activation function to obtain the normalized channel weights because it can remap any real number to (0, 1) and keep the original information in channel dimension [42].

Pyramid Contextual Network
Our proposed network as shown in Figure 2. The input is processed by two convolution layers to calculate the feature map F 0 with H × W × 64, where I is the input image and F entry (, W entry ) is the initial convolution layers. Subsequently, we send F 0 to our enhanced block, where each block contains DRB and PCB, we assume F m is the output of the m th block, F PCB and F DRB is PCB and DRB, W PCB and W DRB is the weight of them. After the same processing twice, we concatenate the final output feature map of each module together and output it after processing by the output convolutional layer.

Loss Function Optimization
Cloud detection is a binary classification problem, so we use cross-entropy as our loss function to obtain the cloud mask with high accuracy, where F r is the classification result generated by our network, F gt is the true cloud mask.

Input Data
All the prepared sample data, including three original bands, namely R, G, B and the corresponding ground truth labels were used as inputs to the Pyramid Contextual Net. The input data used by the deep neural network are shown in Figure 7.

Set of Hyperparameters
Due to the complexity of our network structure, we need to initialize the network parameters. The weight of the convolution operation is initialized to a Gaussian distribution with a mean of 0 and a variance of 0.01, the bias is 0.1, and the weight of Batch Normalization is set to 0.1. To fit the network as quickly as possible, we use Adam optimizer [47] to optimize the network. The exponential decay rates for the first and second moment estimates are set to 0.9 and 0.999, respectively. The initial learning rate is set to 1 × 10 −4 , and reduced to half of the original number every 20 epochs, a total of 100 epochs, and the batch size is set to 1. All the output probabilities of each pixel from a Sigmoid classifier are translated to binary values with a threshold of 0.5 (0.5 as a default setting in a binary segmentation) [48].

Experimental Results
In this section, we compared our model with other state-of-the-art methods to verify the validity of our model. The training and testing environments are as follows: Our proposed model was implemented using the open-source Pytorch framework provided by Facebook in Python. Our platform is Ubuntu 20.04 with NVIDIA GTX 3090 GPU (24 GB). After 100 epochs, our model achieved state-of-the-art results on the dataset (Figure 8).

Quantitative Analysis
In this section, we will compare the quantitative performance of our model. This work is compared with some new methods, including U-Net [49], UNet++ [50], UNet3+ [51], PSPNet [52] and DeepLabV3+ [53] to evaluate the effectiveness of the proposed PCNet in detecting cloud from remote-sensing images. The methods mentioned above are opensource and available on https://github.com/ (accessed on 15 February 2021). All methods have been trained and tested on the same dataset. The dataset has 6959 images, of which 20% (1392 images) are used as the test set and 80% (5567 images) are used as the training set. Compared with U-Net, UNet++ adds Dense Connections to re-use the feature maps. UNet3+ adds Full-scale Skip Connections to explore the full-scale ability of sufficient information and uses Full-scale Deep Supervision to constrain the intermediate features extracted by the network and improves the network's capabilities. PSPNet is also listed as a comparison method. PSPNet uses the Pyramid pooling Module to collect hierarchical information to classify the pixels in the image better. Moreover, we also compare with DeepLabV3+. DeepLabV3+, as the best method in the DeepLab family, uses Atrous Spatial Pyramid Pooling to obtain a larger field of perception. In addition, the network is also improved an encoder-decoder structure to preserve feature information. To show the outstanding performance of our method, we compared with the current commonly used U-Net, UNet++, UNet3+, PSPNet and DeepLabV3+.
We used precision, recall, F1 Score and accuracy [36,38,54] to quantitatively evaluate the performance of our model in detecting clouds from remote-sensing images. These measurements are defined as follows: where TP is true positive, TN is true negative, FP is false positive and FN is false negative. P and R are Precision and Recall, respectively. The accuracy assessment was performed as binary, all non-cloud features were combined into one feature. The results are shown in Table 3. The overall precision of our model reached 97.1%, and the F1 score reached 0.951, which proved that our proposed method was excellent in detecting clouds from the remote-sensing imagery. The experimental results in Table 3 prove the better performance of our proposed method. The F1 Score of PCNet was 0.951, which is higher than the other four results generated by U-Net, UNet++, UNet3+, PSPNet and DeepLabV3+. Our approach considers accuracy while maintaining a high recall rate and a low missed detection rate. Thanks to pyramid contextual block, the network has a better detection accuracy for small isolated clouds in the image. Dilated Residual Block with dilated convolution and residual connection allows the model to retain information from different stages and have a larger receptive field. We will test the effectiveness of these two modules in ablation experiments.

Quality Analysis
After analyzing the quantitative results, we will compare our results with the other four results qualitatively. All these samples are typical and have varying degrees of complexity, involving oceans, deserts, forests, etc. Particular diagnoses for different features also include clouds with different morphological characteristics such as thin clouds and thick clouds.
As can be seen from Table 3, due to the relatively low F1 Score of results generated by the U-Net and PSPNet, the predictions of these two methods will not be shown here; only UNet++, UNet3+, DeepLabV3+, and our model are compared. In Figure 9, the first line where UNet++ misses more detections may be because the method does not have strong constraints on the intermediate results and cannot perceive the global information of the image. Because DeepLabV3+ uses the ASPP module, it has a robust global perception of the image, but it treats some bright objects such as clouds, which causes more false detections. In addition to our method, the thin cloud area in the second row is more or less missed by the other three ways. The data in the 5th and 7th columns of Figure 9 show that the blue area in the DeepLabV3+ prediction result is much smaller than the other areas. This finding indicates that the perception ability of DeepLabV3+ model is more potent than UNet++ and UNet3+. The analysis of the network structure of the DeepLabV3+ shows that the model using dilated convolution and ASPP can effectively capture multi-scale information and improve the performance of cloud detection. We found that the blue area is significantly reduced in our results, but the red area has increased. This finding indicates that our method has vital data=fitting ability and can effectively mine the relationship between pixels to ensure precision while ensuring recall. The boundaries of some clouds are over-fitted, some non-cloud areas are classified as clouds because of the usage of PCB. Although some overfitting cases have been found, the overall performance of PCNet in cloud detection is still better than other pixel classification models.

Extended Experiments (a) Experiments in large-scale FY-3D true-color imagery
In the quantitative analysis and quality analysis mentioned earlier, the performance of our model has been intuitively reflected. To visually prove the superiority of PCNet, we selected some remote-sensing images of China in different periods to show the improved performance of our method. The size of the image is so large that it cannot be fed into the network, so we crop the image to a size of 512 × 512, and there is a 50% overlap between each patch.
It can be seen from Figure 10 that the FY-3D images we used can cover the entire Chinese region. The sizes of the three images are as follows: 13,108 × 17,968, 13,108 × 17,968 and 14,165 × 24,659. Our method can also be better processed for large images. From the Figure 10b,d,f, the prediction results show that our method can detect both thin clouds, thick clouds and some isolated cloud clusters. Although we divide the whole picture into 512 × 512 patches, there is no sense of division between blocks without any post-processing, which is enough to show the superiority of our method. As the plateau area presented the left side of Figure 10, clouds and snow are hard to distinguish considering the confused visual features of the RGB images. According to the corresponding detection result by PCNet in the right side of Figure 10, snow was falsely detected as clouds.

(b) Experiments in Landsat 8 true-color imagery
To show the applicability of our method, we added some experiments in Landsat 8 true-color imagery. We obtained the test images in https://earthexplorer.usgs.gov/ (accessed on 27 August 2021), and the serial numbers are LC08_L1TP_015032_20210614_2021062 2_01_T1 and LC08_L1TP_199026_20210420_20210430_01_T1, respectively.
We employed the trained model using the FY-3D dataset to test in Landsat 8 true-color remote-sensing imagery. As we can see in Figures 11 and 12, our method can acquire superior results for both isolated small clouds and clustered large clouds. However, some thin clouds are not detected because the model is not fine-tuned with Landsat 8 data.

Ablation Study
To evaluate our network, we analyze the effect of each block and hyper-parameter used in our model. To make a fair comparison, all cases are trained, validated and tested, and conducted objective and fair investigations on the same dataset.

Effectiveness of Threshold for Sigmoid Classifier
The feature map is fed into a sigmoid classifier to generate the final result. As we can see in Table 4, we choose different thresholds to make comparisons, and they are set to 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8. In Table 4, we can see that the values of precision rise and the values of recall fall as the number of thresholds increases. We choose 0.5 as our threshold number because the value of F1 Score and accuracy is the highest. All the output probabilities of each pixel from a sigmoid classifier are translated to binary values with a threshold of 0.5 in our method. When the value of the output probabilities is greater than 0.5, we set it to 1 (cloud pixels). Otherwise, it is set to 0 (non-cloud pixels).
We only show the visualized results generated by the DRB, which has five Conv-BN-LeakyReLU blocks. In Figure 13, it is a regular convolution when dilated rate = [1, 1, 1, 1, 1], and the result of this case is the worst. When dilated rate = [1, 2, 2, 2, 1], Figure 14 shows that there are still more missed detections because the receptive field is relatively small compared with others; when dilated rate = [1,4,4,4,1], the receptive field is too large to classify the light pixels into clouds, resulting in more false detections. Figures 13 and 14 show that the dilated rate adopted in our network can generate the most optimal prediction results.

Effectiveness with Different Number of Kernel Sizes in PCB
After determining the dilated rate as [1,2,4,2,1], since the effect of wider receptive field is obvious, we will compare the different kernel sizes in PCB, including [2,4,8], [4,8,16] and [8,16,32].  As we can see in Table 6, the precision is the highest when kernel sizes = [2,4,8]. To take into account precision and recall, we choose [4,8,16] as our kernel sizes. Additionally, there are some results shown in Figure 15. It can be seen from the first row that when strides = [2,4,8], the red part is small, which indicates the precision is higher as similar to the results shown in Table 6. Compared to the other two cases, our result has a higher recall. (d) strides = [8, 16,32]; (e) strides = [4,8,16]. White, red, blue and black mean the TP, FP, FN and TN, respectively.

Effectiveness of PCB Part
When the kernel sizes in PCB are determined, we will verify the necessity of PCB. We set two situations to confirm the robustness of PCB. Since the non-local block [45] is a widely used self-attention mechanism, we add one case that replaces PCB with Non-Local Block. "Only DRB" means replace the PCB with DRB, "NLB" means replace the PCB with Non-Local Block, "Params" refers to the number of variables that the model can automatically learn from the data. "MACs" refers to the number of multiply-accumulate operations and 1MAC ≈ 2FLOPs.
It can be seen from Table 7 that the Params of the module with only DRB have the smallest value. The Params and MACs of the module with PCB are similar to the module with non-local block but can achieve better results. There are also some results shown in Figure 16.

Effectiveness of CAB Part
The feature map is concatenated and fed into the CAB after going through the DRB and PCB in Figure 2. Because the number of input feature channels is too large, there will inevitably be many useless feature maps. To alleviate this kind of problem, we added the CAB as our channel selector. "w/o CAB" means without CAB part, and "CAB" means with CAB part. As we can see from Table 8, if the CAB is not added, the result is slightly inferior to that which adds the CAB module. That is easy to say, during the training process, there is some redundant information in the feature maps. In Table 8, we see that the quantitative result is slightly raised when adding the CAB, but the parameters of the network are increased. If a lighter model is required, we can choose the network without CAB.

Conclusions
Remote-sensing images from Landsat and MODIS are mostly investigated in clouddetection research. The present research explores, for the first time, the effective clouddetection method for the images from FY-3D. To this end, we generated a new dataset based on the FY-3D satellite and proposed a new cloud-detection model Pyramid Contextual Network. Based on the characteristics of clouds, a series of targeted modules are constructed: First, because of the small receptive field of regular convolutional blocks, we proposed DRB to expand the perception field of feature extraction. Second, we proposed PCB to explore the relationship between each pixel in the image for detecting isolated small clusters of thick clouds and thin clouds. Third, to reduce the redundancy of the feature maps, we used CAB to refine the feature maps.
The comparative experiments and adaptation to large-size remote-sensing imageries all proved the superiority of the proposed PCNet in cloud extraction. Moreover, in terms of the effectiveness of each module in the PCNet, a series of ablation experiments were conducted and evaluated by Precision, Recall, F1 Score, Params, etc. The proposed PC-Net provides new insights into automatic cloud detection and is shown to outperform other typical deep-learning methods. Nevertheless, this work still has some unresolved problems. The proposed PCNet fails to distinguish clouds from snow at high latitudes since only RGB bands were used. In the future, more wavelengths will be used to support higher completeness and correctness of cloud detection. Additionally, we will focus on implementing a geoscience-knowledge-guided network and a network with extraordinary transferability. Geographical knowledge is barely integrated deeply to guide the network in detection of clouds and fog. For instance, thick clouds will not appear in desert areas as water vapor rarely exists. In terms of transferability, it is necessary that the PCNet or more networks in future to adapt to different satellite sensors and diverse geo-scenes.