MCNet: Multi-Scale Feature Extraction and Content-Aware Reassembly Cloud Detection Model for Remote Sensing Images

: Cloud detection plays a vital role in remote sensing data preprocessing. Traditional cloud detection algorithms have difﬁculties in feature extraction and thus produce a poor detection result when processing remote sensing images with uneven cloud distribution and complex surface background. To achieve better detection results, a cloud detection method with multi-scale feature extraction and content-aware reassembly network (MCNet) is proposed. Using pyramid convolution and channel attention mechanisms to enhance the model’s feature extraction capability, MCNet can fully extract the spatial information and channel information of clouds in an image. The content-aware reassembly is used to ensure that sampling on the network can recover enough in-depth semantic information and improve the model cloud detection effect. The experimental results show that the proposed MCNet model has achieved good detection results in cloud detection tasks.


Introduction
With the progress of remote sensing technology, remote sensing images are widely used in navigation and positioning [1,2], ground object detection [3,4], environmental surveillance [5,6], and many other fields. However, many clouds will affect the imaging quality of remote sensing images, resulting in less available information in remote sensing images and a negative impact on tasks, such as subsequent target detection and tracking, segmentation, and classification. Therefore, it is of great significance to study high precision remote sensing cloud detection.
The traditional remote sensing cloud detection methods mainly include the physical threshold method, detection method based on cloud texture and spatial characteristics, and detection method based on machine learning. Limited by the spectrum of early remote sensing images, the ISCCP (The International Satellite Cloud Climatology Project) method [7], CLAVR (The NOAA Cloud Advanced Very High Resolution Radiometer) method [8], and other methods based on physical thresholds have low detection accuracy. As the spectrum of remote sensing satellites increases, multi-spectral cloud detection technology has been developed, using the physical characteristics of clouds such as high reflectivity and low-temperature characteristics, cloud detection is achieved through physical radiation threshold filtering [9][10][11]. The cloud detection method based on image space information [12,13], using the characteristic that the radiation change degree of cloud cover image is higher than that of a clean appearance, compares the field radiation value change of each pixel with a certain threshold to achieve pixel-level cloud segmentation. This method is suitable for remote sensing cloud detection with a single background such as in the sea area scene or an explicit scene. Cloud detection methods based on texture information [14,15] use the improved fractal dimension method to express the difference between cloud texture features and surface object texture features to achieve a higher efficiency of cloud detection. The detection method based on cloud texture and spatial information relies on fixed features and has low robustness. With machine learning development, traditional machine learning methods such as decision trees and support vector machines (SVM) are used for cloud detection. The Fmask (Function of mask) method [16][17][18] uses a decision tree with optimized thresholds to classify each pixel. The Fmask method is widely used as a baseline for remote sensing image cloud detection and evaluation. The ACCA (automated cloud cover assessment) method [19] judges whether there is a cloud through a decision tree threshold. Latry et al. [20] use support vector machines to create a decision boundary for cloud detection by representing mapping data. The accuracy of traditional cloud detection and machine learning cloud detection algorithms depends on the correctness of training data and the effectiveness of features, and the detection accuracy of complex cloud layers is low.
In recent years, semantic segmentation combined with deep learning has achieved good results in cloud detection. Fully convolutional neural network (FCN) [21] is the pioneering work of image semantic segmentation algorithm. The image-level classification is extended to pixel-level classification through end-to-end training of the fully convolutional neural network, and semantic segmentation is realized. Mohajerani et al. [22] proposed a hybrid full convolution FCN and gradient identification algorithm and applied FCN to the cloud detection field. Mohajerani et al. [23] proposed that the Cloud-Net algorithm achieves a better detection effect by redesigning the convolution block based on FCN, and Cloud-Net is usually used as the baseline of deep learning cloud detection network. Chen et al. [24] proposed the Deeplab algorithm to expand the receptive field, extract more spatial features, and enrich network features' expression by introducing dilated convolution in the FCN. However, dilated convolution led to the loss of some detailed information of clouds [25], affecting local consistency. Zhao et al. [26] propose that the PSPNet algorithm realizes multi-scale feature fusion by embedding a pyramid pooling module (PPM) in the FCN to solve the problem of insufficient feature extraction ability FCN. Inspired by PSPNet, Yan et al. [27] proposed an MFFSNet algorithm that uses PPM to aggregate feature information of different scales to improve the utilization rate of local and global features of clouds in images. However, this algorithm easily loses partial boundary information. Ronneberger et al. [28] proposed that the Unet uses skip connection to connect the encoding and decoding network to solve the problem of rough mask generation by the FCN. Gonzales et al. [29] used the Unet based on ResNet34 [30] pretraining for cloud detection. It uses transfer learning to transfer the common feature attributes of different types of images to the feature extraction step of cloud detection to strengthen the network feature extraction capability. Guo et al. [31] proposed that the Cloud-AttU algorithm adds a spatial attention mechanism to the Unet and learns more effective feature information expression through jump layer connections. Compared with the cloud detection algorithm based on the traditional method, the above-mentioned remote sensing cloud detection methods based on deep learning can better mine the in-depth semantic information in the cloud detection image and has higher robustness and higher precision. However, the above remote sensing cloud detection algorithms pay more attention to the multi-scale information of the downsampling operation, insufficient utilization of channel features. The receptive field size is insufficient in the upsampling operation, and the upsampling process cannot be adaptively optimized according to the characteristic information. In the field of remote sensing cloud detection, good detection performance has not yet been achieved.
From what has been discussed above, cloud detection in remote sensing images still faces enormous challenges because there are many types of clouds in remote sensing images and their distribution is uneven, which brings specific difficulties to detection, and the background contains many similar ground objects, which makes of cloud detection even harder. Currently, remote sensing cloud detection still has the following problems.

1.
The remote sensing cloud image obtained in the real scene contains many surface objects (such as snow, ice, trees, and white human-made objects) with similar re-flection characteristics to the clouds, and the background interference is serious. As a result, it is challenging to capture clouds under a large amount of background interference accurately.

2.
The uneven distribution and thickness of clouds in remote sensing images make the detection accuracy low.

3.
Clouds are affected by shooting angles and wind speeds, resulting in different scales and various shapes, and the accuracy of cloud mask generation in complex scenes is low.
This paper proposes a deep neural network cloud detection method that combines multi-scale feature extraction based on channel attention mechanism and content perception reorganization to solve the above problems. Experiments show that this method has excellent detection performance. The main contributions of this article are as follows. • To solve the detection problems of uneven cloud layer thickness, uneven cloud layer distribution, and background interference, a pyramidal convolution residual network with an efficient channel attention (ECA) module (EPResNet-50) is proposed, which uses pyramidal convolution [32] to capture multi-scale feature information and increase the network's attention to effective channels through the ECA [33] module, comprehensively consider the channel characteristics and spatial characteristics, and enhance the ability of network feature extraction. • To solve the problem of low accuracy of cloud generation masks in complex scenes, the CARAFE (Content-Aware ReAssembly of FEatures) [34] upsampling module is introduced. The semantic information of feature maps is fully utilized for feature restoration through adaptive kernels, and the accuracy of network generation masks is improved. • Conduct a comparative experiment between the proposed algorithm and the current mainstream algorithm on the 38-cloud [22,23] dataset. Experimental results show that this method has better detection performance.
The other parts of the paper are organized as follows. In Section 2, we introduce an overview of the method in this paper. In Section 3, we compare and analyze the experimental results between the methods proposed in this paper and other methods. In Section 4, we provide a comprehensive conclusion.

Methods
This research introduces pyramidal convolution based on the efficient channel attention mechanism in the Unet to make full use of the red, green, blue, and near-infrared (RGBN) multi-spectral channel features and improves the network feature extraction ability. Content-aware reorganization is introduced in the upsampling. The network can adaptively optimize the upsampling operation according to the extracted feature information and improve mask generation accuracy.

MCNet
The proposed MCNet is based on the encoder and decoder structure, which connects the encoder and the decoder through a skip connection. The backbone of the MCNet encoder is a pyramidal convolutional residual network base on efficient channel attention (EPResNet-50). Each block of EPResNet-50 can process the input feature map at multiple scales and extract the interdependence of local channels to improve the encoder's feature expression ability. Each step of the decoder consists of CARAFE upsampling operation and two convolution operations. CARAFE makes full use of the in-depth semantic information of features through an adaptive kernel, strengthens the decoder's feature recovery capability, and improves the accuracy of mask generation. In the skip connection process, we use 1 × 1 convolution to adjust the number of output channels of each module of EPResNet-50. It has the same number of channels as the decoder as well as a convenient encoder-decoder combination. The MCNet conceptual structure is shown in Figure 1.

EPResNet-50
In the encoder, EPResNet-50 was used as the backbone network for feature extraction, which enhanced feature extraction capability by adding pyramidal convolution and efficient channel attention to take into account the channel characteristics and spatial characteristics.

Pyramidal Convolution
Due to the uneven distribution of clouds in remote sensing images and large-scale differences, the proportion of cloud coverage in different images varies greatly. Traditional convolution uses a single-scale convolution kernel, which cannot extract features from multiple scales and is not suitable for remote sensing cloud detection. In this paper, by introducing pyramidal convolution (PyConv) in the encoder, it is used to capture cloud information in different spaces and depths on multiple scales and improve the network's ability to detect clouds uneven distribution and large scale differences. Pyramidal convolution operation is shown in Figure 2. For the input feature map of FM i channels, the convolution kernel with different sizes of K x and different depths of FM ox (x is the grouping convolution number) is used to conduct grouping convolution processing for the input features. Finally, the feature splice generates the final output feature.

ECA
Many ground objects in remote sensing images are covered by ice and snow, and the temperature at the covered area is low. This results in the near-infrared radiation of ground objects being similar to that of clouds, and the color of ice and snow is identical to that of clouds. Therefore, it is difficult to accurately distinguish clouds under a large amount of similar background interference. In order to reduce similar background interference and improve detection accuracy, we add channel attention module ECA to the encoder to realize local cross-channel information interaction without dimension reduction, allowing network to focus on the aggregation of image channel features, which improve the feature expression ability of the encoder, and enhance the network discrimination ability.
The ECA module is mainly composed of global-average-pooling (GAP), Conv1d, and Scale. GAP compresses W × H × C input features along the spatial dimension into 1 × 1 × C global pooling information, using the channel size adaptive function ψ(C): Reach the number of adjacent channels k that participates in the channel weight calculation. The parameters b and γ are set to 1 and 2, respectively, in this article, and C is the number of available channels; |•| odd means to take adjacent odd numbers. Conv1d obtains the local cross-channel weight value ω by taking the adjoining k-channel according to the formula 2, scale multiplies the weight value to weight the previous feature, and finally obtains a feature map with channel weights: where Ω k i represents the set of k adjacent channels of the channel y i , α is the channel sharing parameter, and σ is the Sigmoid function. The ECA model structure is shown in Figure 3. Figure 3. ECA model structure.

EPResNet-50 Architectures
The basic model of the EPResNet-50 model in this article is ResNet-50. EPResNet-50 introduces feature pyramid convolution to capture more comprehensive spatial information by multi-scale extraction of cloud feature, and introduces the ECA module to aggregate feature channel information, using ECA to improve the effectiveness of feature expression by paying attention to the dependency between channels. EPResNet-50 consists of a stack of similar structural blocks, the overall structure of EPResNet-50 is shown in Table 1.
To extract cloud layer information at multiple scales, the depth and size of each block's pyramid convolution kernel in EPResNet-50 are different. The stage1 block of EPResNet-50 is shown in Figure 4. The input features are first subjected to 1 × 1 convolution for feature mapping. After the mapping, the features are input into pyramidal convolution for multi-scale feature extraction. The number of feature channels is adjusted to 256 through 1 × 1 convolution. The 256 channels feature information is input to the ECA module to achieve local cross-channel interaction without dimension reduction, and to aggregate channel information, and finally, to combine the aggregated features with the original input to produce the output features. Compared with ResNet-50, EPResNet-50 can get more effective features in space and channel dimensions.

CARAFE
The decoder structure recovers cloud effective features through stack upsampling, and the feature upsampling module is the key operation module in the convolution network structure. Unet uses bilinear interpolation method for upsampling and uses a preset sampling core. The semantic information of feature map does not play a role, and the perceptual domain is only 2 × 2, so the network cannot make full use of the information of nearby regions, resulting in the decoder cannot completely recover the effective features of the cloud layer. The CARAFE module adaptively generates the upsampling kernel based on the input features, which can not only make full use of the semantic information of the feature map, but also obtains a larger receptive field, which enhances the feature recovery capability of the decoder and improve the mask generation accuracy.
We use the CARAFE module for upsampling. CARAFE combines the kernel prediction module with the content perception recombination module, uses the underlying content information to predict the adaptively optimized recombination kernel, and performs feature recombination in the vicinity of each pixel to achieve upsampling of the content perception recombination. The CARAFE module is shown in Figure 5.  Figure 5. The CARAFE module structure.
In CARAFE, the feature map X of size C × H × W and the upsampling ratio σ (assuming σ is an integer) are input into the kernel prediction module and the content reorganization module at the same time. After module calculation, the output feature map X with the size of C × σH × σW is obtained, and the upsampling operation of the feature map is realized. Any position l = (i , j ) on the output feature map X . There is a corresponding l = (i, j) at the input feature map X , where i = |i /σ|,j = |j /σ|.
The kernel prediction module ψ adaptively generates a kernel W l for each target position l in the output feature map X , and the calculation is depicted as follows, where X l is a certain pixel position on the input feature map X , k encoder is the kernel, and N(X l , k encoder ) is the target position l on the input feature map X , which corresponds to a square area with l as the center and size k encoder . The kernel prediction module is composed of three sub-modules: channel compressor, content encoder, and kernel normalizer. First, the channel compressor compresses the feature map of C × H × W into C m × H × W. Then, through the k encoder × k encoder × C up size convolution kernel convolution calculation in the content encoder, the feature map of size σ 2 × k up 2 × H × W is obtained, where C up = σ 2 × k up 2 . Next, convert the feature map to the size of k up 2 × σH × σW, finally, use the softmax function of the kernel normalizer to obtain the standardized output.
The recombination module uses a weighted operation to obtain the upsampling feature of the pixel l . The formula for weighting is as follows, where k up is the size of the reorganized kernel, N X l , k up is the target position l in the input feature X corresponding to a square area with l as the center and size of k up × k up , W l is the reorganized kernel generated by the kernel prediction module at the target location l , r = kup/2 .

Datasets
In this paper, we use 38-cloud as the benchmark cloud detection data set for evaluating model performance. The 38-cloud dataset has 38 scenarios, including 18 training scenarios and 20 test scenarios. Each scene data is provided by the Landsat 8 satellite [35]. The Landsat 8 satellite carries two sensors, namely, the Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS). OLI has 9 spectral bands, the spatial resolution for bands 1 to 7 and 9 is 30 m, the spatial resolution for band 8 is 15 m. TIRS has 2 spectral bands, the spatial resolution for bands 10 to 11 is 30 m. The information for all bands of the Landsat 8 is shown in Table 2. The 38-cloud dataset selects the red, green, blue, and near-infrared spectral bands of the landsat 8 satellite, and provides corresponding hand-labeled segmentation masks. The scene image is cropped into 384 × 384 patches. There are 8400 patches in the training set and 9200 patches in the test set. We use holdout cross-validation to partition dataset. We randomly use 85% of the patches in training set for training and 15% of patches for verification. To prevent the model from overfitting and enhance the robustness of the model, we use random translation, symmetry, rotation, and scaling to enhance the training set.

Evaluation Metrics
We use Overall Accuracy, Recall, Precision, Specificity, Jaccard Index, and F1-Score as evaluation indicators. The Precision indicator is used to judge the precision of the algorithm, and the Recall indicator is used to judge the precision of the algorithm. The Specificity index is used to measure the completeness of the error prediction, and the Overall Accuracy index is used to indicate the accuracy of the two classifications. The Jaccard Index indicator is used to describe the similarity between the predicted mask and the real mask. The F1-Score considers the relationship between Precision and Recall. Moreover, the Jaccard Index indicator is an important index to judge the performance of cloud detection algorithms [22,23,31].
Overall Accuracy = TP + TN TP + TN + FP + FN (9) Among them, TP represents the number of positive samples whose judgment result is positive, TN represents the number of negative samples whose judgment result is negative, FP represents the number of negative samples whose judgment result is positive, and FN represents the number of positive samples whose judgment result is negative.

Experimental Details
In this study, all experiments are programmed and implemented using the PyTorch framework running on Ubuntu16.04 with NVIDIA RTX 2080Ti GPU. The experiment uses python3.6 as the software environment. In the experiment, we use the 384 × 384 pixel RGBN 4-channel patch image in 38-cloud as the input of the neural network. The training batch is 8, using the BCE (Binary Cross Entropy) loss function, a total of 400 epochs are trained, and the Adam (Adaptive Moment Optimization) optimizer is used for training optimization. The initial learning rate is set to 0.001, and the learning rate is attenuated by 0.1 at 200, 300, 350, and 390, respectively.

Ablation Experiment
To verify the rationality of the module combination in the network, we use the same computing environment to perform eight sets of ablation experiments on the same data set. Table 3 shows the experimental results. The first line is Unet, which uses ResNet-50 as the backbone network. The ECA, PyConv, or CARAFE conditions were activated separately or conjunctively. The network performance was comparable to the reference network. The Jaccard Index increase indicator 0.58%, 2.62%, and 1.46%, indicating that the ECA, PyConv, and CARAFE modules' validity added separately. Adding any two modules of ECA, PyConv, and CARAFE to Unet simultaneously, the test results are further improved than adding a single module, which verifies the mutual assistance effect between the three modules. Under the condition of adding three modules at the same time, all the index results are optimal. Jaccard Index reaches 83.05%, precision reaches 94.83%, Recall reaches 86.69%, Specificity reaches 98.67%, and Overall Accuracy reaches 96.44%, which proves the rationality of the network structure of this article. ECA considers the interaction of local channels to enhance the importance of effective channels and extract unique information from the cloud. PyConv extracts cloud spatial information from a multi-scale perspective. The PyConv integrated with the ECA module comprehensively considers the spatial relationship and channel relationship of the features extracts comprehensive and representative cloud information, and improves the feature expression ability of the encoder. The upsampling process of the CARAFE module can adaptively generate a kernel according to the input features, which can make full use of the semantic information of the feature map to restore more cloud layer details and improve the mask generation accuracy of the decoder. To prove the effectiveness of the MCNet algorithm, we conducted a quantitative analysis with some popular cloud detection algorithms and semantic segmentation algorithms. Including Gradient Boosting Classifier (BGC), Random Forest (RF), FCN, Fmask, Cloud-net, Att-unet, PSPNet, and Deeplabv3, Table 4 shows the cloud detection results on the 38-cloud dataset. According to the experimental results in Table 4, MCNet's Jaccard Index reached 83.05%, Precision reached 94.83%, Recall reached 86.69%, Specificity reached 98.67%, and Overall Accuracy reached 96.49%. MCNet algorithm is 6.8%, 10.4%, 1.5%, and 0.7% higher than Unet algorithm in Jaccard Index, Precision, Specification, and Overall Accuracy, respectively. The MCNet algorithm is compared with the Cloud-net algorithm of cloud detection deep learning, Jaccard Index is 4.5% higher, and the precision is 3.6% higher; compared with the sub-optimal PSPNet algorithm, Jaccard Index is 3.7% higher and Precision is 7.7% higher. The experimental results indicated the algorithm proposed in this paper has achieved higher performance accuracy under each index, which demonstrates the effectiveness of this algorithm.  Figures 6 and 7, the first column is the test pseudo-color image; the second column is the ground truth; and the third, fourth, fifth, and sixth columns are masks generated by Cloud-net, PSPNet, Unet, and MCNet, respectively, in which white represents clouds and black represents background.
The blue box in Figure 6 is the detection result under complex clouds, the yellow box is the detection result under similar background interference such as ice and snow, and the red box is the detection result under the simultaneous existence of complex clouds and similar background interference such as ice and snow. Figure 8 shows the detailed information in the blue box, yellow box and red box from Figure 6. The experimental results show that the results from MCNet are less disturbed by similar backgrounds such as ice and snow, the generated mask is more real, closer to ground truth, and the detection accuracy is higher than other algorithms.

False-color
Ground truth PSPNet Unet Cloud-Net MCNet Figure 6. Visual results of two overall scene image obtained by different models on the 38-cloud dataset.  Seen from the experimental results of Figure 7, the masks generated by other detection algorithms have the problems of missing information and inaccurate positioning. The mask generated by MCNet retains more detailed cloud information, proving that MCNet can accurately and comprehensively capture cloud information, thus effectively avoiding this problem.

Conclusions
This paper proposes a deep neural network cloud detection method that combines multi-scale feature extraction and content perception reorganization based on the attention mechanism. To solve the interferences in cloud detection caused by cloud thickness, uneven cloud distribution, and ground objects in remote sensing images, the encoder uses EPResNet-50 to extract multi-scale spatial features and channel dependencies. It improves the model's utilization of multi-spectral image features. The CARAFE module is introduced in the decoder stage, and the size of the upsampled receptive field is automatically adjusted through the adaptive kernel, and the deep semantic information is fully utilized for feature recovery to improve the mask generation accuracy. We use the 38-cloud data set captured by the Landsat 8 satellite for a validation. The experimental results prove the MCNet algorithm's effectiveness, Jaccard Index reaches 83.05%, precision reaches 94.83%, Recall reaches 86.69%, and Specificity reaches 98.67%, even in complex background interference. It also has a better detection effect in scenes with large differences in cloud distribution. In future research, we will further optimize the model. By introducing an adaptive dynamic filter, the adaptability of the model will be enhanced, and therefore the model can adaptively optimize convolution operation in different scenes.