An E ﬀ ective Cloud Detection Method for Gaofen-5 Images via Deep Learning

: Recent developments in hyperspectral satellites have dramatically promoted the wide application of large-scale quantitative remote sensing. As an essential part of preprocessing, cloud detection is of great signiﬁcance for subsequent quantitative analysis. For Gaofen-5 (GF-5) data producers, the daily cloud detection of hundreds of scenes is a challenging task. Traditional cloud detection methods cannot meet the strict demands of large-scale data production, especially for GF-5 satellites, which have massive data volumes. Deep learning technology, however, is able to perform cloud detection e ﬃ ciently for massive repositories of satellite data and can even dramatically speed up processing by utilizing thumbnails. Inspired by the outstanding learning capability of convolutional neural networks (CNNs) for feature extraction, we propose a new dual-branch CNN architecture for cloud segmentation for GF-5 preview RGB images, termed a multiscale fusion gated network (MFGNet), which introduces pyramid pooling attention and spatial attention to extract both shallow and deep information. In addition, a new gated multilevel feature fusion module is also employed to fuse features at di ﬀ erent depths and scales to generate pixelwise cloud segmentation results. The proposed model is extensively trained on hundreds of globally distributed GF-5 satellite images and compared with current mainstream CNN-based detection networks. The experimental results indicate that our proposed method has a higher F1 score (0.94) and fewer parameters (7.83 M) than the compared methods.


Introduction
Gaofen-5 (GF-5) is the fifth flight unit of the China High-Resolution Earth Observation System (CHEOS) [1], which was successfully launched in May 2018. It carries two land observation payloads, including a visible and shortwave infrared hyperspectral camera, a multispectral imager, and four atmospheric observation payloads, including a greenhouse gas detector, multiangle polarization detector, differential absorption spectrometer for atmospheric trace gas, and atmospheric environment infrared sensor [2]. GF-5 imagery can be widely used in environmental monitoring, geological mapping, urban heat island monitoring, thermal effluent monitoring, and other fields, by virtue of its wide spectrum range and high spatial and spectral resolution characteristics. GF-5 imagery is of great relevance for global-scale quantitative remote sensing applications [3]. However, the annual mean global cloud cover is approximately 66% [4], which brings significant challenges to the large-scale application of remote sensing data. According to our statistics, the daily peak value of the data obtained by the GF-5 land observation payloads is above 300 scenes. For data providers, it is crucial What is interesting in Table 1 is the rapid increase of cloud detection based on pixelwise CNN methods. This type of approach can perform feature extraction and classification at the same time to implement end-to-end segmentation. Compared with the methods mentioned above, pixelwise CNN methods have apparent advantages: first, they can integrate spatial information and spectral information. Second, hand-crafted features and sophisticated remote sensing preprocessing steps are not needed; third, given sufficient training samples, they have higher accuracy and stronger generalization abilities. However, the pixelwise CNN method still has room for improvement in cloud detection. According to our statistics, the U-shape [38,41,42] and the linear stack structures [22], inspired by U-Net [49], SegNet [50], and VGG [51], are the two mainstream architectures for cloud segmentation. It is well known that cloud images contain different types of representations: high-level semantic information and low-level information such as color, shape, and location information. As such, these architectures, with a single processing pipeline that relies on multistage cascaded CNNs, may lead to the loss of spatial information and may result in inaccurate boundary definitions [52][53][54]. Some meaningful practices relating to the fusion of features at different depths and scales to expand the receptive field of the network have also been reported [25,31,39,40]. However, further research is needed, especially on ways to reduce the loss of spatial information and how to capture and fuse the relevant and meaningful multi-scale contextual information instead of simple concatenation.
Benefiting from deep learning technology, completing a quality assessment of massive satellite data requires fewer resources and even thumbnails can be used to implement accurate cloud detection [40]. Inspired by the excellent CNN architectures of bilateral segmentation network (BiSeNet) [54], pyramid scene parsing network (PSPNet) [55], and squeeze-and-excitation network (SENet) [56], in this paper, we propose a new cloud detection method, a multiscale fusion gated network, using a dual-branch CNN architecture for cloud detection of GF-5 preview RGB images. First, we design a new lightweight backbone network combining the advantages of SENet [56] and ResNeXt [57]. Then, we introduce two attention modules: one is a spatial pyramid pooling attention (SPPA) module based on the channel attention mechanism to extract multiscale semantic features; the other is a low-level feature spatial attention (LFSA) module with a spatialwise attention mechanism for extracting beneficial shallow features. Finally, a gated multilevel feature fusion (GMFF) module is employed to deeply fuse features at different depths and scales to generate the pixelwise cloud segmentation result. The remainder of this paper is organized as follows. The proposed method is described in Section 2. The data source and experiment settings are described in Section 3. Experiments with evaluations Remote Sens. 2020, 12, 2106 4 of 20 and comparisons are presented in Section 4, and the conclusion, along with a discussion, is presented in Section 5.

Methods
The linear stack structure and the U-shape structure are two classic frameworks for semantic segmentation, though there is still much room for improvement. For the linear stack structure, repeated downsampling and resizing operations lose much spatial information, and global context information is not fully exploited. U-shape structures such as U-Net [49] try to fill in the missing details by using skip-connections, but still cannot fundamentally solve the problems [54]. Although shallow features are abundant in spatial information, they are still too noisy to provide sufficient and useful information related to the target [58]. This kind of single processing pipeline, which has a limited effect in improving spatial information loss, often leads to inaccurate boundary definitions [52]. Another critical factor affecting the segmentation results is the size of the receptive field of the convolutional layer, especially for the recognition of targets with multiscale features. Recent work has focused on how to enlarge the receptive field and obtain more global context information. However, further research is needed, especially on how to collect relevant and effective global contextual information and how to fuse features from different depths and scales instead of simply concatenating them. The proposed architecture is designed to offer an improvement plan to address the issues mentioned above.
The multiscale fusion gated network (MFGNet) with dual-branch CNN architectures is mainly composed of four core modules, i.e., an SPPA module, an LFSA module, a GMFF module, and a new backbone network. As shown in Figure 1, patches of size H × W × C (256 × 356 × 3 in this case) are input to the backbone network. The features extracted by the backbone network are divided into two parts, which are input into the SPPA and LFSA modules, respectively, for multiscale semantic and shallow information extraction. In the end, the features from different depths and scales are fused by the GMFF module and output as a cloud segmentation mask with the same size as the input image.

Backbone Network
The backbone network plays an essential role in improving the efficiency and accuracy of segmentation. Recently, lightweight skeleton networks such as the MobileNet series [59,60], ResNeXt [57], and Xception [61] have achieved state-of-the-art performance in many classification and segmentation tasks. We used ResNet as the base structure and proposed a new lightweight backbone network, combining the advantages of Xception and SENet. As shown in Table 2, the new backbone network consisted of a stack of 5 stages. In an effort to keep, as much as possible, sufficient quantities of shallow information, only stages 1, 3, and 4 were downsampled, and the output features were 1/2, 1/4, and 1/8 of the input image size, respectively. Stage 1 was composed of 3 CBR (Conv + BN + ReLU) blocks, which consisted of a convolutional layer (Conv), a batch normalization layer (BN), and a rectified linear unit (ReLU). The remaining stages, with the same topology and different hyperparameters, wre composed of several residual convolutional blocks (RCB) and an identity convolutional block (ICB). The adjustable hyperparameters of RCB and ICB included stride, dilation rate, squeeze and excitation (SE) option, skip connection option, etc.; in stages 2-5, the number of filters of each stage was multiplied by a factor of 2. The kernel size of all convolutional layers was set to 3×3.

Backbone Network
The backbone network plays an essential role in improving the efficiency and accuracy of segmentation. Recently, lightweight skeleton networks such as the MobileNet series [59,60], ResNeXt [57], and Xception [61] have achieved state-of-the-art performance in many classification and segmentation tasks. We used ResNet as the base structure and proposed a new lightweight backbone network, combining the advantages of Xception and SENet. As shown in Table 2, the new backbone network consisted of a stack of 5 stages. In an effort to keep, as much as possible, sufficient quantities of shallow information, only stages 1, 3, and 4 were downsampled, and the output features were 1/2, 1/4, and 1/8 of the input image size, respectively. Stage 1 was composed of 3 CBR (Conv + BN + ReLU) blocks, which consisted of a convolutional layer (Conv), a batch normalization layer (BN), and a rectified linear unit (ReLU). The remaining stages, with the same topology and different hyperparameters, wre composed of several residual convolutional blocks (RCB) and an identity convolutional block Remote Sens. 2020, 12, 2106 5 of 20 (ICB). The adjustable hyperparameters of RCB and ICB included stride, dilation rate, squeeze and excitation (SE) option, skip connection option, etc.; in stages 2-5, the number of filters of each stage was multiplied by a factor of 2. The kernel size of all convolutional layers was set to 3 × 3.  256  2  4  32  32  ICB  256  yes  1  4  32  32  ICB  256  yes  1  4  32  32  ICB  256  yes  1  4  32  32  ICB  256  yes  1  4  32  32  ICB  256  yes  1  4   5   32  32  RCB  512  1  5  32  32  ICB  512  yes  1  5  32  32  ICB  512 yes 1 5 Figure 2 shows two types of convolutional layers with different structures used in the backbone network. Both RCB and ICB were composed of depthwise separable convolutional layers (SepConv), BN, and ReLU. The main difference between RCB and ICB is that the former used skip connections, and the output feature was 1/2 of the input size. In addition, there was an SE unit in the ICB block, which was used to adjust the channel weight of the output layer.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 21  Figure 2 shows two types of convolutional layers with different structures used in the backbone network. Both RCB and ICB were composed of depthwise separable convolutional layers (SepConv), BN, and ReLU. The main difference between RCB and ICB is that the former used skip connections, and the output feature was 1/2 of the input size. In addition, there was an SE unit in the ICB block, which was used to adjust the channel weight of the output layer.

Spatial Pyramid Pooling Attention Module
Enlarging receptive fields and extracting multiscale features can help obtain more global context information. Recently, many practices have tried these two aspects and some useful network structure solutions have been proposed. For example, global convolution network (GCNet) [58] adopts a "large kernel" to enlarge the receptive field, PSPNet utilizes a spatial pyramid pooling (SPP) module to obtain a multiscale pooling feature [55], and DeepLab [62] proposes atrous spatial pyramid pooling to capture the context information of different receptive fields. Inspired by PSPNet,

Spatial Pyramid Pooling Attention Module
Enlarging receptive fields and extracting multiscale features can help obtain more global context information. Recently, many practices have tried these two aspects and some useful network structure solutions have been proposed. For example, global convolution network (GCNet) [58] adopts a "large kernel" to enlarge the receptive field, PSPNet utilizes a spatial pyramid pooling (SPP) module to obtain a multiscale pooling feature [55], and DeepLab [62] proposes atrous spatial pyramid pooling to capture the context information of different receptive fields. Inspired by PSPNet, we introduced an SPPA module that combines the advantages of SPP and channelwise attention. As shown in Figure 3, the features of the final stage output of the backbone network with a size of 1/8 of the original image were connected to a CBR block, and the number of channel dimensions was reduced to 256. A spatial pyramid pooling (SPP) submodule was applied to capture the context information from different scales. The SPP consisted of five average pooling layers, with kernel and stride sizes of 1 × 1, 2 × 2, 4 × 4, 8 × 8, and 16 × 16 respectively. Then, we directly upsampled all five pyramid level layers to 1/8 of the original image, and then concatenated them. Before being fed into the next channelwise attention block, the feature combination was reduced by a 1 × 1 Conv. In the attention block, the spatial information from all the channels was squeezed by average pooling and output as a one-dimensional vector of size 1 × 1 × C. Followed by two 1 × 1 Conv and one active layer, the computed weight vector was able to reweight the feature and control feature selection. After upsampling by a factor of 4, the output of the SPPA module was a feature map with 1/2 the size of the input image.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 21 we introduced an SPPA module that combines the advantages of SPP and channelwise attention. As shown in Figure 3, the features of the final stage output of the backbone network with a size of 1/8 of the original image were connected to a CBR block, and the number of channel dimensions was reduced to 256. A spatial pyramid pooling (SPP) submodule was applied to capture the context information from different scales. The SPP consisted of five average pooling layers, with kernel and stride sizes of 1 × 1, 2 × 2, 4 × 4, 8 × 8, and 16 × 16 respectively. Then, we directly upsampled all five pyramid level layers to 1/8 of the original image, and then concatenated them. Before being fed into the next channelwise attention block, the feature combination was reduced by a 1×1 Conv. In the attention block, the spatial information from all the channels was squeezed by average pooling and output as a one-dimensional vector of size 1×1×C. Followed by two 1×1 Conv and one active layer, the computed weight vector was able to reweight the feature and control feature selection. After upsampling by a factor of 4, the output of the SPPA module was a feature map with 1/2 the size of the input image.

Low-level Feature Spatial Attention Module
The LFSA module ( Figure 4) is mainly used to extract and fuse spatial features at different scales. The core step is that the first two levels of features generated by stages 1 and 2 of the backbone network are further refined by a spatialwise attention block. First, the maximum value of each pixel in all channels was calculated at the spatial scale and concatenated with the original features to enhance the weight of the cloud targets. We reduced the channel dimension of the features by a 1×1 Conv and utilized a squeeze factor to adjust the dimension reduction ratio. Then, a 1×1×1 pointwise Conv and an active sigmoid layer were used to generate a spatial attention map. By multiplying by the input feature combination, the spatial attention map was able to reweight the feature and emphasize meaningful features on the spatial dimension. After being refined by the spatial attention (SA) block, the low-level features, holding 1/2 the size of the input image from different stages, were concatenated as the final output of LFSA.

Low-Level Feature Spatial Attention Module
The LFSA module ( Figure 4) is mainly used to extract and fuse spatial features at different scales. The core step is that the first two levels of features generated by stages 1 and 2 of the backbone network are further refined by a spatialwise attention block. First, the maximum value of each pixel in all channels was calculated at the spatial scale and concatenated with the original features to enhance the weight of the cloud targets. We reduced the channel dimension of the features by a 1 × 1 Conv and utilized a squeeze factor to adjust the dimension reduction ratio. Then, a 1 × 1 × 1 pointwise Conv and an active sigmoid layer were used to generate a spatial attention map. By multiplying by the input feature combination, the spatial attention map was able to reweight the feature and emphasize meaningful features on the spatial dimension. After being refined by the spatial attention (SA) block, the low-level features, holding 1/2 the size of the input image from different stages, were concatenated as the final output of LFSA.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 21 we introduced an SPPA module that combines the advantages of SPP and channelwise attention. As shown in Figure 3, the features of the final stage output of the backbone network with a size of 1/8 of the original image were connected to a CBR block, and the number of channel dimensions was reduced to 256. A spatial pyramid pooling (SPP) submodule was applied to capture the context information from different scales. The SPP consisted of five average pooling layers, with kernel and stride sizes of 1 × 1, 2 × 2, 4 × 4, 8 × 8, and 16 × 16 respectively. Then, we directly upsampled all five pyramid level layers to 1/8 of the original image, and then concatenated them. Before being fed into the next channelwise attention block, the feature combination was reduced by a 1×1 Conv. In the attention block, the spatial information from all the channels was squeezed by average pooling and output as a one-dimensional vector of size 1×1×C. Followed by two 1×1 Conv and one active layer, the computed weight vector was able to reweight the feature and control feature selection. After upsampling by a factor of 4, the output of the SPPA module was a feature map with 1/2 the size of the input image.

Low-level Feature Spatial Attention Module
The LFSA module ( Figure 4) is mainly used to extract and fuse spatial features at different scales. The core step is that the first two levels of features generated by stages 1 and 2 of the backbone network are further refined by a spatialwise attention block. First, the maximum value of each pixel in all channels was calculated at the spatial scale and concatenated with the original features to enhance the weight of the cloud targets. We reduced the channel dimension of the features by a 1×1 Conv and utilized a squeeze factor to adjust the dimension reduction ratio. Then, a 1×1×1 pointwise Conv and an active sigmoid layer were used to generate a spatial attention map. By multiplying by the input feature combination, the spatial attention map was able to reweight the feature and emphasize meaningful features on the spatial dimension. After being refined by the spatial attention (SA) block, the low-level features, holding 1/2 the size of the input image from different stages, were concatenated as the final output of LFSA.

Gated Multilevel Feature Fusion Module
As a common technique in segmentation tasks, the traditional method of fusing shallow spatial information with semantic information is to simply concatenate or sum these features and then apply postprocessing operations such as CRF to refine the results. In a variation from previous practice, the GMLFF module ( Figure 5), which is based on the attention mechanism, pays more attention to further extracting the useful information in the feature combination. In this case, we first combined the shallow and deep features and reduced the dimensions of the channels by a 1 × 1 Conv. Then, we pooled the concatenated features to a vector and computed a weight vector. Through multiplying by the concatenated features, the weight vector was able to adjust the weight of the useful information. In a variation from the attention methods used in the SPPA module, a skip connection was utilized to bring more abundant information. Followed by the upsampling layer, the feature map was fed into a 1 × 1 convolutional layer to get the final cloud segmentation mask with the same size as the original input image.

Gated Multilevel Feature Fusion Module
As a common technique in segmentation tasks, the traditional method of fusing shallow spatial information with semantic information is to simply concatenate or sum these features and then apply postprocessing operations such as CRF to refine the results. In a variation from previous practice, the GMLFF module ( Figure 5), which is based on the attention mechanism, pays more attention to further extracting the useful information in the feature combination. In this case, we first combined the shallow and deep features and reduced the dimensions of the channels by a 1×1 Conv. Then, we pooled the concatenated features to a vector and computed a weight vector. Through multiplying by the concatenated features, the weight vector was able to adjust the weight of the useful information. In a variation from the attention methods used in the SPPA module, a skip connection was utilized to bring more abundant information. Followed by the upsampling layer, the feature map was fed into a 1×1 convolutional layer to get the final cloud segmentation mask with the same size as the original input image.

Dataset
GF-5 data providers need to complete precise cloud detection tasks with hundreds of scenes every day. Just a few years ago, cloud estimation of most satellite data relied mainly on the manual monitoring of RGB images. As mentioned before, traditional methods can achieve accurate cloud recognition of hyperspectral or multispectral data; however, they are not suitable for large-scale productions, especially for GF-5 hyperspectral images with massive data volumes. Moreover, as the first step of data processing, the cloud detection process is required to not consume too much time in data preprocessing, such as performing decompressions and atmospheric corrections. It is not difficult to find that, in most cases, there is sufficient information to make a clear judgment on the cloud through its color, shape, texture, shadow, spatial relationship, and many other features from RGB images. This is the main reason we chose the GF-5 preview RGB image (i.e., thumbnails) as the training dataset for cloud detection.
More than 1600 scenes of GF-5 RGB images with a size of 2008×2083 were collected on a global scale, covering the period from January to March 2019. Some scenes with invalid data were eliminated, and the rest of the images were further selected according to the land-cover types and cloud coverage to ensure that the model could be applied to common scenarios. A collection of images containing typical scenarios was chosen to evaluate the visual performance of the prediction results, and the remaining 717 scenes ( Figure 6) were used for model training and quantitative evaluation. The scenes contained multiple collections such as cloudy, sunny, snow, cloud and snow coexisting, etc., covering common scenarios such as cities, mountains, forests, farmland, etc. All selected images were manually labeled with reference cloud masks (RCMs). In the first stage, the

Dataset
GF-5 data providers need to complete precise cloud detection tasks with hundreds of scenes every day. Just a few years ago, cloud estimation of most satellite data relied mainly on the manual monitoring of RGB images. As mentioned before, traditional methods can achieve accurate cloud recognition of hyperspectral or multispectral data; however, they are not suitable for large-scale productions, especially for GF-5 hyperspectral images with massive data volumes. Moreover, as the first step of data processing, the cloud detection process is required to not consume too much time in data preprocessing, such as performing decompressions and atmospheric corrections. It is not difficult to find that, in most cases, there is sufficient information to make a clear judgment on the cloud through its color, shape, texture, shadow, spatial relationship, and many other features from RGB images. This is the main reason we chose the GF-5 preview RGB image (i.e., thumbnails) as the training dataset for cloud detection.
More than 1600 scenes of GF-5 RGB images with a size of 2008 × 2083 were collected on a global scale, covering the period from January to March 2019. Some scenes with invalid data were eliminated, and the rest of the images were further selected according to the land-cover types and cloud coverage to ensure that the model could be applied to common scenarios. A collection of images containing typical scenarios was chosen to evaluate the visual performance of the prediction results, and the remaining 717 scenes ( Figure 6) were used for model training and quantitative evaluation. The scenes contained multiple collections such as cloudy, sunny, snow, cloud and snow coexisting, etc., covering common scenarios such as cities, mountains, forests, farmland, etc. All selected images were manually labeled with reference cloud masks (RCMs). In the first stage, the threshold-based method with careful thresholding was used to label the cloud, and in the second stage, the mask results were visually Remote Sens. 2020, 12, 2106 8 of 20 checked and corrected, especially for complex samples with snow or ice. The RGB images and the RCMs together formed a four-band dataset.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 21 threshold-based method with careful thresholding was used to label the cloud, and in the second stage, the mask results were visually checked and corrected, especially for complex samples with snow or ice. The RGB images and the RCMs together formed a four-band dataset.

Data Processing
In this process, some necessary preprocessing was performed on the entire GF-5 image to meet the constraints of the algorithm and the hardware, such as graphic processing unit (GPU) memory. As shown in Figure 7, we used a fixed-size window to crop the data into patches of size 256 × 256. The random cropping strategy was achieved by randomly setting the starting point coordinates and the rotation angle of the window. Further, during the cropping process, we set a threshold to keep more positive samples in the patches, to balance the positive and negative sample balance of the dataset.
Remote sensing images from the same place acquired at different times will show some radiation differences due to temporal issues. Considering that this phenomenon is mainly manifested as differences in brightness, contrast, etc., it is reasonable to perform color transformation on the data to enrich the diversity of the data. In addition, the spatial transformation of the data also helps the algorithm better identify the cloud target in the background.

Data Processing
In this process, some necessary preprocessing was performed on the entire GF-5 image to meet the constraints of the algorithm and the hardware, such as graphic processing unit (GPU) memory. As shown in Figure 7, we used a fixed-size window to crop the data into patches of size 256 × 256. The random cropping strategy was achieved by randomly setting the starting point coordinates and the rotation angle of the window. Further, during the cropping process, we set a threshold to keep more positive samples in the patches, to balance the positive and negative sample balance of the dataset.
threshold-based method with careful thresholding was used to label the cloud, and in the second stage, the mask results were visually checked and corrected, especially for complex samples with snow or ice. The RGB images and the RCMs together formed a four-band dataset.

Data Processing
In this process, some necessary preprocessing was performed on the entire GF-5 image to meet the constraints of the algorithm and the hardware, such as graphic processing unit (GPU) memory. As shown in Figure 7, we used a fixed-size window to crop the data into patches of size 256 × 256. The random cropping strategy was achieved by randomly setting the starting point coordinates and the rotation angle of the window. Further, during the cropping process, we set a threshold to keep more positive samples in the patches, to balance the positive and negative sample balance of the dataset.
Remote sensing images from the same place acquired at different times will show some radiation differences due to temporal issues. Considering that this phenomenon is mainly manifested as differences in brightness, contrast, etc., it is reasonable to perform color transformation on the data to enrich the diversity of the data. In addition, the spatial transformation of the data also helps the algorithm better identify the cloud target in the background.  Remote sensing images from the same place acquired at different times will show some radiation differences due to temporal issues. Considering that this phenomenon is mainly manifested as differences in brightness, contrast, etc., it is reasonable to perform color transformation on the data to enrich the diversity of the data. In addition, the spatial transformation of the data also helps the algorithm better identify the cloud target in the background.
To improve the generalization ability of the model and make it adapt to images acquired at different times and scenes, we adopted a random expansion strategy for the batch data before training. As shown in Figure 8, the data augmentation strategy proposed in this task included color-based methods such as saturation, brightness, contrast, and sharpness, and geometry-based methods such as rotate, flip, shift, zoom in, and zoom out. Except for rotate and flip, the amplitudes of the other transformations were set to ±20%.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 21 To improve the generalization ability of the model and make it adapt to images acquired at different times and scenes, we adopted a random expansion strategy for the batch data before training. As shown in Figure 8, the data augmentation strategy proposed in this task included color-based methods such as saturation, brightness, contrast, and sharpness, and geometry-based methods such as rotate, flip, shift, zoom in, and zoom out. Except for rotate and flip, the amplitudes of the other transformations were set to ±20%. Random samples of 12k patches were used in this task. These data were split into two groups, i.e., 80% for training, and 20% for validation. All the input data were normalized to values between 0 and 1.

Model Training and Prediction
In the training stage, the patches of size 256 × 256 were input to the backbone network with five processing stages. The output of the backbone network's first two stages was input to the LFSA module. During this time, low-level features of different depths were selectively extracted through the SA block and output as a feature of size 128 × 128 × 256. At the same time, the features from stage 5 were input to the SPPA module. After pyramid pooling, the high-level features of different scales were concatenated and selectively extracted through a channelwise attention mechanism and output at the size of 128 × 128 × 256. Finally, the low-level spatial information and high-level semantic information were deeply fused by the GMFF module, and a cloud mask with a size of 256 × 256 × 1 was output as the final result.
The training procedure was performed in a TensorFlow (1.13.1) framework on an NVidia GeForce GTX 1080Ti GPU and optimized by the adaptive moment estimation (Adam) algorithm [63] (initial learning rate as 0.001) with "Binary_crossentropy" loss. One hundred epochs were used for training, and the batch size was 20. The convolution weights were initialized by "Glorot_uniform," and were drawn randomly from a uniform distribution within [-limit, limit] with the limit being defined in [64]. The biases in the convolutional layers were initialized with a constant of 0. To prevent overfitting, a dropout layer with dropout rate of 0.2 was added on the top of the backbone network.
The size of a GF-5 image is 2008×2830, which means that it needs to be divided into multiple patches for prediction. An overlap-tile strategy [49], which retains only the intermediate prediction results of each patch (Figure 9), was used to ensure the seamless segmentation of large images. Random samples of 12k patches were used in this task. These data were split into two groups, i.e., 80% for training, and 20% for validation. All the input data were normalized to values between 0 and 1.

Model Training and Prediction
In the training stage, the patches of size 256 × 256 were input to the backbone network with five processing stages. The output of the backbone network's first two stages was input to the LFSA module. During this time, low-level features of different depths were selectively extracted through the SA block and output as a feature of size 128 × 128 × 256. At the same time, the features from stage 5 were input to the SPPA module. After pyramid pooling, the high-level features of different scales were concatenated and selectively extracted through a channelwise attention mechanism and output at the size of 128 × 128 × 256. Finally, the low-level spatial information and high-level semantic information were deeply fused by the GMFF module, and a cloud mask with a size of 256 × 256 × 1 was output as the final result.
The training procedure was performed in a TensorFlow (1.13.1) framework on an NVidia GeForce GTX 1080Ti GPU and optimized by the adaptive moment estimation (Adam) algorithm [63] (initial learning rate as 0.001) with "Binary_crossentropy" loss. One hundred epochs were used for training, and the batch size was 20. The convolution weights were initialized by "Glorot_uniform," and were drawn randomly from a uniform distribution within [-limit, limit] with the limit being defined in [64]. The biases in the convolutional layers were initialized with a constant of 0. To prevent overfitting, a dropout layer with dropout rate of 0.2 was added on the top of the backbone network.
The size of a GF-5 image is 2008 × 2830, which means that it needs to be divided into multiple patches for prediction. An overlap-tile strategy [49], which retains only the intermediate prediction results of each patch (Figure 9), was used to ensure the seamless segmentation of large images. Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 21 Figure 9. Overlap-tile strategy for seamless segmentation of GF-5 images.
In our experiments, BiSeNet, PSPNet, SegNet, and FCN8 (fully convolutional network with 8x upsamping) were also evaluated as reference methods on the same dataset with the same training parameter settings as the MFGNet. An ablation experiment was also conducted to test the performance of the main modules in the MFGNet, which is described in detail in Section 4.

Evaluation Metrics
The performance of the proposed model was quantitatively measured by the agreements and differences between predicted results and RCMs. The most common metrics Equations (1)-(5), the overall accuracy, recall, F1 score, precision, and intersection over union (IoU), were deployed as the evaluation index to evaluate the compared methods. For reference, a general analysis of accuracy metrics for classification tasks can be found in [65]. These metrics are defined as follows: where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively.

Evaluation of the MFGNet
Observing changes in loss and accuracy is a simple and effective way to evaluate the quality of a model during training. The loss and accuracy of the training and validation set of each epoch were computed and are displayed in Figure 10a. As depicted in the figure, the accuracy curve of the training set rises rapidly; meanwhile, the loss curve drops rapidly and reaches stability after a few epochs. Although the loss of the validation set shows periodic oscillations, a stabilized curve is achieved after 60 epochs. The validation loss reaches the lowest point in the 80-100 epoch, and the In our experiments, BiSeNet, PSPNet, SegNet, and FCN8 (fully convolutional network with 8× upsamping) were also evaluated as reference methods on the same dataset with the same training parameter settings as the MFGNet. An ablation experiment was also conducted to test the performance of the main modules in the MFGNet, which is described in detail in Section 4.

Evaluation Metrics
The performance of the proposed model was quantitatively measured by the agreements and differences between predicted results and RCMs. The most common metrics Equations (1)-(5), the overall accuracy, recall, F1 score, precision, and intersection over union (IoU), were deployed as the evaluation index to evaluate the compared methods. For reference, a general analysis of accuracy metrics for classification tasks can be found in [65]. These metrics are defined as follows: where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively.

Evaluation of the MFGNet
Observing changes in loss and accuracy is a simple and effective way to evaluate the quality of a model during training. The loss and accuracy of the training and validation set of each epoch were computed and are displayed in Figure 10a. As depicted in the figure, the accuracy curve of the training set rises rapidly; meanwhile, the loss curve drops rapidly and reaches stability after a few epochs. Although the loss of the validation set shows periodic oscillations, a stabilized curve is achieved after 60 epochs. The validation loss reaches the lowest point in the 80-100 epoch, and the curve trends of the training and validation sets during this period are the same, which indicates that the model is not overfitting.
Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 21 curve trends of the training and validation sets during this period are the same, which indicates that the model is not overfitting.  Figure 10b shows that a scatter plot of the RCMs and predicted cloud coverage was employed to investigate the performance of the MFGNet models further. The proposed model showed significant performance, and the predictions were highly consistent with the RCM with an R 2 of 0.99. It should be noted that the validation set contained cloud coverage data from different scenarios, and no patches with full cloud coverage or without any cloud coverage were evaluated, thereby resulting in a more reliable evaluation of the model.

Comparison Results
To quantitatively evaluate the performance of each model in the cloud detection task, we adopted overall accuracy, recall, F1 score, precision, and IoU as the evaluation metrics. From the results in Table 3, clearly the proposed MFGNet consistently outperformed all reference methods in terms of all metrics. In general, all CNN-based methods are effective for cloud detection, with both accuracy and precision reaching 95% or more, which not only exceeds the performance of traditional methods but also approaches the accuracy of manual labeling. It is worth mentioning that the recall value of the MFGNet is significantly better than other models, which indicates that this model has a lower false-negative rate. F1 score, which combines the evaluation results of precision and recall, can better represent the overall performance of the model, while IoU is used to judge the degree of coverage of the segmentation result on the target, which is more convincing for the segmentation task. Both of the above two comprehensive indicators of the MFGNet reached 0.9, which was significantly better than for the other methods. In general, the results show that the proposed models have better performance than other methods, and it also implies that they can perform more robustly on the validation set, which contains many kinds of cloud coverage data obtained from different scenarios.   Figure 10b shows that a scatter plot of the RCMs and predicted cloud coverage was employed to investigate the performance of the MFGNet models further. The proposed model showed significant performance, and the predictions were highly consistent with the RCM with an R 2 of 0.99. It should be noted that the validation set contained cloud coverage data from different scenarios, and no patches with full cloud coverage or without any cloud coverage were evaluated, thereby resulting in a more reliable evaluation of the model.

Comparison Results
To quantitatively evaluate the performance of each model in the cloud detection task, we adopted overall accuracy, recall, F1 score, precision, and IoU as the evaluation metrics. From the results in Table 3, clearly the proposed MFGNet consistently outperformed all reference methods in terms of all metrics. In general, all CNN-based methods are effective for cloud detection, with both accuracy and precision reaching 95% or more, which not only exceeds the performance of traditional methods but also approaches the accuracy of manual labeling. It is worth mentioning that the recall value of the MFGNet is significantly better than other models, which indicates that this model has a lower false-negative rate. F1 score, which combines the evaluation results of precision and recall, can better represent the overall performance of the model, while IoU is used to judge the degree of coverage of the segmentation result on the target, which is more convincing for the segmentation task. Both of the above two comprehensive indicators of the MFGNet reached 0.9, which was significantly better than for the other methods. In general, the results show that the proposed models have better performance than other methods, and it also implies that they can perform more robustly on the validation set, which contains many kinds of cloud coverage data obtained from different scenarios.

Example Scene and Performance
Cloud segmentation examples for whole scene GF-5 imagery are shown in Figures 11 and 12. Four types of cases, including cloud-only, ice and snow coexisting, snow-only, and cloud and snow coexisting cases, are shown as a comparison. Figure 11a-d shows the recognition results for images with different cloud coverage. At first glance, most algorithms work quite well in the cloud segmentation task, and apart from the apparent errors of FCN8 and SegNet, there is not much difference between the others. However, through careful comparison of the details, it is not difficult to find that the visual performance of the MFGNet is much better than the comparison method (discussed later). Figure 11 also reveals that, with the exception of individual cases, all methods performed well on ice recognition, indicating that the CNN-based method has fully learned the differences between ice and cloud features.
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 21 Cloud segmentation examples for whole scene GF-5 imagery are shown in Figure 11 and Figure  12. Four types of cases, including cloud-only, ice and snow coexisting, snow-only, and cloud and snow coexisting cases, are shown as a comparison. Figure 11a-d shows the recognition results for images with different cloud coverage. At first glance, most algorithms work quite well in the cloud segmentation task, and apart from the apparent errors of FCN8 and SegNet, there is not much difference between the others. However, through careful comparison of the details, it is not difficult to find that the visual performance of the MFGNet is much better than the comparison method (discussed later). Figures 11 also reveals that, with the exception of individual cases, all methods performed well on ice recognition, indicating that the CNN-based method has fully learned the differences between ice and cloud features.   The most serious issue for cloud recognition is the elimination of the misidentification of snow. In terms of statistics, the values of cloud and snow are very close. The low distinguishability in the values of cloud and snow makes for challenges in their identification. However, it is not difficult to find that in most cases, there is a distinguishable difference between snow and clouds, since the distribution of snow is closely related to the terrain. Surprisingly, most CNN-based methods showed the potential to distinguish between clouds and snow, indicating that the characteristics of clouds and snow can be learned through neural networks, even based on RGB imagery with only three bands. As shown in Figure 12, benefiting from the dual-branch CNN architecture, the MFGNet still achieved the best visual performance in all the experimental results. The SPPA module provided sufficient receptive fields while acquiring the multiscale features of the cloud. The attention mechanism adopted by the MFGNet ensures that the model can focus on the relevant and effective features to distinguish between clouds and snow. Furthermore, the fusion of features from different depths and scales improves the accuracy of the prediction. However, there is still room for improvement. By observing the detail of the tough cases, which were inaccurate and marked with a red circle, we divided the inaccurate predictions into two types; the first was the interference of thin clouds, and the other was 100% snow. The misidentification of these two cases may be related to the inaccurate labeling of samples and the limitations of the model's capabilities, which will be discussed in the next section.

Efficiency Evaluation
As mentioned earlier, cloud detection is the first step in a data quality assessment. The detection process needs to be accurate and efficient. We calculated the MFLOPs (millions of floating point operations per second), #Params (number of network parameters), model size, and time cost of each method in the experiment to illustrate the efficiency performance. As we can see from Table 4, the model size of the MFGNet was much smaller than that of the other methods, which shows that the proposed methods achieve the highest accuracy with the fewest parameters. In addition, the MFLOPs of the MFGNet was only 15.72, which was not only the smallest in the comparison experiments but also reached the current level of the mainstream lightweight network. This indicates that the model has a higher tolerance for hardware devices. All models performed similarly in time cost and could complete the prediction of a scene of 2k × 2k images within 10 s. This means that CNN-based methods can complete cloud detections of more than 300 scenes in less than an hour, which is much more efficient than traditional methods. In short, the efficiency evaluation of the proposed model means that it shows great promise for practical applications.

Method Advantage Analysis
The previous experimental results show that the MFGNet outperforms reference methods of cloud segmentation. We believe that this mainly depends on the architecture of the CNN-based methods. It is reasonable that FCN8's segmentation results were not satisfactory. As an early proposed network, the depth of FCN8 is limited, and a lot of useful information is lost during the repeated upsampling process, which leads to misidentification and omission. Benefitting from the SPP module, PSPNet can distill more deep semantic information. Although its segmentation accuracy performance is slightly better than that of FCN8, the problem of loss of spatial information (LSI) still exists. These architectural defects in the models have led to a decrease in evaluation metrics, and they have limited capabilities in small target recognition. SegNet proposed a new upsampling strategy and added more shallow information to the decoder; to some extent, it improved the problem of LSI but also led to inaccurate boundary definitions. BiSeNet employs a dual-branch CNN structure to solve the LSI problem one step further, but as we can see from the experimental results, the segmentation results of SegNet and BiSeNet were over-smooth and inaccurate.
As we can see from Figure 13, the detailed performance of the MFGNet shows more consistency with the RCM. The proposed method has a more delicate edge representation and more accurate recognition accuracy than the U-shape and linear stack structures, especially on small cloud targets (Figure 14), which is an excellent proof of the advantages of the network architecture. The proposed methods can also perform well in the situation of cloud-snow coexistence. Interestingly, in some complex cases, the segmentation results of the MFGNet were actually more accurate than that of the RCM. It may indicate that a good CNN-based method has better fault tolerance than manual labeling.
Remote Sens. 2020, 12, x FOR PEER REVIEW 16 of 21 methods can also perform well in the situation of cloud-snow coexistence. Interestingly, in some complex cases, the segmentation results of the MFGNet were actually more accurate than that of the RCM. It may indicate that a good CNN-based method has better fault tolerance than manual labeling.

Limitation Analysis
In the process of making RCMs, we adopted a more conservative labeling strategy in order to retain as much useful data as possible for subsequent applications. Therefore, we mainly labeled thick clouds and as few as possible thin clouds. However, in actual operation, there is no reference standard, so it is challenging to label different images with a consistent judgment. It directly leads to inconsistent standards in the labeling of thin-cloud RCMs, which is also a common problem faced by deep learning applications.
Another complicated problem is that the algorithm is able to distinguish cloud from snow, but it will make mistakes in recognition of 100% cloud or snow. In an RGB image, 100% of the cloud or snow overlay on the image has often reached a saturation state in value, which means that it cannot be distinguished from features such as color, texture, and brightness. Although this situation is also challenging for artificial recognition, it can still be roughly judged by analyzing the surroundings. Under the premise of having a sufficiently large receptive field, deep learning algorithms can also realize cloud and snow recognition, but due to hardware limitations, we cannot input the entire scene of images into the network. In fact, the size of the input patches determines the upper limit of the receptive field, so it is also a limitation of the capabilities of most CNN-based algorithms.

Extended Application
The development of satellite remote sensing has increased the demand for large-scale data quality assessment. Like other satellite data, the GF-5 satellite also faces considerable challenges in methods can also perform well in the situation of cloud-snow coexistence. Interestingly, in some complex cases, the segmentation results of the MFGNet were actually more accurate than that of the RCM. It may indicate that a good CNN-based method has better fault tolerance than manual labeling.

Limitation Analysis
In the process of making RCMs, we adopted a more conservative labeling strategy in order to retain as much useful data as possible for subsequent applications. Therefore, we mainly labeled thick clouds and as few as possible thin clouds. However, in actual operation, there is no reference standard, so it is challenging to label different images with a consistent judgment. It directly leads to inconsistent standards in the labeling of thin-cloud RCMs, which is also a common problem faced by deep learning applications.
Another complicated problem is that the algorithm is able to distinguish cloud from snow, but it will make mistakes in recognition of 100% cloud or snow. In an RGB image, 100% of the cloud or snow overlay on the image has often reached a saturation state in value, which means that it cannot be distinguished from features such as color, texture, and brightness. Although this situation is also challenging for artificial recognition, it can still be roughly judged by analyzing the surroundings. Under the premise of having a sufficiently large receptive field, deep learning algorithms can also realize cloud and snow recognition, but due to hardware limitations, we cannot input the entire scene of images into the network. In fact, the size of the input patches determines the upper limit of the receptive field, so it is also a limitation of the capabilities of most CNN-based algorithms.

Extended Application
The development of satellite remote sensing has increased the demand for large-scale data quality assessment. Like other satellite data, the GF-5 satellite also faces considerable challenges in

Limitation Analysis
In the process of making RCMs, we adopted a more conservative labeling strategy in order to retain as much useful data as possible for subsequent applications. Therefore, we mainly labeled thick clouds and as few as possible thin clouds. However, in actual operation, there is no reference standard, so it is challenging to label different images with a consistent judgment. It directly leads to inconsistent standards in the labeling of thin-cloud RCMs, which is also a common problem faced by deep learning applications.
Another complicated problem is that the algorithm is able to distinguish cloud from snow, but it will make mistakes in recognition of 100% cloud or snow. In an RGB image, 100% of the cloud or snow overlay on the image has often reached a saturation state in value, which means that it cannot be distinguished from features such as color, texture, and brightness. Although this situation is also challenging for artificial recognition, it can still be roughly judged by analyzing the surroundings. Under the premise of having a sufficiently large receptive field, deep learning algorithms can also realize cloud and snow recognition, but due to hardware limitations, we cannot input the entire scene of images into the network. In fact, the size of the input patches determines the upper limit of the receptive field, so it is also a limitation of the capabilities of most CNN-based algorithms.

Extended Application
The development of satellite remote sensing has increased the demand for large-scale data quality assessment. Like other satellite data, the GF-5 satellite also faces considerable challenges in data quality assessments, such as cloud detection, invalid data screening and classification, and so on ( Figure 15).
The CNN-based method has natural advantages in semantic segmentation and image classification. Combined with the advantages of big data from satellite images, it can propose possible solutions for the realization of automated, full-process, high-efficiency, and high-precision satellite data quality assessment. As a lightweight network, the MFGNet can achieve high-precision cloud detection with lower computational consumption, which shows great potential for large-scale practical applications.
Remote Sens. 2020, 12, x FOR PEER REVIEW 17 of 21 data quality assessments, such as cloud detection, invalid data screening and classification, and so on ( Figure 15). The CNN-based method has natural advantages in semantic segmentation and image classification. Combined with the advantages of big data from satellite images, it can propose possible solutions for the realization of automated, full-process, high-efficiency, and high-precision satellite data quality assessment. As a lightweight network, the MFGNet can achieve high-precision cloud detection with lower computational consumption, which shows great potential for large-scale practical applications. As a data-driven method, a CNN-based model requires continuous optimization to meet the needs of global-scale applications. However, the considerable sample label workload becomes a problem. According to the previous analysis, the MFGNet has excellent performance on cloud segmentation in various environments. Therefore, the MFGNet can be used to predict the samples first, and screen out the unqualified samples with poor prediction accuracy for manual labeling. The newly generated data can be used for model training again to further improve its accuracy, thus forming a looping workflow, which can improve labeling efficiency.

Conclusions and Future Developments
In recent years, there has been an increasing demand for efficient cloud detection in massive satellite images, which makes for challenges to traditional cloud detection methods that mainly rely on spectral information. In this case, the combined use of RGB imagery and CNN-based methods provide a solution for efficient cloud detection in GF-5 satellite data. In this paper, we presented the MFGNet, a novel cloud segmentation model with dual-branch CNN architecture. The proposed model employs SPPA, LFSA, and GMFF modules to implement a better fusion of features from different depths and scales and strengthens the collection of useful spatial information. The MFGNet was trained on hundreds of globally distributed GF-5 satellite images in a variety of scenarios and compared with FCN8, SegNet, PSPNet, and BiSeNet. The overall accuracy, recall, F-score, precision, and IoU, were deployed to quantitatively evaluate the MFGNet and the compared methods. The experimental results show that, compared with the other models, the MFGNet can achieve promising performance for cloud recognition of GF-5 RGB imagery with an F1 score reaching 0.94 and an IoU of approximately 0.9. The efficiency test results also indicate that the proposed model has fewer parameters (#Params=7.83×10 6 ) and less computational consumption (MFLOPs=15.72). Based on these results, we believe that the use of CNN-based methods for cloud detection is a promising way forward and has practical significance for large-scale, automated, and efficient data quality assessment applications.
In our future study, we will collect as much data as possible from around the world for cloud segmentation to improve the generalizability of the algorithm. In addition, we will generalize the proposed method to other satellite data. Furthermore, to better overcome the weaknesses of the current models, we will try to use a small number of hyperspectral bands to improve the segmentation performance of targets where cloud and snow coexist. As a data-driven method, a CNN-based model requires continuous optimization to meet the needs of global-scale applications. However, the considerable sample label workload becomes a problem. According to the previous analysis, the MFGNet has excellent performance on cloud segmentation in various environments. Therefore, the MFGNet can be used to predict the samples first, and screen out the unqualified samples with poor prediction accuracy for manual labeling. The newly generated data can be used for model training again to further improve its accuracy, thus forming a looping workflow, which can improve labeling efficiency.

Conclusions and Future Developments
In recent years, there has been an increasing demand for efficient cloud detection in massive satellite images, which makes for challenges to traditional cloud detection methods that mainly rely on spectral information. In this case, the combined use of RGB imagery and CNN-based methods provide a solution for efficient cloud detection in GF-5 satellite data. In this paper, we presented the MFGNet, a novel cloud segmentation model with dual-branch CNN architecture. The proposed model employs SPPA, LFSA, and GMFF modules to implement a better fusion of features from different depths and scales and strengthens the collection of useful spatial information. The MFGNet was trained on hundreds of globally distributed GF-5 satellite images in a variety of scenarios and compared with FCN8, SegNet, PSPNet, and BiSeNet. The overall accuracy, recall, F-score, precision, and IoU, were deployed to quantitatively evaluate the MFGNet and the compared methods. The experimental results show that, compared with the other models, the MFGNet can achieve promising performance for cloud recognition of GF-5 RGB imagery with an F1 score reaching 0.94 and an IoU of approximately 0.9. The efficiency test results also indicate that the proposed model has fewer parameters (#Params = 7.83 × 10 6 ) and less computational consumption (MFLOPs = 15.72). Based on these results, we believe that the use of CNN-based methods for cloud detection is a promising way forward and has practical significance for large-scale, automated, and efficient data quality assessment applications.
In our future study, we will collect as much data as possible from around the world for cloud segmentation to improve the generalizability of the algorithm. In addition, we will generalize the proposed method to other satellite data. Furthermore, to better overcome the weaknesses of the current models, we will try to use a small number of hyperspectral bands to improve the segmentation performance of targets where cloud and snow coexist.