Object-Based Convolutional Neural Networks for Cloud and Snow Detection in High-Resolution Multispectral Imagers

Cloud and snow detection is one of the most significant tasks for remote sensing image processing. However, it is a challenging task to distinguish between clouds and snow in high-resolution multispectral images due to their similar spectral distributions. The shortwave infrared band (SWIR, e.g., Sentinel-2A 1.55–1.75 μm band) is widely applied to the detection of snow and clouds. However, high-resolution multispectral images have a lack of SWIR, and such traditional methods are no longer practical. To solve this problem, a novel convolutional neural network (CNN) to classify cloud and snow on an object level is proposed in this paper. Specifically, a novel CNN structure capable of learning cloud and snow multiscale semantic features from high-resolution multispectral imagery is presented. In order to solve the shortcoming of “salt-and-pepper” in pixel level predictions, we extend a simple linear iterative clustering algorithm for segmenting high-resolution multispectral images and generating superpixels. Results demonstrated that the new proposed method can with better precision separate the cloud and snow in the high-resolution image, and results are more accurate and robust compared to the other methods.


Introduction
High-resolution images are widely used in land cover monitoring, target detection and geographic mapping [1].Cloud and snow significantly influence the spectral bands of high-resolution optical images [2].
Accurate detection of clouds and snow for remote sensing images is a key task for many remote sensing applications [3].The presence of cloud and snow can cause serious problems for many remote sensing applications such as target recognizing segmentation, atmosphere correction, and more [4].Therefore, the precise detection of clouds and snow in optical satellite image applications is quite challenging work.
Previous methods for cloud and snow detection have been developed based on spectral analysis [5].In the existing methods, the shortwave infrared band (SWIR, e.g., Sentinel-2A 1.55-1.75µm Water 2018, 10, 1666 2 of 11 band) is widely applied to the detection of snow and clouds [6].The main reason is that the reflectance of snow is commonly lower than clouds at SWIR wavelengths [7].The Normalized Difference Snow Index (NDSI) [8], a combination of visible (green band) and SWIR, NDSI can effectively distinguish the snow region from the cloud region [9].However, high-resolution images have a lack of SWIR.Therefore, the previous methods-based spectral is no longer practical due to a lack of SWIR for high-resolution images.Cloud and snow detection is a classification issue, and machine learning has been applied to the remote sensing image classification such as the Support Vector Machine (SVM) and the Artificial Neural Network (ANN) [10].These methods used manually designed features and a binary classifier, which did not take advantage of high-level features.Therefore, distinguishing between cloud and snow in high-resolution satellite images is a challenging task when using only four optical bands (blue, green, red, and infrared).
Nowadays, deep learning algorithms have become a research hotspot.Deep convolutional neural networks (CNN) are a typical deep learning algorithm which have been applied in image classification, speech recognition and image semantic segmentation [11].CNN framework is quickly becoming important in the image processing field since it has the ability to effectively extract semantic information from a large number of input remote sensing images.Many methods for cloud and snow detection are based on pixel level predictions, and tend to produce poor detection results with "salt-and-pepper" noise in high-resolution imagery [12].However, object level-based methods can overcome the disadvantage of "salt-and-pepper" noise on the pixel level.
For the high-resolution multispectral image, it is not easy to obtain good results for cloud and snow detection when using high-resolution multispectral images which only have four optical bands.In this study, a novel CNN architecture is proposed for cloud and snow detection on an object level.Specifically, a two patch-based deep CNN architecture is used, which can mine deep multi-scale features and semantic features (including such areas as "building is surrounded by "building shadow" or "the snow goes along the extending direction of a mountain").In order to reduce features loss in the pooling process, we present self-adaptive pooling based on the maximum pooling and the average pooling.To better consider information from all bands, we propose a new simple linear iterative clustering (SLIC) [13] to generate the spectral information of superpixels.The CNN structure is integrated with spectral information of SLIC segmentation in the preprocessing stage to enhance the performance of CNN.Compared with traditional CNN-based pixel methods, experimental results reveal that our CNN architecture can achieve more accurate cloud and snow detection.
The major contributions of this study are outlined below.
(1) A new CNN structure is proposed to classify cloud and snow on an object level, which is capable of learning cloud and snow multi-scale semantic features from high-resolution multispectral imagery.(2) To better consider information from all bands and overcome the disadvantage of "salt-and-pepper" noise, we propose a new SLIC algorithm to generate superpixels at the preprocessing stage.(3) In order to reduce features loss in the pooling process, we present a self-adaptive pooling based on the maximum pooling and the average pooling.
The structure of this paper is outlined below.In Section 2, we introduce the proposed CNN framework for distinguishing between cloud and snow in the multispectral images.In Section 3, the experimental results and analysis are introduced.Finally, the conclusion is drawn in Section 4.

Methods for Cloud and Snow Detection in High-Resolution Remote Sensing Images
In this section, the proposed framework for distinguishing between cloud and snow in the high-resolution imagery is introduced.First, we propose a new SLIC to generate the spectral information of superpixels, which are used to enhance two patch-based deep CNN outputs.Second, the proposed CNN framework is illustrated.

Preprocessing of Superpixels
In comparison with pixel-based cloud and snow detection methods, object-based cloud and snow detection can solve the disadvantage of the "salt-and-pepper" effect.Therefore, a preprocessing step is used to reduce misclassified cloud and avoid the "salt-and-pepper" effect.
SLIC algorithm [14] is a widely used segmentation superpixel.As is well known, k-means clusters locally to generate superpixels, which reduce the computation cost [15].Superpixel methods aim to over-segment the remote sensing image by grouping homogenous pixels [16].In order to avoid "salt-and-pepper" noise in clouds and snow detection methods based on the pixel-level, therefore, a new method based on the object level is proposed in clouds and in the snow detection field.
A high-resolution multispectral image contains four bands, and if the SLIC algorithm is used to segment a multispectral image, only three bands can be used for superpixel segmentation.The red, green, and blue bands can be converted to the CIELAB color space [l, a, b] and used directly to calculate the CIELAB spectral space distance.It would lead to the infrared band not belonging to the CIELAB spectral space.To better consider all bands information, the distance of the infrared band should be calculated independently.In this study, the infrared space distance between the ith pixel and jth pixel can be calculated by the equation below.
The spatial distance between the ith pixel and jth pixel can be expressed by Equation (2).
The spectral distance between the ith pixel and jth pixel can be calculated by Equation (3).
Therefore, the dissimilarity measure between the ith pixel and jth pixel can be calculated by Equation (4).
where m is used to balance the relative weight between spectral similarity and spatial dissimilarity.S is the area of the jth cluster in the current loop.
For high-resolution multispectral imagery, the improved SLIC of cluster centers to be the mean [l i , a i , b i , nir i , x i , y i ] T and D(i, j) as the dissimilarity measure.The other parts of improved SLIC are as similar to the regular SLIC algorithm.
Results of improved SLIC and SLIC segmentation are shown in Figure 1b,c.From Figure 1b, it can be seen that the improved SLIC algorithm can obtain more compact and shape regular superpixels and adhere to cloud and snow region boundaries very well.
Water 2018, 10, x FOR PEER REVIEW 3 of 11

Preprocessing of Superpixels
In comparison with pixel-based cloud and snow detection methods, object-based cloud and snow detection can solve the disadvantage of the "salt-and-pepper" effect.Therefore, a preprocessing step is used to reduce misclassified cloud and avoid the "salt-and-pepper" effect.
SLIC algorithm [14] is a widely used segmentation superpixel.As is well known, k-means clusters locally to generate superpixels, which reduce the computation cost [15].Superpixel methods aim to over-segment the remote sensing image by grouping homogenous pixels [16].In order to avoid "salt-and-pepper" noise in clouds and snow detection methods based on the pixel-level, therefore, a new method based on the object level is proposed in clouds and in the snow detection field.
A high-resolution multispectral image contains four bands, and if the SLIC algorithm is used to segment a multispectral image, only three bands can be used for superpixel segmentation.The red, green, and blue bands can be converted to the CIELAB color space ] , , [ b a l and used directly to calculate the CIELAB spectral space distance.It would lead to the infrared band not belonging to the CIELAB spectral space.To better consider all bands information, the distance of the infrared band should be calculated independently.In this study, the infrared space distance between the ith pixel and j th pixel can be calculated by the equation below.

(
) The spatial distance between the ith pixel and j th pixel can be expressed by Equation (2).
The spectral distance between the i th pixel and j th pixel can be calculated by Equation (3).
Therefore, the dissimilarity measure between the i th pixel and j th pixel can be calculated by Equation ( 4).
( , ) where m is used to balance the relative weight between spectral similarity and spatial dissimilarity.S is the area of the j th cluster in the current loop.
For high-resolution multispectral imagery, the improved SLIC of cluster centers to be the mean  Water 2018, 10, 1666 4 of 11

CNN Structure
Clouds and snow have similar spectral distributions in high-resolution multispectral images [17].The traditional cloud and snow detection method cannot distinguish well between snow and cloud regions because of the existing traditional methods which only include the spectral feature.In order to separate between snow and cloud regions, it is key elements that how to effectively extract multi-scale semantic feature.
As we all know, CNNs extract high-level features through self-learning [18].However, traditional CNNs cannot extract multi-scale semantic features in input data.In this study, we present a new two patches-CNN architecture to exploit the multi-scale semantic information, which can extract the rich superpixel-level feature.
CNN is a multilayer perceptron inspired by the visual neural mechanism.Generally, a typical CNN architecture consists of the input layer, the convolution layer, the pooling layer, the full connection layer, and the output layer [19].The first convolutional layer computes convolution of the input image with convolutional kernels [20].The convolution layer is often followed by an activation function, after the pooling layer becomes the fully connected layer [21], as illustrated in Figure 2.
Water 2018, 10, x FOR PEER REVIEW 4 of 11

CNN Structure
Clouds and snow have similar spectral distributions in high-resolution multispectral images [17].The traditional cloud and snow detection method cannot distinguish well between snow and cloud regions because of the existing traditional methods which only include the spectral feature.In order to separate between snow and cloud regions, it is key elements that how to effectively extract multi-scale semantic feature.
As we all know, CNNs extract high-level features through self-learning [18].However, traditional CNNs cannot extract multi-scale semantic features in input data.In this study, we present a new two patches-CNN architecture to exploit the multi-scale semantic information, which can extract the rich superpixel-level feature.
CNN is a multilayer perceptron inspired by the visual neural mechanism.Generally, a typical CNN architecture consists of the input layer, the convolution layer, the pooling layer, the full connection layer, and the output layer [19].The first convolutional layer computes convolution of the input image with convolutional kernels [20].The convolution layer is often followed by an activation function, after the pooling layer becomes the fully connected layer [21], as illustrated in Figure 2.  Generally, a convolution operator is followed by a transformation, called the ReLU function [22].Suppose i G is the feature map of the i th layer.Then, i W represents the weight feature vector of the i th convolution kernel.Mathematically, the convolution process can be defined by the equation below [23].
where i b is the bias in the i th channel.⊗ denotes a convolution operation of the i th layer of the image and the ( 1) i − th layer of the input data.
is the activation function.
The ReLU function has recently been frequently used in literature due to neurons with rectified functions performing well to overcome saturation during the learning process [24].The ReLU function can be defined by the equation below.
( ) max(0, ) The pooling layer's function is to merge semantically similar features into one.In general, typical pooling is the max-pooling and average-pooling [25].Let us assume that ij G is a feature map.
The max pooling can be expressed by Equation ( 7) where stand for describe as the max element from the feature map.
The average-pooling can be expressed by Equation ( 8) where the size of the pooling area is c c× pixels.Generally, a convolution operator is followed by a transformation, called the ReLU function [22].Suppose G i is the feature map of the ith layer.Then, W i represents the weight feature vector of the ith convolution kernel.Mathematically, the convolution process can be defined by the equation below [23].
where b i is the bias in the ith channel.⊗ denotes a convolution operation of the ith layer of the image and the (i − 1)th layer of the input data.f (•) is the activation function.
The ReLU function has recently been frequently used in literature due to neurons with rectified functions performing well to overcome saturation during the learning process [24].The ReLU function can be defined by the equation below.
The pooling layer's function is to merge semantically similar features into one.In general, typical pooling is the max-pooling and average-pooling [25].Let us assume that G ij is a feature map.The max pooling can be expressed by Equation ( 7) where (G ij ) stand for describe as the max element from the feature map.
The average-pooling can be expressed by Equation ( 8) where the size of the pooling area is c × c pixels.
Water 2018, 10, 1666 Image features loss is an inevitable problem with the subsampling (pooling) process.It always leads to low clouds and snow detection accuracy.We present self-adaptive pooling based on the maximum pooling and the average pooling.The self-adaptive pooling can adaptively adjust the pooling process through the pooling factors u in the complex pooling area [26].The self-adaptive pooling is shown by Equation ( 9) [27].
where u is the pooling factor.The role of u is to dynamically optimize both the max pooling and the average pooling based on different pooling blocks.The pooling factor can be defined by the equation below [28].
where a is the average of all elements except for the max element in the pooling area and b max is the max element in the pooling area.
A fully connected layer causes overfitting due to these producing the many parameters.In general, to reduce overfitting, it weightily depends on dropout regularization and spatial dropout.The overfitting often leads to low classification accuracy with ineffective training tasks.However, Global Average Pooling (GAP) [29] does not require any optimization.Therefore, two patches-based CNN are replaced by fully connected layers with GAP.
We design a new two patches CNN structure shown in Figure 3.In this study, present a new two CNN architecture, which uses two branches (128 × 128 and 64 × 64) as the input data to mine multi-scale semantic features.The input patch size of 128 × 128 makes up the CNN structure (red box).The first convolution layer is 64@ 24 × 24 and it is composed of 64 filter channels with a dimension of 24 × 24, which is a result of the convolution of an input patch with the kernel of dimensions 36 × 36 with stride 4.Then, the first convolution layer is followed by a self-adaptive pooling of dimensions 2 × 2, which results in 128@ 12 × 12.This process is followed by convolution of 128@ 12 × 12 with filter-kernel of sizes 2 × 2 and 4 × 4 with filter units 256 and 512.Finally, the GAP is computed to 512 channels of dimension size 3 × 3 to reproduce 512@ 1 × 1.
Water 2018, 10, x FOR PEER REVIEW 5 of 11 Image features loss is an inevitable problem with the subsampling (pooling) process.It always leads to low clouds and snow detection accuracy.We present self-adaptive pooling based on the maximum pooling and the average pooling.The self-adaptive pooling can adaptively adjust the pooling process through the pooling factors u in the complex pooling area [26].The self-adaptive pooling is shown by Equation ( 9) [27].
where u is the pooling factor.The role of u is to dynamically optimize both the max pooling and the average pooling based on different pooling blocks.The pooling factor can be defined by the equation below [28].
where a is the average of all elements except for the max element in the pooling area and max b is the max element in the pooling area.
A fully connected layer causes overfitting due to these producing the many parameters.In general, to reduce overfitting, it weightily depends on dropout regularization and spatial dropout.The overfitting often leads to low classification accuracy with ineffective training tasks.However, Global Average Pooling (GAP) [29] does not require any optimization.Therefore, two patches-based CNN are replaced by fully connected layers with GAP.
We design a new two patches CNN structure shown in Figure 3.In this study, we present a new two patches CNN architecture, which uses two branches (128 × 128 and 64 × 64) as the input data to mine multi-scale semantic features.The input patch size of 128 × 128 makes up the CNN structure (red box).The first convolution layer is 64@ 24 × 24 and it is composed of 64 filter channels with a dimension of 24 × 24, which is a result of the convolution of an input patch with the kernel of dimensions 36 × 36 with stride 4.Then, the first convolution layer is followed by a self-adaptive pooling of dimensions 2 × 2, which results in 128@ 12 × 12.This process is followed by convolution of 128@ 12 × 12 with filter-kernel of sizes 2 × 2 and 4 × 4 with filter units 256 and 512.Finally, the GAP is computed to 512 channels of dimension size 3 × 3 to reproduce 512@ 1 × 1.
Input-patch 3@128×128 64@24×24 128@12×12 256@6×6 512@3×3 512 Stride is 4  For the input patch size of 64 × 64 for the CNN structure (yellow box), the first convolution layer is 32@ 14 × 14, which is a result of the convolution of an input patch with the dimension size 12 × 12 with stride 4.Then, the first convolution layer followed by self-adaptive pooling of the dimension size 2 × 2 results in 64@ 7 × 7.This process is followed by convolution of 64@ 7 × 7 with a filter-kernel of sizes 3 × 3 and filter units of 128 and 256.Lastly, the GAP is computed with 512 channels of dimension sizes 3 × 3 and reproduced 256@ 1 × 1.The rectified linear units (RELU) [30] are applied to the output of the every convolutional layer.The output is a 768-dimensional vector, which is reshaped into a 16 × 16 size of three channels (snow, cloud, and background).The softmax classifier is applied to the global average pooling, connected to the detection of snow and clouds.

Accuracy Assessment
The accuracy of the cloud and snow extraction used in this study were evaluated, which included Overall Accuracy (OA) and the Kappa.The cloud and snow in the study area were manually drawn in ENVI (Harris Geospatial, Boulder, CO, USA).Therefore, Overall Accuracy (OA) and the Kappa were used to assess the accuracy of the algorithm results.We used overall accuracy, which is the percentage of correctly classified cloud pixels.Take the cloud as an example.The OA and Kappa are defined as [31]: where T is the total number of pixels in the testing image; TP, FN, FP, and TN denote the pixels for the accurate extraction of cloud, missing cloud regions, inaccurate extraction cloud, and accurate rejection of background regions, respectively; and δ = (TP + FP) × (TP + FN) + (FN + TN) × (FN + TN).

Experiment and Analysis
The new two patches CNN architecture is implemented by using Python 3.5 a PC with an I7 8086K CPU and GPU NVidia Tesla M40 with 12 GB of memory.Our CNN is implemented through the software library Tensorflow (Google Brain Team, Mountain View, CA, USA).The training dataset is manufactured from two SPOT-6 multispectral images and nine Gaofen-1 multispectral images.In the training dataset, 20,000 pairs of patches are obtained from the training set, where the number of clouds, snow, and background patches are 8000, 8000, and 4000, respectively.In this paper, we set up the weights in each layer with a random number drawn from a zero-mean Gaussian distribution with a standard deviation of 0.01.We used batch gradient descent with a mini-batch size of 768.The learning rate started from 0.001.The ground values of cloud and snow areas were manually extracted.

Performance of Improved Superpixel Method and Different CNN Architectures
In order to verify the effectiveness of the improved SLIC method and double-branch CNN, we compared it with our double-branch CNN + pixel, our double-branch + SLIC, single-branch CNN (input branch size of 128 × 128, the first branch (red box) in Figure 3), single-branch CNN (input branch size of 64 × 64, the second branch (yellow box) in Figure 3), our double-branch + average pooling, and our double-branch + max pooling.
Based on the visual assessment, Figure 4 shows the visual performance of the improved superpixel method and different CNN structural methods at cloud and snow detection tasks.Compared with the truth situation, single-branch CNNs bring more false negatives and false positives than our double-branch CNNs Structures in cloud and snow detection, which can extract the multi-scale semantic features in input data.Among the CNN Structures, our CNN achieves better visual performance than CNN + max pooling and CNN + average pooling.From Figure 4b-h, the superpixel-level-based methods can avoid the disadvantage of the "salt-and-pepper" effect.
superpixel method and different CNN structural methods at cloud and snow detection tasks.Compared with the truth situation, single-branch CNNs bring more false negatives and false positives than our double-branch CNNs Structures in cloud and snow detection, which can extract the multi-scale semantic features in input data.Among the CNN Structures, our CNN achieves better visual performance than CNN + max pooling and CNN + average pooling.From Figure 4b-h, the superpixel-level-based methods can avoid the disadvantage of the "salt-and-pepper" effect.To evaluate the quantitative performance among different cloud and snow detection methods, the overall accuracy (OA) and Kappa Coefficient (Kappa) are used.Table 1 shows the statistical results.To evaluate the quantitative performance among different cloud and snow detection methods, the overall accuracy (OA) and Kappa Coefficient (Kappa) are used.Table 1 shows the statistical results.
As shown in Table 1, superpixel segmentation preprocessing further improved the accuracy of our method.The OA accuracy reached 95%, which demonstrates the effectiveness of double-branch CNN and superpixel oriented preprocessing.However, our proposed method achieves the best performance in the two metrics.

Comparison with Other Cloud and Snow Detection Algorithms
To verify the effectiveness of the proposed method, we compare it with the ENVI threshold method [32], ANN [33], and SVM [34].
Based on the visual assessment, Figure 5 shows the similar results using different cloud and snow detection methods.There are some limitations in the ENVI threshold cloud and snow detection.The ENVI threshold method may fail to separate the cloud from the snow (Figure 5b) in the Gaofen-1 multispectral image.Our proposed method can accurately separate the cloud and snow in the high-resolution image, and results are more accurate and robust compared to the other methods.As shown in Table 1, superpixel segmentation preprocessing further improved the accuracy of our method.The OA accuracy reached 95%, which demonstrates the effectiveness of double-branch CNN and superpixel oriented preprocessing.However, our proposed method achieves the best performance in the two metrics.

Comparison with Other Cloud and Snow Detection Algorithms
To verify the effectiveness of the proposed method, we compare it with the ENVI threshold method [32], ANN [33], and SVM [34].
Based on the visual assessment, Figure 5 shows the similar results using different cloud and snow detection methods.There are some limitations in the ENVI threshold cloud and snow detection.The ENVI threshold method may fail to separate the cloud from the snow (Figure 5b) in the Gaofen-1 multispectral image.Our proposed method can accurately separate the cloud and snow in the highresolution image, and results are more accurate and robust compared to the other methods.To evaluate the quantitative performance among different cloud and snow detection methods, the overall accuracy (OA) and Kappa Coefficient (Kappa) are used.The metric precision for ENVI threshold methods is not given in this paper because it could not separate the clouds from the snow.Table 2 shows the statistical OA and Kappa results.From Table 2, our methods obtained high OA and a reasonable kappa coefficient, which verifies the feasibility and effectiveness of the cloud detection.In addition, the proposed method outperforms the ENVI threshold method, ANN, and SVM by a large margin.Specifically, the proposed method is about 33% (Kappa) higher than the ENVI threshold method for cloud detection.

Conclusions
Generally, it is not easy to separate clouds from snow when using high-resolution remote sensing imagery which only includes visible and near-infrared spectral bands.As a result of the insufficient spectral information for high-resolution remote sensing imagery for cloud and snow detection, they are difficult to capture.In this work, a new CNN architecture is proposed for cloud and snow detection on an object level.Specifically, a two patch-based deep CNN architecture is used, which can deeply mine multi-scale feature and semantic feature (including such areas as "building is surrounded by "building shadow").The improved SLIC method was applied to segment the spectral image into shape regular superpixels, which were used to enhance our CNN outputs.The experimental results demonstrate that the proposed CNN architecture can accurately separate the clouds and snow.In addition, the extraction result is more feasible and effective than that of the traditional cloud detection algorithm.
In a future study, we could generalize the proposed convolutional neural network-based methods to another task in remote sensing fields such as urban water extraction and ship detection.

Figure 2 .
Figure 2. The standard architecture of the CNNs.

Figure 2 .
Figure 2. The standard architecture of the CNNs.

Figure 3 .
Figure 3.The architecture of the used CNNs.Figure 3. The architecture of the used

Figure 3 .
Figure 3.The architecture of the used CNNs.Figure 3. The architecture of the used

Table 1 .
Results on the SPOT-6 multispectral images for different methods in terms of OA and Kappa.

Table 1 .
Results on the SPOT-6 multispectral images for different methods in terms of OA and Kappa.

Table 2 .
Quantitative comparison among different methods.