Fuzzy AutoEncode Based Cloud Detection for Remote Sensing Imagery

: Cloud detection of remote sensing imagery is quite challenging due to the inﬂuence of complicated underlying surfaces and the variety of cloud types. Currently, most of the methods mainly rely on prior knowledge to extract features artiﬁcially for cloud detection. However, these features may not be able to accurately represent the cloud characteristics under complex environment


Introduction
With the existence of clouds, solar radiation cannot or can hardly arrive at the land surface, which not only leads to the missing information and spectral distortion, but also hinders the further image application [1,2].Therefore, cloud detection plays an indispensable role in image pre-processing.However, accurate cloud detection is quite challenging.On the one hand, there are various clouds with different spectral characteristics.On the other hand, some objects with high reflectance (such as snow, ice, etc.) are always confused with clouds.In particular, optical thin clouds are difficult to detect as their spectral signal includes both clouds and the surface underneath [3].
In recent years, many researchers have studied these issues and a series of cloud detection methods have been proposed.Generally speaking, these methods can be divided into two categories: single-image-based method and multiple-image-based method [4][5][6].In [7], cloud and shadow areas are detected using spectral information from the blue, shortwave infrared and thermal infrared bands of Landsat Thematic Mapper (TM) or Enhanced Thematic Mapper Plus (ETM+) imagery from two dates.Goodwin proposes a new automated method to screen cloud and cloud shadow from time series of Landsat TM/ETM+, and the results suggest that temporal information can improve the detection of cloud and cloud shadow [8].Zhu also designs an automated cloud, cloud shadow, and snow detection algorithm using multi-temporal Landsat data [9].However, imagery without temporal characteristics is much more common and the emergence of single-image-based methods can be helpful for multiple-image-based cloud detection.In this paper, we mainly focus on the cloud detection with single image.
For single-image-based cloud detection, threshold-based methods are widely used [10][11][12][13].Irish [14] proposes an automated cloud cover assessment method to extract clouds from Landsat data by setting a series of thresholds using different indexes.However, it does not provide sufficient precise locations and boundaries of clouds.Zhu and Woodcock acquire the cloud mask by computing a probability mask and a scene-based threshold, while due to the relatively lower threshold settings, the clouds are always overestimated [15].Zhang et al. obtain a coarse cloud detection result relying on the significance map and the proposed optimal threshold setting [16].The threshold-based method is a simple and practical approach for cloud detection, while it is impractical for general use because of its sensitivity to the background and the range of cloud cover [17].
Subsequently, more sophisticated methods are used to identify cloud from remote sensing imagery [18][19][20][21].In [22], decision trees based on empirical studies and simulations are designed for cloud detection and acquire relatively satisfactory performance.As cloud and cloud shadow always occur in pairs, the relationship between cloud and cloud shadow as well as the sensor parameters can also be used for cloud detection [23,24].Nevertheless, it is a tough job for acquiring sensor parameters, which to some extent increases the difficulty of cloud identification.According to the unique characteristics of clouds, which are brighter and colder than most of the earth surface, spectral features can always be used for cloud pixel detection.In addition, some existing methods add other information of images, such as texture information, shape information, spatial information, and so on [25][26][27][28].
In essence, cloud detection is a classification problem, and the recent developments of machine learning provide more available approaches for cloud detection [29,30].Therefore, some classifier-based methods (such as Support Vector Machine (SVM), Random Forest (RF), etc.) have increasing popularity.Latry classifies the cloud picture using radiances and geometrical characteristics based on SVM [31].In [32], researchers adopt a visual attention technique in computer-vision based RF to automatically identify images with a significant cloud cover.Ma et al. [33] successfully applies the cascaded adaboost classifier to solve the cloud detection problem.In existing methods, a larger number of features are artificially designed and extracted as the classifier input.These artificially designed features rely on prior knowledge and they are difficult to accurately represent the cloud characteristics under complex environment.Thus, we adopt a model integrating deep discriminative feature learning and fuzzy function strategies to detect clouds, which could not only extract implicit information, but also attain good performance.
In this paper, we establish a model named FAEM for discriminative feature learning instead of artificial feature designing.The proposed FAEM mainly consists of two parts: stacked autoencode networks are introduced to learn the deep discriminative features from a great deal of samples, and then a fuzzy function is combined to obtain the accurate cloud detection results.The remainder of this manuscript is organized as follows.Section 2 describes datasets and preprocessing.The proposed methodology for cloud detection is introduced in Section 3, followed by the cloud extraction experiments and result in Section 4. Further discussions are arranged in Section 5 and in Section 6 we give a brief summary of our works.

Datasets
Landsat ETM+ images and GF-1 images with different spatial resolutions are considered in this study.They can be downloaded from the USGS website and geospatial data cloud [34], respectively.In our work, 172 Landsat ETM+ images and 25 GF-1 images with 500 by 500 pixels are applied.For Landsat EMT+ imagery, as the spatial resolution of its thermal infrared band is 60 m (Table 1) which is lower than other bands.We first resample it to 30 m spatial resolution for a uniform size.The detailed parameters of these images are provided in Table 1, and Table 2 shows statistical number of images with different cloud covers.These images contain various underlying surface environment such as green vegetation, water (river and sea), building, bare rock and so on.We can easily distinguish vegetation and bare rock from cloud pixels based on their spectra properties.However, it is difficult to distinguish building and snow from clouds by spectral characteristics alone.In addition, since the semitransparent thin cloud has always been mixed with other background objects and has no clear outline, it is more difficult to detect thin cloud than thick cloud [35].Four cloudy images with various underlying surfaces are shown in Figure 1.

Datasets
Landsat ETM+ images and GF-1 images with different spatial resolutions are considered in this study.They can be downloaded from the USGS website and geospatial data cloud [34], respectively.In our work, 172 Landsat ETM+ images and 25 GF-1 images with 500 by 500 pixels are applied.For Landsat EMT+ imagery, as the spatial resolution of its thermal infrared band is 60 m (Table 1) which is lower than other bands.We first resample it to 30 m spatial resolution for a uniform size.The detailed parameters of these images are provided in Table 1, and Table 2 shows statistical number of images with different cloud covers.These images contain various underlying surface environment such as green vegetation, water (river and sea), building, bare rock and so on.We can easily distinguish vegetation and bare rock from cloud pixels based on their spectra properties.However, it is difficult to distinguish building and snow from clouds by spectral characteristics alone.In addition, since the semitransparent thin cloud has always been mixed with other background objects and has no clear outline, it is more difficult to detect thin cloud than thick cloud [35].Four cloudy images with various underlying surfaces are shown in Figure 1.

Preprocessing: Feature Selection
Cloud is an aerosol comprising a visible mass of minute liquid droplets or frozen crystals, and shows cluster-like distributions in remote sensing imagery.It is generally known that the spectral reflectance value of cloud is relatively larger than those of other underlying surfaces [36].However, because of different objects with similar spectral profiles and image noises, spectral features are not enough for cloud detection.Therefore, for high-accuracy cloud detection, we adequately consider both the spectral and spatial information of remote sensing imagery in this paper, mainly including spectral, texture and structure features [37,38].

Selection of Spectral Features
Cloud types vary widely, but the characteristic of cloud is generally white, bright, and cold compared to the Earth's surface.Just as Table 3 shows, only blue (0.45-0.52 µm), green (0.52-0.59 µm), red (0.63-0.69 µm), and near infrared (0.77-0.89 µm) bands of the GF-1 images and all bands of Landsat ETM+ are used in our experiments while panchromatic bands are excluded.

Selection of Texture Features
Texture reflects the spatial arrangement of spectral information, which is an important portion of spatial feature [39].In this paper, we take four frequently used texture features: means, homogeneity, second moment and correlation based on the Grey Level Co-occurrence Matrix (GLCM).In addition, as infrared band is always different from or even opposite to other bands, only the texture features of the mean of visible image bands are considered.The formulation of the four texture features we used in this paper are shown as following: where p(i, j) is the value in the cell i, j in the co-occurrence matrix and L is the max spectral value.
Means reflects the regularity of image texture.(b) Homogeneity Homogeneity is the measurement of image uniformity in the local region.(c) Second moment ASM is the image energy, and reflects the image uniformity.
(d) Correlation where p(i, j) is the normalized value of co-occurrence matrix, Cor Measures the similarity of GCLM in the row or column direction.

Selection of Structure Features
Just as [40] shows, structure features describe the core information about image.The overall structure features, rather than the individual details, are always the primary information of human perception on image.To exploit structure image S for an input image I, a relative total variation (RTV) model will be employed.
where ε is a small constant, N represents the total number of image, and λ is a presetting parameter for balance.φ x (i) and φ y (i) are the general pixel-wise windowed total variation measure.They represent the absolute spatial difference within the window R(i) and could be written as where j belongs to a window R(i).(∂ x S) and (∂ y S) respectively calculate the partial derivative in x and y directions of image S. g i,j is a weighting function, and it is defined as ψ x (i) and ψ y (i) are defined as different from φ x (i) and φ y (i) ; they are written as In our work, we set the parameter λ = 0.05, and use the mean values of the visible bands as inputs.Then, the right of Equation ( 5) is minimized to obtain the overall structure map.

Methodology: FAEM for Cloud Detection
The framework of the proposed method is shown in Figure 2. Three major steps were performed in this section for accurate cloud detection: (1) fundamental feature fusion; (2) deep discriminative feature learning; and (3) cloud degree prediction.The initial step of FAEM was to calculate fundamental features from the original images.During the deep feature learning phase, the selected fundamental features were fed into the established stacked autoencode networks to generate a set of feature extractors.In the final step, the membership function was used for cloud degree prediction.

Fundamental Feature Fusion
Accurate cloud detection needs to consider both spectral and spatial information.In our experiments, the spectral information is obtained from the observed band values and the spatial information is calculated according to Section 2.2.After that, they are fused to be fundamental feature vector as the input of our FAEM model.We call the fused feature as fundamental feature in this paper as it is the combination of some basic information.

Deep Discriminative Feature Learning
The traditional artificially designed features cannot accurately represent the complex real environment.Most traditional cloud detection methods mainly focus on the construction of features to efficiently differentiate the cloud from others.In most cases, these features are "knowledgedriven", which means they are designed artificially based on prior knowledge.
Deep learning shows that the learned deep feature has powerful ability for feature representation.Recently, deep learning has achieved much success in image processing thanks to its deep network, which is constructed with many network layers and has the ability to mine the deep discriminative feature of image.In this paper, we apply stacked autoencode networks to learn the deep discriminative feature, which is a powerful representation of corresponding sample for accurate cloud detection, with a number of samples from the real environment.
Imagine that each fundamental feature vector is a point in , and our goal is to find a function : → that maps each feature vector into so that the new transformed vector can be classified linearly.Suppose the feature vectors are denoted as = , , … , , where is number of the training samples.The feature matrix is normalized to [0, 1] with following formulation: where is the normalized data, and are the maximum and minimum values of the original data set.
During the feature learning stage, ∈ × is the input of the network's first layer.
Formally, the output of the first layer is represented as an operation : where ∈ × and represent the weights and biases, respectively, and " * " denotes the multiplication.Here, corresponds to filters of support 1 × , where is the dimension of

Fundamental Feature Fusion
Accurate cloud detection needs to consider both spectral and spatial information.In our experiments, the spectral information is obtained from the observed band values and the spatial information is calculated according to Section 2.2.After that, they are fused to be fundamental feature vector as the input of our FAEM model.We call the fused feature as fundamental feature in this paper as it is the combination of some basic information.

Deep Discriminative Feature Learning
The traditional artificially designed features cannot accurately represent the complex real environment.Most traditional cloud detection methods mainly focus on the construction of features to efficiently differentiate the cloud from others.In most cases, these features are "knowledge-driven", which means they are designed artificially based on prior knowledge.
Deep learning shows that the learned deep feature has powerful ability for feature representation.Recently, deep learning has achieved much success in image processing thanks to its deep network, which is constructed with many network layers and has the ability to mine the deep discriminative feature of image.In this paper, we apply stacked autoencode networks to learn the deep discriminative feature, which is a powerful representation of corresponding sample for accurate cloud detection, with a number of samples from the real environment.
Imagine that each fundamental feature vector is a point in R P , and our goal is to find a function f : R P → R Q that maps each feature vector into R Q so that the new transformed vector can be classified linearly.Suppose the feature vectors are denoted as X = [x 1 , x 2 , . . . ,x n ], where n is number of the training samples.The feature matrix X is normalized to [0, 1] with following formulation: where X norm is the normalized data, X max and X min are the maximum and minimum values of the original data set.
During the feature learning stage, X norm ∈ R P×n is the input of the network's first layer.Formally, the output of the first layer is represented as an operation F 1 : where W 1 ∈ R n 1 ×n and B 1 represent the weights and biases, respectively, and " * " denotes the multiplication.Here, W 1 corresponds to n 1 filters of support 1 × P, where P is the dimension of input samples and n 1 is the dimension of the first layer output of the network.Intuitively, W 1 applies n 1 matrix multiplications on the features, and each multiplication has a size 1 × P. The output is composed of n 1 -dimensional features.B 1 is an n 1 -dimensional vector, whose each element is associated with a filter.We apply the ReLU function f 1 (•) = max(•, 0) on the filters responses.The first layer extracts the n 1 -dimensional features for each sample.
In the second layer, each of these n 1 -dimensional feature vectors is mapped to n 2 -dimensional ones.This is equivalent to applying n 2 filters which have a support 1 × n 1 .The output of the second layer is: where W 2 ∈ R n 2 ×n 1 contains n 2 filters of size 1 × n 1 , and B 2 is an n 2 -dimensional bias vector.The output n 2 -dimensional is the feature of the sample in anther R Q space where the sample is easier to be classed and detected.It is possible to add more feature learning layers to increase the non-linearity and the ability of feature representation.Nevertheless, this will increase the complexity of the model, and thus demands more computation time.We will explore deeper structures by introducing additional non-linear mapping layers in Section 4.1.

Cloud Degree Prediction
Cloud in remote sensing imagery varies spatially, and thick and thin clouds can exist in the same imagery.Most traditional methods regard cloud detection as a 0-1 classification problem by artificially selecting features and classification.However, these sample classification methods are heavily stretched to represent the real situation.In the FAEM proposed in this paper, a membership function was utilized to detect the thickness degree of the cloud at the tail of the last layer of stacked autoencode networks.The Gaussian-type membership function is utilized in this model, where k and a are the parameters of the model, and the result A(X norm ) is the degree of the training samples belonging to class A, called the membership degree of A. The output cloud degree of each pixel is within [0, 1], where the higher the cloud degree is, the denser the cloud is at the location of corresponding pixel.According to the output membership degree, we can further obtain the corresponding cloud density map, which shows the cloud density of each pixel on the image.

Parameter Tuning
Learning the mapping function for cloud detection requires the estimation of network parameters . This is achieved through minimizing the loss between the model output (the predicted labels) and the true labels.Given a set of pixels x i and their corresponding labels l(x i ), the mean squared error (MSE) is used as the loss function: where n is the number of training samples.The loss function is minimized using batch stochastic gradient descent algorithm with the standard back propagation scheme.In particular, the weight matrices are updated as follows where l ∈ {1, 2} and i are the indices of the layers and iterations, γ is the learning rate, β is the momentum factor, and ∂L ∂W l i is the derivative.The filter weights of each layer are initialized by drawing randomly from Gaussian distribution with zero mean and standard derivation 0.001 (and 0 for biases).The learning rate is 0.01 and the momentum factor is 0.9.In addition, to avoid over-fitting problem, dropout factor 0.5 is used to randomly reduce a half features during training stage.Once the model is trained, the parameters are set to construct a nonlinear mapping for discriminative feature extraction and cloud density map prediction.

Accuracy Assessment
Both qualitative and quantitative assessments are necessary.Qualitative assessment can be evaluated with visual effects.For quantitative assessment, as many other cloud detection methods (SVM, RF, Function of Mask (Fmask) [15], etc.), simply divide the imagery into two classes: cloud and non-cloud.Therefore, though the final output of our FAEM is a cloud density map, for the convenience of assessment, we convert it into cloud and non-cloud for quantitative comparison with other methods.
In addition to overall accuracy (OA) and Kappa, there are four metrics used in this paper: right rate (RR), error rate (ER), false alarm rate (FAR), and the ratio of RR to ER (RER).RR is defined as where CC is the number of correctly detected cloud pixels and GN is the number of cloud pixels in ground truth.RR provides us with the information of correctly detected results.
ER is defined as [42] ER = (CN + NC) TN where CN represents the number of cloud pixels identified as non-cloud pixels.NC represents the number of non-cloudy pixels identified as cloud pixels, and TN denotes the number of pixels of the input image.ER is used to provide incorrect information.
FAR is defined with the same form as in the papers [37] where NC and TN have the same meanings as the above formula.FAR is one part of ER and it explicitly represents the false alarm rate.
Using only one of them to assess algorithms is insufficient, as some methods may obtain high RR but bring too many false alarms.On the contrary, some methods may obtain low ER but also low RR.Therefore, RER is defined to obtain an integrated result as it considers the RR and ER.The higher it is, the better it will be.RER is defined as the ratio of

Experiments and Results
As there are some adjustable parameters in our FAEM model, we first analysis and determine the best parameter combination for accurate detection result.In this experiments, 16,698 cloud pixels and 50,253 non-cloud pixels which contain as many objects as possible are selected as samples for the Landsat ETM+ imagery.The numbers of cloud and non-cloud samples for 25 GF-1 images are 13,197 and 17,462 respectively.In addition, the samples should belong to the pure objects such as thick cloud, water, building, vegetation and so on, the thin cloud which is actually the mixture of cloud and ground objects is not selected as samples.
After that, the samples are divided into two parts: training set and test set, and the number of training and test set are 2/3 and 1/3 of the total number of samples respectively.During the training procedure, training set is used for model training and after each training epoch, the test sets are used to test the trained model and output the corresponding test accuracy.At the prediction procedure, each pixel's degree of belonging to cloud is predicted and used to derive the corresponding cloud density map.Moreover, to demonstrate the efficiency of proposed model, both Landsat ETM+ and GF-1 images with different spatial resolution are applied in this experiment.

Parameter Analysis
The number of layers and hidden nodes can influence the result of cloud detection.In our previous experiments with Landsat ETM+ imagery, the number of the fundamental vector is 13 (including spectral, texture and structural information) and the numbers of the two hidden layers are 12 and 10, respectively, and a cloud density map is outputted as last.For convenience, we denote the model node number as 13-12-10-1.In this section, several experiments are designed to fully explore the relation between different model parameter combinations and cloud detection results.
Based on our model parameters set in Section 3.2, we conduct four groups of experiments in this section.The trend of iteration error and the iteration number are reported in Figure 3.In experiment (a), by comparing experiments with node number 13-12-1, 13-12-10-1 and 13-12-10-10-1, we can see that the convergent rate is getting quicker and the final iteration error is becoming lower with the increase of the model layers.Similarly, experiments (b) and (c) demonstrate the same regular pattern.In addition, it also shows that when the model layers increase from 3 to 4, the final iteration error falls along with it.At the same time, the final iteration error with 4 and 5 model layers are basically equal, while the computational complexity can be raised.Therefore, we focus on the relation of model performance and the number of hidden nodes with 4 model layers.Figure 3d exhibits the results with different number of hidden nodes, which shows that model 13-12-10-1 has acquired relatively better performance for our cloud detection issue.
Remote Sens. 2017, 9, 311 9 of 19 density map.Moreover, to demonstrate the efficiency of proposed model, both Landsat ETM+ and GF-1 images with different spatial resolution are applied in this experiment.

Parameter Analysis
The number of layers and hidden nodes can influence the result of cloud detection.In our previous experiments with Landsat ETM+ imagery, the number of the fundamental vector is 13 (including spectral, texture and structural information) and the numbers of the two hidden layers are 12 and 10, respectively, and a cloud density map is outputted as last.For convenience, we denote the model node number as 13-12-10-1.In this section, several experiments are designed to fully explore the relation between different model parameter combinations and cloud detection results.
Based on our model parameters set in Section 3.2, we conduct four groups of experiments in this section.The trend of iteration error and the iteration number are reported in Figure 3.In experiment (a), by comparing experiments with node number 13-12-1, 13-12-10-1 and 13-12-10-10-1, we can see that the convergent rate is getting quicker and the final iteration error is becoming lower with the increase of the model layers.Similarly, experiments (b) and (c) demonstrate the same regular pattern.In addition, it also shows that when the model layers increase from 3 to 4, the final iteration error falls along with it.At the same time, the final iteration error with 4 and 5 model layers are basically equal, while the computational complexity can be raised.Therefore, we focus on the relation of model performance and the number of hidden nodes with 4 model layers.Figure 3d exhibits the results with different number of hidden nodes, which shows that model 13-12-10-1 has acquired relatively better performance for our cloud detection issue.

Cloud Density Map Predication
In this section, we give some qualitative prediction results with our model.According to the parameter analysis in Section 4.1, we set the parameter combination of the FAEM for Landsat ETM+ imagery to be 13-12-10-1.The parameters and (in Section 3.3) in the membership function are set as 3 and 1, respectively.
Figure 4 shows the predicted cloud density map two randomly selected experimental images.The left column represents the original Landsat ETM+ images.The images in the second column are corresponding cloud density map.It can be seen that either thin cloud or thick cloud have been detected very well with our proposed FAEM.In addition, for images that contain many ma-made objects, which have similar spectral characteristics with clouds, the proposed model can still precisely detect the cloud without being affected.

Cloud Density Map Predication
In this section, we give some qualitative prediction results with our model.According to the parameter analysis in Section 4.1, we set the parameter combination of the FAEM for Landsat ETM+ imagery to be 13-12-10-1.The parameters k and α (in Section 3.3) in the membership function are set as 3 and 1, respectively.
Figure 4 shows the predicted cloud density map two randomly selected experimental images.The left column represents the original Landsat ETM+ images.The images in the second column are corresponding cloud density map.It can be seen that either thin cloud or thick cloud have been detected very well with our proposed FAEM.In addition, for images that contain many ma-made objects, which have similar spectral characteristics with clouds, the proposed model can still precisely detect the cloud without being affected.

Cloud Density Map Predication
In this section, we give some qualitative prediction results with our model.According to the parameter analysis in Section 4.1, we set the parameter combination of the FAEM for Landsat ETM+ imagery to be 13-12-10-1.The parameters and (in Section 3.3) in the membership function are set as 3 and 1, respectively.
Figure 4 shows the predicted cloud density map two randomly selected experimental images.The left column represents the original Landsat ETM+ images.The images in the second column are corresponding cloud density map.It can be seen that either thin cloud or thick cloud have been detected very well with our proposed FAEM.In addition, for images that contain many ma-made objects, which have similar spectral characteristics with clouds, the proposed model can still precisely detect the cloud without being affected.

Comparison with Other Methods
As most traditional methods regard the cloud detection as 0-1 classification problem [43][44][45], the classification results only contain cloud and non-cloud.For the convenience of quantitative comparison, we use a threshold 0.5 to cut the pixels whose cloud degrees in density map are smaller than the threshold to be non-cloud and the remains are regard as clouds.In this experiment, Landsat ETM+ images are considered to demonstrate the effectiveness of proposed approach.
In this experiment, we use 172 Landsat ETM+ images with manual labelled as ground truth and the results are compared with some other cloud detection methods such as Fmask, SVM and RF. Figure 5 shows the detection results of three images containing different underlying surfaces with different methods.Results show that regardless of images with vegetation (first row), water (second row) or snow (third row), the proposed method has shown stronger stability and better performance than Fmask, SVM and RF for cloud detection.Obviously, Fmask is seriously overestimated in the three images.For images with extensive vegetation in the first row, SVM is much more likely to detect some bright ground objects as cloud.While in the second-row images with water, SVM confuses some cloud as non-cloud.In particular, for the third-row images with snow, which is a challenging case in cloud detection since snow and cloud also have similar characteristics on remote sensing image, both SVM and RF obtain false results in such circumstance, SVM has regarded much more ice as cloud while RF only detects a part of the cloud from the snow.

Comparison with Other Methods
As most traditional methods regard the cloud detection as 0-1 classification problem [43][44][45], the classification results only contain cloud and non-cloud.For the convenience of quantitative comparison, we use a threshold 0.5 to cut the pixels whose cloud degrees in density map are smaller than the threshold to be non-cloud and the remains are regard as clouds.In this experiment, Landsat ETM+ images are considered to demonstrate the effectiveness of proposed approach.
In this experiment, we use 172 Landsat ETM+ images with manual labelled as ground truth and the results are compared with some other cloud detection methods such as Fmask, SVM and RF. Figure 5 shows the detection results of three images containing different underlying surfaces with different methods.Results show that regardless of images with vegetation (first row), water (second row) or snow (third row), the proposed method has shown stronger stability and better performance than Fmask, SVM and RF for cloud detection.Obviously, Fmask is seriously overestimated in the three images.For images with extensive vegetation in the first row, SVM is much more likely to detect some bright ground objects as cloud.While in the second-row images with water, SVM confuses some cloud as non-cloud.In particular, for the third-row images with snow, which is a challenging case in cloud detection since snow and cloud also have similar characteristics on remote sensing image, both SVM and RF obtain false results in such circumstance, SVM has regarded much more ice as cloud while RF only detects a part of the cloud from the snow.Figure 6 exhibits the Kappa and OA of 172 Landsat ETM+ images with Fmask, SVM, RF and proposed approach.The Kappa and OA for our proposed method, which are lying in the upper right corner, are dramatically higher than others.By comparison, the performance of Fmask is slightly unsatisfactory.The result of SVM and RF is very close, and most of their OA and Kappa are higher than 0.9 and 0.8 respectively.Figure 6 exhibits the Kappa and OA of 172 Landsat ETM+ images with Fmask, SVM, RF and proposed approach.The Kappa and OA for our proposed method, which are lying in the upper right corner, are dramatically higher than others.By comparison, the performance of Fmask is slightly unsatisfactory.The result of SVM and RF is very close, and most of their OA and Kappa are higher than 0.9 and 0.8 respectively.To further compare the detection results, four evaluation indicators RR, ER, FAR and RER are calculated in this experiment.The average results of 172 Landsat ETM+ images are shown in Table 4.We can see that for Fmask, RR is closer to 1, ER and FAR are 10 times larger than others, and the RER is least among all methods.This shows the performance of Fmask is overestimated: many non-cloud are regarded as cloud pixels.Meanwhile, the RR of our proposed method is high up to 0.86, which is more than SVM and RF, while the ER of the proposed method is much smaller.More importantly, the RER of our method is high, up to 29.50, while it is only 10.984, 21.84 and 23.02 for Fmask, SVM and RF, respectively.

Cloud Density Map Predication
To demonstrate that the proposed model is available for different imagery with different spatial resolutions, GF-1 imagery with 8 m spatial resolution is also be considered in this section.The fundamental vector dimension of GF-1 imagery is 9 (including spectral, texture and structure information) in this experiment.Similar to the parameter analysis in Section 4.1, we set the model parameter combination for GF-1 imagery to be 9-8-6-1.The parameters and (in Section 3.3) in the membership function are set as 3 and 1 respectively, which is the same with Landsat ETM+ imagery.
Similar as the experiments on Landsat ETM+ imagery, two randomly selected images and their corresponding detection results are shown in Figure 7.The left column represents the original GF-1 images and the images in second column are their corresponding cloud density map.It shows that for GF-1 imagery with higher spatial resolution, the proposed model still work well for accurate cloud detection.In Figure 7, we can see that both thin cloud and thick cloud have been well detected.Larger value represents that the pixels contains more cloud composition.For the value equals to 1, it means To further compare the detection results, four evaluation indicators RR, ER, FAR and RER are calculated in this experiment.The average results of 172 Landsat ETM+ images are shown in Table 4.We can see that for Fmask, RR is closer to 1, ER and FAR are 10 times larger than others, and the RER is least among all methods.This shows the performance of Fmask is overestimated: many non-cloud are regarded as cloud pixels.Meanwhile, the RR of our proposed method is high up to 0.86, which is more than SVM and RF, while the ER of the proposed method is much smaller.More importantly, the RER of our method is high, up to 29.50, while it is only 10.984, 21.84 and 23.02 for Fmask, SVM and RF, respectively.

Cloud Density Map Predication
To demonstrate that the proposed model is available for different imagery with different spatial resolutions, GF-1 imagery with 8 m spatial resolution is also be considered in this section.The fundamental vector dimension of GF-1 imagery is 9 (including spectral, texture and structure information) in this experiment.Similar to the parameter analysis in Section 4.1, we set the model parameter combination for GF-1 imagery to be 9-8-6-1.The parameters k and α (in Section 3.3) in the membership function are set as 3 and 1 respectively, which is the same with Landsat ETM+ imagery.
Similar as the experiments on Landsat ETM+ imagery, two randomly selected images and their corresponding detection results are shown in Figure 7.The left column represents the original GF-1 images and the images in second column are their corresponding cloud density map.It shows that for GF-1 imagery with higher spatial resolution, the proposed model still work well for accurate cloud detection.In Figure 7, we can see that both thin cloud and thick cloud have been well detected.Larger value represents that the pixels contains more cloud composition.For the value equals to 1, it means that the pixel is pure cloud.Conversely, the smaller the value is, the thinner the cloud is likely to be at the point.that the pixel is pure cloud.Conversely, the smaller the value is, the thinner the cloud is likely to be at the point.

Comparison with Other Methods
Twenty-five GF-1 images are considered in this experiment.To save space, we display only three groups of detection results with different methods.Figure 8 shows the pseudo color images combined with band 4, 3, and 2. Clouds in these images have quite different shapes and thicknesses.From the left to right of Figure 8, it shows original image and the detection results with SVM, RF and the proposed method, respectively.For images with extensive mountainous area, all three methods have acquired relatively satisfactory detection results.However, when there are many buildings and roads in the image, SVM and RF, which simply use artificially designed primary features, have difficulty.The advantage of our proposed method is well shown in this case.

Comparison with Other Methods
Twenty-five GF-1 images are considered in this experiment.To save space, we display only three groups of detection results with different methods.Figure 8 shows the pseudo color images combined with band 4, 3, and 2. Clouds in these images have quite different shapes and thicknesses.From the left to right of Figure 8, it shows original image and the detection results with SVM, RF and the proposed method, respectively.For images with extensive mountainous area, all three methods have acquired relatively satisfactory detection results.However, when there are many buildings and roads in the image, SVM and RF, which simply use artificially designed primary features, have difficulty.The advantage of our proposed method is well shown in this case.In Figure 9, we can see the Kappa and OA of cloud detection results on GF-1 images.Most of our results appear in the upper right corner.It shows that higher accuracy can be achieved in our method.After the previous qualitative comparison, we now focus on a quantitative comparison of our method as shown in Table 5.Similar with the analyses of Landsat ETM+ images, the four same In Figure 9, we can see the Kappa and OA of cloud detection results on GF-1 images.Most of our results appear in the upper right corner.It shows that higher accuracy can be achieved in our method.In Figure 9, we can see the Kappa and OA of cloud detection results on GF-1 images.Most of our results appear in the upper right corner.It shows that higher accuracy can be achieved in our method.After the previous qualitative comparison, we now focus on a quantitative comparison of our method as shown in Table 5.Similar with the analyses of Landsat ETM+ images, the four same After the previous qualitative comparison, we now focus on a quantitative comparison of our method as shown in Table 5.Similar with the analyses of Landsat ETM+ images, the four same evaluation indicators RR, ER, FAR and RER are used for GF-1 assessment.It can be seen from Table 5 that our proposed method has acquired much higher right rate than SVM and RF, while the error rate of the proposed method is much smaller.In particular, The RER with our method is high, up to 25.94, while for SVM and RF it is only 19.15 and 17.75, respectively.

Analysis of Feature Combination
In this section, experimental results with three different feature combinations of spectra, spectra + texture and spectra + texture + structure are reported.Three images containing thick cloud and thin cloud of different surface earth are considered.As Figure 10 shows, from left to right, the cloud detection results with feature combinations of spectra, spectra + texture and spectra + texture + structure are listed.Compared with the original images, we can see that the detection results become better along with the increasing of features.Obviously, the values of RR and RER increase along with the number of features, while the ER values decrease.

Analysis of Bands Necessity
In our pervious experiments, both visible band and infrared band are considered.However, cloud has high reflectance in visible band while its reflectance in infrared band is low.Due to the characteristics of high light and low temperature, cloud commonly shows different characteristics in visible and infrared bands.In this section, we aim to explore the effects of the infrared band for cloud detection.
Similar to Section 5.1, different feature combinations of spectra, spectra + texture, spectra + texture + structure are considered in the experiments.Nonetheless, we remove the infrared band in the experiments for comparison.Table 7 shows the average value of four metrics of 172 experiment Landsat ETM+ images.Compared with the experiments using all bands in Section 5.1, the detection accuracy with feature combination of spectra + texture and spectra + texture + structure do not change much, while, for experiments that only consider the spectral information, infrared band plays a relatively important role.

Analysis of Bands Necessity
In our pervious experiments, both visible band and infrared band are considered.However, cloud has high reflectance in visible band while its reflectance in infrared band is low.Due to the characteristics of high light and low temperature, cloud commonly shows different characteristics in visible and infrared bands.In this section, we aim to explore the effects of the infrared band for cloud detection.
Similar to Section 5.1, different feature combinations of spectra, spectra + texture, spectra + texture + structure are considered in the experiments.Nonetheless, we remove the infrared band in the experiments for comparison.Table 7 shows the average value of four metrics of 172 experiment Landsat ETM+ images.Compared with the experiments using all bands in Section 5.1, the detection accuracy with feature combination of spectra + texture and spectra + texture + structure do not change much, while, for experiments that only consider the spectral information, infrared band plays a relatively important role.

Conclusions
This study has presented a new cloud detection method.The advantages of the proposed method are integrating the feature learning ability of stacked autoencode networks and the detection ability of fuzzy function to acquire good performance on cloud detection.To validate the effectiveness of our method, 172 Landsat ETM+ images and 25 GF-1 images are used in this paper.Experimental results demonstrate that the proposed approach has achieved relatively higher detection accuracy compared with several state-of-the-art cloud detection methods (Fmask, SVM, and RF).Furthermore, we experimentally demonstrate that feature combination of spectral + texture + structure has attained better performance than single feature.Our proposed method is applicable in a variety of scenarios and is reliable in different resolution images.Generally, the proposed approach can potentially yield better results in terms of detection accuracy compared with related approaches, and is not limited by image resolution.
To fully consider the spatial and spectral information for a better cloud detection result, three fundamental features, spectral, texture and structure features, are applied in this work as the basic information to learn a deep discriminative feature.However, these texture and structure features are still manually selected and may not contain enough information.In future study, we will consider applying convolutional network for cloud detection.In addition, as convolutional network extracts feature with convolution kernel by integrating the local spatial information, the global constraint information will also be combined for accurate cloud detection.
This study has presented a new cloud detection method.The advantages of the proposed method are integrating the feature learning ability of stacked autoencode networks and the detection ability of fuzzy function to acquire good performance on cloud detection.To validate the effectiveness of our method, 172 Landsat ETM+ images and 25 GF-1 images are used in this paper.Experimental results demonstrate that the proposed approach has achieved relatively higher detection accuracy compared with several state-of-the-art cloud detection methods (Fmask, SVM, and RF).Furthermore, we experimentally demonstrate that feature combination of spectral + texture + structure has attained better performance than single feature.Our proposed method is applicable in a variety of scenarios and is reliable in different resolution images.Generally, the proposed approach can potentially yield better results in terms of detection accuracy compared with related approaches, and is not limited by image resolution.
To fully consider the spatial and spectral information for a better cloud detection result, three fundamental features, spectral, texture and structure features, are applied in this work as the basic information to learn a deep discriminative feature.However, these texture and structure features are still manually selected and may not contain enough information.In future study, we will consider applying convolutional network for cloud detection.In addition, as convolutional network extracts feature with convolution kernel by integrating the local spatial information, the global constraint information will also be combined for accurate cloud detection.

Figure 2 .
Figure 2. The framework of proposed method.

Figure 2 .
Figure 2. The framework of proposed method.

Figure 3 .
Figure 3. (a-d)The correlation of iteration numbers and iteration error.

Figure 4 .
Figure 4. (a-d): (a,c) The original image of Landsat ETM+ images; and (b,d) the corresponding cloud density.

Figure 3 .
Figure 3. (a-d) The correlation of iteration numbers and iteration error.

Figure 3 .
Figure 3. (a-d)The correlation of iteration numbers and iteration error.

Figure 4 .
Figure 4. (a-d): (a,c) The original image of Landsat ETM+ images; and (b,d) the corresponding cloud density.

Figure 4 .
Figure 4. (a-d): (a,c) The original image of Landsat ETM+ images; and (b,d) the corresponding cloud density.

Figure 6 .
Figure 6.Kappa and OA of 172 Landsat ETM+ images with Fmask, SVM, RF and proposed approach.

Figure 6 .
Figure 6.Kappa and OA of 172 Landsat ETM+ images with Fmask, SVM, RF and proposed approach.

Figure 7 .
Figure 7. (a,c) The original image of GF-1 images; and (b,d) the corresponding cloud density of (a,c) with proposed approach.

Figure 7 .
Figure 7. (a,c) The original image of GF-1 images; and (b,d) the corresponding cloud density of (a,c) with proposed approach.

Figure 8 .
Figure 8.(a) Pseudo color GF-1 image with combination of band 4-3-2; and (b-d) the cloud detection results with SVM, RF and our proposed approach, respectively.

Figure 9 .
Figure 9. Kappa and OA of 25 GF-1 images with SVM, RF and proposed approach.

Figure 8 .
Figure 8.(a) Pseudo color GF-1 image with combination of band 4-3-2; and (b-d) the cloud detection results with SVM, RF and our proposed approach, respectively.

Figure 8 .
Figure 8.(a) Pseudo color GF-1 image with combination of band 4-3-2; and (b-d) the cloud detection results with SVM, RF and our proposed approach, respectively.

Figure 9 .
Figure 9. Kappa and OA of 25 GF-1 images with SVM, RF and proposed approach.

Figure 9 .
Figure 9. Kappa and OA of 25 GF-1 images with SVM, RF and proposed approach.

Figure 10 .
Figure 10.Visual comparisons of detection results with different fundamental features combination: (a) the original image with band combination 3-4-1; and (b-d) detected results of spectra, spectra + texture, and spectra + texture+ structure, respectively.

For a clearer
comparison, Figures11 and 12show the average Kappa and OA of the 172 experimental Landsat ETM+ images respectively.The blue bars represent the experimental accuracy using all bands and the red bars are the detection accuracy without infrared band.Data in Figures11 and 12also show the same conclusion that the infrared band is necessary for cloud detection with only spectral information.As for experiments combined with spatial information (texture and structure), which constrains the spatial consistency of the detection result, the influence of infrared band becomes weak.

For a clearer
comparison, Figures 11 and 12 show the average Kappa and OA of the 172 experimental Landsat ETM+ images respectively.The blue bars represent the experimental accuracy using all bands and the red bars are the detection accuracy without infrared band.Data in Figures 11 and 12 also show the same conclusion that the infrared band is necessary for cloud detection with only spectral information.As for experiments combined with spatial information (texture and structure), which constrains the spatial consistency of the detection result, the influence of infrared band becomes weak.

Figure 11 .
Figure 11.Average Kappa of 172 Landsat ETM+ images with different feature combination.

Figure 11 .
Figure 11.Average Kappa of 172 Landsat ETM+ images with different feature combination.

Figure 12 .
Figure 12.Average OA of 172 Landsat ETM+ images with different feature combination.

Table 1 .
Detailed information of experimental data and cloud cover.

Table 2 .
Number of images with different cloud covers.

Table 1 .
Detailed information of experimental data and cloud cover.

Table 2 .
Number of images with different cloud covers.

Table 3 .
Experimental data sources and bands information.

Table 4 .
Cloud detection accuracy for SVM, RF and our proposed method with Landsat images.

Table 4 .
Cloud detection accuracy for SVM, RF and our proposed method with Landsat images.

Table 5 .
Cloud detection accuracy for SVM, RF and our proposed method with GF-1.

Table 6
illustrates the ER, RR, FAR and RER of testing images with different feature combinations.

Table 6 .
RR, ER, and RER with different features combination.

Table 7 .
RR, ER, and RER with different features combination.

Table 6
illustrates the ER, RR, FAR and RER of testing images with different feature combinations.Obviously, the values of RR and RER increase along with the number of features, while the ER values decrease.

Table 6 .
RR, ER, and RER with different features combination.

Table 7 .
RR, ER, and RER with different features combination.