A Novel Robust Classiﬁcation Method for Ground-Based Clouds

: Though the traditional convolutional neural network has a high recognition rate in cloud classiﬁcation, it has poor robustness in cloud classiﬁcation with occlusion. In this paper, we propose a novel scheme for cloud classiﬁcation, in which the convolutional neural networks are used for feature extraction and a weighted sparse representation coding is adopted for classiﬁcation. Three such algorithms are proposed. Experiments are carried out using the multimodal ground-based cloud dataset and the results show that in the case of occlusion, the accuracy of the proposed methods can be much improved over the traditional convolutional neural network-based algorithms.


Introduction
Recently, with the expansion of big data combined with the improvement in algorithms and the exponential growth in computing power, machine learning has become a research highlight. Machine learning, especially deep learning, is widely used in computer vision, speech recognition, natural language processing, data mining and meteorological information processing.
In the early days, Buch et al. extracted features from cloud images, including texture measurements, location information and pixel brightness, then used binary decision trees to divide them into four kinds of clouds: altocumulus, cirrus, cumulus and stratiform [1]. Singh et al. proposed a procedure to test the texture feature used for the automatic training of a cloud classifier, in which five feature extraction methods were examined, namely autocorrelation, co-occurrence matrix, edge frequency, Law's feature and primitive length. These tests can help us better understand the advantages and disadvantages of different feature extraction methods and classification techniques [2]. Heinle et al. made use of k-nearest neighbor classifier to distinguish 7 kinds of cloud images by extracting color and texture features from cloud images [3]. Neto et al. used the multivariate color space features to classify clouds and sky pixels using the statistical features of Euclidean geometric distance [4]. Liu et al. proposed a multiple random projection algorithm to obtain textons and discriminative features [5]. However, the accuracy of these cloud classification methods based on manual feature extraction is far from expectation, and we must manually design features carefully for each classification.
Cloud classification plays a key role in a wide range of applications such as solar power generation, weather forecasting [6,7], deep space climate observatory mission [8], rainfall estimation [9] and optical remote sensing [10]. However, cloud classification is a challenging task. Different clouds show different meteorological information features such as temperature, humidity, pressure, wind speed and color distribution. To achieve cloud classification, it is necessary to accurately estimate its characteristics from the shape, thickness, degrees of sparseness and other cloud features. However, the traditional cloud classification [11][12][13] is in the charge of professional observers. In addition to consuming a lot of labor, there will also be human errors, and it is more difficult to classify the clouds with a satisfactory accuracy when there exist interferences such as sunlight, fog and many others.
Recently convolution neural networks (CNN) have been widely used in image classification and recognition applications. Such networks have many advantages. First, they do not need to extract features manually. In fact, the CNNs automatically extract features through many training pictures and yield a very good performance. Secondly, they can process many data well. Moreover, they can also reflect a strong image discrimination under a large amount of data, so they are a good choice for dealing with the ever-changing clouds. Yang et al. proposed CNN model, known as LeNet-5, yielding an accuracy of handwritten digit recognition as high as 99% [14]. Krizhevsky proposed a deep neural network model, denoted as Alexnet, which is relatively smaller than the LeNet in size, and changed the sigmoid activation function to a simpler ReLU activation function. Such a model makes the training easier under different parameter initialization methods and can reduce overfitting by dropout layers [15]. In [16], an efficient network, called Inception-v3, was proposed, but the deeper network is more difficult to train. He et al. proposed a residual learning framework to simplify the training of deeper networks and the network, denoted as ResNet, takes the residual blocks as the basic blocks and can prevent gradient from disappearing [17]. Liu et al. used multimodal information and visual features to achieve cloud classification through support vector machine [18], where a fusion of convolutional neural networks, including visual sub-network and multimodal sub-network, is proposed. After extracting visual and multimodal features, a weighted integration was used for cloud classification [19]. Although the recognition accuracy of these CNN-based cloud classification methods is very high, they have a common weak point-lack of robustness to perturbation.
Compressed sensing (CS) [20,21] is a recently developed technology. It is based on the concept of sparse representation of signals [22]. Let x ∈ R n×1 be a signal vector. It is called ||α|| 0 -sparse in a matrix D ∈ R n×m if x = Dα, where D is called the dictionary and ||α|| 0 denotes the number of non-zero entries in the coefficient vector α ∈ R m×1 . Sparse representation has been used in the fields of image processing [23] and face recognition [24,25]. A weighted sparse representation was proposed in [26] and simulations show that it is more robust to occlusion, destruction and other interferences. Gan et al. [27] proposed a dual norm bounded sparse coding, which classifies cloud images by extracting ninedimensional features, including five color features and four statistical features, but they did not execute the robustness analysis.
In this paper, we investigate classification of ground-based clouds that are observed by ground-based cameras. The main contributions of this paper are as follows: • A novel structure for robust cloud classification is proposed, in which the features are extracted by convolution neural networks and the classification is executed using the weighted sparse representation; • A two-channel-neural network is proposed for extracting features of the groundbased clouds.
Experimental results show that our proposed method yields a satisfactory performance for classifying ground-based clouds. In particular, the robustness of the proposed twochannel-neural network-based classifier is better than the one with a single convolutional neural network in the case of less occlusion.
The rest of this paper is organized as follows: in Section 2, we briefly introduce some related work, we describe a novel robust cloud classification structure and specific algorithm flow in Section 3, Section 4 presents experimental analysis. To end this paper, some remarking points are given in Section 5.

Related Work
Now, we introduce some existing results that will be used for our proposed robust classification algorithm for ground-based clouds.

ResNet Model
It has been observed that after adding too many layers in a CNN, the training error tends to rise. Though the numerical stability brought by the batch normalization makes it easier to train deeper networks, this problem still exists. The ResNet is proposed for addressing this problem. The residual block is the basic block of the ResNet. In a residual block, the input can propagate forward faster through cross layer data lines, and the occurrence of gradient disappearance can be reduced. Two additional major benefits of rectified linear units(ReLU) are sparsity and a reduced likelihood of vanishing gradient [17], see Figure 1.  Table 1 shows the parameters involved in the ResNet-50 model. Table 1. Architecture of the ResNet-50, c is the number of channels, m × n is the dimension of data, k × k is convolution kernel size.

Type
Patch Size

Inception Model
The efficient convolutional neural network based on Inception model, proposed in [16], consists of a set of basic convolution blocks, i.e., the Inception blocks. The naive version of such a network is showed in Figure 2. As shown in Figure 2, the k × k convolutions, for k = 1, 3 and 5 (k is 2D of convolution kernel sizes), are employed to extract information in different spatial sizes. However, this structure increases in the number of outputs from stage to stage [28].
The Inception module with dimension reduction is shown in Figure 3, where the 1 × 1 convolutions are used to compute reductions before the expensive 3 × 3 and 5 × 5 convolutions. Besides being used for dimension reduction, they are also used for the rectified linear activation, which makes them dual-purpose. In general, an Inception network is a network consisting of modules of the above type, stacked upon each other with occasional max-pooling layers with stride 2 to halve the resolution of the grid and therefore, reducing the computational complexity and improving performance [28]. Table 2 shows the parameters involved in the Inception-v3 model. Table 2. Architecture of the Inception-v3, c is the number of channels, m × n is the dimension of data, k × k is convolution kernel size.

Robust Sparse Coding-Based Classification
The traditional sparse coding problem is formulated as where κ is the prescribed sparsity level. This coding model actually assumes that the coding residuals e = y − Dα follow Gaussian or Laplacian distribution. When applied for image recognition, each image is represented by a (feature) vector x ∈ R n×1 and the columns of the dictionary D ∈ R n×m are formed with the m training images and it is assumed that the test images y can be well represented as a linear combination of training samples with a few terms involved, i.e., not more than κ as (1) suggested.
In practice, the residual e = y − Dα may be far away from Gaussian distribution or Laplacian distribution, especially when the images contain occlusion or damage, so the traditional sparse coding model has a poor robustness. To improve the robustness, we must make a good use of those pixels which are not perturbed and neglect those greatly distorted in the sparse coding-based classification (1). This can be done by assigning each pixel as weighting factor, leading to the following robust sparse coding problem: where the diagonal matrix W is selected based on the residual vectorê in such a way that if |ê(i)| < |ê(j)|, then W(i, i) > W(j, j). Here, we choose W with where the parameters β, ϕ are determined by experiments [26]. Usually, W is normalized trace(W 2 ) = 1. See the next section. The class that image y belongs to, denoted as j(y), is then determined by

Our Proposed Methods
We propose a novel robust classification scheme for ground-based clouds. The basic idea is to use CNNs for feature extraction and sparse coding for classification.

A Two-Channel-Neural Network-Based Feature Extraction
A more sophisticated approach is depicted in Figure 4, which contains two parts. The first one is for feature extraction, which converts an image into a (feature) vector z. The second part is for classifying the feature vector z. As each CNN performs differently for different set of conditions, an intuitive idea is to fuse the features obtained with more than one CNNs. Here, we propose a scheme, shown in Figure 4, where two CNNs are used, yielding feature vectors y i ∈ R n i ×1 for i = 1, 2 that are simply stacked as y = y 1 y 2 ∈ R (n 1 +n 2 )×1 .
Let n 1 + n 2 = 2n. The system of feature selection aims to convert the feature vector y of dimension 2n into a vector z of dimension n. Here, we propose such a system consisting of two projections P I and P E : where the two projections P I and P E are determined with the training samples Y = -mean matrix of the kth class for k = 1, 2, · · · , K. • P I -For a given Y k ∈ R 2n×m k , denote the mean and variance vectors which is a vector determined by the matrix Y with (7). Let {V Y (j p )} with j p < j p+1 for p = 1, 2, · · · , 1.5n − 1 be the set of the 1.5n smallest entries of V Y , then the projection P I such thatz = P I (y) is given bỹ Thus, Y k →Z k = P I (Y k ) for k = 1, 2, · · · , K, i.e., Y →Z = P I (Y). As understood, the projection P I intends to keep those entries of the feature vector y, which are clustered within the each of the classes. • P E -With V k ∈ R 1.5n×K obtained, we can compute the mean and variance vectors, liked as v * ,v * ∈ R 1.5n×1 , using (7). Let {v * (j p )} with j p < j p+1 for p = 1, 2, · · · , n − 1 be the set of the n largest entries ofv * , then the projection P E such that z = P E (z) is given by z(p) =z(j p ), ∀ p ≤ n.
Unlike P I , P E aims at enhancing the discrimination between the classes by keeping those entries of vectorz that are of a big variance.

Robust Sparse Coding with Extended Dictionary
As assumed before, there are K different classes of clouds to be considered. The dictionary for robust sparse coding is formed based on the feature vectors of the training samples. Instead of using one sample for each class, we make use of all the m k training pictures for the kth class. Precisely speaking, the k-sub-dictionary, denoted as D k , is given by and the total dictionary D ∈ R n×m with m = ∑ K k=1 m k , named extended dictionary, is formed with {D k }: The optimal sparse representation-based robust classification (2) is usually attacked with the following problem (see [20,21]).
where W is the weighting matrix, which is diagonal and a function of e = z − Dα.
The problem defined by (11) is an alternative version of (2) and is very difficult to solve because W is a function of e = z − Dα. Practically, the weighting matrix W is obtained with an iterative procedure. Actually, W updated based on the previous estimate of α: let α (l) be the estimate of α at the lth iteration of Algorithm 1 below. Denotê The weighting matrix W is then updated with the aboveê (l) via (3), i.e., and hence W (l) is then updated with with ||.|| F denoting the Frobenius norm.
With the obtained α = α 1 · · · α k · · · α K and W, compute and hence The class of the test image represented by z is determined with The entire procedure of the proposed WSR algorithm is outlined in Algorithm 1. Outline of Robust Sparse Representation-based Classification.

Experiments
In this section, we present some experimental results to examine our proposed approach. We first introduce the dataset and setup used in the experiments, then present the experimental results and discussions.

Dataset
The multimodal ground-based cloud dataset (MGCD) [19] collected in China mainly contains two kinds of ground-based cloud information, i.e., the cloud images and the multimodal cloud information. The cloud images with the size of 1024 × 1024 pixels are captured at different times by a sky camera with fisheye lens. The fisheye lens could provide a wide range observation of the sky conditions with the horizontal and vertical angles of 180 degrees. The multimodal cloud information is collected by a weather station, including temperature, humidity, pressure, wind speed, maximum wind speed and average wind speed. Each cloud image corresponds to a set of multimodal data. The 8000 pictures used are classified into 7 groups of clouds: cumulus, altocumulus, cirrus, clear sky, stratocumulus and mixed. In addition, it should be noted that the cloud images with less than 10% cloud cover belong to clear sky. The detailed distribution of samples for each class is illustrated in Table 3. Table 3. Distribution of the 8000 samples from the MGCD.

Label
Cloud Class Number of Samples Mixed 1020 Figure 5 shows three samples for each of the 7 classes of clouds. The occluded testing samples are demonstrated in Figure 6.

Parameter Setting
Random partition ensures that the learning features will be uniformly distributed, while non-random allocation leads to uneven samples, which will seriously affect the convergence of network training phase and the generalization of network testing phase. Therefore, the data set is randomly divided into training set and testing set. The first one contains 2/3 of the cloud samples of each class, and the second one contains 1/3 of the cloud samples of each class.
All experiments are carried out with the same experimental setup. We use transfer learning to train the convolutional neural network models of Inception-v3 network and ResNet-50 network. We change the size of the fully connected output layer to fit the cloud categories, and the input cloud samples are automatically cropped to input layer size.
A small batch random gradient descent method is used to adjust the parameters continuously, we set minimum batch size to 10, maximum epochs to 6, initial learn rate to 0.00001, validation frequency to 250.

Results
We will examine five methods for classifying ground-based Clouds, among which there are two existing ones: • ICNN-using the Inception-v3 convolutional neural network (ICNN) for feature extraction and classification; • RCNN-it is similar to ICNN, but the CNN used is the ResNet-50 convolutional neural network.
Three methods that we proposed: • IWSRC-using the Inception-v3 convolutional neural network for feature extraction and the weighted sparse representation coding for classification; • RWSRC-exactly the same as IWSRC but with the Inception-v3 CNN replaced by the ResNet-50; • RIWSRC-this is the proposed two-channel CNN-based sparse representation coding method, depicted in Figure 4.
Train loss of convolutional neural network is displayed in Figure 7. Train accuracy of convolutional neural network is shown in Figure 8.  Table 4 shows the accuracy of all the methods used in this paper without occlusion.  Table 5 shows the accuracy of each method when the occluded cloud samples are used.

Discussion
In Figure 7, the overall trend is downward, and generally the loss of RCNN is lower than that of ICNN. In Figure 8, the recognition rate of RCNN is faster than that of ICNN.
As see Table 4, the RCNN is better than the ICNN since the ResNet-50 increases gradient propagation and hence is more effective in clouds classification. As to the proposed methods IWSRC, RWSRC and RIWSRC, they are all better than ICNN and RCNN because the noise interference can be suppressed with weighted representation learning during classification. However, RIWSRC achieves the best performance with the optimized feature selection process. The testing cloud images are with 5%, 10%, 15%, 20% and 25% occlusion, respectively.
In Table 5, experimental results with various occlusion show that the proposed IWSRC and RWSRC are better than ICNN and RCNN. The neural network is sensitive to perturbation with occlusion, and weighted sparse representation(WSR) will adjust the weights according to the size of the error. The larger error leads to smaller weights, which helps to reduce the perturbation on classification. In general, the proposed RIWSRC is more robust and effective than both IWSRC and RWSRC. The multiple features fusion can further improve the robustness of the algorithm The code is now publicly available on https://github.com/tangming666/NRC (accessed on 18 July 2021).

Conclusions
A novel robust classification scheme is proposed for ground-based clouds in this paper. The basic idea behind this is to use the CNNs for feature extraction and a weighted sparse representation coding for classification. Two classification algorithms are proposed directly along this line and the third one is based on the fusion of two CNNs. Experimental results show that the three proposed algorithms can enhance the robustness greatly. The proposed methods can be applied in most of deep learning neural networks.
As future research, we will consider adding some multimodal information to the selected neural network features to improve the robustness of the system. More applications of the proposed methods will be explored. Data Availability Statement: The models and code used during the study are available in a repository (https://github.com/tangming666/NRC, accssed on 1 August 2021). The data is available from the corresponding author by request (shuangliu.tjnu@gmail.com).