Multi-View Ground-Based Cloud Recognition by Transferring Deep Visual Information

Since cloud images captured from different views possess extreme variations, multi-view ground-based cloud recognition is a very challenging task. In this paper, a study of view shift is presented in this field. We focus both on designing proper feature representation and learning distance metrics from sample pairs. Correspondingly, we propose transfer deep local binary patterns (TDLBP) and weighted metric learning (WML). On one hand, to deal with view shift, like variations of illuminations, locations, resolutions and occlusions, we first utilize cloud images to train a convolutional neural network (CNN), and then extract local features from the part summing maps (PSMs) based on feature maps. Finally, we maximize the occurrences of regions for the final feature representation. On the other hand, the number of cloud images in each category varies greatly, leading to the unbalanced similar pairs. Hence, we propose a weighted strategy for metric learning. We validate the proposed method on three cloud datasets (the MOC_e, IAP_e, and CAMS_e) that are collected by different meteorological organizations in China, and the experimental results show the effectiveness of the proposed method.


Introduction
Clouds are aerosols consisting of large amounts of frozen crystals, minute liquid droplets, or particles suspended in the atmosphere (https://www.weather.gov/).Their size, type, composition and movement reflect the atmospheric motion.Especially the cloud type, as one of crucial cloud macroscopic parameters in the cloud observation, plays a vital role in the weather prediction and climate change research [1].Currently, a large quantity of labor and material resources are consumed because ground-based cloud images are classified by qualified professionals.Therefore, developing automatic techniques for ground-based cloud recognition is vital.To date, there are various devices for digitizing ground-based clouds, for example the whole sky imager (WSI) [2], the infrared cloud imager (ICI) [3], and the whole-sky infrared cloud-measuring system (WSIRCMS) [4] etc.With the help of these devices, various methods for automatic ground-based cloud recognition [5][6][7] have been proposed.However, the cloud features used in these methods are not discriminative enough to represent cloud images.
Practically, the appearance of clouds can be regarded as a type of natural texture [8].Hence making it reasonable to use texture descriptors to portray cloud appearances.Inspired by the success of local features in the texture recognition field [9][10][11][12], some local features are proposed to recognize ground-based cloud images [13,14].This kind of method includes two procedures; initially, the cloud image is described as a feature vector using local features.Secondly, the Euclidean distance or chi-square distance is utilized in the matching or recognizing process.
The major focal point of the existing methods is based on recognizing cloud images which originate from similar views.These methods are implemented under the condition that the training and test images come from the same feature space.Nevertheless, these methods are not suitable for multi-view cases.This is because the cloud images captured from different views belong to different feature spaces.Practically, we often handle cloud images in two views.For instance, the cloud images collected by a variety of weather stations possess variances in image resolutions, illuminations, camera settings, occlusions and so on.This kind of cloud images actually distributes in different feature spaces.As illustrated in Figure 1a, the cloud images are captured in multiple views, and vary greatly in appearance.The competitive methods for ground-based cloud recognition, i.e., local binary patterns (LBP) [15], the bag-of-words (BoW) model [16], and the convolutional neural network (CNN) [17], generally achieve promising results when training and testing in the same feature space, while the performances degrade significantly when training and testing in different feature spaces, as shown in Figure 1b.Therefore, we hope to employ cloud images from one view (feature space) to train a classifier, which is then used to recognize cloud images from other views (feature spaces).This is a kind of view shift problem, and we define it as the multi-view ground-based cloud recognition.It is very common worldwide.For instance, for the sake of obtaining completed weather information, it is essential to set up more new weather stations to capture cloud images.However, due to the fact that there are insufficient labelled cloud images in the new weather stations to train a robust classifier makes it unrealistic to expect users to label the cloud images for new weather stations.This is time-consuming and a dissipate of manpower.Considering that there are many labelled cloud images accumulated in the established weather stations, we aspire to employ such labelled cloud images to train a classifier which can be used to recognize cloud images in new weather stations.In this paper, we propose a novel multi-view ground-based cloud recognition method by transferring deep visual information.The cloud features used in the existing methods are not discriminative enough to sufficiently describe cloud images when presented with view shift, and therefore we propose an effective method named transfer deep local binary patterns (TDLBP) for feature representation.Concretely, we first train a CNN model, and we propose part summing maps (PSMs) based on all feature maps for one convolutional layer.Then we extract LBP in local regions from the PSMs, and each local region is represented as a histogram.Finally, in order to adapt view shift, we discover the maximum occurrence to make a stable representation.
After cloud images are represented as feature vectors, we compute the similarity between feature vectors to classify ground-based cloud images.Classical distance metrics are predefined, such as the Euclidean distance [18], chi-square metric [13] and quadratic-chi metric [19].Hence, we propose a learning-based method called weighted metric learning (WML) which aims to utilize sample pairs to learn a transformation matrix.In Figure 2, green and blue indicate two kinds of feature spaces.Two samples from both feature spaces comprise a sample pair.Here, the red lines denote similar pairs, while black lines denote dissimilar pairs.In practice, the number of cloud images in each category greatly differs.For example, there are many clear sky images as the clear sky appears frequently, while there are few images of altocumulus which has a low probability of occurrence.There exists an unbalance problem of sample pairs when we learn the transformation matrix.Hence, to avoid the learning process being dominated by sample pairs in which clouds appear frequently, and neglecting limited sample pairs in which clouds occur rarely, we propose a weighted strategy for metric learning.We assign a corresponding weight for sample pairs in each category.Thus, we assign a small weight to sample pairs that possess a large number (squares in Figure 2) and assign a large weight to sample pairs that possess a small number (circles in Figure 2).Finally, we utilize the nearest neighborhood classifier, where the distances are determined by the proposed distance metric, to classify cloud images which are from another feature space.The rest of this paper is organized as follows.Section 2 presents the related work including feature representation for ground-based cloud recognitions and metric learning.The details of the proposed TDLBP and WML are introduced in Section 3. In Section 4, we conduct a series of experiments to verify the proposed method.Section 5 summarizes the paper.

Related Work
In recent years, researchers have developed a number of algorithms for ground-based cloud recognition.The co-occurrence matrix and edge frequency were introduced in [5] to extract local features to describe cloud images, and recognized five different sky conditions.The work [20] extended to classify cloud images into eight sky conditions by utilizing Fourier transformation and statistical features.Since the BoW model is an effective algorithm for texture recognition, some extension methods [21,22] were proposed.Since the appearance of clouds is a kind of natural texture, Sun et al. [23] employed LBP to classify infrared cloud images.Liu et al. [19] proposed illumination-invariant completed local ternary patterns (ICLTP), which can effectively handle the illumination variations.They soon proposed the salient LBP (SLBP) [13] to capture descriptive cloud information.The desirable property of SLBP is the robustness to noises.However, these features are not robust to view shift for describing cloud images.
Recently, due to the inspiration caused by the success of convolutional neural networks (CNNs) in image recognition [17,24], Ye et al. [25] first proposed to apply CNNs to ground-based cloud recognition.They employed Fisher Vector (FV) to encode the last conventional layer of CNNs, and they further proposed to extract the deep convolutional visual features to represent cloud images in [26].Shi et al. [27] employed the deep convolutional activations-based features (DCAFs) to describe cloud images.These aformentioned methods showed promising recognition results when trained and tested on the same feature space.In other words, these features are also not robust to view shift.
In the recognition procedure to compute similarities or distances between two feature vectors, many predefined metrics cannot show the desirable topology that we are trying to capture.A sought-after alternative is to apply metric learning in place of these predefined metrics.The key idea of metric learning is to conduct a Mahalanobis distance where a transformation matrix is applied to compute the distance between a sample pair.Since metric learning has shown remarkable performance in various fields, such as image retrieval and classification [28], face recognition [29][30][31] and human activity recognition [32,33], we employ the framework of metric learning to ground-based cloud recognition and meanwhile consider the sample imbalance problem.

Part Summing Maps
With the appearance of large-scale image datasets and the development of high-performance computing systems, CNNs have shown promising performance in image classification [34] and object detection [35,36].Hence, we extract features from a CNN model to describe cloud images.Generally, an effective CNN requires a large number of training images.When there are insufficient training images to train a CNN, it results in overfitting.In this tribulation, we fine-tune the VGG-19 model [17] on our cloud datasets to train a CNN.As presented in Table 1, the VGG-19 model consists of 16 convolutional layers and three fully-connected (FC) layers.The size of receipt fields throughout the whole model is set to 3 × 3 pixels, and the number of receipt fields is different for each convolutional layer.In the process of fine-tuning the VGG-19 model, we replace the number of kernels in the final FC layer with the number of cloud categories.
A lot of processes have been developed in utilizing feature maps for image representations in computer vision fields [37][38][39].Furthermore, the feature maps for a convolutional layer describe different patterns.To obtain completed information from the convolutional layer, we propose PSMs based on all feature maps for image representations.Practically, we divide all feature maps from one convolutional layer into several parts for one cloud image evenly.Suppose that there are K parts of feature maps, as shown in Figure 3. Then we add the feature maps of each part into one part summing map (PSM), denoted as C k (k = 1, 2, ..., K), and it is formulated as: where c k j indicates the j-th feature map and J is the number of the feature maps in each part.

Transfer Deep LBP
We propose TDLBP to address the view shift problem.The convolutional layers can capture more local characteristics [40,41].Therefore, we propose to extract local patterns from the PSMs of a convolutional layer to represent cloud images.TDLBP is an improved operator over LBP, which computes a region representation based on the PSMs.The TDLBP is not only invariance to intensity scale changes, but is robust to view shift and obtain the completed scale information of cloud.We first partition each PSM into L × L (L = 1, 2, 3) regions.Second, we extract LBP in each region of the PSMs.We take the PSMs of 2 × 2 regions as an example (see Figure 4) and perform the following steps: (1) Feature extractions for each region in the PSMs.Within each region, we extract three scales of LBP histograms, i.e., (P, R) = (8, 1), (16, 2) and (24, 3).Hence, each region can be described as a 54 dimensional descriptor.Each PSM is divided into 2 × 2 regions, which are denoted as four colors, i.e., blue, green, yellow, and pink, respectively.We extract features from each region, and apply max pooling for the final feature representation.

Weighted Metric Learning
Suppose there is a sample pair (i, z), where i ∈ R d×1 and z ∈ R d×1 are the feature vectors of two cloud images from two views, respectively (i.e., i and z come from two feature spaces).If the category labels of i and z are the same (or different), we define (i, z) as a similar pair (dissimilar pair).The number of cloud categories from each view is N, and we further construct N sets of similar pairs: where C n is a set of similar pairs in the n-th category.We formulate the dissimilar pairs as: We aspire to learn a transformation matrix M ∈ R d×r (r ≤ d) to parameterize the squared Mahalanobis distance: where M = GG T is a positive semidefinite matrix.For convenience, we denote s = (i − z).The squared Mahalanobis distance is a scalar, and hence we reformulate Equation (4) as: Our goal is to minimize the distance between similar pairs, and meanwhile maximize the distance between dissimilar pairs.For this purpose, we conduct the following objective function: where D C − D I is the cost function, the distances of all similar pairs are added to obtain D C , and D I is the sum of the distances of dissimilar pairs.D C and D I are defined in the following.The first constraint ensures a valid metric, and the second one excludes the trivial solution [42].
When computing D C in the learning process, the classical metric learning methods assign the same weight to each similar pair of all categories.This does not consider that the numbers of similar pairs in each category is largely unbalanced.This weight strategy is not suitable for multi-view ground-based cloud recognition, because the occurrence probabilities of various weather conditions are different, and the number of cloud images in each category varies greatly resulting in the unbalanced similar pairs.Therefore, we propose WML to solve the problem of sample unbalance.For similar pairs, we assign a different weight to each category.Concretely, we first compute the distances between similar pairs of each category, and give a weight to each category according to the similar pair number.Then we sum the weighted distance of all categories.We compute D C and D I by: where |C n | is the number of similar pairs in the n-th category, and |I| is the total number of dissimilar pairs of all categories.We minimize the objective function, i.e., Equation (6), subject to two constraints to learn M. Since M = GG T is a positive semidefinite matrix, the first constraint can be relaxed when explicitly solved for M [42].Equations ( 7) and ( 8) are substituted into Equation (6), and then we make use of the standard Lagrange multiplier on Equation (6): Then the partial derivative of the Lagrangian function with respect to M is computed, and we set the result to zero: where and We solve the eigenvalue of Equation ( 10), and preserve r eigenvectors of (W C − W I ) corresponding to the first r largest eigenvalues.As a result, the learned transformation matrix M is equal to: where m 1 ∈ R d×1 is the eigenvector of (W C − W I ) corresponding to the largest eigenvalue, and m 2 ∈ R d×1 is the eigenvector of (W C − W I ) corresponding to the second largest eigenvalue, and so on.

Datasets and Experimental Setup
In this paper, each cloud dataset is divided into seven categories according to the criteria published in World Meteorological Organization (WMO).The first cloud dataset MOC_e is collected in Wuxi, Jiangsu Province, China, and provided by Meteorological Observation Centre, China Meteorological Administration.The cloud images have strong illuminations and no occlusions, and have the resolution of 2828 × 4288.There are two cloud datasets, i.e., the CAMS_e and IAP_e, captured in Yangjiang, Guangdong Province, China, but provided by Chinese Academy of Meteorological Sciences, and Institute of Atmospheric Physics, Chinese Academy of Sciences, respectively.Each cloud image in the CAMS_e is 1392 × 1040 pixels with weak illuminations and no occlusions.The acquisition device used to collect the IAP_e differs from that of the CAMS_e, and as a result, the cloud images from the IAP_e have higher resolution of 2272 × 1704, strong illuminations and occlusions.The total number of the MOC_e is 2107, and the CAMS_e's total number is 2491.The IAP_e has a large number of 3533.The number of each category is listed in Table 2. Samples for each category are shown in Figure 5.It is observed that each cloud dataset is captured from different views and belongs to different feature spaces.
All images from the three datasets are resized to 224 × 224 pixels, and we employ the feature maps of the fourth convolutional layer.We select two parts of the images as the training images, i.e., all of the images from one view and half of images in each category from another view, and the remaining are taken as the test images.We implement experiments 10 times, and we take the average accuracy over these 10 times as the final results.

Effect of TDLBP
We compare the proposed TDLBP with the other two texture features, i.e., LBP and DLBP.It should be noted that we extract LBP from the original cloud images and the PSMs, respectively, so we define the second one as DLBP.For fair comparison, we partition all original cloud images (for LBP) and the PSMs (for DLBP and TDLBP) into L × L (L = 1, 2, 3) regions.For each region, we extract three scales LBP with (P, R) equal to (8, 1), (16, 2) and (24,3).As for LBP, we accumulate LBP histograms in each divided region, and concatenate all histograms into one histogram with 1 × 54 + 4 × 54 + 9 × 54 = 756 dimensions.As for DLBP, within each region of the PSMs, we extract LBP histograms, and then apply sum pooling to aggregate all features in each region.Each image is also described as a feature vector with 756 dimensions.The chi-square metric is used in this section, and Table 3 presents the recognition accuracies.From Table 3, in all six situations, the highest classification accuracies are obtained by TDLBP.Both TDLBP and DLBP outperform LBP, because the CNN can learn highly nonlinear features for view shift.Moreover, TDLBP and DLBP are extracted from the PSMs which contain the completed and spatial information of clouds.The TDLBP outperforms DLBP by about 1% in all six situations.Since cloud images have some interferences and noises in general, max pooling could opt for the discriminative and salient features.Hence, TDLBP is more suitable for adapting view shift.Furthermore, the best performance is obtained in the situation of the IAP_e to MOC_e shift.This is probably because the cloud images of IAP_e have some similarities with the ones of MOC_e, such as illuminations, occlusions and locations.
We replaced chi-square metric with metric learning to classify the cloud images with the three features, and we denote them as LBP + ML, DLBP + ML and TDLBP + ML, respectively.From the results shown in Table 4, with the help of metric learning, the performance improvement is more significant, i.e., it all improves approximately by 2%.Particularly, TDLBP + ML achieves the best recognition results in all six conditions.It demonstrates that TDLBP is effective both in predefined metric and learning-based metric.In addition, it is observed that metric learning is more suitable for measuring the similarity between sample pairs when presented with view shift.

Effect of WML
In this subsection, we evaluate WML combined with the above mentioned features.LBP + WML, DLBP + WML and TDLBP + WML denote LBP, DLBP and TDLBP with the proposed WML, respectively.We choose r = 150 in Equation ( 13) when learning M, and the number of PSMs K = 8.The results are shown in Table 5 where we can observe that TDLBP + WML achieves the best performance in all multi-view recognitions once again.Comparing Table 5 with Table 4, the proposed WML achieves better results than ML when using the same features, because it considers the imbalanced sample problem by using a weight strategy.In order to further verify the effectiveness of the WML, we compare WML with SMOTEBoost [43] and RUSBoost [44] based on TDLBP.SMOTEBoost and RUSBoost are the representative methods for alleviating the problem of class sample imbalance where we make use of the default optimal parameters for them.From Table 6, the proposed TDLBP + WML still achieves the best recognition result in all multi-view recognition cases.The performances of SMOTEBoost and RUSBoost are very similar, but RUSBoost is a preferable alternative for learning from imbalanced data because it is simpler, faster, and less complex than SMOTEBoost.

Comparison to the Competitive Methods
We compare the proposed method TDLBP + WML with three competitive methods, i.e., LBP, BoW and CNN.Note that the experimental results of LBP in this section are the same as the one mentioned in Table 4.For BoW, we stretch a 9 × 9 neighborhood around each pixel into an 81 dimensional vector to represent each patch, and apply Weber's law [45] to normalize the patch vectors.Then, we learn a dictionary for each category by using K-means clustering [46] over patch vectors, and the size of dictionary for each category is set to 300.Each image is described as a 2100 dimensional vector.Finally, we make use of LIBSVM [47] for SVM training and classification with the radial basis function (RBF) kernel, where the parameters C and γ are set to 200 and 2, respectively.The C is a penalty coefficient that trades off the relationship between the misclassification and the complexity of the decision surface.The γ is a parameter of the RBF kernel, and can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.For CNN, we utilize the widely-used VGG-19 model [17] for fine-tuning the network on the cloud datasets.Then we treat the final FC layer as the feature vector.Note that LBP, BoW and CNN utilize the same training samples as TDLBP + WML.
From the experimental results listed in Table 7, LBP, BoW and CNN are not suitable for multi-view ground-based cloud recognition.However, BoW and CNN still outperform LBP, because LBP is a fixed feature extraction method without learning process.Compared with BoW and CNN, we not only extract the feature vectors from PSMs, but also sufficiently consider the diverse numbers of cloud images in each category.Hence, the proposed method outperforms BoW and CNN by more than 30% and 17%, respectively.

Influence of Parameter Variances
In this section, we analyze the proposed TDLBP + WML in three aspects, including the selection of the convolutional layers for PSMs, and the influences of K and r.It is noted that we select the IAP_e as one view, and the MOC_e as the other view to implement the following experiments.
Generally, we can extract structural and textural local features from shallow convolutional layers of a CNN, and extract features with high-level semantic information from deep convolutional layers.The appearance of clouds can be regarded as a type of natural texture, and therefore we extract feature from the PSMs in the shallow convolutional layers.We select the first to eighth convolutional layers for the PSMs to analyze the performance of TDLBP + WML.From Table 8, it is obvious that the highest result of TDLBP + WML is obtained when we make use of the PSMs in the 4-th convolutional layer.Since each convolutional layer contains different information, we extract TDLBP features from two different convolutional layers for cloud image representation in order to obtain the completed cloud information.From Table 9, TDLBP + WML obtains the highest result when utilizing the PSMs of conv_4, and therefore we combine conv_4 with each of the other convolutional layers for TDLBP feature extraction.Specifically, we extract TDLBP features from this kind of two convolutional layer, and the resulting TDLBP features are concatenated to form the final feature for describing the cloud image.Comparing Table 9 with Table 8, the performances all improve, and the case of conv_3 & conv_4 achieves the best result of 79.46%.Based on this result, we further combine TDLBP features for three different convolutional layers, and follow the same procedure of feature extraction as mentioned above.The results are shown in Table 10.Comparing Table 10 with Table 9, the performances slightly degrade.Hence, considering both the computation complexity and the recognition accuracy, we conclude that extracting TDLBP features from two different convolutional layers is optimal for cloud image representation.
The effect of K on recognition performance is shown in Figure 6, and K is the number of PSMs for the 4-th convolutional layer.We can conclude that larger K may result in better recognition accuracies, but may probably lead to heavier computational burden.We obtain the best result when K increases to 8. r is the the number of eigenvalues (see Section 3.3).r has an impact on recognition performance as it controls the dimensionality of M. In addition, we evaluate the performance of TDLBP + WML with respect to r.As illustrated in Figure 7, with r increasing, the recognition performance improves, and the best result of 78.52% is obtained at a certain point where r is equal to 150.The proper r can make the feature vectors contain the discriminative information with the favourable dimensionality.

Conclusions
In conclusion, we have proposed TDLBP + WML for multi-view ground-based cloud recognition.Specifically, a novel feature representation called TDLBP has been proposed which is robust to view shift, such as variances in locations, illuminations, resolutions and occlusions.Furthermore, since the numbers of cloud images in each category is different, we propose WML which assigns different weights to each category when learning the transformation matrix.We have verified TDLBP + WML with a series of experiments on three cloud datasets, i.e., the MOC_e, CAMS_e, and IAP_e.Compared to other competitive methods, TDLBP + WML achieves better performance.

Figure 1 .
Figure 1.(a) We present cloud images from two different views; (b) The performance of three competitive methods degrade when presented with view shift.

Figure 2 .
Figure 2. The green and blue indicate two kinds of feature spaces.Then we employ weighted pairwise constraints to the feature spaces.Here, red and black lines denote similar pairs and dissimilar pairs, respectively.The final feature space is learned for cloud recognition.

Figure 3 .
Figure 3.The procedure of generating part summing maps.
(2) Feature pooling.Max pooling is applied on all local features of the local regions at the same position, i.e., preserving the maximum value of each bin among all histograms, resulting in four histograms.The pooled feature of each local region is more robust to view shift.(3) Feature concatenation.The four histograms are concatenated into one histogram to represent each cloud image.The resulting histogram can capture global information and local characteristics of image regions, simultaneously.

Figure 4 .
Figure 4.Each PSM is divided into 2 × 2 regions, which are denoted as four colors, i.e., blue, green, yellow, and pink, respectively.We extract features from each region, and apply max pooling for the final feature representation.

Figure 5 .
Figure 5.We present cloud samples of each category (each row indicates one category) from the three cloud datasets, i.e., (a) the MOC_e, (b) the CAMS_e, and (c) the IAP_e.

Figure 6 .
Figure 6.Recognition accuracies achieved by TDLBP + WML with varied numbers of K.

Figure 7 .
Figure 7. Recognition accuracies achieved by TDLBP + WML with varied numbers of r.

Table 1 .
The configuration of the VGG-19 model.con_i denotes the i-th convolutional layer, and the convolution stride is set to 1 pixel.Max pooling is implemented by a sliding window of 2 × 2 pixels with stride 2.

Table 2 .
The sample number in each category of three datasets.

Table 8 .
The performance of TDLBP + WML in different convolutional layers.

Table 9 .
The performance of TDLBP + WML in combinations of two convolutional layers.

Table 10 .
The performance of TDLBP + WML in combinations of three convolutional layers.