Multi-Evidence and Multi-Modal Fusion Network for Ground-Based Cloud Recognition

: In recent times, deep neural networks have drawn much attention in ground-based cloud recognition. Yet such kind of approaches simply center upon learning global features from visual information, which causes incomplete representations for ground-based clouds. In this paper, we propose a novel method named multi-evidence and multi-modal fusion network (MMFN) for ground-based cloud recognition, which could learn extended cloud information by fusing heterogeneous features in a uniﬁed framework. Namely, MMFN exploits multiple pieces of evidence, i.e


Introduction
Clouds are collections of very tiny water droplets or ice crystals floating in the air. They exert a considerable impact on the hydrological cycle, earth's energy balance and climate system [1][2][3][4][5]. Accurate cloud observation is crucial for climate prediction, air traffic control, and weather monitoring [6].
In general, space-based satellite, air-based radiosonde and ground-based remote sensing observations are three major ways for cloud observation [7]. Satellite observations are widely applied in large-scale surveys. However, they have deficiencies in providing sufficient temporal and spatial resolutions for localized and short-term cloud analysis over a particular area. Although air-based radiosonde observation is excellent in cloud vertical structure detection, its cost is considerably high. As a result, the equipment of ground-based remote sensing observations are rapidly developed, such as total-sky imager (TSI) [8,9] and all sky imager [10,11], which can provide high-resolution remote sensing images at a low cost so as to promote local cloud analysis. The visual information contained in ground-based cloud image represents cloud only from the visual perspective, which cannot describe the cloud accurately due to the large variances in cloud appearance. It should be noted that cloud formation is a mutual process of multiple natural factors, including temperature, humidity, pressure and wind speed, which we name as multi-modal information. Clouds have a great correlation with multi-modal information [29,30]. For example, humidity influences cloud occurrence and cloud shape is affected by wind. Hence, instead of only focusing on cloud visual representations, it is more reasonable to enhance the recognition performance via combining ground-based cloud visual and multi-modal information. Liu and Li [31] extracted deep features by stretching the sum convolutional map obtained from pooling activation at the same position of all the feature maps in deep convolutional layers. Then the deep features are integrated with multimodal features with weight. Liu et al. [32] propounded a two-stream network to learn ground-based cloud images and multi-modal information jointly, and then employed a weighted strategy to fuse these two kinds of information. In spite of these efforts, recognizing ground-based cloud using both cloud images and multi-modal information still remains an open issue.
Furthermore, existing public ground-based cloud datasets [26,33,34] lack data richness, which restricts the research of multi-modal ground-based cloud recognition. Specifically, none of these datasets contain both the ground-based cloud images and multi-modal information. Moreover, in practice, meteorological stations have been installed with equipment for collecting cloud images and multi-modal information, and therefore this information can easily be acquired.
In this paper, considering the above-mentioned issues, we propose the multi-evidence and multi-modal fusion network (MMFN) to fuse heterogeneous features, namely, global visual features, local visual features, and multi-modal information, for ground-based cloud recognition. To this end, the MMFN mainly consists of three components, i.e., main network, attentive network and multi-modal network. The main and attentive networks could mine the multi-evidence, i.e., global and local visual features, to provide discriminative cloud visual information. The multi-modal network is designed with fully connected layers to learn multi-modal features.
To optimize the existing public datasets, we release a new dataset named multi-modal ground-based cloud dataset (MGCD) which contains both the ground-based cloud images and the corresponding multi-modal information. Here, the multi-modal information refers to temperature, humidity, pressure and wind speed. It contains 8000 ground-based cloud samples constructed from a long-time period and larger than any of the existing public ones.
The contributions of this paper are summarized as follows: • The proposed MMFN could fuse the multi-evidence and multi-modal features of ground-based cloud in an end-to-end fashion, which maximizes their complimentary benefits for ground-based cloud recognition.
• The attentive network refines the salient patterns from convolutional activation maps which could learn the reliable and discriminative local visual features for ground-based clouds.
• We release a new dataset MGCD which not only contains the ground-based cloud images but also contains the corresponding multi-modal information. To our knowledge, the MGCD is the first public cloud dataset containing multi-modal information.

Methods
The proposed MMFN is used for the multi-modal ground-based cloud recognition by fusing ground-based cloud images and multi-modal information. As depicted in Figure 2, it comprises three networks, i.e., main network, attentive network and multi-modal network, two fusion layers (concat1 and concat2) and two fully connected layers ( f c5 and f c6). In this section, we detail the main network, the attentive network and the multi-modal network, respectively. Then, the fusion strategy between visual and multi-modal features is elaborated.

Main Network
The main network is used to learn global visual features from the entire ground-based cloud images, and it evolves from the widely-used ResNet-50 [20]. Figure 3 summaries the framework of ResNet-50 which mainly consists of six components, i.e., conv1, conv2 x ∼ conv5 x and a fully connected layer. Additionally, conv2 x ∼ conv5 x are constituted by 3, 4, 6 and 3 residual building blocks, respectively. Taking conv3 x as an example, it contains 4 residual building blocks, each of which is made up of three convolutional layers, where the convolution kernels are with the size of 1 × 1, 3 × 3 and 1 × 1, respectively. Note that the final fully connected layer of ResNet-50 is discarded in the main network. The output of conv5 x is aggregated by the average pooling layer (avgpool1) resulting in a 2048-dimensional vector which is treated as the input of the fusion layer (concat1).

Attentive Network
CNNs tend to pay more attention to local regions where the structure and texture information of clouds is reflected. Figure 4 visualizes the features of CNN implying that the salient parts or regions in ground-based cloud images play a decisive role in the recognition process. Hence, there is an inevitable need to extract local features from ground-based cloud images so as to complement global features. The attentive network is inspired by the great potential of the attention mechanism and used for exploiting local visual features for ground-based cloud recognition. Attention is the process of selecting and gating relevant information based on saliency [35] and it has been widely investigated in speech recognition [36,37], object detection [38], image captioning [39] and many other visual recognition works [40][41][42][43]. Meanwhile, the convolutional activation maps in shallow convolutional layers contain rich low-level patterns, such as structure and texture. Hence, we design the attentive network which consists of the attentive maps, two convolutional layers (conv2 and conv3) and one average pooling layer (avgpool2) to extract local visual features from convolutional activation maps. Specifically, we first refine salient patterns from convolutional activation maps to obtain attentive maps that contain more semantic information. We then optimize the local visual features from attentive maps. To learn reliable and discriminative local visual features, we propose attentive maps as the first part of the attentive network, which are generated by refining the salient patterns from the convolutional activation map. Explicitly, we treat the convolutional activation maps of the first residual building block in conv3 x as the input of attentive maps. Let X i = {x i,j | j = 1, 2, · · · , h × w} denote the i-th convolutional activation map, where x ij is the response at location j, and h, w denote the height and width of convolutional activation map. Herein, there are 512 convolutional activation maps, and h = w = 28 in the first block of conv3 x. For the i-th convolutional activation map, we sort x i,1 ∼ x i,h×w in descending order and select the top n × n responses. Afterward, we reconstruct them into an n × n attentive map and maintain the descending order. Figure 5 illustrates the process of obtaining an attentive map, where n is set to 5. We exert the same strategy to all the convolutional activation maps, and therefore obtain 512 attentive maps. Hence, the attentive maps gather higher responses for the meaningful content and eliminate the negative effects caused by the non-salient responses. Subsequently, the attentive maps are followed by a dropout layer. conv2, conv3 and avgpool2 are used to transform the attentive maps to a high dimensional vector by non-linear transformations. The convolution kernels of conv2 and conv3 are with the size of 3 × 3 and 1 × 1, and with the stride of 2 and 1, respectively. Notice that, the target of using the convolution kernel with the size of 1 × 1 in conv3 is to increase the output dimension. In addition, the numbers of convolution kernels in conv2 and conv3 are 512 and 2048, respectively. Both conv2 and conv3 are normalized by the batch normalization followed by the Leaky rectified linear unit (Leaky ReLU). The output of avgpool2 is a 2048-dimensional vector and is fed into concat2.
Hence, with the help of the main and attentive networks, the proposed MMFN could mine the multi-evidence to represent ground-based cloud images in a unified framework. This provides discriminative cloud visual information.

Multi-Modal Network
The multi-modal information indicates the procedure of cloud formation, and thus we apply the multi-modal network to learn multi-modal features for completed cloud representation. As the input of the multi-modal network is the multi-modal information, which is represented as a vector, it is designed with four fully connected layers, i.e., f c1 ∼ f c4, where the neuron numbers of them are 64, 256, 512 and 2048, respectively. Additionally, the batch normalization and the Leaky ReLU activation are connected to each of the first three. The output of f c4 is passed through the Leaky ReLU activation, and then it is treated as the input of concat1 and concat2 at the same time. Herein, we denote the input of concat1 and concat2, namely the output of the multi-modal network, as f m .

Heterogeneous Feature Fusion
Feature fusion has been proved to be a robust and effective strategy to learn rich information in various areas, such as scene classification [44], facial expression recognition [45], action recognition [46] and so on. Especially, feature fusion methods based on deep neural networks are deemed to be extremely powerful and they are roughly divided into homogeneous feature fusion [47][48][49][50] and heterogeneous feature fusion [51][52][53]. For the former, many efforts concentrate on fusing the homogeneous features extracted from different components of CNN for recognition tasks. Compared with the homogeneous feature fusion, the heterogeneous feature fusion is rather tough and complex, because heterogeneous features possess significant different distributions and data structures. Herein, we focus the emphasis on the heterogeneous feature fusion.
The outputs of the main network, the attentive network and the multi-modal network, i.e., f g , f l and f m , are treated as global visual features, local visual features and multi-modal features respectively, each of which is a 2048-dimensional vector. f g is learned from the entire cloud images, and it contains more semantic information because of being extracted from the deeper layer of the main network. While f l is learned from salient patterns in the shallow convolutional activation maps and it contains more texture information. Different from the visual features, f m describes the clouds from the aspect of multi-modal information. Hence, these features describe the ground-based clouds from different aspects, and they contain some complementary information. To take full advantage of the complementary strengths among them, we combine the multi-modal features with the global and local visual features, respectively.
In this work, we propose two fusion layers concat1 and concat2 to fuse f m with f g and f l respectively. In concat1, the integration algorithm for f g and f m can be formulated as where g(·) denotes the fusion operation. In this work, g(·) is represented as where [·, ·] means to concatenate two vectors, and λ 1 and λ 2 are the coefficients to trade-off the importance of f g and f m .
Similarly, the fusion features of f l and f m for concta2 can be expressed as where λ 3 and λ 4 are the coefficients to balance the importance of f l and f m . The final fully connected layers f c5 and f c6 are used for the recognition task and are connected to concat1 and concat2, respectively. Each of them has K neurons, where K refers to the number of ground-based cloud categories. The output of f c5 is fed into the softmax activation, and a series of label predictions over K categories are obtained to represent the probability of each category. The softmax activation is defined as where x k and y k ∈ [0, 1] are the value of the k-th neuron of f c5 and the predicted probability of the k-th category, respectively. The cross-entropy loss is employed to calculate the loss value where q k denotes the ground-truth probability, and it is assigned with 1 when k is the ground-truth label, otherwise, it is assigned with 0. As for f c6, it is similar to f c5. Namely, its output is activated by the softmax and then evaluated by the cross-entropy loss L 2 . The total cost of the proposed MMFN is computed as where α and β are the weights to balance L 1 and L 2 . Hence, the optimization target of MMFN is to minimize Equation (6), and the training of MMFN is an end-to-end process, which is beneficial to the fusion of the multi-modal features with the global and local visual features under a single network. After training the MMFN, we extract the fused features F gm and F lm from ground-based cloud samples according to Equation (2) and Equation (3). Finally, F gm and F lm are directly concatenated as the final representation for ground-based cloud samples. In short, the proposed MMFN has the following three properties. Firstly, the attentive network is utilized to refine the salient patterns from convolutional activation maps so as to learn the reliable and discriminative local visual features for ground-based clouds. Secondly, the MMFN could process the heterogeneous data. Specifically, the three networks in MMFN transform the corresponding heterogeneous data into the uniform format, which provides conditions for the subsequent feature fusion and discriminative feature learning. Thirdly, the MMFN could learn more extended fusion features for the ground-based clouds. It is because the heterogeneous features, i.e., global (local) visual features and the multi-modal features, are fused by two fusion layers which are optimized under one unified framework.

Comparison Methods
In this subsection, we describe comparison methods that are conducted in the experiments including variants of MMFN, and hand-crafted and learning-based methods.

Variants of MMFN
A unique advantage of the proposed MMFN is the capability of learning supplementary and complementary features, i.e., global visual features, local visual features and multi-modal features, from ground-based cloud data. Successful extraction of the expected features is guaranteed by several pivotal components, i.e., three networks, two fusion layers and the cooperative supervision of two losses. For the purpose of demonstrating their effectiveness on MGCD, we list several variants of the proposed MMFN as follows.
variant1. The variant1 is designed to only learn the global visual features of the ground-based cloud. It directly connects a fully connected layer with 7 neurons to the main network and the output of avgpool1, a 2048-dimensional vector, is used to represent the ground-based cloud images. Furthermore, the concatenation of the global visual features with the multi-modal information is denoted as variant1 + MI.
variant2. The variant2 is utilized to simply learn the local visual features. It maintains the architecture of the main network before the second residual building block in conv 3 and the attentive network. Then, it adds a fully connected layer with 7 neurons after the attentive network. The output of avgpool2, a 2048-dimensional vector as well, is regarded as the representation of ground-based cloud. Similarly, the concatenation of the local visual features and the multi-modal information is denoted as variant2 + MI.
variant3. The variant3 is designed for integrating global and local visual features. To this end, the multi-modal network, as well as the two fusion layers of MMFN, are removed. Furthermore, the variant3 adds a fully connected layer with 7 neurons after the main network and the attentive network, respectively. The outputs of avgpool1 and avgpool2 of the variant3 are concatenated resulting in a 4096-dimensional vector as the final representation of ground-based cloud images. The feature representation of the variant3 is integrated with the multi-modal information, which is referred to as variant3 + MI. variant6. For the purpose of demonstrating the effectiveness of two fusion layers, the variant6 integrates the outputs of three networks using one fusion layer which is followed by one fully connected layer with 7 neurons. The output of the fusion layer is treated as the ground-based cloud representation which is a 6144-dimensional vector.
variant7. To learn discriminative local visual features, the MMFN reconstructs the salient responses to form the attentive maps. The counterpart variant7 is employed to highlight the advantage of this innovative strategy. Instead of aggregating the top n × n responses, variant7 randomly selects n × n responses from each convolutional activation map.
The sketches of variant1 ∼ variant6 are shown in Figure 6.

Hand-Crafted and Learning-Based Methods
In this part, we provide descriptions of a series of methods for ground-based cloud classification involving hand-crafted methods, i.e., local binary patterns (LBP) [54] and completed LBP (CLBP) [55] and the learning-based method, i.e., bag-of-visual-words (BoVW) [56].
(a) The BoVW designs the ground-based cloud image representation with the idea of a bag of features framework. It densely samples SIFT features [57], which are then clustered by the K-means to generate a dictionary with 300 visual words. Based on the dictionary, each ground-based cloud image is transformed into the histogram of visual word frequency. To exploit the spatial information, the spatial pyramid matching scheme [58] is employed to partition each ground-based cloud image into an additional two levels with 4 and 16 sub-regions. Therefore, each ground-based cloud image with BoVW is represented by a 6300-dimensional histogram. This method is denoted as PBoVW.

(b)
The LBP operator assigns binary labels to the circular neighborhoods of a center pixel according to their different signs. In this work, the uniform invariant LBP descriptor LBP riu2 P,R is used as the texture descriptor of ground-based cloud images, where (P, R) means P sampling points on a circle with the radius of R. We evaluate the cases with (P, R) being set to (8, 1), (16,2) and (24,3), resulting into the descriptor being a feature vector with the dimensionality of 10, 18 and 26, respectively.
(c) The CLBP is proposed to improve LBP, and it decomposes local differences into signs and magnitudes. Besides, the local central information is considered as an operator. These three operators are jointly combined to describe each ground-based cloud image. The (P, R) is set to (8, 1), (16,2) and (24,3), resulting in a 200-dimensional, a 648-dimensional and a 1352-dimensional feature vector, respectively.

Implementation Details
We first resize the ground-based cloud images into 252 × 252 and then randomly crop them to the fixed size of 224 × 224. The ground-based cloud images are also augmented by the random horizontal flip. Thereafter, each of them is normalized by the mean subtraction. To ensure the compatibility, the values of multi-modal information, i.e., temperature, humidity, pressure and wind speed, are scaled to [0, 1].
The main network is initialized by the ResNet-50 pre-trained on the ImageNet dataset. We adopt the weight initialization method in [59] for the convolutional layers (conv2 and conv3) and the fully connected layers. The weights of the batch normalization layers are initialized by the normal distribution with the mean and the standard deviation of 1 and 0.02, respectively. All the biases in the convolutional layers, fully connected layers and the batch normalization layers are initialized to zero. For the attentive maps, we discard the top 5 responses in each convolutional activation map to alleviate the negative effects of noise or outliers. We then form the attentive maps.
During the training phase, the SGD [60] optimizer is employed to update the parameters of the MMFN. The total training epochs are 50 with a batch size of 32. The weight decay is set to 2 × 10 −4 with a momentum of 0.9. The learning rate is initialized to 3 × 10 −4 and reduced by a factor of 0.2 at epoch 15 and 35, respectively. The slope of Leaky ReLU is fixed to 0.1, and the drop rate in the dropout layer of the attentive network is a constant of 0.5. Furthermore, the parameters in the multi-modal network are restricted to the box of [−0.01, 0.01]. After training the MMFN, each cloud sample is represented as an 8192-dimensional vector by concatenating the fusion features F gm and F lm . Then, the final representations of training samples are utilized to train the SVM classifier [61].
Furthermore, several parameters introduced in this paper, i.e., the parameter n in the attentive network, the parameters λ 1 ∼ λ 4 in Equations (2) and (3)

Data
The multi-modal ground-based cloud dataset (MGCD) was the first one composed of groundbased cloud images and the multi-modal information. It was collected in Tianjin, China from March 2017 to December 2018 over a period of 22 months, at different locations and day times in all seasons, which ensured the diversity of cloud data. The MGCD contains 8000 ground-based cloud samples, which is the largest cloud dataset. Each sample was composed of one ground-based cloud image and a set of multi-modal information which had a one-to-one correspondence. The cloud images were collected by a sky camera with a fisheye lens, with a resolution of 1024 × 1024 pixels in JPEG format. The multi-modal information was collected by a weather station and stored in a vector with four elements, namely, temperature, humidity, pressure and wind speed.
We divided the sky conditions into seven sky types, as listed in Figure 7, including (1) cumulus, (2) altocumulus and cirrocumulus, (3) cirrus and cirrostratus, (4) clear sky, (5) stratocumulus, stratus and altostratus, (6) cumulonimbus and nimbostratus and (7) mixed cloud, under the genera-based classification recommendation of the World Meteorological Organization (WMO) as well as the cloud visual similarities in practice. The sky is often covered by no less than two cloud types, and this sky type is regarded as mixed cloud. Additionally, cloud images with cloudiness no more than 10% are categorized as clear sky. It should be noticed that all cloud images are labeled by meteorological experts and ground-based cloud-related researchers after much deliberation.
The ground-based cloud samples from the first 11 months constitutes the training set and these from the second 11 months are the test set, where each of the sets contains 4000 samples. Figure 7 presents the samples and the number from each cloud category in MGCD, where the multi-modal information is embedded in the corresponding cloud image. The MGCD is available at https://github. com/shuangliutjnu/Multimodal-Ground-based-Cloud-Database.

Results
In this section, we present the comparisons of the proposed MMFN with the variants of MMFN and other methods on MGCD followed by the analysis of classification results with different parameters

Comparison with Variants of MMFN
The recognition accuracy of MMFN and its variants based on MGCD are presented in Table 1 where several conclusions can be drawn. Firstly, both variant1 and the variant2 achieve promising recognition accuracy. It indicates that both global visual features and local visual features are essential for cloud recognition, in which global visual features contain more semantic and coarse information while local visual features contain more texture and fine information. The variant3 achieves a recognition accuracy of 86.25% which exceeds 3.1% and 4.02% over the variant1 and the variant2, respectively. It is because the variant3 combines the benefits of global visual features and local visual features. Table 1. The recognition accuracy (%) of the proposed multi-evidence and multi-modal fusion network (MMFN) as well as its variants. The notation "+" indicates the concatenation operation.

Methods
Accuracy ( Secondly, the methods (variant1 + MI, variant2 + MI, variant3 + MI, variant4 and variant5) which employ the multi-modal information are more competitive than those (variant1, variant2 and variant3) that do not. Specifically, compared with variant1, variant1 + MI has an improvement of 1.33%, and so for variant2 + MI and variant3 + MI which have improvements of 1.47% and 0.85%, respectively. More importantly, the improvements of the variant4, variant5 and MMFN are 2.75%, 1.47% and 2.38% over the variant1, variant2 and variant3, respectively. We therefore conclude that jointly learning the cloud visual features and the multi-modal features under a unified framework can further improve the recognition accuracy.
Thirdly, from the comparison between variant6 and the proposed MMFN, we can discover that the recognition performance of the latter is superior to the former even though both of them learn the global visual features, local visual features and the multi-modal features under a unified framework. It indicates that fusing the multi-modal features with the global and local visual features, respectively, can sufficiently mine the complementary information among them and exploit more discriminative cloud features.
Finally, the proposed MMFN achieves better results than variant7 because the attentive maps of MMFN learn local visual features from the salient patterns in the convolutional activation maps. While variant7 randomly selects responses from convolutional activation maps. Hence, the proposed attentive map could learn effective local visual features.

Comparison with Other Methods
The comparison results between the proposed MMFN and other methods, such as [32,62,63], are summarized in Table 2. Firstly, most results in the right part of the table are more competitive than those in the left part, which indicates that the multi-modal information contains useful information for ground-based cloud recognition. The visual features and the multi-modal information are supplementary to each other, and therefore the integration of them could obtain the extended information for ground-based cloud representation. Secondly, the CNN-based methods, such as CloudNet, JFCNN, DTFN and so on, are much better than the hand-crafted methods (LBP and CLBP) and the learning-based methods (BoVW and PBoVW). It is because the CNNs are with the nature of highly nonlinear transformations which enables them to extract effective features from highly complex cloud data. Thirdly, the proposed MMFN has an improvement over CNN-based methods, which verifies the effectiveness of the multi-evidence and multi-modal fusion strategy. Such a strategy thoroughly investigates the correlations between the visual features and the multi-modal information and takes into consideration the complementary and supplementary information between them as well as their relative importance for the recognition task.

Parameter Analysis
In this subsection, we analyse the parameter n in the attentive network, the parameters λ 1 ∼ λ 4 in Equations (2) and (3), and the parameters α and β in Equation (6).
We first analyse the parameter n which determines the size of the attentive map. The recognition accuracy with different n are displayed in Figure 8. We can see that the best recognition accuracy is achieved when n is equal to 7, namely 25% salient responses are selected from the convolutional activation map. While n is set to away from 7, the corresponding recognition accuracy is below the peak value of 88.63%. Then we analyse two pairs of parameters, i.e., λ 1 and λ 2 , λ 3 and λ 4 , in Equation (2) and Equation (3). λ 1 and λ 2 balance the significance of the global visual features and the multi-modal features respectively. Similarly, λ 3 and λ 4 balance the significance of the local visual features and the multi-modal features. Appropriate λ 1 ∼ λ 4 settings can optimize the recognition result. The recognition accuracies with different (λ 1 , λ 2 ) and (λ 3 , λ 4 ) settings are illustrated in Table 3 and  Table 4. From Table 3 we can see that when (λ 1 , λ 2 ) is equal to (0.3, 0.7), the best recognition accuracy is obtained. Similarly, Table 4 shows that when (λ 3 , λ 4 ) is with the setting of (0.3, 0.7), the best recognition accuracy is achieved. Afterwards, we evaluate the parameters α and β which are a tradeoff between the losses L 1 and L 2 in Equation (6). The recognition performances with different α and β settings are summarized in Table 5. It is observed when α = β = 1, the best recognition accuracy is obtained. Besides, as more training data means better training of the network, we implement the experiment with the dataset being divided into different ratios, i.e., 60/40, 70/30 and 80/20, to evaluate the influences on recognition accuracies caused by the training data numbers, and the results are illustrated in Figure 9. As shown, more training samples lead to higher recognition performance.

Overall Discussion
Cloud classification is both a basic and necessary part of climatological and weather research and provides indicative knowledge about both short-term weather conditions and long-term climate change [64]. There are many algorithms developed for automatic cloud classification. Xiao et al. [65] aggregated texture, structure and color features which are extracted simultaneously from ground-based cloud images and encoded by the Fisher vector as a cloud image descriptor. Afterwards, a linear SVM was employed to group 1231 ground-based cloud images into six classes with an accuracy of 87.5%. Wang et al. [7] applied the selection criterion which is based on the Kullback-Leibler divergence between LBP histograms of the original and resized ground-based cloud images to select the optimal resolution of the resized cloud image. The criterion was evaluated on three datasets with a total of ground-based cloud images 550 and 5000 from five classes, and 1500 from seven classes, respectively. The overall classification results of these three datasets are around 65.5%, 45.8% and 44% respectively. Zhang et al. [26] employed the CloudNet composed of five convolutional layers and two fully connected layers to divide 2543 ground-based cloud images into 10 categories with an average accuracy of 88%. Li et al. [62] presented a deep tensor fusion network which fuses cloud visual features and multimodal features at the tensor level so that the spatial information of ground-based cloud images can be maintained. They obtained the classification result of 86.48% over 8000 ground-based cloud samples. Liu et al. [63] fused deep multimodal and deep visual features in a two-level fashion, i.e., low-level and high-level. The low-level fused the heterogeneous features directly and its output was regarded as a part of the input of the high-level which also integrates deep visual and deep multimodal features. The classification accuracy of the hierarchical feature fusion method was 87.9% over 8000 ground-based cloud samples.
Of the above-mentioned studies, most of them have achieved high accuracy, but the datasets used in these studies are either with a small number of ground-based images or not public. The availability of sufficient ground-based cloud samples is a fundamental factor to allow CNNs to work effectively. In addition, since cloud types change over time, appropriate fusion of multi-modal information and cloud visual information could improve the classification performance. The JFCNN [32] achieved excellent performance with the accuracy of 93.37% by learning ground-based cloud images and multi-modal information jointly. However, the dataset used in [32] only contains 3711 labeled cloud samples, and it is randomly split into the training set and the test set with the ratio of 2:1, which means there may exist high dependence between training and test samples. In this study, we create a more extensive dataset MGCD containing 8000 ground-based cloud samples with both cloud images and the corresponding multimodal information. All the samples are classified into the training set and the test set and both of them are with 4000 samples, where the training set is grouped from the first 11 months and the test set is grouped from the second 11 months. Hence, the training and test sets in the MGCD are independent. As salient local patterns play a decisive role in the decision-making procedure, we devise MMFN with three networks, i.e., main network, attentive network and multi-modal network, to generate global visual features, local visual features and multi-modal features, and fuse them at two fusion layers. The proposed MMFN obtains the best result of 88.63%. We first assess the rationality of each component of MMFN by comparing with its variants. Then, we exert comparisons between MMFN and other methods, including the hand-crafted methods (LBP and CLBP), the learning-based methods (BoVW and PBoVW) and the CNN-based methods (DMF, JFCNN, HMF and so on), where the accuracy gaps between the proposed MMFN and the hand-crafted and learning-based methods are over 18 percentage points, and the gap between the proposed MMFN and the second-best CNN-based method, i.e., HMF is 0.73 percentage points.
It is quite reasonable that the proposed MMFN has a competitive edge over other methods. Affected by temperature, wind speed, illumination, noise, deformation and other environmental factors, the cloud images are with the characteristic of volatility leading to intractability of cloud recognition. A more effective strategy is imperative and required to obtain extended cloud information. Hence, the proposed MMFN jointly learns the cloud multi-evidence and the multi-modal information and extracts powerful and discriminative features from ground-based cloud samples. Accordingly, the proposed MMFN makes a significant improvement over other methods to the recognition accuracy.

Potential Applications and Future Work
Generally, this research points out a new method to promote the accuracy of cloud classification using cloud images and multi-modal information, which is beneficial to the regional weather prediction. Furthermore, this research may provide a novel solution to other studies related to heterogeneous information fusion, for example, image-text recognition.
In this work, we utilized four weather parameters to improve the cloud classification, and we will investigate how to employ other measurements such as cloud base height for cloud classification in future work. Additionally, we cannot guarantee that the MMFN trained with the MGCD can be generalized well to another dataset, for example the cloud samples gathered from more windy, colder, warmer or on lower ground. Thus, we will utilize unsupervised domain adaptation to enhance the model generalization ability in the future work.

Conclusions
In this paper, we have proposed a novel method named MMFN for ground-based cloud recognition. The proposed MMFN has the ability of learning and fusing heterogeneous features under a unified framework. Furthermore, the attentive map is proposed to extract local visual features from salient patterns. To discover the complementary benefit from heterogeneous features, the multi-modal features are integrated with global visual features and local visual features respectively by using two fusion layers. We have also released a new cloud dataset MGCD which includes the cloud images and the multi-modal information. To evaluate the effectiveness of the proposed MMFN, we have conducted a range of experiments and the results demonstrate that the proposed MMFN can stand comparison with the state-of-the-art methods.