Next Article in Journal
Source-Specific Health Risk of PM2.5-Bound Metals in a Typical Industrial City, Central China, 2021–2022
Previous Article in Journal
Influence of Key Climate Factors on Desertification in Inner Mongolia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CloudY-Net: A Deep Convolutional Neural Network Architecture for Joint Segmentation and Classification of Ground-Based Cloud Images

1
School of Automation and Electrical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
2
Zhejiang International Science and Technology Cooperation Base of Intelligent Robot Sensing and Control, Hangzhou 310023, China
*
Author to whom correspondence should be addressed.
Atmosphere 2023, 14(9), 1405; https://doi.org/10.3390/atmos14091405
Submission received: 25 August 2023 / Accepted: 4 September 2023 / Published: 6 September 2023
(This article belongs to the Section Atmospheric Techniques, Instruments, and Modeling)

Abstract

:
Ground-based cloud images contain a wealth of cloud information and are an important part of meteorological research. However, in practice, ground cloud images must be segmented and classified to obtain the cloud volume, cloud type and cloud coverage. Existing methods ignore the relationship between cloud segmentation and classification, and usually only one of these is studied. Accordingly, our paper proposes a novel method for the joint classification and segmentation of cloud images, called CloudY-Net. Compared to the basic Y-Net framework, which extracts feature maps from the central layer, we extract feature maps from four different layers to obtain more useful information to improve the classification accuracy. These feature maps are combined to produce a feature vector to train the classifier. Additionally, the multi-head self-attention mechanism is implemented during the fusion process to enhance the information interaction among features further. A new module called Cloud Mixture-of-Experts (C-MoE) is proposed to enable the weights of each feature layer to be automatically learned by the model, thus improving the quality of the fused feature representation. Correspondingly, experiments are conducted on the open multi-modal ground-based cloud dataset (MGCD). The results demonstrate that the proposed model significantly improves the classification accuracy compared to classical networks and state-of-the-art algorithms, with classification accuracy of 88.58%. In addition, we annotate 4000 images in the MGCD for cloud segmentation and produce a cloud segmentation dataset called MGCD-Seg. Then, we obtain a 96.55 mIoU on MGCD-Seg, validating the efficacy of our method in ground-based cloud imagery segmentation and classification.

1. Introduction

Clouds cover more than 50% of the Earth’s surface and are an important meteorological element [1]. They have a significant impact on the global climate system, global irradiance and water vapor changes [2,3,4,5]. The observation and prediction of clouds is essential for weather monitoring, climate forecasting and predicting photovoltaic power generation. Cloud cover and cloud type are the two main directions of cloud observation [6], so it is important to study the segmentation and classification of cloud images [7,8,9].
There are three primary methods of acquiring cloud maps: space-based satellites, air-based radio sounders and ground-based remote sensing observations [10]. Satellite observations are widely used in large-scale measurements. However, satellite imagery is impractical, expensive and time-consuming for hyperlocal cloud analysis. It has a low frequency and resolution of updates. Although space-based sounders are effective in detecting vertical cloud structures, they are quite expensive. As a result, ground-based remote sensing observation devices such as the total-sky imager (TSI) [11] and the all-sky imager [12] have rapidly evolved. These devices provide high-resolution, low-cost remote sensing images, which can facilitate local cloud analysis.
In addition, with the proliferation of ground-based cloud images, the segmentation and classification of clouds in ground-based cloud images have been the subject of extensive academic research. Traditional classification and segmentation techniques for cloud maps rely heavily on feature extraction and conventional machine learning algorithms.
For instance, Long et al. [13] proposed a cloud segmentation algorithm based on RGB color channels, and Heinle et al. [14] classified clouds by extracting the spectral and texture features of cloud maps using k-nearest neighbor (KNN) algorithms. Kazantzidis et al. [15] proposed an improved KNN classification method that considers the color and texture characteristics of the cloud map while incorporating multi-modal information. Additionally, texture, local structure and statistical features were proposed by Zhou et al. [16] as inputs to support vector machines for cloud classification. Moreover, Dev et al. [17] used feature extraction and clustering to achieve cloud image segmentation via probabilistic pixel binary segmentation. Zhu et al. [18] proposed a new channel attention module, ECA-WS, to improve the network’s ability to express channel information and used a decision fusion algorithm to solve the problems of network size limitations and dataset imbalance.
However, these methods have disadvantages when dealing with high-dimensional, nonlinear, high-noise and large-volume data cloud images, including low classification accuracy and inadequate feature extraction. Therefore, it is necessary to develop new algorithms to improve the accuracy and efficiency of classification and segmentation.
With the rapid advancement of deep learning technology in recent years, deep learning and neural networks have been widely used in the field of atmospheric detection [19] and the classification and segmentation of ground-based cloud images based on deep learning has become a popular research field. In particular, the application of a convolutional neural network (CNN) in the classification and segmentation of ground-based cloud images has produced remarkable results. A CNN can automatically extract features from the original data while overcoming some limitations of traditional algorithms, significantly improving ground-based cloud images’ classification and segmentation.
Among these, Ye et al. [20] employed a CNN to extract cloud map features and Fisher vector coding and an SVM classifier to classify cloud maps. By optimizing the pooled feature map, Shi et al. [21] obtained the depth features of the cloud graph for cloud identification. In contrast, Zhang et al. [9] proposed a CloudNet model and achieved high accuracy on a self-built CCSN dataset. In addition, Li et al. [22] introduced the two-lead loss into the field of cloud classification segmentation, which had the ability to integrate information from multiple CNNs during the learning process to improve the model’s performance. Moreover, Liu et al. [23] created the MGCD with 8000 ground cloud maps and corresponding meteorological data and classified them with a multi-modal fusion algorithm. Huertas-Tato et al. [24] proposed an integrated learning algorithm to improve the classification accuracy by fusing the output probability vector of a CNN with a random forest classifier. Moreover, Liu et al. [25] proposed an MMFN network that combined heterogeneous features within a unified framework and learned extended cloud information. To enable lightweight mobile deployment, Gyasi and Swarnalatha [26] proposed Cloud-MobiNet, which can be deployed on smartphones.
In these previous studies, to our knowledge, cloud segmentation has been studied independently of cloud classification using several distinct methodologies. However, in areas such as PV forecasting, weather forecasting and climate research, cloud segmentation and classification are both crucial, important and widely utilized. Accordingly, we believe that the semantic segmentation information of cloud maps is beneficial, useful and conducive to cloud classification tasks.
In this paper, considering the above-mentioned issues, we propose a deep convolutional neural network architecture for the joint segmentation and classification of ground-based cloud images, called CloudY-Net. This network improves on the Y-Net architecture with its classification branch. By utilizing only one network, our study achieves the dual tasks of cloud segmentation and classification, which results in improved accuracy in cloud classification through the application of cloud segmentation. The classification branch and segmentation branch can share weights and pull feature maps from each segmentation encoder layer. The multi-head self-attention mechanism is used to increase the interactions between each feature vector in the combined input. Meanwhile, it mines deeper feature information and further improves feature expression. By self-training each feature weight through the C-MoE module, the input to the classifier is weighed to significantly improve the cloud classification accuracy. It also ensures that cloud segmentation is effective.
To cater for both segmentation and classification training, we have produced a cloud segmentation dataset by annotating images from a publicly available multi-modal ground-based cloud dataset (MGCD) that contains cloud classification information. Named MGCD-Seg, this dataset contains 4000 ground-based cloud images and their corresponding semantic segmentation annotation files.
The contributions of this paper are summarized as follows.
  • The proposed CloudY-Net has both segmentation and classification branches, performing the dual task of cloud segmentation and cloud classification in one network.
  • The CloudY-Net improves on the traditional Y-Net with an enhanced classification branch by introducing more features from the segmentation branch. The classification accuracy is better than that of state-of-the-art neural networks.
  • We produce a new cloud segmentation dataset, MGCD-Seg, which contains 4000 ground-based cloud images and semantic segmentation annotation files.

2. Methodology

2.1. Y-Net Architecture

The U-Net architecture, known for its wide applicability, has found extensive adoption across diverse domains. Furthermore, its versatility has led to the creation of various modified versions tailored to specific image-related tasks [27,28]. Moreover, Y-Net is a segmentation and classification network based on the U-Net segmentation network that excels in breast cancer segmentation and diagnosis tasks. Using a U-Net segmentation CNN, consisting of encoder and decoder paths, in a standard U-Net, the input image was first compressed into a central block by a convolutional layer; then, a transposed regular convolutional layer was utilized to generate the segmentation mask. In addition, jump connections were used to concatenate feature maps from the shrinking and expanding blocks to improve the training performance. On the other hand, the Y-Net network architecture was equipped with an additional classification branch on top of U-Net to accomplish the classification task using characteristics from the central block of U-Net.
Similar to the U-Net structure, the upper portion of Y-Net was also composed of an encoding–decoding structure. The entire network was composed of convolutional and pooling layers without fully connected layers, as shown in Figure 1. The coding network can be conceptualized as a stack of coding blocks and downsampling blocks. The coding block is responsible for learning the input representation, while downsampling teaches the network scale invariance. Both convolution and downsampling result in the loss of spatial information. The decoder can, thus, be viewed as a stack of upsampling and decoding blocks. The upsampling block helps to reduce the loss of spatial resolution, whereas the decoding block assists the network in compensating for the loss of spatial information in the encoder.
In order to achieve information sharing between the encoder and decoder, skip connections are introduced into Y-Net, as in U-Net. Moreover, in the encoding portion, unlike U-Net, Y-Net uses ESP blocks as opposed to normal convolutional layers, which makes the network wide and deep, as larger CNN architectures tend to perform better than smaller ones.

2.2. Deep Residual Networks

A deep residual network (ResNet) is a typical form of deep learning network proposed by [30], which has an excellent performance and excels in target detection, image classification and semantic segmentation. ResNet introduces residual learning as a solution to the issue of optimizing deep neural networks in response to the circumstances mentioned above. Formally, we denote the optimal mapping as H ( x ) and let the stacked nonlinear layers conform to another mapping, F ( x ) = H ( x ) x . The optimal mapping is then denoted as H ( x ) = F ( x ) + x . Assuming that residual mapping is easier to optimize than original mapping, it is easy to reduce the residual to zero in the extreme case, which is easier than approximating the mapping to a simpler mapping. The residual network is illustrated in Figure 2. F ( x ) + x can be represented by adding a shortcut connection to the feedforward network. The shortcut connection bypasses one or more layers and performs a constant mapping without adding additional parameters or computational complexity, and the entire network can still be trained end-to-end using stochastic gradient descent (SGD) and backpropagation.
Each set of networks mapped by the residual network is considered a residual block by the ResNet algorithm, and each building block is defined as follows:
y = F ( x , { W i } ) + x
where x and y are the input and output vectors of the construction block. F ( x , W i ) is the residual mapping to be trained, and the dimensions of x and F in Equation (1) need to be the same. If the dimensions do not match, a linear projection W s is added to the shortcut connection in order to match the dimensions:
y = F ( x , { W i } ) + W s x
ResNet is different from a traditional convolutional neural network in several ways. One of the key distinctions is the presence of bypassed branches, which connect the input directly to later layers. This enables the later layers to learn the residuals directly, thus mitigating the loss of information that occurs when passing information through traditional convolutional or fully connected layers. ResNet is able to protect the integrity of the information by directly bypassing the input information to the output and only learning the input–output differences. This simplifies the learning objective and reduces the overall difficulty in training the network.

3. Proposed Cloud Image Joint Segmentation and Classification Approach

3.1. Overall Framework of the Approach

Figure 3 depicts the proposed structure of CloudY-Net. The U-Net segmentation network serves as the backbone and segmentation branch of the joint classification segmentation dual-task network. This branch has two components: an encoder and a decoder. The U-Net segmentation network compresses the input cloud image layer by layer using the convolutional layer of the encoder, resulting in a compact, intermediate block. Subsequently, the decoder generates the segmentation mask using both transposed and regular convolutional layers. These steps enable the network to produce accurate segmentation outputs. Meanwhile, jump connections are established between the encoder and decoder components, with each size corresponding to the feature maps, to enhance the segmentation performance of the network. The Y-Net network framework uses the feature information in the central block of the U-Net to create an additional branch for classification from the central block [29]. In contrast, our modified CloudY-Net structure extracts features collectively from multiple layers in the segmentation branch, instead of focusing solely on the output features. This is because we believe that the features obtained only from the central block are too simple and superfluous. Conversely, the other layers in the segmentation branch contain richer cloud shape features that can positively impact cloud classification if we properly process this feature information.
To fully leverage these features, each of the four extracted feature maps is compressed using convolutional blocks. To ensure that the compressed feature vectors have a consistent scale, the size of the convolution block is decreased proportionally to the size reduction of each encoder layer’s feature map as it decreases from large to small. Subsequently, the multi-head self-attention mechanism receives four feature vectors on the same scale that have been stacked. This step involves enhancing the interactions among individual feature vectors, enhancing the correlations among features and extracting deep information. The shape of the output feature scales from the dot product self-attention mechanism remains unchanged, but their representational power is increased.
The terms for the enhanced feature are entered into the C-MoE module. C-MoE is used to learn the significance of each feature term, and the corresponding weight coefficients are outputted. These coefficients are then utilized to weigh different feature terms, leading to an improvement in the utilization of each layer of features. The C-MoE weighted fusion produces a 2 × 256 dimensional feature vector, which functions as the input to the classifier.

3.2. Multi-Head Self-Attention

The multi-head self-attention [31] mechanism can simultaneously compute attention on multiple subspaces, allowing it to handle the information interaction of multi-layer features better and adapt to different feature map sizes. In classification networks, the multi-head self-attention mechanism can be used to improve the classification performance by enhancing the expressiveness of feature representations, extracting richer and more accurate feature representations and so on.
It performs a linear projection of Q, K and V using projection parameter matrices, enters the point of attention and repeats the process h times as follows:
M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , , h e a d h ) W o ,
where h e a d i is calculated as
h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V ) ,
The parameter matrix W i Q R d model × d q , W i K R d model × d k , W i V R d model × d v performs a linear projection of Q, K, V, respectively.
The structure of the multi-head self-attention module is depicted in Figure 4. Notably, the multi-head self-attention mechanism must be centred on the sequence dimension so that, when applied to a classified network, the input data can be viewed in the channel dimension, enabling them to be viewed as a sequence. Correspondingly, the four layers of feature vectors were combined by us through concatenation to form a concatenation feature matrix, which contains all the extracted features. Additionally, this can be mapped to the focus of the attention mechanism. Subsequently, the attention module calculates the correlation among the characteristics, combines the various characteristics of each head using concatenation and produces a more exhaustive and characterized description. The inclusion of multiple scales and semantics in the information is indicated by this feature, which enhances the comprehensiveness and safeguards against the loss of crucial data.

3.3. Cloud Mixture-of-Experts

Mixture-of-Experts (MoE) is a method that combines several expert models [32], with each model processing different aspects of the data. The final results are determined by averaging the outputs of each expert, weighted according to their significance. Inspired by the MoE Decomp in FEDformer, the proposed Cloud Mixture-of-Experts (C-MoE) extends this technique to combine different feature maps or feature extractors. In this section, we provide a more detailed analysis of the C-MoE module, elucidating its significance and the rationale for its inclusion.
  • Significance of Feature Maps: Our approach leverages C-MoE to evaluate the significance of each layer’s feature maps. Unlike traditional methods, C-MoE autonomously learns the relevance of different feature maps, allowing it to adapt dynamically to the data. This dynamic learning capability empowers the model to identify critical features that may not be apparent through manual feature engineering.
  • Feature Weighting with MLP: To further enhance the feature representation, we employ a multi-layer perceptron (MLP). The MLP takes the extracted features as input and produces corresponding weight coefficients, which are crucial in combining features effectively. This added layer of adaptability improves the model’s capacity to capture intricate relationships within the data.
  • Softmax for Coefficient Conversion: To ensure that our weight coefficients are valid and range from 0 to 1, we employ the Softmax function. Softmax converts the output weight coefficients of the MLP into a valid probability distribution, allowing them to be used as weighting factors for the features. This transformation guarantees that the weights are proportional and suitable for feature combination.
In this method, each feature’s weight coefficient is multiplied by a weighted sum, and the resulting values are added. The final output is a feature vector used for classification. One notable advantage of this approach is its ability to effectively reduce feature dimensionality, thereby improving both the classifier’s computational efficiency and overall accuracy. For the input feature maps X i , an MLP was used to estimate the confidence levels separately, and then the coefficients w i were obtained by scaling these confidence levels using Softmax, w i , where the sum is 1:
w i = s o f t m a x ( M L P ( X i ) ) ,
The final output is the sum of the dot product of w i and X i :
O u t p u t = i = 0 n + 1 w i X i ,
Although both the self-attention mechanism and the C-MoE can be used to improve the interactions among features, their functions do not exactly overlap.
The self-attentive mechanism is primarily used to enhance the interactions of each piece of feature data, calculate the relationships among input data, map various input data to a unified vector space and weigh the correlations of each piece of input data to other data to obtain a more expressive feature representation.
On the other hand, C-MoE is primarily used to learn the weights of each feature automatically. These weights can combine feature information from different levels and judge the importance of each feature by the model’s self-training. C-MoE can improve the classification accuracy by enhancing the weights of important features. They can be utilized in tandem to further enhance the model’s performance.

3.4. Cross-Entropy Loss Function

Our proposed model uses the cross-entropy loss function, which is commonly used in classification tasks. Its core idea is to measure the difference between the model’s predicted probability distribution and the true label distribution for each sample. By minimizing the cross-entropy loss, the model is encouraged to better fit the training data, thereby improving the classification performance.
In our task, each sample has an associated true label, indicating which category the sample belongs to. The model’s prediction results in a probability distribution representing the predicted probabilities for each category. The cross-entropy loss quantifies the difference between these two probability distributions and serves as the optimization objective.
For each sample, the computation of the cross-entropy loss is as follows:
L ( y , P ) = 1 N i = 1 N j = 1 C y i , j log ( P i , j ) ,
Here, L ( y , P ) represents the cross-entropy loss, N is the number of samples and C is the number of categories. y i , j denotes the true label probability for sample i belonging to category j, and P i , j represents the model’s predicted probability for sample i belonging to category j.
The computation of the cross-entropy loss involves comparing the true label distribution and the model’s predicted probability distribution for each category, evaluating the model performance. The optimization goal is to minimize this loss, enabling the model to make more accurate predictions about sample categories.
In our proposed model, the cross-entropy loss function is employed to optimize the performance of the 7-class classification task. By minimizing the cross-entropy loss, we encourage the model to generate predicted probability distributions that closely resemble the true label distribution. This helps to improve the classification accuracy, making the model better suited to handle different types of cloud images.

4. Experiment

4.1. Data

4.1.1. Dataset for Classification

We conducted experiments on the multi-modal ground-based cloud dataset (MGCD) to demonstrate the efficacy of this method. The MGCD is a public ground-based cloud image dataset collected by a sky fish-eye camera and contains ground-based cloud images of Tianjin, China, for 22 months in 2017–2018. The MGCD consists of 8000 ground-based cloud images with a resolution of 1024 × 1024 pixels in JPEG format. Based on the World Meteorological Organization’s (WMO) cloud genus definition, the approximate cloud appearance and the actual sky conditions, the dataset is classified into seven cloud types, as depicted in Figure 5, including (a) cumulus, (b) altocumulus and cirrocumulus, (c) cirrus and cirrostratus, (d) clear sky, (e) stratocumulus, stratus and altostratus, (f) cumulonimbus and nimbostratus and (g) mixed cloud. As a result of the temporal continuity of the cloud maps in MGCD, adjacent cloud maps are relatively similar. The random distribution of datasets to the training and test sets resulted in a high rate of accuracy that is inconsistent with the actual situation. Consequently, the dataset was split into two sets using the acquisition time: the first 50% of the dataset was used as the training set, and the remaining 50% was used as the test set, with 4000 samples in each set. Table 1 displays the quantities and proportions of the training set and test set samples from each cloud category in MGCD.

4.1.2. Dataset for Segmentation

The original dataset was only annotated with cloud image categories because the MGCD is a classification dataset. The training of the segmentation branch of CloudY-Net requires image segmentation annotation information. As a result, we had to segment and annotate the ground-based cloud images of the MGCD prior to training. Accordingly, the ground-based cloud images in the MGCD were segmented and annotated using the EISeg interactive segmentation and annotation software. The cloud tag was used to annotate the cloud region, assigning a pixel value of 1. Other non-cloud regions (sky background, black areas around the image) with a pixel value of 0 were annotated with the background tag. The annotated image was generated as a file. This study labeled 4000 ground-based cloud images to create a cloud segmentation dataset called MGCD-Seg. The dataset was used for training and segmentation experiments of CloudY-Net’s segmentation branch to demonstrate the efficacy of this method for cloud segmentation. The labeled labels had low pixel values, and the labeling effect could be observed after visualizing the filling of the cloud and non-cloud regions using different colors, as depicted in Figure 6, where the white area is the cloud in the label, and the black area is the background in the label.
In order to ensure that the trained neural network achieved good results for different cloud cover and cloud types, we labeled the ground-based cloud images of different cloud types. Among them, Cu, Ac & Cc, Ci & Cs, Cb & Ns and mixed were each labeled with 600 images, while the two types of clear sky and Sc, St & As were only labeled with 500 images each, because their cloud cover was 0% and 100%, respectively. The data statistics of the labeled dataset were calculated, and the ratio of the pixels labeled as clouds in each ground-based cloud image to the sky pixels was calculated, so as to obtain the cloud cover of each labeled cloud image. According to the amount of cloud cover, we divided the cloud image into five levels, 0% to 20%, 20% to 40%, 40% to 60%, 60% to 80% and 80% to 100%, and counted the proportion of cloud images with different cloud cover in MGCD-Seg, as shown in Table 2.

4.2. Experimental Details

4.2.1. Experimental Environment

The experiments were conducted on a server running the Ubuntu 18.04 OS, equipped with 128 GB of RAM, an Intel Xeon E5-2680 V4 processor operating at a maximum frequency of 2.40 GHz and four NVIDIA GeForce RTX 3090 Ti graphics cards with 24 GB of video memory. Pytorch 1.7.1 comprised the deep learning framework, and the CUDA version was 11.4.

4.2.2. Experimental Setup

The noise and model robustness of the dataset were improved by enhancing the cloud images’ data. These operations included (1) a random horizontal flip; (2) a random vertical flip; (3) a 50% grayscale probability; (4) a random 45° rotation; and (5) random changes in brightness, contrast, saturation and hue. The input cloud images were resized to 224 × 224 resolution, and migration learning was employed to train these cloud images. The network’s learning rate was fixed at 3 × 10 4 , and the Adam optimizer was employed to optimize the gradient descent problem. The batch size was set to 32 for training and testing, the number of classes to 7 and the epochs to 200.

5. Results and Discussion

5.1. Segmentation

This section displays the results of cloud cluster segmentation on the dataset. We conducted cloud cluster segmentation experiments on the segmented dataset using the CloudY-Net network and assessed the model’s performance using three evaluation metrics: mean intersection over union (mIoU), mean pixel accuracy (mPA) and overall accuracy. These three metrics can provide a comprehensive performance evaluation as they provide different levels of performance information, which helps to understand how the model performs in different aspects.
Table 3 showcases the segmentation performance with three backbone choices of VGG16, RegNet and ResNet50 for CloudY-Net. Among them, RegNet is the latest proposed backbone network, which was expected to perform better than the other two backbone networks. However, the experimental results show that when ResNet50 was selected as the backbone, our model achieved the highest level on MGCD-Seg, with mIoU 96.55%, mPA 98.26% and accuracy of 98.33%. This may be because different backbone networks have different adaptations to the characteristics and distribution of the data. ResNet50 may be a better fit for our dataset, which may have contributed to the performance differences between the different backbone networks.
In order to intuitively reflect the segmentation effect of CloudY-Net, we input a group of five ground-based cloud images into the network and obtained their segmented images. At the same time, the labeled images of these five figures were also listed as a group and are shown in Figure 7 for comparison, in which the green pixels represent the cloud and the black pixels represent the background of the segmented image. In the annotated images, white pixels represent the cloud and black pixels represent the background. In order to facilitate comparative observation, we superimposed the original image and segmentation image together to form a mask image. It can be seen that the green coverage area highly overlapped with the cloud area, and some edge and corner details were also handled very well. This shows that the segmentation effect of the network is excellent, with high accuracy and robustness.

5.2. Classification

5.2.1. Classification Method Experiment

The classification accuracy of each network model is shown in Table 4. The KNN algorithm, based on texture or spectral features [14], performed poorly on both datasets. Cloud images possess greater texture and semantic depth than other images, and the classification of cloud images cannot be accomplished using texture or spectral features alone. Additional image characteristics are required to satisfy the classification needs of such images. Due to its potent feature extraction capacity, a CNN has been utilized extensively in recent years for ground-based cloud image classification tasks. CloudNet [9], based on CNN, demonstrates good results on the MGCD.
Additionally, HMF-fused multi-modal meteorological data based on a traditional CNN achieved good classification accuracy of 87.90% [23]. Although good classification accuracy has been achieved with other classical CNN models, our CloudY-Net achieved the highest classification accuracy of 88.58% on the MGCD. This improved network greatly enhances cloud feature extraction and optimizes the weight distribution of the classification output vector. Comparing the classification accuracy of our method with that of the latest algorithms and networks reveals the superiority of CloudY-Net. Accordingly, CloudY-Net has a good effect in the cloud classification of ground-based cloud images and will have a significant impact on areas such as photovoltaic power prediction and meteorological cloud calculation.

5.2.2. Ablation Experiment

In this subsection, comparison experiments are described, which aimed to determine the efficacy of the modification whereby the CloudY-Net model acquires four feature layers from the encoder. Since the original Y-Net acquires feature maps from the central block for classification, we designed the comparison experiments to introduce more feature maps from the central block incrementally.
The experimental results are shown in Figure 8, where the classification accuracy of the network is represented by a bar graph and the growth rate of the accuracy is represented by a line chart. It can be seen that as the number of introduced feature maps increased, the network’s classification accuracy also improved. In addition, observing the line chart, it can be seen that the accuracy of the model improved the most when the third layer of feature maps was introduced. Obviously, the third layer feature map had a great influence on the classification, and this is also confirmed in the next figure.
Figure 9 depicts the results of transforming the feature maps into more observable feature heat maps in order to confirm the positive impact of each layer of feature maps on the classification.
All four groups of feature heat maps revealed that the network pays more attention to the clouds in the images, with brighter areas in the image indicating that the network pays more attention to it. It can be seen that there are highlighted areas in each feature heat map that coincide with the cloud contour, with the third-layer feature map being the most evident. Consequently, each layer of features introduced by CloudY-Net is valid and useful for classification. Our proposed enhancements increase the acquisition of valid feature information, enhancing the network’s capacity for classification.
Subsequently, to verify the efficacy of the proposed method, we conducted comparative experiments on the use of the self-attention mechanism. Table 5 displays the outcomes of three models for experiments conducted without self-attention, with dot product self-attention and with multi-head self-attention. As demonstrated, self-attention improved the classification accuracy, and the improvement brought by multi-head self-attention was observed to be greater than the improvement brought by standard dot product self-attention.
For the C-MoE presented in this paper, we set up comparison experiments with different weights. We did this by manually setting the weights for each feature layer at fusion, rather than automatically learning and assigning weights as in C-MoE. As shown in Table 6, we considered different weight assignments and increased the weights of the third-layer feature, considering that it performed best in the previous experiments. When the weight of the third layer feature was set to 0.4, the accuracy of the model was indeed higher than that of other weight allocation schemes, reaching 85.64%. However, even so, the highest classification accuracy was obtained using the C-MoE approach, with the accuracy of 88.58%.
Clearly, the use of C-MoE is superior to setting the weights manually or directly setting all layers to the same weight. It can improve the computational efficiency and accuracy of the classifier.

6. Conclusions

In this study, we present an enhanced Y-Net-based joint classification segmentation technique for ground-based cloud images. Using a single network, the proposed CloudY-Net can segment and classify ground-based cloud images. Accordingly, we enhanced the classification branch of the original Y-Net framework. Feature maps were acquired from each of the four different layers of the encoder for the classification task. The acquired features were fed into a multi-head self-attentive mechanism to obtain a feature representation with increased representational power.
Meanwhile, the C-MoE module learned the significance of each feature, assigned weights to them and combined them into a classification feature vector. In addition, we created a ground-based cloud image segmentation dataset called MGCD-Seg, with 4000 images, for the training of the segmentation branch. To evaluate the efficacy of the proposed CloudY-Net, a series of experiments were conducted, and the results indicated that the segmentation branch of the current network structure could perform the segmentation task of ground-based cloud images effectively; our model achieved great performance on MGCD-Seg, with mIoU 96.55%, mPA 98.26% and accuracy of 98.33%. In addition, compared with the traditional methods and the latest networks, the improved classification branch also obtained better classification accuracy. Compared with the traditional KNN model, the classification accuracy was only 68.9%, and the latest and most advanced Inception v3 model had accuracy of 88.32%. Our CloudY-Net achieved the highest classification accuracy of 88.58% on MGCD. This improved network greatly improves the capacity for cloud feature extraction and optimizes the weight distribution of the classification output vector. Comparing the classification accuracy of our method with state-of-the-art algorithms and networks shows the excellence of CloudY-Net. Therefore, CloudY-Net has a good effect in the cloud classification of ground cloud images and will have important impacts on fields such as photovoltaic power generation prediction and meteorological cloud computing.
Currently, our improvement over the original Y-Net is limited to the classification branch; however, the accuracy of cloud segmentation is in need of further improvement. Accordingly, future model researchers will likely consider modifying and enhancing the CloudY-Net segmentation branch to enhance the model’s segmentation performance.

Author Contributions

Methodology, F.H. and B.H.; software, Y.Z.; validation, F.H., B.H. and W.Z.; formal analysis, W.Z. and Q.Z.; data curation, F.H., Y.Z. and Q.Z.; writing—original draft preparation, F.H. and B.H.; writing—review and editing, B.H. and W.Z.; supervision, Y.Z. and Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R&D Program of Zhejiang Province, No. 2021C04030; the Natural Science Foundation of Zhejiang Province, No. LGG21F030004; and the “Pioneer” and “Leading Goose” R&D Program of Zhejiang Province, No. 2022C04012.

Institutional Review Board Statement

No applicable.

Informed Consent Statement

No applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from ref. [25] and are available (https://github.com/shuangliutjnu/Multimodal-Ground-based-Cloud-Database (accessed on 1 May 2021)) with the permission of ref. [25]. The MGCD used in this work is an open-source dataset available in the corresponding references within this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Huang, W.; Wang, Y.; Chen, X. Cloud detection for high-resolution remote-sensing images of urban areas using colour and edge features based on dual-colour models. Int. J. Remote Sens. 2018, 39, 6657–6675. [Google Scholar] [CrossRef]
  2. Dagan, G.; Koren, I.; Kostinski, A.; Altaratz, O. Organization and oscillations in simulated shallow convective clouds. J. Adv. Model. Earth Syst. 2018, 10, 2287–2299. [Google Scholar] [CrossRef]
  3. Goren, T.; Rosenfeld, D.; Sourdeval, O.; Quaas, J. Satellite observations of precipitating marine stratocumulus show greater cloud fraction for decoupled clouds in comparison to coupled clouds. Geophys. Res. Lett. 2018, 45, 5126–5134. [Google Scholar] [CrossRef]
  4. Gorodetskaya, I.V.; Kneifel, S.; Maahn, M.; Van Tricht, K.; Thiery, W.; Schween, J.; Mangold, A.; Crewell, S.; Van Lipzig, N. Cloud and precipitation properties from ground-based remote-sensing instruments in East Antarctica. Cryosphere 2015, 9, 285–304. [Google Scholar] [CrossRef]
  5. Zheng, Y.; Rosenfeld, D.; Zhu, Y.; Li, Z. Satellite-based estimation of cloud top radiative cooling rate for marine stratocumulus. Geophys. Res. Lett. 2019, 46, 4485–4494. [Google Scholar] [CrossRef]
  6. Utrillas, M.P.; Marín, M.J.; Estellés, V.; Marcos, C.; Freile, M.D.; Gómez-Amo, J.L.; Martínez-Lozano, J.A. Comparison of Cloud Amounts Retrieved with Three Automatic Methods and Visual Observations. Atmosphere 2022, 13, 937. [Google Scholar] [CrossRef]
  7. Fu, H.; Shen, Y.; Liu, J.; He, G.; Chen, J.; Liu, P.; Qian, J.; Li, J. Cloud detection for FY meteorology satellite based on ensemble thresholds and random forests approach. Remote Sens. 2018, 11, 44. [Google Scholar] [CrossRef]
  8. Liu, S.; Li, M.; Zhang, Z.; Xiao, B.; Cao, X. Multimodal ground-based cloud classification using joint fusion convolutional neural network. Remote Sens. 2018, 10, 822. [Google Scholar] [CrossRef]
  9. Zhang, J.; Liu, P.; Zhang, F.; Song, Q. CloudNet: Ground-based cloud classification with deep convolutional neural network. Geophys. Res. Lett. 2018, 45, 8665–8672. [Google Scholar] [CrossRef]
  10. Wang, Y.; Wang, C.; Shi, C.; Xiao, B. A selection criterion for the optimal resolution of ground-based remote sensing cloud images for cloud classification. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1358–1367. [Google Scholar] [CrossRef]
  11. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  12. Nouri, B.; Kuhn, P.; Wilbert, S.; Hanrieder, N.; Prahl, C.; Zarzalejo, L.; Kazantzidis, A.; Blanc, P.; Pitz-Paal, R. Cloud height and tracking accuracy of three all sky imager systems for individual clouds. Sol. Energy 2019, 177, 213–228. [Google Scholar] [CrossRef]
  13. Long, C.N.; Sabburg, J.M.; Calbó, J.; Pagès, D. Retrieving cloud characteristics from ground-based daytime color all-sky images. J. Atmos. Ocean. Technol. 2006, 23, 633–652. [Google Scholar] [CrossRef]
  14. Heinle, A.; Macke, A.; Srivastav, A. Automatic cloud classification of whole sky images. Atmos. Meas. Tech. 2010, 3, 557–567. [Google Scholar] [CrossRef]
  15. Kazantzidis, A.; Tzoumanikas, P.; Bais, A.F.; Fotopoulos, S.; Economou, G. Cloud detection and classification with the use of whole-sky ground-based images. Atmos. Res. 2012, 113, 80–88. [Google Scholar] [CrossRef]
  16. Zhuo, W.; Cao, Z.; Xiao, Y. Cloud classification of ground-based images using texture–structure features. J. Atmos. Ocean. Technol. 2014, 31, 79–92. [Google Scholar] [CrossRef]
  17. Dev, S.; Lee, Y.H.; Winkler, S. Multi-level semantic labeling of sky/cloud images. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 636–640. [Google Scholar]
  18. Zhu, W.; Chen, T.; Hou, B.; Bian, C.; Yu, A.; Chen, L.; Tang, M.; Zhu, Y. Classification of ground-based cloud images by improved combined convolutional network. Appl. Sci. 2022, 12, 1570. [Google Scholar] [CrossRef]
  19. Roy, D.S. Forecasting the air temperature at a weather station using deep neural networks. Procedia Comput. Sci. 2020, 178, 38–46. [Google Scholar] [CrossRef]
  20. Ye, L.; Cao, Z.; Xiao, Y.; Li, W. Ground-based cloud image categorization using deep convolutional visual features. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4808–4812. [Google Scholar]
  21. Shi, C.; Wang, C.; Wang, Y.; Xiao, B. Deep convolutional activations based features for ground-based cloud classification. IEEE Geosci. Remote Sens. Lett. 2017, 14, 816–820. [Google Scholar] [CrossRef]
  22. Li, M.; Liu, S.; Zhang, Z. Dual guided loss for ground-based cloud classification in weather station networks. IEEE Access 2019, 7, 63081–63088. [Google Scholar] [CrossRef]
  23. Liu, S.; Duan, L.; Zhang, Z.; Cao, X. Hierarchical multimodal fusion for ground-based cloud classification in weather station networks. IEEE Access 2019, 7, 85688–85695. [Google Scholar] [CrossRef]
  24. Huertas-Tato, J.; Martín, A.; Camacho, D. Cloud type identification using data fusion and ensemble learning. In Proceedings of the Intelligent Data Engineering and Automated Learning–IDEAL 2020: 21st International Conference, Guimaraes, Portugal, 4–6 November 2020; Proceedings, Part II 21. Springer: Berlin/Heidelberg, Germany, 2020; pp. 137–147. [Google Scholar]
  25. Liu, S.; Li, M.; Zhang, Z.; Xiao, B.; Durrani, T.S. Multi-evidence and multi-modal fusion network for ground-based cloud recognition. Remote Sens. 2020, 12, 464. [Google Scholar] [CrossRef]
  26. Gyasi, E.K.; Swarnalatha, P. Cloud-MobiNet: An Abridged Mobile-Net Convolutional Neural Network Model for Ground-Based Cloud Classification. Atmosphere 2023, 14, 280. [Google Scholar] [CrossRef]
  27. Xu, J.; Zhou, W.; Chen, Z.; Ling, S.; Le Callet, P. Binocular rivalry oriented predictive autoencoding network for blind stereoscopic image quality measurement. IEEE Trans. Instrum. Meas. 2020, 70, 1–13. [Google Scholar] [CrossRef]
  28. Liu, S.; Liu, S.; Zhang, S.; Li, B.; Hu, W.; Zhang, Y.D. SSAU-Net: A spectral–spatial attention-based U-Net for hyperspectral image fusion. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
  29. Mehta, S.; Mercan, E.; Bartlett, J.; Weaver, D.; Elmore, J.G.; Shapiro, L. Y-Net: Joint segmentation and classification for diagnosis of breast biopsy images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018; Proceedings, Part II 11. Springer: Berlin/Heidelberg, Germany, 2018; pp. 893–901. [Google Scholar]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  31. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  32. Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv 2020, arXiv:2006.16668. [Google Scholar]
Figure 1. Comparison between U-Net and Y-Net [29].
Figure 1. Comparison between U-Net and Y-Net [29].
Atmosphere 14 01405 g001
Figure 2. Schematic diagram of residual network [30].
Figure 2. Schematic diagram of residual network [30].
Atmosphere 14 01405 g002
Figure 3. Architecture of the Y-Net convolutional neural network used for joint classification and segmentation of ground-based cloud images. Compared with the classic Y-Net architecture, we add three feature maps from the encoder part. The attention mechanism and C-MoE module are added to obtain deeper feature information.
Figure 3. Architecture of the Y-Net convolutional neural network used for joint classification and segmentation of ground-based cloud images. Compared with the classic Y-Net architecture, we add three feature maps from the encoder part. The attention mechanism and C-MoE module are added to obtain deeper feature information.
Atmosphere 14 01405 g003
Figure 4. Multi-head self-attention.
Figure 4. Multi-head self-attention.
Atmosphere 14 01405 g004
Figure 5. MGCD. (a) Cu; (b) Ac and Cc; (c) Ci and Cs; (d) clear sky; (e) Sc, St and As; (f) Cb and Ns; (g) mixed. This dataset combines altocumulus and cirrocumulus; cirrocumulus and cirrostratus; stratocumulus, stratus and altostratus; and cumulonimbus and nimbostratus, respectively.
Figure 5. MGCD. (a) Cu; (b) Ac and Cc; (c) Ci and Cs; (d) clear sky; (e) Sc, St and As; (f) Cb and Ns; (g) mixed. This dataset combines altocumulus and cirrocumulus; cirrocumulus and cirrostratus; stratocumulus, stratus and altostratus; and cumulonimbus and nimbostratus, respectively.
Atmosphere 14 01405 g005
Figure 6. Example of split dataset label visualization.
Figure 6. Example of split dataset label visualization.
Atmosphere 14 01405 g006
Figure 7. Examples of the segmentation results for ground-based cloud images using proposed architectures. (a) is the input original cloud image, (b) is the ground truth of the cloud and (c) is the output (segmentation maps) of the network. (d) We overlaid the input images with the output images to generate mask images to intuitively reflect the quality of the segmentation effect.
Figure 7. Examples of the segmentation results for ground-based cloud images using proposed architectures. (a) is the input original cloud image, (b) is the ground truth of the cloud and (c) is the output (segmentation maps) of the network. (d) We overlaid the input images with the output images to generate mask images to intuitively reflect the quality of the segmentation effect.
Atmosphere 14 01405 g007
Figure 8. Classification accuracy of CloudY-Net with different feature maps on 8000 labeled MGCD cloud images.
Figure 8. Classification accuracy of CloudY-Net with different feature maps on 8000 labeled MGCD cloud images.
Atmosphere 14 01405 g008
Figure 9. Examples of feature heat maps for each layer of the decoder in CloudY-Net. (a) is the original cloud image, and the following (b) (c) (d) and (e) four columns correspond to the first, second, third and fourth layers of the encoder’s feature heat map.
Figure 9. Examples of feature heat maps for each layer of the decoder in CloudY-Net. (a) is the original cloud image, and the following (b) (c) (d) and (e) four columns correspond to the first, second, third and fourth layers of the encoder’s feature heat map.
Atmosphere 14 01405 g009
Table 1. Ground-based cloud image dataset.
Table 1. Ground-based cloud image dataset.
Cloud TypeTrainTestSumPercentage
Cumulus690748143817.97%
Altocumulus, cirrocumulus4003317319.14%
Cirrus, cirrostratus650673132316.54%
Clear sky650688133816.72%
Stratocumulus, stratus, altostratus50046396312.04%
Cumulonimbus, nimbostratus600587118714.84%
Mixed510510102012.75%
Total images400040008000100%
Table 2. Data statistics of ground-based cloud images with different cloud cover and cloud types in MGCD-Seg.
Table 2. Data statistics of ground-based cloud images with different cloud cover and cloud types in MGCD-Seg.
Cloud TypeNumberCloud CoverNumberPercentage
Cu6000% to 20%109327.32%
Ac & Cc60020% to 40%47611.9%
Ci & Cs50040% to 60%48812.2%
Clear sky50060% to 80%3879.68%
Sc, St & As60080% to 100%159639.9%
Cb & Ns600
Mixed600Total images4000100%
Table 3. Segmentation performance of CloudY-Net with different backbone choices (VGG16, RegNet and ResNet50) on the MGCD-Seg dataset.
Table 3. Segmentation performance of CloudY-Net with different backbone choices (VGG16, RegNet and ResNet50) on the MGCD-Seg dataset.
MethodBackbonemIoU (%)mPA (%)Accuracy (%)
CloudY-NetVGG1695.4997.7198.16
RegNet95.9498.0298.34
ResNet5096.5598.2698.33
Table 4. Comparative experiment with classification methods.
Table 4. Comparative experiment with classification methods.
MethodsAccuracy (%)
KNN [14]68.9
CloudNet [9]81.14
HMF [23]87.9
MobileNet V286.92
VGG1687.2
GoogleNet87.53
ResNet5088.05
Inception V388.32
Y-Net84.7
CloudY-Net88.58
Table 5. The classification accuracy (%) with different self-attention mechanisms.
Table 5. The classification accuracy (%) with different self-attention mechanisms.
MethodSelf-Attention MechanismAccuracy (%)
CloudY-NetNo Self-Attention87.9
Dot Product Self-Attention88.0
Multi-Head Self-Attention88.58
Table 6. The classification accuracy (%) with different weight (a, b, c, d) settings.
Table 6. The classification accuracy (%) with different weight (a, b, c, d) settings.
MethodWeight Calculation MethodWeights (a, b, c, d)Accuracy (%)
CloudY-NetFixed weights(0.4, 0.3, 0.2, 0.1)78.4
(0.1, 0.2, 0.3, 0.4)80.25
(0.2, 0.2, 0.4, 0.2)85.64
C-MoEAdaptive weight88.58
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, F.; Hou, B.; Zhu, W.; Zhu, Y.; Zhang, Q. CloudY-Net: A Deep Convolutional Neural Network Architecture for Joint Segmentation and Classification of Ground-Based Cloud Images. Atmosphere 2023, 14, 1405. https://doi.org/10.3390/atmos14091405

AMA Style

Hu F, Hou B, Zhu W, Zhu Y, Zhang Q. CloudY-Net: A Deep Convolutional Neural Network Architecture for Joint Segmentation and Classification of Ground-Based Cloud Images. Atmosphere. 2023; 14(9):1405. https://doi.org/10.3390/atmos14091405

Chicago/Turabian Style

Hu, Feiyang, Beiping Hou, Wen Zhu, Yuzhen Zhu, and Qinlong Zhang. 2023. "CloudY-Net: A Deep Convolutional Neural Network Architecture for Joint Segmentation and Classification of Ground-Based Cloud Images" Atmosphere 14, no. 9: 1405. https://doi.org/10.3390/atmos14091405

APA Style

Hu, F., Hou, B., Zhu, W., Zhu, Y., & Zhang, Q. (2023). CloudY-Net: A Deep Convolutional Neural Network Architecture for Joint Segmentation and Classification of Ground-Based Cloud Images. Atmosphere, 14(9), 1405. https://doi.org/10.3390/atmos14091405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop