A Novel Method for Ground-Based Cloud Image Classiﬁcation Using Transformer

: In recent years, convolutional neural networks (CNNs) have achieved competitive performance in the ﬁeld of ground-based cloud image (GCI) classiﬁcation. Proposed CNN-based methods can fully extract the local features of images. However, due to the locality of the convolution operation, they cannot well establish the long-range dependencies between the images, and thus they cannot extract the global features of images. Transformer has been applied to computer vision with great success due to its powerful global modeling capability. Inspired by it, we propose a Transformer-based GCI classiﬁcation method that combines the advantages of the CNN and Trans-former models. Firstly, the CNN model acts as a low-level feature extraction tool to generate local feature sequences of images. Then, the Transformer model is used to learn the global features of the images by efﬁciently extracting the long-range dependencies between the sequences. Finally, a linear classiﬁer is used for GCI classiﬁcation. In addition, we introduce a center loss function to address the problem of the simple cross-entropy loss not adequately supervising feature learning. Our method is evaluated on three commonly used datasets: ASGC, CCSN, and GCD. The experimental results show that the method achieves 94.24%, 92.73%, and 93.57% accuracy, respectively, outperforming other state-of-the-art methods. It proves that Transformer has great potential to be applied to GCI classiﬁcation tasks.


Introduction
Clouds are visible collections composed of water vapor liquefied by cold in the atmosphere, which cover about 70% of the Earth's surface.The study of clouds and their properties has a very important role in many applications, such as climate simulation, weather forecasting, meteorological studies, solar energy production, and satellite communications [1][2][3].Clouds are also closely linked to the hydrological cycle, affecting the energy balance on local and global scales through interactions with radiation from the sun and the land [4][5][6][7][8][9].Because different cloud types have different radiative effects on the Earth's surface-atmosphere system, the study of cloud type classification is of great importance [10].
There are two main methods of cloud observation: meteorological satellite observations [11][12][13] and ground-based remote sensing observations [14][15][16].Satellite cloud images can capture large areas of clouds and allow for the direct observation of the effects of clouds on earth radiation.However, the low resolution of images prevents the study of more local cloud details.Ground-based cloud image (GCI) classification is widely used to monitor the texture and distribution of clouds in local areas and has the advantages of flexible observation sites and rich image information, so GCI classification has become a hot research topic.As more and more places need cloud monitoring, a large number of images are generated at the same time.Relying on experts alone to identify and classify these images is clearly a time-consuming task and is easily influenced by personal subjective The disadvantage of CNN models is that they do not handle global features well and thus lead to underutilization of features.In contrast, the recently emerged Transformer model can extract abundant global information.Transformer was originally proposed by Vaswani et al. [33] for natural language processing (NLP) problems, and the model introduced a self-attention mechanism to perform global computation on the input sequence.In the field of NLP, Transformer is gradually replacing recurrent neural networks (RNN) [34,35].Inspired by this, related works have applied Transformer to image processing, such as DETR for target detection [36] and SETR for semantic segmentation [37].Meanwhile, many research results have been generated in the field of image classification.Parmar et al. [38] input the pixels of an image as a sequence into the Transformer, which achieved better results but had a high computational cost.Dosovitskiy et al. [39] proposed the Vision Transformer (ViT), a model that reduces computational complexity by first dividing images into patches before feeding them into the Transformer.Touvron et al. [40] suggested a knowledge distillation strategy for Transformer that relies on a distillation token to ensure that the student network learns feature information from the teacher network through attention.Due to the success of Transformer in natural image processing, some researchers have applied it to other fields.Reedha et al. [41] used a transfer learning strategy to apply ViT to unmanned aerial vehicle (UAV) image classification, and the performance outperformed the state-of-the-art CNN model.Chen et al. [42] proposed a LeViT-based method for classifying asphalt pavement images, which consists of convolutional layers, transformer stages, and classifier heads.Shome et al. [43] developed a ViT-based classification model for chest X-ray images, which outperformed previous methods.He et al. [44] proposed a Transformer-based hyperspectral image classification method that uses CNN to extract spatial features while using a densely connected Transformer to capture sequence spectral relationships.
Transformer has proved to be very successful in some fields, but it has almost no reported applications for GCI classification, which is, in fact, a complex problem, with some images containing large cloud areas and others containing only a small portion.Thus, the models for GCI classification should have the ability to extract both global features and local features.Spurred by the above reasons, a novel Transformer-based GCI classification method is proposed in this paper, which first sends the images to a CNN model to extract low-level features and generate the local feature sequences of images, then uses the Transformer to learn the relationship between the low-level feature sequences.It is able to capture both the local features and global features of images, which improves the discrimination of the images by the model.To the best of our knowledge, this is the first time that Transformer has been introduced to the field of GCI classification.The results of experiments on three GCI datasets show that the classification performance of the method exceeds the available methods.
The main contributions of this paper are summarized as follows: (1) We apply Transformer to GCI classification task and propose a Transformer-based classification method that combines the advantages of Transformer and CNN to extract both local and global features of images, maximizing their complementary advantages for GCI classification.
(2) We optimize the loss function to enhance supervised feature learning by supplementing the cross-entropy loss with the center loss.
(3) An experimental evaluation is performed on three datasets (ASGC, CCSN, and GCD), and the results show that the proposed method in this paper has better classification accuracy.
The rest of the paper is structured as follows.Section 2 details the components and overall structure of the research method.Section 3 reports different GCI datasets and the experimental setup used in this paper.The experimental results, as well as the discussion, are presented in Section 4. Section 5 provides the conclusion of this study.

Overview of Proposed Method
This section shows the overview architecture of the designed Transformer-based classification method.As shown in Figure 1, a specially designed CNN model is used to extract the low-level semantic feature maps of GCI.The convolution operation is highly localized so it is beneficial to learn the local feature information of images.Then, the feature maps are fed into Transformer to learn the feature relationships between sequences to obtain the global feature representation of images.Finally, a linear classifier is used to complete the classification.In addition, a loss function is supplemented for enhanced supervised feature learning.The specific parameter information of the model is described in Table 1, where the number in parentheses after the convolutional layer or module indicates the kernel size, and the number in parentheses after the block indicates the number of times the feature map is downsampled.The details of each part are described below.

Overview of Proposed Method
This section shows the overview architecture of the designed Transformer-based classification method.As shown in Figure 1, a specially designed CNN model is used to extract the low-level semantic feature maps of GCI.The convolution operation is highly localized so it is beneficial to learn the local feature information of images.Then, the feature maps are fed into Transformer to learn the feature relationships between sequences to obtain the global feature representation of images.Finally, a linear classifier is used to complete the classification.In addition, a loss function is supplemented for enhanced supervised feature learning.The specific parameter information of the model is described in Table 1, where the number in parentheses after the convolutional layer or module indicates the kernel size, and the number in parentheses after the block indicates the number of times the feature map is downsampled.The details of each part are described below.

The EfficientNet-Based CNN
Related studies have demonstrated that the local connection and weight-sharing characteristics of CNN provide the network with strong effectiveness and robustness in the local feature extraction of images [45,46].GCI classification can be regarded as Remote Sens. 2022, 14, 3978 5 of 20 a fine-grained classification task that requires more information for extracting the lowlevel features of images than natural image classification, so the use of CNN is necessary.CNN is mainly composed of a convolutional layer, a pooling layer, and a fully connected layer.The convolutional layer learns the representation information in the image through different kernels; the pooling layer plays the role of downsampling, thus filtering useful feature information; and the fully connected layer maps the learned distributed feature representations to the sample label space, thus predicting the sample class.
EfficientNet [47] is an efficient network developed by Google Research using neural network architecture search technology.It optimizes the three factors of network depth, the number of channels, and the input image resolution according to the same fixed scale factor, with the advantages of high efficiency and accuracy.EfficientNet is stacked by Mobile Inverted Bottleneck Convolution (MBConv) in MobileNet [48]; the structure of MBConv is shown in Figure 2.For the main branch, the input feature map first passes through a convolution layer when the kernel size is 1 × 1 and a depthwise convolution layer when the kernel size is 3 × 3 or 5 × 5, both of which are followed by the addition of a Swish activation function [49], a batch normalization (BN) layer, and an input Squeeze and Excitation (SE) module [50], followed by a convolutional layer and a dropout layer.Finally, the output of the main branch and the input branch are summed to obtain the final output of the module.Among them, the first convolutional layer is used to increase the dimensionality of the feature map.When the increased multiplier is 6, the module is called MBConv6, and MBConv1 can be obtained for the same reason.The original EfficientNet-B0 contains thirty-nine layers MBConv, and using all layers is not the best choice for the GCI feature extraction.Therefore, only some layers are selected to form the new CNN model in this paper.Specifically, we use thirteen layers MBConv and remove the pooling and fully connected layers from the original model, and the detailed structure is shown in the CNN section in Figure 1.

The EfficientNet-Based CNN
Related studies have demonstrated that the local connection and weight-sharing characteristics of CNN provide the network with strong effectiveness and robustness in the local feature extraction of images [45,46].GCI classification can be regarded as a finegrained classification task that requires more information for extracting the low-level features of images than natural image classification, so the use of CNN is necessary.CNN is mainly composed of a convolutional layer, a pooling layer, and a fully connected layer.The convolutional layer learns the representation information in the image through different kernels; the pooling layer plays the role of downsampling, thus filtering useful feature information; and the fully connected layer maps the learned distributed feature representations to the sample label space, thus predicting the sample class.
EfficientNet [47] is an efficient network developed by Google Research using neural network architecture search technology.It optimizes the three factors of network depth, the number of channels, and the input image resolution according to the same fixed scale factor, with the advantages of high efficiency and accuracy.EfficientNet is stacked by Mobile Inverted Bottleneck Convolution (MBConv) in MobileNet [48]; the structure of MBConv is shown in Figure 2.For the main branch, the input feature map first passes through a convolution layer when the kernel size is 1 × 1 and a depthwise convolution layer when the kernel size is 3 × 3 or 5 × 5, both of which are followed by the addition of a Swish activation function [49], a batch normalization (BN) layer, and an input Squeeze and Excitation (SE) module [50], followed by a convolutional layer and a dropout layer.Finally, the output of the main branch and the input branch are summed to obtain the final output of the module.Among them, the first convolutional layer is used to increase the dimensionality of the feature map.When the increased multiplier is 6, the module is called MBConv6, and MBConv1 can be obtained for the same reason.The original Effi-cientNet-B0 contains thirty-nine layers MBConv, and using all layers is not the best choice for the GCI feature extraction.Therefore, only some layers are selected to form the new CNN model in this paper.Specifically, we use thirteen layers MBConv and remove the pooling and fully connected layers from the original model, and the detailed structure is shown in the CNN section in Figure 1.

Vision Transformer
Vision Transformer (ViT) [39] was the first model to apply Transformer to train largescale image datasets, and the overall structure is shown in Figure 3a.The input of Transformer is a one-dimensional (1D) sequence token embedding, so it needs to be processed for two-dimensional (2D) images.Firstly, the image, X ∈ R H×W×C , is divided into a number of nonoverlapping patches, XP ∈ R P×P×C , where H×W is the size of the original image, C is the number of channels in the image, and P×P is the size of each patch.Then, each patch is linearly projected into a vector of fixed dimensions using a trainable embedding matrix.To preserve the spatial information in the image, position embedding is added to each embedding vector to collectively form the input of the transformer encoder.In addition, an extra class embedding is also fed into the encoder to distinguish images from different classes.

Transformer Architecture 2.3.1. Vision Transformer
Vision Transformer (ViT) [39] was the first model to apply Transformer to train largescale image datasets, and the overall structure is shown in Figure 3a.The input of Transformer is a one-dimensional (1D) sequence token embedding, so it needs to be processed for two-dimensional (2D) images.Firstly, the image, X ∈ R H×W×C , is divided into a number of nonoverlapping patches, X P ∈ R P×P×C , where H×W is the size of the original image, C is the number of channels in the image, and P×P is the size of each patch.Then, each patch is linearly projected into a vector of fixed dimensions using a trainable embedding matrix.To preserve the spatial information in the image, position embedding is added to each embedding vector to collectively form the input of the transformer encoder.In addition, an extra class embedding is also fed into the encoder to distinguish images from different classes.The structure of the transformer encoder is shown in Figure 3b.It is mainly composed of a multi-head self-attention (MSA) module and a feed-forward multilayer perceptron (MLP) [33].MSA is the core module of the encoder, which will be explained in detail below.MLP contains two fully connected layers with a Gaussian Error Linear Unit (GeLU) activation function between the layers [51].In addition, the two parts of the encoder both use residual connections, while a normalization layer is added before the input of each part.Finally, GCI classification is performed using the MLP head based on the features trained by the transformer encoder.
Multi-head self-attention (MSA) [33] is the core component of Transformer, and the introduction of an attention mechanism can make the network pay more attention to the relevant information in the input vector.Figure 4b provides an illustration of MSA.It is composed of multiple self-attention connections, and the structure of self-attention is shown in Figure 4a.The input vectors are first transformed into three different vector matrices: query matrix, Q; key matrix, K; and value matrix, V.The weight assigned to each value is determined by calculating the dot product of the query matrix and the key matrix of different input vectors so that the individual input vectors can be connected to each other to achieve the effect of global modeling.The specific calculation equation is: where  is the dimension of matrix K.The purpose of dividing by  is to provide proper normalization to make the gradient more stable.In order to generate more correlations between different inputs, the input vector is split into several parts by the multihead self-attention mechanism, which then computes the matrix dot product of each part in parallel and then finally concatenates all the attention output results.The calculation processes are Equations ( 2) and ( 3), which are expressed as: , ,  =  ℎ , … , ℎ where i denotes the number of input vector splits and  ,  ,  , and  are all trainable parameter matrices.The structure of the transformer encoder is shown in Figure 3b.It is mainly composed of a multi-head self-attention (MSA) module and a feed-forward multilayer perceptron (MLP) [33].MSA is the core module of the encoder, which will be explained in detail below.MLP contains two fully connected layers with a Gaussian Error Linear Unit (GeLU) activation function between the layers [51].In addition, the two parts of the encoder both use residual connections, while a normalization layer is added before the input of each part.Finally, GCI classification is performed using the MLP head based on the features trained by the transformer encoder.
Multi-head self-attention (MSA) [33] is the core component of Transformer, and the introduction of an attention mechanism can make the network pay more attention to the relevant information in the input vector.Figure 4b provides an illustration of MSA.It is composed of multiple self-attention connections, and the structure of self-attention is shown in Figure 4a.The input vectors are first transformed into three different vector matrices: query matrix, Q; key matrix, K; and value matrix, V.The weight assigned to each value is determined by calculating the dot product of the query matrix and the key matrix of different input vectors so that the individual input vectors can be connected to each other to achieve the effect of global modeling.The specific calculation equation is: where d k is the dimension of matrix K.The purpose of dividing by √ d k is to provide proper normalization to make the gradient more stable.In order to generate more correlations between different inputs, the input vector is split into several parts by the multi-head selfattention mechanism, which then computes the matrix dot product of each part in parallel and then finally concatenates all the attention output results.The calculation processes are Equations ( 2) and (3), which are expressed as: where i denotes the number of input vector splits and

Swin Transformer
Swin Transformer [52] is a transformer model recently proposed and is gradually becoming the backbone for various vision tasks.The original Transformer always keeps the same downsampling operation, and each patch needs to compute self-attention with all other patches.This would result in extracting only single-scale features and incurring extremely high computational costs when classifying high-resolution images.Based on the above reasons, Swin Transformer builds a hierarchical structure imitating the CNN model, as shown in Figure 5.The hierarchical representation is achieved by patch merging operations for each stage.The method sets the dimensionality by merging adjacent patches and two linear transformations, and the resolution of the feature map becomes larger and larger as the transformer layer deepens.In addition, the model uses nonoverlapping windows to compute self-attention, thus changing the computational complexity from quadratic to linear with the size of image resolution.However, this approach results in a lack of information interaction between windows, which prevents a better understanding of contextual information.Therefore, Swin Transformer offers the shifted window method, which enables the interaction of information between neighboring windows.This is also the main difference from the original transformer model.

Swin Transformer
Swin Transformer [52] is a transformer model recently proposed and is gradually becoming the backbone for various vision tasks.The original Transformer always keeps the same downsampling operation, and each patch needs to compute self-attention with all other patches.This would result in extracting only single-scale features and incurring extremely high computational costs when classifying high-resolution images.Based on the above reasons, Swin Transformer builds a hierarchical structure imitating the CNN model, as shown in Figure 5.The hierarchical representation is achieved by patch merging operations for each stage.The method sets the dimensionality by merging adjacent patches and two linear transformations, and the resolution of the feature map becomes larger and larger as the transformer layer deepens.In addition, the model uses nonoverlapping windows to compute self-attention, thus changing the computational complexity from quadratic to linear with the size of image resolution.However, this approach results in a lack of information interaction between windows, which prevents a better understanding of contextual information.Therefore, Swin Transformer offers the shifted window method, which enables the interaction of information between neighboring windows.This is also the main difference from the original transformer model.

Swin Transformer
Swin Transformer [52] is a transformer model recently proposed and is gradually becoming the backbone for various vision tasks.The original Transformer always keeps the same downsampling operation, and each patch needs to compute self-attention with all other patches.This would result in extracting only single-scale features and incurring extremely high computational costs when classifying high-resolution images.Based on the above reasons, Swin Transformer builds a hierarchical structure imitating the CNN model, as shown in Figure 5.The hierarchical representation is achieved by patch merging operations for each stage.The method sets the dimensionality by merging adjacent patches and two linear transformations, and the resolution of the feature map becomes larger and larger as the transformer layer deepens.In addition, the model uses nonoverlapping windows to compute self-attention, thus changing the computational complexity from quadratic to linear with the size of image resolution.However, this approach results in a lack of information interaction between windows, which prevents a better understanding of contextual information.Therefore, Swin Transformer offers the shifted window method, which enables the interaction of information between neighboring windows.This is also the main difference from the original transformer model.Different from the original MSA, the Swin Transformer block is constructed based on window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA).W-MSA reduces the computational complexity of the model by computing self-attention within the window, but it lacks communication between the windows as a result.Therefore, SW-MSA is proposed to achieve the interaction between the windows by dividing and merging the feature maps.The specific operation is shown in Figure 6.First, the feature map is partitioned into a number of nonoverlapping windows and then attention between different patches in each window is calculated with W-MSA.Then, the window positions are cyclically shifted to form a new feature map.Next, the feature map is input into SW-MSA for the calculation of self-attention within the windows, thus achieving the effect of information interaction across windows.Finally, the feature map is reversed back to the original state for the next loop operation.Different from the original MSA, the Swin Transformer block is constructed based on window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA).W-MSA reduces the computational complexity of the model by computing self-attention within the window, but it lacks communication between the windows as a result.Therefore, SW-MSA is proposed to achieve the interaction between the windows by dividing and merging the feature maps.The specific operation is shown in Figure 6.First, the feature map is partitioned into a number of nonoverlapping windows and then attention between different patches in each window is calculated with W-MSA.Then, the window positions are cyclically shifted to form a new feature map.Next, the feature map is input into SW-MSA for the calculation of self-attention within the windows, thus achieving the effect of information interaction across windows.Finally, the feature map is reversed back to the original state for the next loop operation.Swin Transformer blocks always appear in pairs due to the abovementioned shifted window characteristics.Figure 7 shows the structure of two successive Swin Transformer blocks.The composition of each Swin Transformer block is almost the same as that of the ordinary transformer encoder; the only difference is that the MSA in it is replaced by W-MSA and SW-MSA.Based on this window mechanism, the equations for calculating the features of two successive Swin Transformer blocks are as follows: where ̂ and  represent the outputs of (S)W-MSA and the MLP module of the -th block, respectively, and LN denotes layer norm.Swin Transformer blocks always appear in pairs due to the abovementioned shifted window characteristics.Figure 7 shows the structure of two successive Swin Transformer blocks.The composition of each Swin Transformer block is almost the same as that of the ordinary transformer encoder; the only difference is that the MSA in it is replaced by W-MSA and SW-MSA.Based on this window mechanism, the equations for calculating the features of two successive Swin Transformer blocks are as follows: where ẑl and z l represent the outputs of (S)W-MSA and the MLP module of the l-th block, respectively, and LN denotes layer norm.

Loss Function Design
In the linear classifier, the features output from the Swin Transformer are first fed into the average pooling layer, then passed through the normalization layer, and then, finally, the prediction results are obtained using the fully connected layer.The loss function is used to evaluate the extent to which the predicted results of the model are different from the true results.Therefore, the loss function is of great importance in the optimization process of the model.The commonly used loss function in the field of image classification is the cross-entropy loss function, which can be expressed by Equation (8): where the size of the mini-batch and the number of classification classes is M and N, respectively, p x j ∈ {0, 1} is the ground truth of sample x belonging to class j, and q x j ∈ [0, 1] represents the probability of sample x being predicted as class j.
The GCI classification task can be regarded as a fine-grained classification.The feature difference between cloud regions in the image is small, so the simple cross-entropy loss is not sufficient to adequately supervise the learning of such fine-grained features.Therefore, we introduce the center loss [53] to improve the supervisory ability of the model.The center loss function is defined in Equation (9).
where M is the size of the mini-batch and x i denotes the feature of the i-th sample being extracted.c y i ∈ R d denotes the yi-th class center of deep features, where d is the feature dimension, and the c y i should be updated as the deep features change.This function can minimize the intra-class distance while ensuring the separability of inter-class features, thus improving the discriminability between features.In conclusion, the final loss function of the whole model can be expressed as: where λ is the hyperparameter to balance the two loss functions.

Datasets and Experimental Settings
In this section, we introduce three GCI datasets and then describe the relevant experimental settings.

Dataset Description
We use three GCI datasets in the experiments, including the All-Sky Ground-based Cloud (ASGC), Cirrus Cumulus Stratus Nimbus (CCSN), and the Ground-based Cloud Dataset (GCD).

All-Sky Ground-Based Cloud (ASGC)
The cloud images in this dataset were captured by the all-sky camera [54] located in Muztagh, Xinjiang (38.19 • N, 74.53 • E).The all-sky camera consists of a Sigma 4.5 mm fisheye lens and a Canon 700D camera with a maximum field of view of 180 • due to the fisheye lens.Different from conventional GCI, the sky of GCI in this dataset is mapped as a circle, where the center is the zenith and the boundary is the horizon.The frequency of shooting during the day is 20 min, increasing to 5 min a shot at night.In addition, the exposure time will is adjusted between 15 and 30 s according to the moon phase.All images are stored in a color JPEG format with a resolution of 460 × 460 pixels.To meet the training requirements, all images were uniformly adjusted to 448 × 448 pixels.A total of seven classes of cloud images were included in the original dataset, including altocumulus (Ac), cumulonimbus (Cb), cirrus (Ci), clear (Cl), cumulus (Cu), mixed (Mi), and stratocumulus (Sc).The number of images in each class varies from 210 to 450, and the dataset employs data augmentation because the small number of samples is prone to overfitting.The specific data augmentation methods include: horizontal flip, vertical flip, and random rotation.The example images of each class are shown in Figure 8.The sample distribution of the training set and testing set in the expanded dataset is described in Table 2.  2.  This dataset is an open-source dataset collected by the Nanjing University of Information Engineering, and it is categorized according to the World Meteorological Organization's genus-based classification proposal for the separation of cloud images into types, including altocumulus (Ac), altostratus (As), cumulonimbus (Cb), cirrocumulus (Cc), cirrus (Ci), cirrostratus (Cs), contrail (Ct), cumulus (Cu), nimbostratus (Ns), stratocumulus (Sc), and stratus (St).Due to a large number of cloud types, the cloud images in this dataset  This dataset is an open-source dataset collected by the Nanjing University of Information Engineering, and it is categorized according to the World Meteorological Organization's genus-based classification proposal for the separation of cloud images into types, including altocumulus (Ac), altostratus (As), cumulonimbus (Cb), cirrocumulus (Cc), cirrus (Ci), cirrostratus (Cs), contrail (Ct), cumulus (Cu), nimbostratus (Ns), stratocumulus (Sc), and stratus (St).Due to a large number of cloud types, the cloud images in this dataset have large light variations and intra-class variations; more details are provided in [27].
All images are in a color JPEG format with a resolution of 256 × 256 pixels.To meet the input size requirements of the model, the images were uniformly resized to 448 × 448 pixels using bilinear interpolation [55].The number of cloud images in each class varies from 140 to 340.In order to avoid the occurrence of overfitting, data augmentation is also used to expand the dataset.Figure 9 shows example images from the dataset, and the number of samples used for training and testing is described in Table 3.
Remote Sens. 2022, 14, x FOR PEER REVIEW 11 of 20 used to expand the dataset.Figure 9 shows example images from the dataset, and the number of samples used for training and testing is described in Table 3.The images in this dataset were captured by camera sensors in nine Chinese provinces over a period of more than one year and have a great diversity.This dataset classifies images according to the classification criteria published by the World Meteorological Organization, specifically, altocumulus (Ac), cumulonimbus (Cb), cirrus (Ci), clear (Cl),  The images in this dataset were captured by camera sensors in nine Chinese provinces over a period of more than one year and have a great diversity.This dataset classifies images according to the classification criteria published by the World Meteorological Organization, specifically, altocumulus (Ac), cumulonimbus (Cb), cirrus (Ci), clear (Cl), cumulus (Cu), mixed (Mi), and stratocumulus (Sc).All images are stored in a JPEG format with a resolution of 512 × 512 pixels, and more details of the dataset are presented in [29].In this paper, the image size is uniformly resized to 448 × 448 pixels.An example of images in the dataset shown in Figure 10.Table 4 describes the distribution of images in this dataset.
Remote Sens. 2022, 14, x FOR PEER REVIEW 12 of 20 cumulus (Cu), mixed (Mi), and stratocumulus (Sc).All images are stored in a JPEG format with a resolution of 512 × 512 pixels, and more details of the dataset are presented in [29].
In this paper, the image size is uniformly resized to 448 × 448 pixels.An example of images in the dataset is shown in Figure 10.Table 4 describes the distribution of images in this dataset.

Implementation Details
The method proposed is implemented on an Intel (R) Core (TM) i7-8750H CPU @ 3.20 GHz computer with 32.0 GB RAM, utilizing an NVIDIA GeForce GTX 2070 super 16 G graphical processing unit (GPU).The programming language used for the code is Python, and Pytorch is chosen as the deep learning framework.
In order to improve the convergence speed and generalization ability of the model, we adopt transfer learning for training.The model obtained by training the Swin Transformer on the ImageNet-1K dataset is used for the pretraining weights in the experiments, and the model proposed in this paper is directly trained on the basis of its initial weights, which can not only shorten the training time but also avoid overfitting.The Adaptive Momentum Estimation and Weight Decay (AdamW) [56] optimizer is used for optimization, which adds the L2 regular term to the Adam optimizer, thus solving the problem of parameter overfitting, which has the advantage of fast gradient descent.In addition, the model uses a scheduling technique called Cosine AnnealingLR [56] and sets the initial  The method proposed is implemented on an Intel (R) Core (TM) i7-8750H CPU @ 3.20 GHz computer with 32.0 GB RAM, utilizing an NVIDIA GeForce GTX 2070 super 16 G graphical processing unit (GPU).The programming language used for the code is Python, and Pytorch is chosen as the deep learning framework.
In order to improve the convergence speed and generalization ability of the model, we adopt transfer learning for training.The model obtained by training the Swin Transformer on the ImageNet-1K dataset is used for the pretraining weights in the experiments, and the model proposed in this paper is directly trained on the basis of its initial weights, which can not only shorten the training time but also avoid overfitting.The Adaptive Momentum Estimation and Weight Decay (AdamW) [56] optimizer is used for optimization, which adds the L2 regular term to the Adam optimizer, thus solving the problem of parameter overfitting, which has the advantage of fast gradient descent.In addition, the model uses a scheduling technique called Cosine AnnealingLR [56] and sets the initial learning rate to 0.0016.This method causes the learning rate to fall rapidly and then rise steeply during the training process to avoid falling into the local minimum and to find the true global minimum.

Evaluation Metric
In order to comprehensively evaluate the classification performance of the proposed method for various types of images, accuracy, precision, and recall are calculated as the evaluation metrics in this paper.Accuracy can be calculated based on positive and negative samples as: where TP (True Positive) is the number of correctly classified samples for a specific class, TN (True Negative) is the number of correctly classified samples for the remaining classes, FP (False Positive) is the number of misclassified samples for the remaining classes, and FN (False Negative) is the number of misclassified samples for a specific class.Precision and recall can be expressed as: Recall(Re) = TP TP + FN In addition, we also use an F1_score for evaluation, the expression of which is shown in Equation (14).

Results of GCI Classification
The classification results for each class in different datasets and the overall classification accuracy are provided in Tables 5-7.It can be seen that the proposed method achieves accuracies of 94.24%, 92.73%, and 93.57% in the ASGC, CCSN, and GCD datasets, respectively.The precision and recall are all greater than 89% for all classes except Sc in the ASGC dataset, where the highest accuracy is 99.66% for Cu, and the highest recall is 100% for Cl, but the accuracy and recall are both less than 90% for Sc.The seven types of images are sorted according to the F1_score value from largest to smallest, and the order is Cl, Cu, Mi, Ac, Ci, Cb, and Sc.In the CCSN dataset, the precision and recall are greater than 87% for all classes.The highest precision is 97.18% for Cb, and the highest recall is 98.67% for Ct, but the lowest precision and recall are 87.21% and 88.67%, respectively, for St.The various types of images are sorted according to the F1_score value from largest to smallest, and the order is Ct, Ci, Cb, Ac, Cu, As, Cc, Ns, Sc, Cs, and St.Both the precision and recall of all classes are greater than 85% in the GCD dataset.Among them, the highest precision is 98.62% for Cu, and the highest recall is 99.33% for Cl, but the lowest precision is 89.49% for Cb, and the lowest recall is 85.33% for Sc.All types of images are sorted according to the F1_score value from largest to smallest, and the order is Cl, Cu, Ac, Ci, Mi, Cb, and Sc.In conclusion, the method can better identify various types of cloud images and has the ability to automatically classify cloud images.To analyze misclassification, the classification confusion matrix of different datasets is obtained as shown in Figure 11.The horizontal axis in each figure indicates the true image class, the vertical axis indicates the predicted image class, and the values present in the nondiagonal elements represent the number of misclassifications between classes.From the figure, it can be seen that, in the ASGC and GCD dataset, images of the Cl class are correctly classified the most, and the misclassified images are mainly from Cb and Sc, which is due to the fact that some images of the Sc class are affected by illumination, causing the bottom of the clouds to become dark black and thus easily confused with Cb.In addition, the movement of clouds can lead to changes in the shooting viewpoint, which increases the difficulty of identification.In the CCSN dataset, images of the Ct class are correctly classified in the largest number.The misclassified images are mainly from St, Ns, and Sc, which is predictable because they all belong to low-level clouds with a relatively similar structure and transparency.

Parameter Analysis
To provide a comprehensive study of the proposed method, this section analyzes the effect of hyperparameter  in the loss function on the classification results.We vary  from 0 to 0.1 to learn different models, and the accuracy of these models in the ASGC, CCSN, and GCD datasets is shown in Figure 12.

Parameter Analysis
To provide a comprehensive study of the proposed method, this section analyzes the effect of hyperparameter λ in the loss function on the classification results.We vary λ from 0 to 0.1 to learn different models, and the accuracy of these models in the ASGC, CCSN, and GCD datasets is shown in Figure 12.

Parameter Analysis
To provide a comprehensive study of the proposed method, this section analyzes the effect of hyperparameter  in the loss function on the classification results.We vary  from 0 to 0.1 to learn different models, and the accuracy of these models in the ASGC, CCSN, and GCD datasets is shown in Figure 12.It can be seen that when λ is 0, the loss function contains only cross-entropy loss, which cannot adequately supervise feature learning at this point, resulting in poor performance.
When the center loss is supplemented, the joint supervision improves the discriminative power of the deep features, thus improving the classification accuracy.In addition, the model performance remains largely stable over a large range of λ, where the proposed method obtains the best results when λ takes the value of 0.01.Therefore, λ is set to 0.01 to obtain the best classification performance for GCI classification when conducting experiments on each dataset.

Feature Visualization
In order to illustrate the feature extraction ability of the proposed method more intuitively, we use the Gradient-weighted Class Activation Mapping (Grad-CAM) [57] method for feature visualization.The method shows the important regions of the image predicted by generating a rough attention map from the last layer of the model.The brighter the color of the attention map, the higher the importance of the corresponding region of the image.Some images are selected from the ASGC dataset for testing, and the results are shown in Figure 13.It can be seen that the proposed method can highlight the real class of the object and has strong localization and recognition abilities.
It can be seen that when  is 0, the loss function contains only cross-entropy loss, which cannot adequately supervise feature learning at this point, resulting in poor performance.When the center loss is supplemented, the joint supervision improves the discriminative power of the deep features, thus improving the classification accuracy.In addition, the model performance remains largely stable over a large range of , where the proposed method obtains the best results when  takes the value of 0.01.Therefore,  is set to 0.01 to obtain the best classification performance for GCI classification when conducting experiments on each dataset.

Feature Visualization
In order to illustrate the feature extraction ability of the proposed method more intuitively, we use the Gradient-weighted Class Activation Mapping (Grad-CAM) [57] method for feature visualization.The method shows the important regions of the image predicted by generating a rough attention map from the last layer of the model.The brighter the color of the attention map, the higher the importance of the corresponding region of the image.Some images are selected from the ASGC dataset for testing, and the results are shown in Figure 13.It can be seen that the proposed method can highlight the real class of the object and has strong localization and recognition abilities.
The training and testing samples of each dataset were kept constant in the experiments, and the final testing results of each dataset are shown in Table 8.
It can be seen from the experimental results that EfficientNet-B0 achieves the highest accuracy among the CNN-based methods, with 91.47%, 89.97%, and 90.48% in the ASGC, CCSN, and GCD datasets, respectively.The classification accuracy of the Transformerbased method is higher than that of the CNN-based method, which proves that Transformer is more capable of extracting image features.The accuracy of the original Swin-T in each dataset reaches 92.86%, 91.06%, and 92.38%.The proposed method combines CNN with Transformer and optimizes the loss function to compensate for the lack of local
The training and testing samples of each dataset were kept constant in the experiments, and the final testing results of each dataset are shown in Table 8.
It can be seen from the experimental results that EfficientNet-B0 achieves the highest accuracy among the CNN-based methods, with 91.47%, 89.97%, and 90.48% in the ASGC, CCSN, and GCD datasets, respectively.The classification accuracy of the Transformer-based method is higher than that of the CNN-based method, which proves that Transformer is more capable of extracting image features.The accuracy of the original Swin-T in each dataset reaches 92.86%, 91.06%, and 92.38%.The proposed method combines CNN with Transformer and optimizes the loss function to compensate for the lack of local features and enhances the learning of supervised features, thus improving the classification accuracy.Using the proposed method, the accuracy is improved by 1.38%, 1.67%, and 1.19% compared with the original Swin-T.

Conclusions
Transformer is a powerful deep neural network for processing sequences, but it has received little attention in the field of ground-based cloud image processing.In this paper, we apply Transformer to GCI and propose a novel GCI classification method.Different from traditional CNN-based methods, our method combines the Transformer and CNN models.Specifically, the CNN model is used as a low-level feature extraction tool to generate the local feature sequences of images, and Transformer effectively extracts long-distance dependencies between the sequences from the low-level feature.Using this method, both the local features and the global features of cloud images can be extracted.In addition, the center loss is introduced to supplement the cross-entropy loss to enhance the of supervised features.We evaluate the performance of the proposed method with three different GCI datasets.Compared with several other advanced methods, our method achieves the highest accuracy, with 94.24%, 92.73%, and 93.57% in the ASGC, CCSN, and GCD datasets, respectively.The proposed method shows the great potential of Transformer for GCI classification, and we will continue to study various improvements for Transformer to further improve the classification performance in the preceding work.

Figure 1 .
Figure 1.Overall architecture of the proposed method.

Figure 3 .
Figure 3.The structure of (a) the Vision Transformer (ViT) and (b) the transformer encoder.

Figure 3 .
Figure 3.The structure of (a) the Vision Transformer (ViT) and (b) the transformer encoder.
and W o are all trainable parameter matrices.Remote Sens. 2022, 14, x FOR PEER REVIEW 7 of 20 (a) (b)

Figure 4 .
Figure 4.The illustration of self-attention in Transformer.(a) The self-attention; (b) the multi-head self-attention.

Figure 4 .
Figure 4.The illustration of self-attention in Transformer.(a) The self-attention; (b) the multi-head self-attention.

Figure 4 .
Figure 4.The illustration of self-attention in Transformer.(a) The self-attention; (b) the multi-head self-attention.

Figure 5 .
Figure 5. Illustration of hierarchical feature maps.Swin Transformer provides four versions of this model: Swin-T, Swin-S, Swin-B, and Swin-L.Considering the specificity and computational complexity of GCI, we chose Swin-T, whose overall structure is shown in a part of Figure 1.The model is divided into four
Remote Sens. 2022, 14, x FOR PEER REVIEW 8 of 20 stages containing 2, 2, 6, and 2 blocks.Stage 1 contains two Swin Transformer blocks, and a linear embedding layer is added before the block to change the number of channels in the feature map.Stage 2 contains a patch merging layer and two Swin Transformer blocks, where patch merging acts like the pooling layer in CNN for downsampling before each stage to adjust the dimensionality of the feature map.The structure of stage 3 and stage 4 is similar to that of stage 2, except that the amount of downsampling multiplies, and the number of blocks is different.

Figure 6 .
Figure 6.The illustration of the shifted window approach.

Figure 6 .
Figure 6.The illustration of the shifted window approach.
̂ and  represent the outputs of (S)W-MSA and the MLP module of th block, respectively, and LN denotes layer norm.

Figure 7 .
Figure 7.The structure of two successive Swin Transformer blocks.
Remote Sens. 2022, 14, x FOR PEER REVIEW 10 of 20 training requirements, all images were uniformly adjusted to 448 × 448 pixels.A total of seven classes of cloud images were included in the original dataset, including altocumulus (Ac), cumulonimbus (Cb), cirrus (Ci), clear (Cl), cumulus (Cu), mixed (Mi), and stratocumulus (Sc).The number of images in each class varies from 210 to 450, and the dataset employs data augmentation because the small number of samples is prone to overfitting.The specific data augmentation methods include: horizontal flip, vertical flip, and random rotation.The example images of each class are shown in Figure 8.The sample distribution of the training set and testing set in the expanded dataset is described in Table

Figure 12 .
Figure 12. Analysis of the performing results of the  parameters.

Figure 12 .
Figure 12. Analysis of the performing results of the  parameters.Figure 12. Analysis of the performing results of the λ parameters.

Figure 12 .
Figure 12. Analysis of the performing results of the  parameters.Figure 12. Analysis of the performing results of the λ parameters.

Figure 13 .
Figure 13.Grad-CAM visualization results in the ASGC dataset.Each input image is shown on the second line, and attention maps are shown on the first line: (a) Ac; (b) Cb; (c) Ci; (d) Cu.

Figure 13 .
Figure 13.Grad-CAM visualization results in the ASGC dataset.Each input image is shown on the second line, and attention maps are shown on the first line: (a) Ac; (b) Cb; (c) Ci; (d) Cu.

Table 1 .
The details of the proposed model.
Figure 1.Overall architecture of the proposed method.

Table 1 .
The details of the proposed model.

Table 2 .
Training and testing samples for the ASGC dataset.

Table 2 .
Training and testing samples for the ASGC dataset.

Table 3 .
Training and testing samples for the CCSN dataset.

Table 3 .
Training and testing samples for the CCSN dataset.

Table 4 .
Training and testing samples for the GCD dataset.

Table 4 .
Training and testing samples for the GCD dataset.

Table 5 .
The classification results for the ASGC dataset.

Table 6 .
The classification results for the CCSN dataset.

Table 7 .
The classification results for the GCD dataset.

Table 8 .
The classification results of different methods based on different datasets.