Self-Supervised Cloud Classification with Patch Rotation Tasks (SSCC-PR)

Yan, Wuyang; Xiong, Xiong; Xia, Xinyuan; Zhang, Yanchao; Guo, Xiaojie

doi:10.3390/app15169051

Open AccessArticle

Self-Supervised Cloud Classification with Patch Rotation Tasks (SSCC-PR)

by

Wuyang Yan

¹,

Xiong Xiong

^1,2,*

,

Xinyuan Xia

³

,

Yanchao Zhang

³ and

Xiaojie Guo

⁴

¹

Information and Systems Science Institute, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China

⁴

School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9051; https://doi.org/10.3390/app15169051

Submission received: 14 July 2025 / Revised: 10 August 2025 / Accepted: 15 August 2025 / Published: 16 August 2025

Download

Browse Figures

Versions Notes

Abstract

Solar irradiance, which is closely influenced by cloud cover, significantly affects photovoltaic (PV) power generation efficiency. To improve cloud type recognition without relying on labeled data tasks, this paper proposes a self-supervised cloud classification method based on patch rotation prediction. In the Pre-training stage, unlabeled ground-based cloud images are augmented through blockwise rotation, and high-level semantic representations are learned via a Swin Transformer encoder. In the fine-tuning stage, these representations are adapted to the cloud classification task using labeled data. Experimental results show that our method achieves 96.61% accuracy on the RCCD and 90.18% on the SWIMCAT dataset, outperforming existing supervised and self-supervised baselines by a clear margin. These results demonstrate the effectiveness and robustness of the proposed approach, especially in data-scarce scenarios. This research provides valuable technical support for improving the prediction of solar irradiance and optimizing PV power generation efficiency.

Keywords:

photovoltaic power; solar irradiance; cloud classification; self-supervised; raindrop patch rotation

1. Introduction

As renewable energy is booming, photovoltaic (PV) power generation, a clean and eco-friendly energy source, has become an essential part of the global energy transition. But the efficiency of PV power generation is affected by multiple environmental factors. Among them, cloud cover is crucial in influencing the performance of PV systems. The type, thickness, density, and coverage of clouds directly determine the intensity of solar radiation reaching the ground, thus having an impact on the power output of PV panels. Especially in cloudy conditions, the variability and instability of irradiance bring great challenges to PV systems [1,2,3,4]. Thoroughly researching the impact of clouds on solar irradiance, especially the shading effects of different cloud types, is important for enhancing the accuracy of PV power generation forecasts and optimizing system operation strategies. Therefore, fully grasping cloud variations and their specific impacts on solar irradiance can not only offer a theoretical basis for the design, monitoring, and optimization of PV systems but also promote further progress in PV power generation technology.

Ground-based cloud imagery is mainly utilized for observing the distribution, movement, and alterations in cloud cover within a specific local area. In contrast to other observation techniques, it is more efficient in discerning cloud types, heights, and other features, offering strong support for local weather forecasting. Singh et al. utilized five distinct feature extraction methods, such as autocorrelation, the cooccurrence matrix, and edge frequency [5]. These were combined with k-nearest neighbor (KNN) and neural network classifiers, enabling the successful identification of cumulus and cumulonimbus clouds. Calbo and Sabburg created a classifier founded on the super-spectral parallelotope technique [6]. They made use of statistical texture features, Fourier transform-based image features, and threshold-based image features to classify eight weather conditions, attaining a classification consistency index of 76%. Liu et al. carried out preliminary research on feature extraction and cloud classification by using sky infrared images captured by the Whole Sky Infrared Cloud Measurement System (WSIRCMS) [7]. They explored seven structural features, including mean gray value (ME), estimated cloud fraction (ECF), and edge sharpness (ES). Additionally, they designed a rectangular method cloud classifier to classify cirrus clouds, wave clouds, and cumulus clouds, achieving an accuracy rate of 90.97%.

As convolutional neural networks (CNNs) have evolved, remarkable advancements have been achieved in the automatic classification of ground-based cloud images. Zhang et al. put forward a new CNN model and built an 11-class ground-based cloud dataset that complies with meteorological standards, called Cirrus Cumulus Stratus Nimbus [8]. The proposed CloudNet model showed excellent performance in meteorological cloud classification. Fang et al. augmented existing cloud type datasets through data augmentation techniques and employed transfer learning for parameter fine-tuning and training [9]. Eventually, they reached an accuracy of 96.55%. Shi et al. combined features from shallow convolutional layers with DCAF for a comprehensive assessment in cloud classification [10]. Experiments on two difficult public datasets verified that their approach outperformed traditional classification methods. Liu et al. introduced a Task-Based Graph Convolutional Network (TGCN) method that takes into account the relationships between images and established the Ground-Based Remote Sensing Cloud Dataset (GRSCD) [11]. Comparisons with other cloud classification methods proved the effectiveness of the TGCN in ground-based cloud classification.

Existing cloud classification methods typically rely on large amounts of annotated data, which are costly and time-consuming to obtain. Moreover, the existing publicly available datasets are limited in number (such as SWIMCAT, with only 784 images), making it difficult to train robust models. In addition, existing classifications often overlook the physical relationship between cloud types and solar irradiance, leading to suboptimal performance in downstream tasks such as photovoltaic energy forecasting. To solve these problems, this paper conducts cloud classification research by adopting a self-supervised learning method. It takes patch rotation as an auxiliary task and uses the Swin Transformer as the backbone network. Moreover, a ground-based cloud image dataset is created for experiments. This dataset is classified according to the impact of cloud cover on irradiance. The specific contributions are as follows:

A self-supervised learning approach is adopted for cloud classification research, and a patch rotation auxiliary task is designed for Pre-training.
Most existing cloud datasets are constructed from the image aspect, with little consideration given to their impact on photovoltaic power generation. Consequently, this paper develops a new cloud classification dataset centered around the influence that different cloud types have on irradiance.
The Swin Transformer serves as the backbone network for training. When compared with existing supervised and unsupervised models, it brings about substantial performance enhancements.

2. Related Works

2.1. Self-Supervised Learning

Existing cloud classification methods face critical limitations in integrating physical cloud–irradiance relationships and bridging satellite–ground disparities. Satellite-based approaches capture global physical features but lack real-time resolution, while ground-based methods offer high temporal detail but ignore irradiance impact. Hybrid models fuse data but treat physics as black-box inputs. In this paper, by designing a patch rotation prediction task (predicting the rotation angle of image patches), the model autonomously learns local morphological invariant features of clouds (such as edges, textures, and thickness gradients) from unlabeled cloud image data, addressing the bottleneck of traditional supervised learning that relies on scarce manually labeled data by meteorological experts.

Self-supervised learning (SSL) [12,13,14] is a specific subcategory of unsupervised learning that focuses on automatically generating labels or learning objectives from the raw data itself, without requiring manual annotation. Unlike general unsupervised learning, which aims to uncover inherent data structures (e.g., via clustering or dimensionality reduction), SSL defines surrogate (pretext) tasks—such as rotation prediction, contrastive instance discrimination, or masked prediction—that guide the model to learn semantically meaningful and transferable representations.

The objective of self-supervised learning is to extract useful feature representations from unlabeled data, eliminating the necessity for manually labeled tags. In self-supervised learning, the model generates pseudo-labels automatically by devising tasks and then uses these for training. It constructs predictive tasks like predicting missing parts, image rotation, or image completion. This enables the model to learn the underlying structures and patterns within the data. By doing so, the model can independently learn meaningful feature representations from vast amounts of unlabeled data, which benefits downstream tasks.

In recent years, with the application and progress of self-supervised learning techniques in natural images, self-supervised learning classification has emerged as a mainstream method in image analysis. Self-supervised learning is a special form of an unsupervised learning method. It uses dataset-inherent information to create pseudo-labels for representation learning (RL). As a subset of unsupervised learning, it does not need time-consuming, labor-intensive, and error-prone manual image annotations. Thus, when large-scale cloud datasets are scarce, applying self-supervised learning methods to extract discriminative information from unlabeled data is an effective way to conduct cloud-related studies.

2.2. Swin Transformer

The Transformer model, initially proposed by Vaswani, is built on the self-attention mechanism to model the global dependencies of input data [15]. In traditional Vision Transformers like the ViT, images are split into fixed-size patches [16]. These patches are linearly processed and then input into the Transformer as a sequence for global feature modeling. Compared with CNNs, Transformers are more flexible in feature modeling [17]. They can handle more complex global dependencies via the self-attention mechanism. However, the computational complexity of this global self-attention is extremely high, especially when dealing with high-resolution images, resulting in large computational and memory costs.

When the Swin Transformer was first applied to image classification on ImageNet, its efficient computation and excellent performance quickly made it one of the mainstream models in computer vision [18,19]. Experiments on several standard datasets, such as ImageNet and CIFAR, have demonstrated that the Swin Transformer outperforms traditional CNNs and other Transformer variants like the ViT in terms of performance [20].

In self-supervised learning, the Swin Transformer has been used for various auxiliary tasks, including image reconstruction, rotation prediction, and contrastive learning. The adoption of self-supervised learning allows the Swin Transformer to be pre-trained without a large quantity of labeled data, thus enhancing the model’s generalization ability. In this research, we used the patch rotation task as a self-supervised task and combined it with the Swin Transformer for cloud classification. By predicting the rotation angles of image patches, the model can learn better local and global feature representations, which is beneficial for improving the classification performance.

In the following content, we will elaborate on several aspects in detail. Initially, we will introduce the experimental methodology, with a focus on the experiment’s design concept and implementation procedures. Subsequently, we will describe the datasets employed, offering in-depth details about their origins and characteristics. After that, we will explain the experimental part, covering the experimental setup, evaluation metrics, parameter analysis, experimental outcomes, as well as the analysis and discussion of these results. Finally, in the summary and outlook segment, we will summarize the main work of this chapter and explore potential future research directions or improvement plans.

2.3. Cloud Classification

Cloud classification methods are primarily divided into traditional computer vision methods and modern deep learning methods. Traditional methods rely on hand-crafted features (such as spectral thresholds for satellite cloud images or color, texture, and shape for ground-based cloud images) combined with shallow models like SVMs or decision trees. While highly interpretable, they are limited by the subjectivity of feature design and environmental factors (e.g., lighting, viewing angle); deep learning methods, by contrast, automatically extract high-level semantic features via deep models like CNNs or Transformers, achieving significantly higher accuracy but requiring large volumes of labeled data.

From the perspective of data sources, cloud classification can be divided into satellite cloud images (with global coverage and rich multi-spectral information, but low resolution (250–1000 m/pixel), making it difficult to capture detailed cloud morphology) and ground-based cloud images (with high resolution (1024 × 1024 pixels or higher), capable of clearly identifying cloud edges and structures (such as the anvil top of cumulonimbus), but with limited coverage). This study selects ground-based cloud images as the research object, and the core reason is that their high resolution can accurately capture subtle morphological features of clouds—this is critical for meteorological early warning (e.g., detection of severe convective clouds like cumulonimbus), while the low resolution of satellite cloud images limits the analysis of such details, making it unable to meet the needs of refined early warning.

In response to the characteristic of “clouds having rotation invariance” in ground-based cloud images (e.g., the morphology of the anvil top of cumulonimbus remains recognizable after rotation), we choose rotated patches as the self-supervised learning task. This task involves rotating image patches and having the model predict the rotation angle, which can force the model to learn shape features of clouds (such as the contour of the anvil top)—this is exactly the core basis for cloud classification; compared to generic self-supervised tasks like color distortion (which alters cloud color, irrelevant to morphology) or instance discrimination (which requires large batches and high computational cost), rotated patches are more aligned with the morphology recognition needs of ground-based cloud images, and are computationally efficient, suitable for processing high-resolution data, ultimately enhancing the model’s ability to recognize cloud morphology.

3. Approach

This paper presents a self-supervised learning classification method centered around patch rotation prediction tasks (SSCC-PR). This approach tackles the difficulties in cloud classification by making use of numerous unlabeled images. The framework is depicted in Figure 1. Based on the self-supervised learning framework, this paper devises a patch rotation prediction task. It capitalizes on the spatial structure similarity within cloud images to learn contextual spatial information. Simultaneously, it steers the model to concentrate on the foreground (the cloud region) of the image. The whole process is composed of two parts: the Pre-training phase and the fine-tuning phase. In the Pre-training phase, unlabeled ground-based cloud images first undergo patch rotation processing. Subsequently, these processed images are input into the Swin Transformer encoder for training. This enables the model to acquire a Transformer encoder network with a high semantic representation capacity. Next, in the fine-tuning phase, the encoder obtained from Pre-training serves as the basis for the cloud classification task. Through additional training for this specific task, the model gradually adjusts its parameters. This allows the pre-trained encoder to better adapt to the particular downstream task, thus optimizing the final classification performance. During the fine-tuning process, the model learns from labeled data, strengthening its ability to differentiate between various cloud types and ensuring the accuracy and robustness of cloud classification.

3.1. Pre-Training Stage

In the Pre-training stage, a patch rotation task is put forward in this paper to assist the model in learning the spatial structure and contextual information of images. Initially, the input image is split into numerous small patches, and some of them are randomly chosen for processing. For these selected patches, random clockwise rotation angles are applied. During this operation, each rotation angle is encoded so that the model can comprehend and handle this rotation-related information. These patches, along with their corresponding angle encodings, are jointly input into the backbone network, the Swin Transformer, for further processing. Thanks to its strong feature extraction ability, the Swin Transformer extracts spatial information from the image and offers high-dimensional feature representations for subsequent training. Subsequently, the decoder is utilized for training. The model’s output is a matrix that represents the rotation information of each patch. This entire process enables the model to learn the spatial relationships among different regions in cloud images, providing essential semantic support for accurate prediction in the cloud classification task.

Algorithm 1 shows the detailed algorithm description of the patch rotation prediction auxiliary task. To begin with, the input image is partitioned into multiple small patches. For example, an image of 224 × 224 pixels is divided into 4 × 4 patches, with each patch sized 56 × 56 pixels. This patching approach enables the model to better capture local information and process features of small regions in different positions. Next, a subset of these image patches is randomly rotated. The aim of this rotation task is to train the model to understand image content from various perspectives. The rotation angles are determined based on a unit rotation angle. For example, when the unit rotation angle is 90°, the rotation angles are set as four fixed values: 0°, 90°, 180°, and 270°. Since these rotation angles are discrete, one-hot encoding can be employed to convey the angle information to the network. For example, 0° is encoded as [1, 0, 0, 0], 90° is encoded as [0, 1, 0, 0], and so forth. This encoding method allows the angle information to be fed into the network in a suitable format, facilitating the network to learn the influence of the angle on the image.

Algorithm 1: The algorithm of patch rotation prediction

Input:

I_{i} \in ℝ^{H \times W \times C}

,

θ_{i}^{e n c}

Output:

{\hat{θ}}_{i}^{e n c}

Step1:

I_{i}^{r o t} = r o t a t e (I_{i}, θ_{i})

,

θ_{i} \in \{0 °, 90 °, 180 °, 270 °\}

Step2:

θ_{i}^{e n c} = o n e_h o t (θ_{i})

Step3:

(I_{i}^{r o t}, θ_{i}^{e n c}) \to E n c o d e r

Step4:

D e c o d e r \to {\hat{θ}}_{i}^{e n c} = p r e d i c t_a n g l e (I_{i}^{r o t})

Step5:

L_{p a t c h_r o t a t i o n} = - \sum_{i = 1}^{4} \sum_{j = 1}^{4} θ_{i}^{e n c} [j] \log ({\hat{θ}}_{i}^{e n c} [j])

The Swin Transformer utilizes a window-based mechanism. This mechanism enables the network to conduct self-attention calculations within each local window, effectively capturing the relative relationships among small patches in the image. By doing so, the Swin Transformer can deeply explore the fine-grained features between image patches while maintaining computational efficiency. This unique mechanism facilitates the network in better comprehending the relationship between local regions and the global image information, thereby enhancing the model’s capacity to recognize different cloud classes. As a result, the Swin Transformer can perform cloud classification more accurately, further enhancing the accuracy and robustness of the task. Consequently, we opt to use the Swin Transformer as the backbone network for the cloud classification task. In our approach, the divided image patches, along with their encoded rotation angles, are jointly input and then passed to the Swin Transformer network for processing.

The network outputs a matrix, with each element in it corresponding to the rotation angle info of each small patch in the image. This task can be regarded as a classification problem. Here, the model has to predict the rotation angle of every image patch. Thanks to the decoder design, the model can generate a matrix that includes the rotation angles of all patches. The Pre-training decoder employs a standard three-layer multilayer perceptron (MLP) architecture designed for computational efficiency and implementation simplicity. Thus, it can accurately predict the rotation angles of each patch. The optimization goal is to train the model to make its predictions as close as possible to the real rotation angles. This way, the model’s accuracy and robustness can be improved, ensuring its effectiveness in related tasks.

In order to optimize the model, it is necessary to calculate the difference between the predicted angle encoding and the true angle encoding. Commonly, the cross-entropy loss is utilized to compute the loss, which is defined as Equation (1):

L_{p a t c h_r o t a t i o n} = - \sum_{i = 1}^{4} \sum_{j = 1}^{4} θ_{i}^{e n c} [j] \log ({\hat{θ}}_{i}^{e n c} [j])

(1)

where

θ_{i}^{e n c} [j]

represents the value at the

i

position in the true angle encoding of the

j

patch, and

{\hat{θ}}_{i}^{e n c} [j]

represents the value at the

i

position in the predicted rotation angle encoding of the

j

patch.

Ultimately, the network’s weights are optimized using backpropagation and the gradient descent algorithm. In each training iteration, the network parameters are updated according to the calculated loss value. This leads to a gradual adjustment of the weights, allowing the model to better learn the rotation angle features of image patches. This optimization process enables the network to steadily enhance its prediction ability. Eventually, it can accurately predict the rotation angles, thus improving the overall performance of the model.

3.2. Finetuning Stage

In the fine-tuning stage, we make use of the features acquired in the Pre-training phase. Then, we Fine-tuning the model with cloud image data to optimize its performance in cloud image classification. This process mainly consists of these steps: loading the pre-trained model, replacing the classification head, adjusting the learning rate, conducting data augmentation, and carrying out the training process.

Firstly, we load the pre-trained model parameters into the Swin Transformer model. Then, based on the specific cloud image categories, we adjust the network’s output layer. This adjustment enables the network to meet the output demands of the cloud image classification task. As the model has learned numerous useful features during Pre-training, there is no need for major parameter adjustments in the fine-tuning stage. Instead, we reduce the learning rate to control the update scale. This way, we can avoid interfering with the effective features already learned.

The essence of the fine-tuning stage is to improve the model using the specific features of cloud images. Through additional training with cloud image data, the model can utilize the knowledge it gained during Pre-training to optimize its parameters and enhance classification performance. This process enables the model to recognize different cloud types more precisely, guaranteeing greater accuracy and robustness in practical applications.

In the cloud classification task, we calculate the difference between the predicted values and the true values by using cross-entropy loss. The specific formula is as follows (Equation (2)):

L_{c l a s s i f i c a t i o n} = - \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{i} [j] \log ({\hat{y}}_{i} [j])

(2)

where

N

is the total number of images;

C

is the number of cloud image categories;

y_{i} [j]

is the one-hot encoding of the true category label of the image

i

;

{\hat{y}}_{i} [j]

is the probability predicted by the model that the image

i

belongs to category

j

.

4. Data

A deep learning-based model for ground-based cloud type recognition requires training, validation, and testing using a dataset. The model’s effectiveness is assessed based on its performance on this dataset. Therefore, creating a high-quality dataset is crucial for all-sky cloud observations.

As shown in Table 1, to investigate the effect of cloud types on irradiance, the paper categorizes cloud conditions into five types: clear sky, cloudy, overcast, cumulus, and cumulonimbus, based on how these clouds affect irradiance [21,22,23]. From this classification, the Radiance Cloud Classification Dataset (RCCD) was developed.

As illustrated in Figure 2, the dataset consists of 4700 images, each with a resolution of 1229 × 1229 pixels. These images were captured by the all-sky imager ASI16 at the Grim Point station in Tasmania, Australia, between February and July 2024. To reduce optical distortion and vignetting effects common in all-sky imagery, we cropped the outer 10% of each RCCD image from all sides, preserving the central 90% region for analysis. This ensures consistent cloud structure representation and minimizes errors caused by edge deformation. Our dataset effectively prevents temporal leakage through complete sequence randomization via Fisher–Yates shuffling, achieving near-maximum entropy (7.92 bits/8 bits ideal) to decouple temporal correlations. For model training, the dataset was divided into training and testing sets in an 8:2 ratio. This dataset offers a diverse range of samples for training and evaluating cloud classification models, making it a valuable resource for cloud classification research, particularly in studying the impact of cloud types on irradiance.

To assess the performance of the proposed method, we also conducted a comparative evaluation using the publicly available SWIMCAT dataset [24]. This dataset was collected by Nanyang Technological University in Singapore using the WAHRSIS all-sky imager, with data captured between January 2013 and May 2014 in Singapore. As shown in Figure 3, the dataset contains 784 cloud images across five sky types: clear sky, patterned clouds, veil clouds, thick white clouds, and thick dark clouds. With its diverse cloud types and extensive image samples, the SWIMCAT dataset serves as a valuable resource for evaluating cloud classification models. By comparing it with the RCCD, we can thoroughly assess the proposed method’s performance and its ability to generalize across different datasets.

To ensure a fair and comprehensive comparison, we prepared the datasets and fine-tuned all baseline CNN models using consistent preprocessing and training settings. The RCCD, containing 4700 high-resolution images (1229 × 1229), was center-cropped and resized to 384 × 384 to fit the input requirements of most CNN and Transformer models while reducing computational costs. The SWIMCAT dataset, with 784 low-resolution images (125 × 125), was resized to 224 × 224 using bilinear interpolation to maintain consistency across models.

5. Experiment

This section provides a comprehensive introduction in four parts. First, we will describe the experimental environment, covering hardware configuration, software platforms, and the necessary tool libraries. Next, we will outline the evaluation metrics used to assess model performance, establishing the criteria for subsequent analysis. The third section will present a comparative analysis of key experimental parameters, examining how different settings affect the results and identifying the optimal configuration. Finally, we will validate the superiority of the proposed method through a comparative analysis with other models, showcasing its top performance across various metrics.

5.1. Experimental Configuration

The experimental environment for this study is based on a computer running the Windows 10 operating system, with an Intel Core i5-12600KF central processing unit (CPU) and an NVIDIA GeForce RTX 3070 Ti graphics processing unit (GPU). The programming language used is Python 3.8, with the general parallel computing architecture being CUDA 12.1, and cuDNN 8.9 is used as the GPU acceleration library for deep neural networks. To build and Fine-tuning the model parameters, the study employs the Torch 2.2.1 deep learning framework. This experimental setup provides robust computational support for the cloud classification task, effectively accelerating the model training and optimization process.

For the RCCD, which contains high-resolution images and a relatively large number of labeled samples, we adopted a full fine-tuning strategy. The pre-trained Swin Transformer was initialized with weights learned from the self-supervised stage. All layers were unfrozen and updated during training. The model was optimized using the SGD optimizer with a learning rate of 5 × 10⁻⁵ and weight decay of 0.01. This setup allows the model to adapt fully to the high-resolution, information-rich data provided by the RCCD, which benefits from end-to-end training.

For the SWIMCAT dataset, we partially froze the lower layers of the Swin Transformer backbone (i.e., the first two stages) to retain the general visual features learned during self-supervised Pre-training. Only the higher layers and the classification head were updated using labeled data. This strategy helps reduce overfitting.

5.2. Evaluation Indicators

To thoroughly evaluate the performance of the proposed model, we used accuracy, recall, precision, and F1-score as key metrics. These provide a well-rounded view of the model’s effectiveness in cloud image classification. Accuracy reflects the overall proportion of correct predictions, serving as a standard measure of performance. Recall focuses on the model’s ability to identify all actual positive samples, highlighting its sensitivity. Precision measures the proportion of true positives among predicted positives, indicating accuracy. The F1-score, which combines precision and recall through their harmonic mean, assesses the model’s ability to handle positive samples, especially in situations with class imbalance. By using these four metrics, we can evaluate the model from multiple angles, ensuring a comprehensive and balanced performance analysis in the cloud image classification task.

Accuracy is one of the most widely used metrics for evaluating classification models, as it reflects the overall correctness of the model in classifying all samples. A higher accuracy indicates better performance. The formula for calculating accuracy is Equation (3):

A_{c c} = \frac{T_{P} + T_{N}}{T_{P} + T_{N} + F_{P} + F_{N}}

(3)

Recall, or true positive rate, evaluates the model’s ability to correctly identify positive samples out of all actual positive instances. A higher recall means the model misses fewer positive samples and detects more of them. Recall is especially crucial in cloud classification tasks as it highlights the model’s effectiveness in identifying different cloud categories. The formula for recall is Equation (4):

R_{e c} = \frac{T_{P}}{T_{P} + F_{N}}

(4)

Precision, or Positive Predictive Value, measures the proportion of predicted positive samples that are actually true positives. A higher precision indicates that the model is more accurate in predicting positive samples, reducing the likelihood of false positives. The formula for precision is Equation (5):

P_{r e} = \frac{T_{P}}{T_{P} + F_{P}}

(5)

The F1-score is the harmonic mean of precision and recall, providing a balanced measure of both. It is particularly useful in situations with class imbalance. A higher F1-score indicates a better balance between precision and recall. The formula for the F1-score is Equation (6):

F_{1} = \frac{2 \cdot P_{r e} \cdot R_{e c}}{P_{r e} + R_{e c}}

(6)

In the formulas,

T_{P}

represents the number of positive samples that are predicted as positive by the model,

T_{N}

represents the number of negative samples that are predicted as negative by the model,

F_{P}

represents the number of negative samples that are predicted as positive by the model, and

F_{N}

represents the number of positive samples that are predicted as negative by the model.

5.3. Parameter Analysis

The number of rotated blocks and the rotation angle size are key factors influencing the experimental outcomes. In this study, we divide the image into 16 smaller blocks (4 × 4) and randomly rotate 1 to 15 blocks for comparison analysis, with a training batch size of 100. As shown in Figure 4a, the model performs optimally when 9 blocks are randomly rotated. This allows the model to capture local features effectively while minimizing computational complexity. Using fewer blocks (1 to 3) may result in a lack of sufficient detail, while rotating too many blocks (13 to 15) increases computational cost and may cause the model to overly focus on local features, hindering its understanding of global context. As the number of blocks increases, computational complexity grows, especially with 15 blocks, leading to longer training times, higher resource demands, and suboptimal performance within 100 batches.

To investigate the impact of different rotation angles on the experimental results, we tested angles of 30°, 45°, 60°, and 90°, with the training batch size set to 100. As shown in Figure 4b, the model performs best with a 60° rotation angle. Smaller angles, like 30°, capture more details but make the training process more complex, with limited performance improvement within the given training time. Larger angles, such as 90°, enable effective classification, but the model’s accuracy slightly decreases compared to the 60° setting.

5.4. Comparison Experiment

To validate the effectiveness of the self-supervised method proposed in this paper for cloud classification, we conducted a series of comparative experiments and analyzed the results. The experiments are divided into two main sections: the first compares the performance of the proposed self-supervised cloud classification model with other supervised and unsupervised models using the self-built RCCD and the publicly available SWIMCAT dataset; the second examines the impact of various auxiliary tasks on the model’s performance by comparing classification results under different auxiliary task settings. These two parts provide a thorough evaluation of the self-supervised method’s performance in cloud image classification, exploring the effects of different training strategies and auxiliary tasks, and highlighting the advantages and potential of the proposed approach. In the comparative experiments, performance analysis was conducted using optimized parameters (number of rotations = 9, rotation units = 60). For the comparative experiments, we performed systematic hyperparameter tuning on all baseline models to achieve their best possible performance. For other models included in the comparison, we similarly conducted hyperparameter tuning and ensured consistent training conditions and preprocessing pipelines across all models.

In the comparative experiments, we selected several well-known convolutional neural networks (CNNs), including VGG16 [25], ResNet34 [26], UNet [27], EfficientNet-B4 [28], SENet [29], DenseNet-264 [30], Inception-V3 [31], and Xception [32], as well as Transformer-based models such as the Vision Transformer (ViT) and Swin Transformer. These models are widely used in image classification and segmentation tasks and are recognized for their strong feature extraction capabilities. To ensure a fair and reproducible comparison, all baseline models were trained in a fully supervised manner on the RCCD and the SWIMCAT dataset, which were divided into training and testing sets in an 8:2 ratio. During training, data augmentations including random cropping, horizontal flipping, and color jittering were applied. The models were trained for 100 epochs with the Adam optimizer, an initial learning rate of 1 × 10⁻⁴ (with cosine decay), and a batch size of 32. Early stopping was used based on validation accuracy. It is worth noting that while the Swin Transformer is also used in our proposed method, there is a fundamental difference in the training approach. In our method, the Swin Transformer backbone is first pre-trained using a self-supervised patch rotation task on unlabeled cloud images, followed by fine-tuning on labeled data. In contrast, the baseline Swin Transformer is trained directly in a supervised learning manner without any Pre-training.

In the comparative experiments for unsupervised methods, we selected several well-established self-supervised learning techniques, including rotation-based learning for visual representations (Rotation) [33], Bootstrap Your Own Latent (BYOL) [34], Momentum Contrast (MoCo) [35], SimCLR for contrastive learning of visual representations [36], and Pretext-Invariant Representation Learning (PIRL) [37]. These methods represent key approaches in the current unsupervised learning landscape, spanning various strategies from contrastive learning to rotation prediction, all designed to learn effective image representations from unlabeled data. By comparing these methods, we can more effectively evaluate the performance of the proposed self-supervised cloud classification approach and analyze the strengths and limitations of different unsupervised learning strategies. It is worth noting that traditional rotation tasks involve rotating the entire image (e.g., 0°, 90°, 180°, 270°), and the model is trained to classify global rotations. In contrast, the block rotation task in this article divides the image into smaller blocks, each of which may rotate differently. This enables the model to learn local features in different directions while maintaining awareness of the global structure.

In the experimental process, the model is first trained using self-supervised learning methods on the self-supervised training dataset. In this stage, the model is trained to classify and discriminate the rotation angles of different image blocks and extract local and global features of the cloud map. Table 2 shows the hyperparameters for each stage of the experiment. The Stochastic Gradient Descent (SGD) optimizer is used with an initial learning rate of 0.005 [38]. A cosine annealing strategy is employed to adaptively adjust the learning rate during training, helping the model converge smoothly and avoid premature local optima [39]. Once the self-supervised training is complete, the model has learned key features for cloud type recognition, which aid in the cloud classification task. The model then moves to the downstream classification phase, where it applies the learned features to classify cloud types on the labeled dataset, thus evaluating its performance and generalization ability for real-world cloud classification.

As shown in Table 3, the proposed self-supervised method SSCC-PR demonstrates strong performance across both datasets, outperforming traditional supervised and unsupervised methods in most metrics. On the RCCD, SSCC-PR achieves the highest accuracy of 96.61%, F1-score of 85.27%, precision of 90.79%, and recall of 92.87%, surpassing the best supervised model (Swin Transformer) by 1.87%, 2.05%, 0.92%, and 0.90%, respectively. The performance margin is even more significant when compared to the best unsupervised model (SimCLR), with improvements of 5.07%, 5.01%, 1.79%, and 4.23%, indicating the effectiveness of the proposed self-supervised Pre-training strategy in extracting meaningful features without reliance on labels. On the SWIMCAT dataset, which is considerably smaller in size, SSCC-PR still outperforms the top supervised method (Swin Transformer) by 1.61% in accuracy, 3.05% in F1-score, 1.95% in precision, and 2.21% in recall. Compared to the best-performing unsupervised method (SimCLR), the improvements reach 3.73%, 6.05%, 2.39%, and 4.64%, respectively. These results suggest that the proposed approach can generalize well even under data-scarce conditions. While the Transformer-based models (e.g., Swin Transformer, Vision Transformer) also achieve competitive results, they rely heavily on supervised fine-tuning and often exhibit overfitting on smaller datasets like SWIMCAT. The proposed SSCC-PR method, by contrast, benefits from self-supervised Pre-training, which improves generalization and robustness, especially when labeled data is limited. In summary, the proposed method not only performs competitively with state-of-the-art supervised models but also shows clear and consistent superiority over unsupervised baselines, reinforcing its applicability in practical cloud classification scenarios where labeled data is scarce.

As shown in Table 4, to provide a comprehensive evaluation, we also referred to existing studies that employed the SWIMCAT dataset. For instance, Liu et al. [40] proposed a deep convolutional neural network (CNN) trained from scratch to address the challenges of small-scale cloud classification. Fang et al. [41] explored cloud type classification using data augmentation and transfer learning techniques. Tang et al. [42] introduced an improved Region Covariance Descriptor (RCovD) combined with a Riemannian Bag-of-Features (BoF) model to achieve high-accuracy predictions of cloud categories. Compared with these approaches, our proposed SSCC-PR method achieves an accuracy of 90.18% on the SWIMCAT dataset, outperforming other models. Moreover, it does not require label supervision during Pre-training, demonstrating competitive or even superior performance.

To establish the Swin Transformer’s efficacy as our foundational architecture, we performed rigorous benchmarking against contemporary Vision Transformer variants, including DeiT, PVT, and Twins-SVT, under identical experimental conditions.

As shown in Table 5, research demonstrates that under identical experimental conditions, the Swin Transformer exhibits advantages over other ViT variants for our specific task, including fewer parameters, lower computational costs, and higher accuracy.

The block rotation auxiliary task divides the image into smaller blocks and trains the model to rotate them, allowing it to capture detailed local features and their rotational variations. Unlike traditional rotation tasks, block rotation can capture both local details and global rotation information. In this experiment, we compare the block rotation task with other common auxiliary tasks, including the rotation task, color restoration task, and occlusion prediction task.

As shown in Table 6, the block rotation auxiliary task outperforms other common tasks (such as rotation, color restoration, and occlusion prediction) in cloud image classification. It shows significant advantages in accuracy, recall, precision, and F1-score. This suggests that the block rotation task, by offering richer feature learning, enhances the model’s classification accuracy and generalization ability.

5.5. Performance Evaluation

To more comprehensively assess the model’s class-specific classification performance, we constructed a normalized confusion matrix. This matrix serves two key purposes: (1) quantifying per-class classification accuracy and (2) identifying critical misclassification patterns. By dissecting these patterns, we aim to uncover the model’s strengths and limitations, providing a data-driven foundation for targeted improvements.

As shown in Figure 5, we analyzed the constructed normalized confusion matrix. Overall, due to their consistent visual features, the model exhibited strong recall for all classes, except for cumulus clouds (77.3% recall), for which the performance was slightly worse. Specifically, 13.8% of true cumulus clouds were incorrectly labeled as cumulonimbus, and 11.2% of true cumulonimbus clouds were mistakenly classified as cumulus. This was primarily due to mutual misclassifications with cumulonimbus, a critical issue for severe weather forecasting. It reflects the morphological similarity during transitional stages (e.g., towering cumulus) and the model’s limited ability to integrate discriminative meteorological features.

To assess the stability of our results, we performed 10-fold cross-validation on both the RCCD and the SWIMCAT dataset and calculated the standard deviation (SD) of key performance indicators. On the RCCD, the accuracy of our model has an SD of ±0.8% (mean: 89.7%), indicating minimal variance across different data splits. For cumulonimbus recall, the SD is ±1.2% (mean: 82%), demonstrating consistent performance on this critical class. We also repeated the experiments 10 times with different random seeds (for training/test splits) and observed similar stability—this further confirms that our model’s performance is not sensitive to data partitioning.

6. Conclusions

This paper introduces a self-supervised learning-based cloud classification network (SSCC-PR) for classifying ground-based cloud images. By incorporating a block rotation auxiliary task, the model’s feature extraction capabilities are improved, and the Swin Transformer is used as the backbone network to effectively capture both local and global interactions. To evaluate the method’s effectiveness, we performed comparative experiments on the self-built RCCD and the public SWIMCAT dataset. The results show that the proposed method outperforms others in performance and generalization. Beyond academic contributions, the proposed framework holds strong potential for integration into renewable energy forecasting systems, environmental monitoring networks, and real-time weather prediction pipelines. By improving the accuracy and stability of irradiance forecasts, our work can contribute to more efficient and sustainable energy management strategies.

7. Prospect

Our proposed rotation patch-augmented self-supervised method achieves strong ground-based cloud classification accuracy, supporting refined meteorological warning. However, key directions remain to enhance generality, interpretability, real-time performance, and practicality: First, the exploration of rotation patch angles and sizes is incomplete—future work will investigate the impact of different granularities on model performance. Second, current models rely solely on high-resolution but limited-coverage ground-based images, missing satellite-derived physical features critical for complex clouds (e.g., transitional cumulonimbus) [43]. In addition, our model often confuses optically thick cumulus (which causes a 60% reduction in irradiance) with optically thin altocumulus (which causes a 20% reduction in irradiance)—both have a fluffy white morphology, but their physical impacts are vastly different. This leads to inaccurate irradiance predictions. Through experiments, we found that for downstream cloud classification tasks, each cloud class requires a minimum of 500 labeled samples, with key classes such as cumulus and cumulonimbus needing at least 750 samples to ensure model performance. To address the data bottleneck, we will use generative technologies like GANs to expand the dataset in the future to improve model performance. Future work will fuse morphological features (ground-based edges/textures) with satellite physical features via multimodal frameworks, creating a “morphology–physics” model to improve complex cloud recognition and understanding of global cloud distribution [44].

Author Contributions

Conceptualization, W.Y. and X.X. (Xiong Xiong); methodology, W.Y.; software, X.G.; validation, W.Y., X.X. (Xiong Xiong) and Y.Z.; formal analysis, X.X. (Xinyuan Xia); investigation, W.Y.; resources, X.X. (Xinyuan Xia); data curation, X.X. (Xinyuan Xia); writing—original draft preparation, W.Y.; writing—review and editing, W.Y.; visualization, X.X. (Xinyuan Xia); supervision, X.X. (Xiong Xiong); funding acquisition, X.X. (Xiong Xiong). All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Natural Science Foundation of China under Grant Nos. 42205150 and 42275156, by the China Postdoctoral Science Foundation under Grant No. 2024M761470, and by the Postgraduate Research and Practice Innovation Program of Jiangsu Province under Grant No. KYCX24_1511.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rossow, W.B.; Garder, L.C.; Lacis, A.A. Global, seasonal cloud variations from satellite radiance measurements. Part I: Sensitivity of analysis. J. Clim. 1989, 2, 419–458. [Google Scholar] [CrossRef]
Rossow, W.B.; Lacis, A.A. Global, seasonal cloud variations from satellite radiance measurements. Part II. Cloud properties and radiative effects. J. Clim. 1990, 3, 1204–1253. [Google Scholar] [CrossRef]
Chiu, J.C.; Marshak, A.; Knyazikhin, Y.; Wiscombe, W.J.; Barker, H.W.; Barnard, J.C.; Luo, Y. Remote sensing of cloud properties using ground-based measurements of zenith radiance. J. Geophys. Res. Atmos. 2006, 111, D16201. [Google Scholar] [CrossRef]
Zinner, T.; Mayer, B.; Schröder, M. Determination of three-dimensional cloud structures from high-resolution radiance data. J. Geophys. Res. Atmos. 2006, 111, D08204. [Google Scholar] [CrossRef]
Singh, M.; Glennen, M. Automated ground-based cloud recognition. Pattern Aanalysis Appl. 2005, 8, 258–271. [Google Scholar] [CrossRef]
Calbo, J.; Sabburg, J. Feature extraction from whole-sky ground-based images for cloud-type recognition. J. Atmos. Ocean. Technol. 2008, 25, 3–14. [Google Scholar] [CrossRef]
Liu, L.; Sun, X.; Chen, F.; Zhao, S.; Gao, T. Cloud classification based on structure features of infrared images. J. Atmos. Ocean. Technol. 2011, 28, 410–417. [Google Scholar] [CrossRef]
Zhang, J.; Liu, P.; Zhang, F.; Song, Q. CloudNet: Ground-based cloud classification with deep convolutional neural network. Geophys. Res. Lett. 2018, 45, 8665–8672. [Google Scholar] [CrossRef]
Fang, C.; Jia, K.; Liu, P.; Zhang, L. Research on cloud recognition technology based on transfer learning. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 791–796. [Google Scholar]
Shi, C.; Wang, C.; Wang, Y.; Xiao, B. Deep convolutional activations-based features for ground-based cloud classification. IEEE Geosci. Remote Sens. Lett. 2017, 14, 816–820. [Google Scholar] [CrossRef]
Liu, S.; Li, M.; Zhang, Z.; Cao, X.; Durrani, T.S. Ground-based cloud classification using task-based graph convolutional network. Geophys. Res. Lett. 2020, 47, e2020GL087338. [Google Scholar] [CrossRef]
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
Zhai, X.; Oliver, A.; Kolesnikov, A.; Beyer, L. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1476–1485. [Google Scholar]
Gui, J.; Chen, T.; Zhang, J.; Cao, Q.; Sun, Z.; Luo, H.; Tao, D. A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9052–9071. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11 October 2021; pp. 10012–10022. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Krizhevsky, A.; Hinton, G. Convolutional deep belief networks on cifar-10. Unpubl. Manuscr. 2010, 40, 1–9. [Google Scholar]
Feng, L.; Hu, C. Cloud adjacency effects on top-of-atmosphere radiance and ocean color data products: A statistical assessment. Remote Sens. Environ. 2016, 174, 301–313. [Google Scholar] [CrossRef]
Wang, T.; Shi, J.; Letu, H.; Ma, Y.; Li, X.; Zheng, Y. Detection and removal of clouds and associated shadows in satellite imagery based on simulated radiance fields. J. Geophys. Res. Atmos. 2019, 124, 7207–7225. [Google Scholar] [CrossRef]
Smith, G.; Priestley, K.; Loeb, N.; Wielicki, B.; Charlock, T.; Minnis, P.; Doelling, D.; Rutan, D. Clouds and Earth Radiant Energy System (CERES), a review: Past, present and future. Adv. Space Res. 2011, 48, 254–263. [Google Scholar] [CrossRef]
Dev, S.; Lee, Y.H.; Winkler, S. Categorization of cloud image patches using an improved texton-based approach. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 422–426. [Google Scholar]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR 2019, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar] [CrossRef]
Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Misra, I.; Maaten, L. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6707–6717. [Google Scholar]
Amari, S. Backpropagation and stochastic gradient descent method. Neurocomputing 1993, 5, 185–196. [Google Scholar] [CrossRef]
Cazenave, T.; Sentuc, J.; Videau, M. Cosine annealing, mixnet and swish activation for computer Go. In Advances in Computer Games; Springer International Publishing: Cham, Swizerland, 2021; pp. 53–60. [Google Scholar]
Liu, Z.; Zhou, S.; Wang, M.; Peng, S.; Shen, A.; Zhou, S. Ground-based visible-light cloud image classification based on a convolutional neural network. In Proceedings of the 2019 4th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, 21–22 November 2019; pp. 108–112. [Google Scholar]
Fang, H.; Han, B.; Zhang, S.; Zhou, S.; Hu, C.; Ye, W.-M. Data Augmentation for Object Detection via Controllable Diffusion Models. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1246–1255. [Google Scholar] [CrossRef]
Tang, Y.; Yang, P.; Zhou, Z.; Pan, D.; Chen, J.; Zhao, X. Improving cloud type classification of ground-based images using region covariance descriptors. Atmos. Meas. Tech. 2021, 14, 737–747. [Google Scholar] [CrossRef]
Sinaga, K.P.; Yang, M.S. A Globally Collaborative Multi-View k-Means Clustering. Electronics 2025, 14, 2129. [Google Scholar] [CrossRef]
Salazar, A.; Vergara, L.; Vidal, E. A proxy learning curve for the Bayes classifier. Pattern Recognit. 2022, 136, 109240. [Google Scholar] [CrossRef]

Figure 1. The framework of self-supervised cloud classification with patch rotation tasks (SSCC-PR). The decoder employs a standard three-layer multilayer perceptron (MLP) architecture.

Figure 2. Radiance Cloud Classification Dataset (RCCD, N = 4700).

Figure 3. Singapore Whole sky IMaging CATegories database (SWIMCAT, N = 784).

Figure 4. Analysis experiment on the number of rotating blocks and the rotation unit angle. (a) For the analysis of the number of rotation, (b) For the analysis of the rotation units.

Figure 5. Cloud classification confusion matrix. The rows of the matrix represent the true classes, and the columns represent the predicted classes. The diagonal values indicate the proportion of correct classifications by the model, while the off-diagonal values represent misclassification rates. Darker colors indicate higher values, and the red box highlights the cloud classes with the highest misclassification.

Table 1. Table of the relationship between cloud types and irradiance, as well as the distribution of RCCD data.

Varieties of Clouds	Description	Impact on Irradiance	Quantity
Clear Sky	The sky has no significant cloud layers or only a few clouds, with ample sunlight and high visibility.	Irradiance is high, with sunlight directly reaching the ground, typically representing the brightest moments.	805
Cloudy	The sky is partially covered by clouds, with cloud cover typically ranging from 40% to 70%. Clear skies alternate with cloud layers, and sunlight may appear intermittently. This condition may include various cloud types such as cumulus, stratus, and cirrostratus.	Irradiance is lower than on a clear day because the clouds partially block the sunlight, reducing the intensity of the sun.	1200
Overcast	The sky is completely covered by clouds, which are thick and dense, preventing sunlight from penetrating.	Irradiance is very low, with sunlight completely blocked, creating a gray and overcast feeling.	845
Cumulus	It has a distinct white, fluffy appearance with a flat base and a convex-shaped top.	Cumulus clouds block sunlight over a relatively small area, causing significant variations in irradiance. In some regions, irradiance may decrease due to the shading effect of cumulus clouds.	1000
Cumulonimbus	A highly developed cumulus cloud, typically characterized by intense convective activity, with the cloud top potentially reaching the tropopause. It is the most powerful type of convective cloud.	It has a significant impact on irradiance, as it can completely block sunlight, resulting in extremely low irradiance. It is usually associated with precipitation, thunderstorms, and other meteorological disasters.	850

Table 2. Experimental stage hyperparameter table.

Hyperparameter	Pre-Training	Fine-Tuning (RCCD)	Fine-Tuning (SWIMCAT)
Batch Size	256	32	32
Epochs	300	100	50
Optimizer	AdamW	SGD	AdamW
LR/Weight Decay	0.05	5 × 10⁻⁵	1 × 10⁻⁴
LR Schedule	Cosine Annealing	Cosine Annealing	Cosine Annealing

Table 3. Experimental comparison of SSCC-PR with different models.

	Models	RCCD				SWIMCAT
	Models	$A_{c c}$	$R_{e c}$	$P_{r e}$	$F_{1}$	$A_{c c}$	$R_{e c}$	$P_{r e}$	$F_{1}$
Supervised	VGG16	81.08	70.55	79.14	80.68	80.29	64.59	79.41	79.38
	Resnet34	83.71	71.94	81.29	82.43	82.46	71.97	80.41	81.86
	UNet	77.58	65.83	76.97	77.29	76.85	65.33	76.29	77.91
	EfficientNet-B4	91.85	79.54	88.64	91.57	88.57	76.52	86.98	87.54
	SENet	81.82	69.82	81.25	82.93	80.74	67.85	75.98	79.52
	DenseNet-264	88.53	77.69	85.94	97.85	86.45	75.85	84.54	84.22
	Inception-V3	85.84	74.56	84.18	85.49	82.57	71.76	82.05	79.84
	Xception	86.88	74.93	85.09	85.49	85.97	69.85	81.25	81.99
	Vision Transformer	92.67	80.73	89.31	90.19	82.95	69.86	80.54	81.46
	Swin Transformer	94.74	83.22	89.87	91.97	83.46	69.59	81.64	82.28
Unsupervised	Rotation	74.57	62.98	73.46	71.94	71.38	60.35	71.58	70.29
	BYOL	85.61	76.51	83.85	83.87	81.62	70.66	81.88	80.25
	MoCo	85.93	76.87	84.33	84.16	82.71	70.98	81.45	80.59
	SimCLR	91.54	80.26	89.00	88.64	86.45	73.52	86.54	85.15
	PIRL	82.44	71.59	81.61	80.58	82.26	71.14	82.64	81.57
Ours	SSCC-PR	96.61	85.27	90.79	92.87	90.18	79.57	88.93	89.75

Table 4. Comparison of results on SWIMCAT dataset.

Method	Accuracy
Fang et al. [41]	88.1
Tang et al. [42]	88.75
Liu et al. [40]	85.81
Ours	90.18

Table 5. Comparative experimental results of baseline models.

Baseline Model	Params (M)	FLOPs (G)	Inference Time (ms)	GPU Memory (GB)	$P_{r e}$
ViT-Base	86	17.6	86	23.5	82.34
DeiT-Small	22	4.6	23	6.1	87.91
PVTv2-B3	45	6.9	34	9.2	85.61
Twins-SVT-B	57	8.9	44	11.9	87.56
Swin-Tiny	28	4.5	22	6.0	90.79

Table 6. Comparison of different auxiliary tasks in experiments.

Auxiliary Tasks	$A_{c c}$	$R_{e c}$	$P_{r e}$	$F_{1}$
Rotation	89.46	78.63	84.67	85.37
Colorization	90.45	79.88	85.53	85.74
Inpainting	94.75	83.54	87.41	88.96
SSCC-PR (Ours)	96.61	85.27	90.79	92.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, W.; Xiong, X.; Xia, X.; Zhang, Y.; Guo, X. Self-Supervised Cloud Classification with Patch Rotation Tasks (SSCC-PR). Appl. Sci. 2025, 15, 9051. https://doi.org/10.3390/app15169051

AMA Style

Yan W, Xiong X, Xia X, Zhang Y, Guo X. Self-Supervised Cloud Classification with Patch Rotation Tasks (SSCC-PR). Applied Sciences. 2025; 15(16):9051. https://doi.org/10.3390/app15169051

Chicago/Turabian Style

Yan, Wuyang, Xiong Xiong, Xinyuan Xia, Yanchao Zhang, and Xiaojie Guo. 2025. "Self-Supervised Cloud Classification with Patch Rotation Tasks (SSCC-PR)" Applied Sciences 15, no. 16: 9051. https://doi.org/10.3390/app15169051

APA Style

Yan, W., Xiong, X., Xia, X., Zhang, Y., & Guo, X. (2025). Self-Supervised Cloud Classification with Patch Rotation Tasks (SSCC-PR). Applied Sciences, 15(16), 9051. https://doi.org/10.3390/app15169051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Supervised Cloud Classification with Patch Rotation Tasks (SSCC-PR)

Abstract

1. Introduction

2. Related Works

2.1. Self-Supervised Learning

2.2. Swin Transformer

2.3. Cloud Classification

3. Approach

3.1. Pre-Training Stage

3.2. Finetuning Stage

4. Data

5. Experiment

5.1. Experimental Configuration

5.2. Evaluation Indicators

5.3. Parameter Analysis

5.4. Comparison Experiment

5.5. Performance Evaluation

6. Conclusions

7. Prospect

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI