Next Article in Journal
An Improved FDTD Method Based on Multi-Frame Lorentz Transformations for Plasma-Sheath-Covered Hypersonic Vehicle
Next Article in Special Issue
Research on Short-Term Wind Power Forecasting Based on VMD-IDBO-SVM
Previous Article in Journal
Simultaneously Captures Node-Level and Sequence-Level Features in Parallel for Cascade Prediction
Previous Article in Special Issue
Substation Instrument Defect Detection Based on Multi-Domain Collaborative Attention Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fine-Grained Segmentation Method of Ground-Based Cloud Images Based on Improved Transformer

1
State Power Investment Group Hebei Electric Power Co., Ltd., Shijiazhuang 050000, China
2
Department of Electronic and Communication Engineering, North China Electric Power University, Baoding 071003, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 156; https://doi.org/10.3390/electronics15010156
Submission received: 6 December 2025 / Revised: 26 December 2025 / Accepted: 27 December 2025 / Published: 29 December 2025
(This article belongs to the Special Issue Applications of Artificial Intelligence in Electric Power Systems)

Abstract

Solar irradiance is one of the main factors affecting the output of photovoltaic power stations. The cloud distribution above the photovoltaic power station can determine the strength of the absorbed solar irradiance. Cloud estimation is another important factor affecting the output of photovoltaic power stations. Ground-based cloud automation observation is an important means to achieve cloud estimation and cloud distribution. Ground-based cloud image segmentation is an important component of ground-based cloud image automation observation. Most of the previous ground-based cloud image segmentation methods rely on convolutional neural networks (CNNs) and lack modeling of long-distance dependencies. In view of the rich fine-grained attributes in ground-based cloud images, this paper proposes a new Transformer architecture for ground-based cloud image fine-grained segmentation based on deep learning technology. The model consists of an encoder–decoder. In order to further mine the fine-grained features of the image, the BiFormer Block is used to replace the original Transformer; in order to reduce the model parameters, the MLP is used to replace the original bottleneck layer; and for the local features of the ground-based cloud, a multi-scale dual-attention (MSDA) block is used to integrate in the jump connection, so that the model can further extract local features and global features. The model is analyzed from both quantitative and qualitative aspects. Our model achieves the best segmentation accuracy, with mIoU reaching 65.18%. The ablation experiment results prove the contribution of key components to segmentation accuracy.

1. Introduction

The continuous increase in global carbon dioxide (CO2) emissions has intensified the occurrence of extreme climate events, resulting in substantial environmental and socio-economic impacts worldwide [1]. In response, China has committed to the objectives of the Paris Agreement by proposing the strategic targets of “carbon peaking” and “carbon neutrality” and has implemented a series of national policies to promote a sustainable low-carbon transition [2]. Within this global trend, many countries and enterprises have expanded their investments in renewable energy technologies. Photovoltaic (PV) power generation, in particular, has become a dominant form of solar energy utilization due to its cleanliness, scalability, and declining cost [3]. However, the power output of PV systems is strongly influenced by solar irradiance and meteorological conditions, whose intermittent and stochastic characteristics can lead to significant variability in PV generation. This variability poses challenges for grid stability, thereby making accurate PV power forecasting essential for distribution network planning and real-time operational control.
Consequently, accurate forecasting of PV power output is critically important to support distribution network planning and real-time operational management, ensuring a balanced and resilient integration of renewable energy sources into the electrical system.
Clouds are among the most critical atmospheric phenomena, formed through the condensation of water vapor and the aggregation of ice crystals [4]. Their morphology and spatial distribution exhibit strong short-term variability driven primarily by local weather conditions rather than long-term climate change. For photovoltaic (PV) power applications, these rapid weather-induced changes in cloud cover have a direct and immediate impact on the real-time solar irradiance received by ground-based PV power stations. In this context, timely and accurate characterization of cloud cover is a key prerequisite for short-term PV power forecasting and real-time operational decision-making.
Currently, cloud images used for analysis and modeling are generally categorized into two types based on their acquisition methods: satellite-based cloud images [5] and ground-based cloud images [6]. Compared to satellite images, ground-based cloud imagery offers higher local spatial resolution and retains more detailed cloud texture features, which are essential for accurate cloud characterization. Moreover, since the background in ground-based imagery is predominantly the sky, these images are less susceptible to interference during the cloud detection and recognition process, thereby enhancing their reliability for localize. The shape and spatial distribution of clouds play a critical role in climate and weather modeling, as well as in advancing our understanding of aerosol–cloud interactions. Accurate characterization of cloud cover is essential for the development of environmental prediction models that incorporate radiative transfer and cloud microphysical properties. Furthermore, the detection and interpretation of cloud coverage have been extensively studied for their importance in estimating and forecasting solar irradiance, which is a key factor in the performance prediction of PV power generation systems [7].
Ground-based cloud image segmentation is a fundamental aspect of ground-based cloud observation and is closely related to applications such as weather forecasting and PV power prediction. Existing segmentation methods primarily focus on separating clouds from the sky background, without further distinguishing between different cloud types [8]. However, for PV power generation, fluctuations in output are influenced by several factors, including local cloud coverage, solar irradiance, and photovoltaic cell performance. Among these, variations in local cloud coverage are one of the primary causes of power intermittency and instability. Accurate quantification of cloud coverage through cloud observation is therefore essential for improving the precision of PV power forecasting. Notably, clouds exhibit rich fine-grained semantic features, and different cloud types exert varying influences on solar irradiance, which in turn have differential impacts on PV power output [9]. Coarse segmentation approaches that treat all clouds as a single category fail to capture this critical heterogeneity. To address this limitation, this study proposes a fine-grained segmentation method for ground-based cloud images based on visual deep learning techniques. By enabling the identification and classification of different cloud types within a single scene, the proposed method lays a robust data foundation for the development of more accurate and reliable PV power prediction models. By explicitly distinguishing multiple cloud categories within a single scene, the proposed method aims to provide segmentation results that are not only accurate in a quantitative sense but also sufficiently informative for downstream PV power forecasting tasks. As demonstrated by the experimental results, the achieved segmentation performance satisfies the practical requirements of the intended application scenario, thereby enabling reliable cloud-aware PV power prediction.

2. Related Work

Ground-based cloud image segmentation methods can be broadly classified into three categories: threshold-based methods, traditional machine learning approaches, and deep learning-based techniques.
Threshold-based methods often rely on differences between the red (R) and blue (B) channels in cloud images to distinguish cloud pixels from the sky background [10]. However, these methods are highly sensitive to the choice of threshold and typically struggle with multimodal histogram distributions. To address these limitations, researchers have proposed a variety of improved thresholding techniques, including global adaptive thresholding, local thresholding, and hybrid thresholding methods, which enhance segmentation robustness and accuracy to varying degrees. However, Traditional thresholding-based cloud–sky segmentation methods are limited to binary classification and rely primarily on intensity contrasts, making them unsuitable for distinguishing multiple fine-grained cloud types with subtle texture and structural differences.
Traditional machine learning approaches for cloud segmentation primarily include pixel-level and superpixel-level learning. Pixel-level learning treats each pixel independently, applying classifiers to assign labels on a per-pixel basis. In contrast, superpixel-level learning involves first segmenting the image into homogeneous superpixels, then applying classification models to these grouped regions as unified entities, which can reduce noise and computational complexity [11].
In recent years, deep learning has been increasingly applied to the segmentation of ground-based cloud images, yielding significant advances in both performance and generalization. Various convolutional neural network (CNN) architectures have been adapted and extended for this task, including Fully Convolutional Networks (FCNs) [12], U-Net [13], and SegNet [14]. Wu and Shi proposed a locally clustered FCN-CNN model and an enhanced fully convolutional network for improved segmentation accuracy. Shi et al. further refined the U-Net architecture, introducing several variants such as CloudU-Net [6], CloudU-Netv2 [15] and CloudRAEDNet [16], each demonstrating improved capability in modeling cloud structures and boundaries. Similarly, SegNet-based models such as CloudSegNet, SegCloud, and LGCSegNet have been developed and employed for effective cloud image segmentation.
To better capture global contextual information, Liu et al. proposed TransCloudSe [17], a hybrid architecture that integrates convolutional neural networks with Transformer-based modules. By leveraging self-attention mechanisms, the model effectively fuses heterogeneous feature maps, resulting in significantly improved segmentation accuracy over previous methods.

3. Methods

3.1. Swin Transformer

Although CNN have demonstrated remarkable success in image processing tasks, their inherently limited receptive fields constrain their ability to model long-range contextual dependencies. This limitation is particularly critical in dense prediction tasks such as semantic segmentation, where capturing global context is essential for accurate classification at the pixel level. In the case of cloud image segmentation, where cloud structures often span large areas and exhibit complex patterns, global contextual understanding becomes indispensable.
Traditional CNN have long been the backbone of many vision tasks, but they often hit a wall when it comes to capturing relationships between distant parts of an image. This mainly stems from the fixed size of convolutional filters, which limits the network’s ability to see beyond a certain local neighborhood. To get around this, researchers have explored several clever tricks. For example, dilated convolutions stretch the receptive field without increasing the number of parameters, allowing the network to gather more context. Other methods involve making convolution kernels bigger or applying spatial pyramid pooling, which pools features at multiple scales to get a broader picture. There are also approaches that combine features from shallow and deep layers, hoping to mix fine details with more abstract representations. Despite all these advances, however, the fundamentally local nature of convolutions means that modeling truly long-range dependencies remains challenging.
In recent years, the Transformer [18] architecture has gained significant attention in the computer vision community for its powerful capability in capturing global interactions through self-attention mechanisms. While the original Transformer architecture achieves state-of-the-art performance in many vision tasks, it computes attention across all pairs of image patches at the same scale, resulting in prohibitively high computational costs when applied to high-resolution images.

3.2. Overall Framework

The overall architecture of the model is shown in Figure 1. The proposed model is mainly divided into four main parts: encoder, decoder, bottleneck layer and MSDA. In models based on the Transformer architecture that are used for image processing, we do not feed the raw pixel grid of the image directly into the network. Instead, there is a preprocessing step where the image is sliced into smaller, regularly sized sections-commonly 16 × 16 or 32 × 32 pixels. Think of these sections like small tiles cut from a larger picture. Each tile, or patch is then flattened into a one-dimensional array, and this array is passed through a trainable linear layer to create what’s called an embedding.
This patch embedding step essentially re-frames the 2D image into a form that the Transformer can interpret more easily—as a sequence of vectors, similar to how it would process words in a sentence. To help the model understand where each patch came from in the original image, a positional encoding is added. This encoding tells the model something about the spatial order of the patches, so it does not lose track of the structure of the image. Some models use a fixed formula for this encoding, while others learn the position data during training. Once patch embeddings and their corresponding position information are ready, the data is fed into the Transformer layers. These layers can model relationships across distant parts of the image, which turns out to be very useful. For instance, in tasks like analyzing cloud patterns from ground-based sky images, it is important not just to see texture in a small area, but to understand how different parts of the sky relate to one another across the whole frame. The Transformer helps with that by capturing both local detail and the bigger picture.

3.3. BiFormer Blcok

In this stage, the ground-based cloud image is first processed through Patch Embedding, which reshapes the 2D image into a sequence of vectors that the Transformer can process, similar to how words are encoded in natural language. To help the model understand the spatial location of each patch in the original image, positional encoding is added, ensuring that structural information is preserved. The positional encoding can be either generated using a fixed formula or learned during training. Once the patch embeddings and their corresponding positional information are prepared, they are fed into the Transformer layers. Transformers can model relationships across different regions of the image, which is particularly important for fine-grained segmentation of ground-based clouds—not only capturing local cloud textures but also understanding the spatial distribution of clouds across the entire sky to accurately delineate cloud types and boundaries.
After patch embedding, the output is input into BiFormer Block, the specific structure is shown in Figure 2. In Transformer-based vision models, handling the trade-off between accuracy and computational cost is a key challenge, especially when working with high-resolution images. One approach that has gained attention is Bi-level Routing Attention (BRA) [19]. Instead of relying on a fixed pattern to compute attention across all tokens, which can be expensive and often redundant-BRA takes a more flexible, content-driven approach. The attention mechanism runs in two stages. First, it looks at larger image regions and decides, based on semantic similarity, which areas are likely to be important. This initial filtering step cuts out irrelevant parts early on. Then, once it narrows things down, it dives deeper by applying token-level attention inside those selected regions. So, it is like zooming out to get a general idea, and then zooming in where it matters. The processing flow within the BiFormer Block consists of three sequential stages: a convolutional stage for local feature enhancement, a BRA stage for global dependency modeling, and a feed-forward network stage.
Let X i n R H × W × C denote the input feature map. First, a 3 × 3 convolution is applied to implicitly encode relative positional information, which is critical for dense prediction tasks like cloud segmentation. This is formulated as:
X c o n v = X i n + Conv 3 × 3 ( X i n )
where Conv 3 × 3 ( ) represents the depthwise convolution operation with a kernel size of 3 × 3, and residual connection is applied to preserve the original feature information.
Subsequently, the feature map passes through the BRA module and the MLP module. Following the design of modern Vision Transformers, Layer Normalization (LN) is applied before the attention and MLP layers (Pre-Norm configuration), while residual connections are added after each module. The mathematical expressions for these steps are:
X a t t = X c o n v + BRA LN ( X c o n v )
X o u t = X a t t + MLP LN ( X a t t )
where LN ( ) denotes Layer Normalization, and BRA ( ) represents the core Bi-level Routing Attention mechanism that dynamically filters out irrelevant key-value pairs at a coarse region level before computing fine-grained token-to-token attention. X o u t is the final output of the BiFormer block, which serves as the input for the next layer.
One key feature that distinguishes BRA from earlier sparse attention methods is its dynamic, content-adaptive mechanism. Unlike fixed-pattern attention approaches that blindly compute relationships across all regions, BRA leverages the query information itself to determine which areas are most relevant, effectively focusing the model’s resources on the most informative parts of the image. This not only improves computational efficiency but also enhances the model’s ability to capture meaningful structures, avoiding wasted computation on irrelevant regions.
Another major advantage of BRA is its scalability. By selectively attending only to a subset of token pairs—specifically, those likely to contain valuable information—BRA significantly reduces computational overhead, even when handling high-resolution images. This property is particularly critical for tasks such as semantic segmentation, and it is especially beneficial in fine-grained segmentation of ground-based cloud imagery. In such scenarios, models must simultaneously understand the global layout of the sky and precisely capture local cloud textures and boundaries. BRA’s coarse-to-fine attention strategy enables the model to first identify semantically important regions (e.g., dense or complex cloud formations) and then focus on fine-grained details within those regions, ensuring accurate segmentation of cloud types and shapes.
By combining efficiency, dynamic adaptability, and fine-grained focus, BRA provides a powerful mechanism for modeling both local and global contextual relationships in cloud images. This makes it particularly well-suited for the challenges of ground-based cloud analysis, where subtle texture differences and spatial relationships are crucial for precise, high-resolution segmentation.

3.4. Multi-Scale Dual-Attention

Considering that although Swin-UNet can rely on Transformer to model global semantic information, its skip connection structure still directly transmits the shallow features of the encoder. These features usually contain a lot of background noise and manifest as blurred cloud-sky boundaries and incomplete thin cloud structures in ground-based cloud images. Therefore, inspired by medical image segmentation [20], this study innovatively proposes a MSDA fusion module, which embeds a fine-grained feature reconstruction module that integrates PAM, CAM and multi-scale convolution enhancement in each level of skip connection to improve the effectiveness of information transmitted from the encoding end to the decoding end. The detailed structure of MSDA is shown in Figure 3.
Specifically, the multi-scale deep convolution branch captures cloud structural features at different scales, enhancing local details such as broken and thin clouds. The PAM branch models global positional dependencies between pixels, allowing the skip-transferred features to focus more on cloud morphology and boundary regions. The CAM branch adaptively assigns weights based on inter-channel semantic importance, highlighting features related to cloud texture while suppressing sky background and observation noise. Finally, a learnable fusion strategy dynamically fuses the three enhanced features with the original skip features, enabling more effective synergy between high-level semantics and low-level structural information at the decoder. This design achieves accurate preservation of cloud edge details and cross-scale semantic integration, significantly alleviating the problem of insufficient local boundary recognition in the original architecture. It enables the network to more clearly depict the cloud-sky boundary in the fine-grained segmentation of ground-based cloud images, enhance the segmentation accuracy of thin clouds and small-scale cloud clusters, and thus obtain higher-quality cloud structure analysis results. Especially when facing complex cloud image structures, it can better capture the shape, edge details and local texture of the cloud, thereby providing more reliable and detailed feature support for subsequent cloud classification, segmentation and other tasks. This fine-grained feature extraction method is of great significance for improving the performance of the model in ground-based cloud image analysis.
To capture cloud structure features at different scales and enhance local details (such as stratus and cirrus clouds), MSDA designed an intermediate parallel convolutional branch. The input features F i n are first reduced in dimensionality by a 1 × 1 convolution, and then fed into three parallel depthwise separable convolution (DS Conv) paths. These three paths extract local to mesoscale structural features from different receptive fields. Depthwise separable convolutions increase the network’s non-linear expressive power while maintaining low computational cost. The features extracted by each path are unified through a 1 × 1 convolution and then element-wise summed. The calculation process can be represented as follows:
F M S = i = 1 3 C 1 × 1 i ( D k i ( C 1 × 1 i n ( F i n ) ) )
Here, C 1 × 1 i represents a 1 × 1 convolution, and D k i represents a depthwise separable convolution with different kernel sizes or dilation rates, designed to capture multi-scale spatial details.
To suppress the interference of the sky background on the distribution of thin clouds, the top branch introduces a Position Attention (PAM) mechanism. This branch first performs Global Average Pooling (GAP) on the input features to aggregate global semantic information and then generates global context-enhanced features through 1 × 1 convolution. PAM aims to establish pixel-level global dependencies, enabling the model to focus on the shape and boundary regions of clouds during feature transfer. Assuming the input feature map is X R C × H × W , PAM refines the features by calculating the spatial attention map A p R ( H × W ) × ( H × W ) in the following way:
F PAM = PAM ( C 1 × 1 ( GAP ( F i n ) ) )
The bottom branch employs a channel attention mechanism to capture long-range dependencies and multi-scale semantic information. Similarly to the PAM branch, this branch first preprocesses the channel attention channels using global average pooling and convolution, then computes channel attention weights in the spatial dimension.
This module adaptively assigns weights based on the semantic importance of channels, highlighting feature channels related to cloud texture while suppressing observation noise. The process can be described as follows:
F PCM = PCM ( C 1 × 1 ( GAP ( F i n ) ) )
To achieve effective synergy between high-level semantics and low-level structural information, MSDA employs a learnable fusion strategy. The features output from the three branches are channel-unified using 1 × 1 convolutions and dynamically fused with the original skip connection features (including element-wise addition and multiplication operations, as shown in the figure). The final output feature F o u t exhibits stronger discriminative power and spatial structure awareness.
F o u t = F f u s e ( F i n ,   F M S ,   F P A M ,   F P C M )
Overall, the MSDA module’s design goes beyond simple feature overlay; it establishes a cross-layer semantic-detail complementarity mechanism. By dynamically integrating global contextual information from PAM with multi-scale local details from MS-Conv during the decoding stage, this design effectively mitigates the loss of low-level spatial information caused by multiple downsampling in the original architecture. Specifically, this fusion strategy significantly sharpens the feature response of cloud-air boundaries, enabling the model to maintain pixel-level continuity and consistency when processing blurred boundaries.
Furthermore, for cirrus clouds and fragmented small-scale cloud clusters commonly found in ground-based cloud images, MSDA suppresses background noise interference through a channel attention mechanism (CAM/PCM), enhancing the expressive power of weak texture features. This significantly improves the segmentation accuracy of small targets and reduces missed detections and fragmentation. Ultimately, this high-fidelity feature reconstruction capability provides more accurate and robust feature support for subsequent meteorological analysis tasks such as cloud classification, cloud cover estimation, and solar irradiance prediction, demonstrating significant application value.

3.5. MLP Bottleneck Layer

To make the segmentation model more efficient and practical for real-world deployment, particularly when handling high-resolution ground-based cloud images, we replaced the standard Swin Transformer bottleneck with a more lightweight alternative based on token-wise MLPs. This adjustment aims to reduce computational burden and memory usage without severely compromising the ability of model to learn expressive feature representations.
Rather than computing multi-head self-attention across local windows, as done in Swin Transformer, our method takes a different route. It begins by shifting the encoded latent features along spatial dimensions, first horizontally and then vertically. These shifted features are then passed through two distinct MLP blocks, with each block focusing on dependencies along a single axis. While this approach simplifies the computation, it still retains the essence of directional attention, offering a behavior similar to that of the shifted window mechanism in Swin Transformer, but at a fraction of the cost.
Once the bi-directional shifts and corresponding MLP transformations are complete, we apply a learnable linear projection to refine the feature representation. This is followed by Layer Normalization, which helps maintain numerical stability and supports faster convergence during training. Finally, a second MLP layer reprojects the features to the desired dimension, making them ready for subsequent decoding or fusion steps.
In essence, this token-based MLP bottleneck serves as a middle ground between architectural simplicity and semantic modeling capacity. It enables the model to remain lightweight and efficient, which is especially valuable in time-sensitive or resource-constrained scenarios, without forgoing the contextual awareness needed for accurate segmentation.

4. Experiment

4.1. Parameter Settings

All experiments in this study were implemented using the PyTorch 2.0 deep learning framework. The training and evaluation processes were conducted on a system equipped with two NVIDIA GeForce RTX 4090 GPUs, each with 24 GB of memory. The models were developed and executed under the Ubuntu 22.04 Linux operating system with Python 3.8.
For all ground-based cloud image segmentation methods evaluated in this work, the number of training epochs was set to 1000 iterations, and a batch size of 8 images was used. The Ranger optimizer was employed with an initial learning rate of 0.001. Considering that different segmentation networks may converge at different rates, an early stopping strategy was adopted to prevent overfitting and unnecessary computation. Specifically, training was automatically terminated if the loss on the validation set did not improve for 30 consecutive epochs.
The input to each network consisted of RGB ground-based cloud images, and the output corresponded to the predicted segmentation maps. This experimental configuration was designed to ensure fair and consistent evaluation across all compared models.

4.2. Dataset

Cloud classification ideally requires an understanding of how clouds form, evolve, and interact with their environment. Traditional classification methods used internationally are largely based on visual traits, such as brightness, color, size, and altitude-rather than their functional impact on solar radiation. For example, the widely recognized system developed by the World Meteorological Organization (WMO) does not account for how different cloud types influence solar irradiance. This limits its usefulness in areas like PV power forecasting, where radiation levels are directly affected by cloud presence and type.
In this study, we therefore adopt a revised cloud categorization tailored specifically to irradiance modeling. Ignoring cloud height, we regroup the clouds into five categories according to their radiative characteristics, morphological features, and meteorological behavior, Cumulonimbus and nimbostratus clouds are characterized by their large coverage, dark-gray appearance, and substantial thickness, typically shading the entire sky. Cumulus clouds appear as distinct clumps, commonly observed on clear days, while stratocumulus clouds are thinner and more horizontally extended. Cirrus clouds exhibit fibrous and elongated structures, usually white and sparsely distributed or occurring at the edges of cumulus formations. In this cloud formation, green represents Cumulus clouds, red represents Cirrus clouds, orange represents Stratocumulus clouds, purple represents Cumulonimbus and nimbostratus, and blue represents the sky. A total of 300 ground-based cloud images were selected and annotated at the pixel level to construct a ground-based cloud image fine-grained segmentation dataset.
To quantify how each cloud type affects solar radiation, we collected and analyzed multiple cloud images, calculating their corresponding clear-sky index values. This offers a more practical basis for predicting solar irradiance variations, ultimately helping to improve the reliability of PV power output forecasting.

4.3. Evaluation Metrics

In order to comprehensively evaluate the effectiveness of the proposed model, four commonly used metrics are selected as evaluation indicators. These metrics are employed to assess the segmentation performance of the network from multiple perspectives, ensuring a thorough and objective comparison with other baseline methods. The calculation process is shown in the Formula (8)–(11):
m F s c o r e = 1 N k = 1 N F s c o r e k
Accuracy = 1 N k = 1 N T P k i + T N k i T P k i + T N k i + F P k i + F N k i
P r e c i s i o n = 1 N k = 1 N T P k i T P k i + F P k i
m I o U = 1 N k = 1 N T P k T P k + F P k + F N k
In the evaluation of semantic segmentation performance, let N represent the total number of categories. For each class k, four key statistical terms are defined. True Positives (TPk) refer to the number of pixels correctly predicted as belonging to class k. True Negatives (TNk) denote the number of pixels accurately identified as not belonging to class k. False Positives (FPk) represent the number of pixels incorrectly predicted as class k while actually belonging to other categories, and False Negatives (FNk) indicate the number of pixels that truly belong to class k but are misclassified as other classes.

4.4. Comparative Experiment

Table 1 presents a comprehensive quantitative comparison among the proposed method and several widely used semantic segmentation architectures on the fine-grained ground-based cloud segmentation task. Bold numbers indicate the best performance. The evaluated models include classical convolutional networks such as U-Net and HRNet [21], as well as more recent Transformer-based or hybrid designs represented by SwinU-Net [22], SegFormer [23], and TransUNet [24]. Four standard metrics are employed to assess segmentation quality: mean F1-score (mFscore), overall Accuracy, Precision, and mean Intersection-over-Union (mIoU). These metrics jointly reflect the capability of each model in terms of balanced class prediction, pixel-level correctness, reliability of positive predictions, and spatial coherence of segmented cloud regions.
The comparison clearly shows that the proposed method achieves the highest performance across all four metrics. Relative to the strongest baseline TransUNet, our method achieves noticeable absolute gains in mFscore, Accuracy, Precision, and mIoU. These improvements indicate that the proposed approach provides more reliable discrimination among visually similar cloud categories and more accurate boundary estimation in scenes containing fragmented, overlapping, or low-contrast cloud structures. Ground-based cloud imagery often contains diffuse edges, rapidly changing textures, and highly irregular cloud shapes. The consistent improvements across all indicators demonstrate that the proposed method is particularly effective in handling these domain-specific challenges.
The advantages of our model are also evident when compared with classical architectures such as U-Net. The large margins observed across all metrics highlight the limitations of traditional encoder–decoder frameworks for tasks requiring fine-grained recognition of atmospheric patterns, especially when long-range dependencies and multi-scale spatial cues must be captured simultaneously. In contrast, the proposed framework benefits from enhanced contextual reasoning, more expressive feature representations, and more stable optimization behavior during training. These strengths enable the model to better integrate global structural information with local texture cues, which is essential for distinguishing subtle differences between cloud types and for producing accurate segmentation masks with coherent boundaries.
A closer examination of the results reveals several important observations. The improvement in Precision suggests a substantial reduction in false detections, indicating that the model is better aligned with the true distribution of cloud categories. The higher mFscore reflects improved balance between precision and recall, which is critical for categories that occupy only small regions of the image or appear intermittently. The improvement in mIoU indicates that the predicted cloud regions match the ground truth more closely, both in shape and in spatial extent, which is particularly significant for cloud types that have irregular boundaries or thin structures. The higher overall Accuracy further confirms that the model generalizes well across diverse sky conditions, including scenes with strong illumination variations, partial occlusions, and multi-layer cloud formations.
Due to the relatively small size of the dataset, performance variability may exist. To demonstrate the stability of our proposed model, we show the training and validation loss curves in Figure 4. Both curves exhibit smooth and consistent convergence without significant fluctuations, indicating that the model training is stable and that the reported single-run metrics are representative of its performance.
Overall, the consistent improvements across all evaluation indicators confirm the effectiveness and robustness of the proposed framework. The results demonstrate that our method provides a more reliable solution for fine-grained ground-based cloud segmentation and has strong potential for practical application in atmospheric observation and related remote sensing fields. While overall mIoU is used as the primary evaluation metric in this study, per-category mIoU analysis is not included in the current experimental framework. Since different cloud types exhibit distinct radiative behaviors and segmentation difficulty levels, a detailed category-wise performance analysis will be conducted in future work to provide a more comprehensive understanding of model behavior.
To further verify the superiority of the model segmentation performance, we visually compare the segmentation results, as shown in Figure 5. The first and second columns represent the input ground-based cloud original image and the corresponding fine-grained label, respectively. The experiment selected three typical scenes. The first picture has the sky occupying the majority, accompanied by long strips of cirrus clouds. The second picture has an equal proportion of clouds and sky, mainly cumulus clouds, accompanied by a small amount of stratus clouds. The third picture is mainly cloud.
In the first image, U-Net did not accurately segment the cloud shape, and its segmentation effect was poor. HrNet was slightly better than U-Net, but it mistakenly mixed cumulus clouds with cirrus clouds. The proposed model best segmented the long strips of cirrus clouds in terms of shape. For the second ground-based cloud image, the clouds are mainly distributed around the diagonal line. However, both U-Net and HrNet failed to segment the stratus clouds on the upper right side. When multiple cloud types appear in the segmentation target, the segmentation effect is poor. SwinU-Net can segment the stratus clouds, but the edges are jagged and the capture of local ground-based cloud details is poor. The proposed model has the best segmentation performance for clouds of different shapes and sizes. For the last set of ground-based cloud images, we can see that the clouds are mainly distributed in the lower left side of the image, mainly cumulus clouds, with small patches of cirrus and stratus clouds at the edge. It can be seen that U-Net mistakenly classifies all the cirrus and stratus clouds above as cirrus clouds, resulting in misjudgment. HrNet and SwinU-Net also have the same problem. Only the proposed model achieves the best segmentation effect, achieving the best judgment of cloud boundaries and different cloud types.
Table 2 reports the number of parameters, FLOPs, and inference time for all compared models. Measurements were obtained using input images of size 512 × 512 on an NVIDIA RTX 3090 GPU under consistent settings. The results show clear differences in computational cost among the architectures. U-Net and SegFormer have relatively small parameter counts, while HRNet and TransUNet are larger due to more complex structures. Our model has 58.2 million parameters, slightly higher than SwinU-Net but notably lower than TransUNet. This reflects a design that increases representational capacity while controlling model size. In terms of FLOPs, our model requires 110.7 GFLOPs, which is in the mid-range of the compared methods. This allows the network to capture rich features without excessive computational overhead. For inference time, our method processes a single image in 23.8 milliseconds. Although slightly slower than lightweight models, it remains efficient and offers higher segmentation accuracy.
For the intended application of short-term photovoltaic power forecasting, fully real-time inference at the millisecond level is not strictly necessary. PV power predictions are typically updated on the order of seconds or minutes, corresponding to the temporal resolution of solar irradiance measurements and control decisions in distribution networks. In this context, the inference time of our model is sufficient to generate timely cloud segmentation inputs for accurate PV power prediction, without introducing significant latency in operational decision-making.
Although lighter models such as Segformercan run faster, they achieve lower segmentation accuracy on fine-grained cloud types, which could negatively affect the precision of PV power forecasting. Conversely, heavier models such as TranUnetoffer comparable or slightly higher computational cost but do not provide a better trade-off between accuracy and efficiency. Therefore, our model is particularly suitable for deployment in PV power forecasting systems where high-quality segmentation of diverse cloud types is critical, while inference speed remains sufficient for short-term prediction requirements.
Overall, the results indicate that the proposed model achieves a reasonable balance between performance and computational cost, making it suitable for fine-grained ground-based cloud segmentation in practical applications.

4.5. Ablation Experiment

In order to further verify the effectiveness of the model architecture and key components proposed in this paper, this section conducts four sets of ablation experiments to verify. Table 3 shows the contribution of different key components to the segmentation accuracy in the ablation experiment. Bold numbers indicate the best performance. There are four groups of experiments. The first group removes DA-Block and adopts a basic jump connection structure to integrate the encoder and decoder. In the encoder, Swin Transformer Block is used to replace BiFormer Block. The second group of experiments uses DA-Block and removes BiFormer Block. The third group of experiments removes DA-Block and uses BiFormer Block. The last group is the complete structure of the model proposed in this paper. It can be seen that both DA-Block and BiFormer Block contribute to the improvement of segmentation accuracy. Compared with Experiment 1, in Experiment 2, DA-Block is added, and the mIoU reaches 61.31%, an increase of 4.49 percentage points, and the other indicators are also improved to a certain extent. In Experiment 3, BiFormer Block is added, and the mIoU reaches 62.94%, an increase of 6.12 percentage points, and the other indicators are also improved to a certain extent.
Figure 6 shows the per-class IoU for representative cloud types along with Sky. Sky achieves the highest IoU (92.7%) due to its relatively uniform appearance and clear boundaries. Cumulus follows with a moderately high IoU (70.3%), benefiting from its well-defined structure and texture. Stratus/Stratocumulus achieves a lower IoU (54.8%) because of its thin and diffuse morphology, which makes boundary delineation more challenging. Cirrus reaches 50.2% IoU, reflecting its high-altitude, wispy features and weak texture contrast. The lowest IoU is observed for Cumulonimbus and Nimbostratus (36.7%), due to their complex vertical structure, irregular shapes, and frequent overlap with other cloud types. Background is not shown in the figure but is included in the overall mIoU metric reported in Table 1.
The per-class analysis also highlights the contribution of key model components. The MSDA module particularly improves the segmentation of thin and diffuse cloud types, such as Cirrus and Stratocumulus, by adaptively enhancing task-relevant channel and spatial features. In contrast, the dynamic sparse attention mechanism in the BiFormer backbone contributes more significantly to structurally complex cloud types, such as Cumulonimbus, by improving long-range dependency modeling. These results demonstrate that the proposed architecture effectively balances fine-grained feature extraction and global context modeling across diverse cloud categories.

5. Conclusions

Ground-based cloud image segmentation only achieves the segmentation of clouds and sky and cannot provide basic input and precise guidance for photovoltaic power prediction based on fine-grained cloud attribute characteristics. We propose a ground-based cloud image fine-grained segmentation method based on an improved Transformer architecture. In order to solve the problem that the convolutional structure is not able to extract global features and cannot achieve long-distance dependency modeling, this paper uses Swin Transformer as the model infrastructure. Since the encoder cannot further mine the large number of fine-grained features rich in ground-based cloud images, a new dynamic sparse attention mechanism based on dual-layer routing is introduced to replace the original movable window attention mechanism. In view of the fact that jump connections may introduce low-level noise or redundant information, MSDA is proposed to assign weights to the semantic information carried by different channels and spaces, thereby strengthening the feature response related to the task. In order to further reduce the amount of computation and reduce the resource burden, a lightweight MLP is used to replace the original bottleneck layer. Our model achieves the best segmentation accuracy in constructing a fine-grained segmentation dataset rich in various cloud species. Ablation experiments prove that the key components of the model contribute to the improvement of segmentation accuracy. This will provide new possibilities for the subsequent photovoltaic power prediction. More importantly, the achieved segmentation performance is sufficient to reliably capture the spatial distribution and semantic characteristics of major cloud types, which meets the practical requirements for serving as an effective input to downstream PV-related applications, such as solar irradiance estimation and short-term PV power forecasting. While the proposed method does not directly perform PV power prediction, it establishes a robust and application-oriented cloud information representation that bridges the gap between cloud observation and PV power analysis.
In future work, we plan to integrate the proposed segmentation framework with irradiance estimation and PV power forecasting models, enabling a quantitative evaluation of how segmentation accuracy influences prediction performance and operational decision-making in photovoltaic systems.

Author Contributions

Conceptualization, L.Z. and R.Z.; methodology, C.W. and D.S.; software, P.L.; validation, R.Z. and C.S.; formal analysis, P.L.; investigation, P.L.; resources, B.L. and B.J.; data curation, T.S.; writing—original draft preparation, L.Z.; writing—review and editing, R.Z.; visualization, R.Z.; supervision, R.Z.; project administration, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Due to the confidentiality of the project and other reasons, it is not possible to directly access the datasets. The data will be available on request from the corresponding author.

Conflicts of Interest

Author Lihua Zhang, Dawei Shi, Pengfei Li, Buwei Liu, Tongmeng Sun, Bo Jiao, Chunze Wang, and Rongda Zhang were employed by the Company State Power Investment Group Hebei Electric Power Co., Ltd. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Zhang, J.; Zhao, L.; Deng, S.; Xu, W.; Zhang, Y. A critical review of the models used to estimate solar radiation. Renew. Sustain. Energy Rev. 2017, 70, 314–329. [Google Scholar] [CrossRef]
  2. Tan, Z.; Zhang, H.; Xu, J. Photovoltaic power generation in China: Development potential, benefits of energy conservation and emission reduction. J. Energy Eng. 2012, 138, 73–86. [Google Scholar] [CrossRef]
  3. Kabir, E.; Kumar, P.; Kumar, S.; Adelodun, A.A. Solar energy: Potential and future prospects. Renew. Sustain. Energy Rev. 2018, 82, 894–900. [Google Scholar] [CrossRef]
  4. Zhang, Z.; Yang, S.; Liu, S.; Xiao, B.; Cao, X. Ground-based cloud detection using multiscale attention convolutional neural network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8019605. [Google Scholar] [CrossRef]
  5. Hong, D.; Yokoya, N.; Chanussot, J.; Zhu, X.X. An augmented linear mixing model to address spectral variability for hyperspectral unmixing. IEEE Trans. Image Process. 2019, 28, 1923–1938. [Google Scholar] [CrossRef] [PubMed]
  6. Shi, C.; Zhou, Y.; Qiu, B.; Guo, D.; Li, M. CloudU-Net: A deep convolutional neural network architecture for daytime and nighttime cloud images segmentation. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1688–1692. [Google Scholar] [CrossRef]
  7. Zhen, Z.; Zhang, X.; Mei, S.; Chang, X.; Chai, H. Ultra short-term irradiance forecasting model based on ground-based cloud image and deep learning algorithm. IET Renew. Power Gener. 2022, 16, 2604–2616. [Google Scholar] [CrossRef]
  8. Dev, S.; Nautiyal, A.; Lee, Y.H. CloudSegNet: A deep network for nychthemeron cloud image segmentation. IEEE Geosci. Remote Sens. Lett. 2021, 16, 1814–1818. [Google Scholar] [CrossRef]
  9. Ye, L.; Cao, Z.; Yang, Z. CCAD-Net: A cascade cloud attribute discrimination network for cloud genera segmentation in whole-sky images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6512105. [Google Scholar] [CrossRef]
  10. Dev, S.; Lee, Y.H.; Winkler, S. Color-based segmentation of sky/cloud images from ground-based cameras. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 10, 231–242. [Google Scholar] [CrossRef]
  11. Gacal, G.F.B.; Antioquia, C.; Lagrosas, N. Ground-based detection of nighttime clouds above Manila observatory (14.64° N, 121.07° E) using a digital camera. IEEE Geosci. Remote Sens. Lett. 2016, 55, 6040–6045. [Google Scholar]
  12. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  13. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  14. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
  15. Shi, C.; Zhou, Y.; Qiu, B. CloudU-Netv2: A cloud segmentation method for ground-based cloud images based on deep learning. Neural Process. Lett. 2021, 53, 2715–2728. [Google Scholar] [CrossRef]
  16. Shi, C.; Zhou, Y.; Qiu, B. CloudRaednet: Residual attention-based encoder-decoder network for ground-based cloud images segmentation in nychthemeron. Int. J. Remote Sens. 2022, 43, 2059–2075. [Google Scholar] [CrossRef]
  17. Liu, S.; Zhang, J.; Zhang, Z. TransCloudSeg: Ground-based cloud image segmentation with transformer. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6121–6132. [Google Scholar] [CrossRef]
  18. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
  19. Zhu, L.; Wang, X.; Ke, Z. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
  20. Sun, G.; Pan, Y.; Kong, W. DA-TransUNet: Integrating spatial and channel dual attention with transformer U-Net for medical image segmentation. Front. Bioeng. Biotechnol. 2024, 12, 1398237. [Google Scholar] [CrossRef] [PubMed]
  21. Sun, K.; Xiao, B.; Liu, D. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
  22. Cao, H.; Wang, Y.; Chen, J. Swin-Unet: U-Net-like pure transformer for medical image segmentation. In Computer Vision—ECCV 2022 Workshops; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
  23. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online Conference, 6–14 December 2021; pp. 12077–12090. [Google Scholar]
  24. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. In Proceedings of the 24th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cham, Switzerland, 8 February 2021; pp. 485–495. [Google Scholar]
Figure 1. Overall architecture of the model.
Figure 1. Overall architecture of the model.
Electronics 15 00156 g001
Figure 2. Structure of a BiFormer block.
Figure 2. Structure of a BiFormer block.
Electronics 15 00156 g002
Figure 3. Overall structure of MSDA.
Figure 3. Overall structure of MSDA.
Electronics 15 00156 g003
Figure 4. Training and validation loss curves of the proposed model.
Figure 4. Training and validation loss curves of the proposed model.
Electronics 15 00156 g004
Figure 5. Segmentation visualization comparison of different models.
Figure 5. Segmentation visualization comparison of different models.
Electronics 15 00156 g005
Figure 6. Per-class IoU for representative cloud categories.
Figure 6. Per-class IoU for representative cloud categories.
Electronics 15 00156 g006
Table 1. Quantitative comparisons with counterparts. The best performance with respect to each metric is highlighted in boldface.
Table 1. Quantitative comparisons with counterparts. The best performance with respect to each metric is highlighted in boldface.
MethodsmFscoreAccuracyPrecisionmIoU
U-Net [13]39.3040.7537.1936.52
HrNet [21]59.4857.3059.7952.44
SwinU-Net [22]60.7162.0360.5457.02
Segformer [23]62.5866.9063.7459.34
TranUnet [24]64.4867.5365.9061.60
ours69.8571.4770.2065.18
Table 2. Model Complexity Comparison of Different Segmentation Methods.
Table 2. Model Complexity Comparison of Different Segmentation Methods.
MethodsParams (M)FLOPs (G)Inference Time (ms/Image)
U-Net [13]31.055.212.8
HrNet [21]65.984.718.4
SwinU-Net [22]41.396.521.6
Segformer [23]27.562.314.2
TranUnet [24]105.3122.827.5
ours58.2110.723.8
Table 3. Ablation experiments of different key components.
Table 3. Ablation experiments of different key components.
MSDABiFormer BlockmFscoreAccuracyPrecisionmIoU
62.1063.5659.4856.82
63.4264.7064.5661.31
65.8066.3766.6862.94
69.8571.4770.2065.18
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, L.; Shi, D.; Li, P.; Liu, B.; Sun, T.; Jiao, B.; Wang, C.; Zhang, R.; Shi, C. Fine-Grained Segmentation Method of Ground-Based Cloud Images Based on Improved Transformer. Electronics 2026, 15, 156. https://doi.org/10.3390/electronics15010156

AMA Style

Zhang L, Shi D, Li P, Liu B, Sun T, Jiao B, Wang C, Zhang R, Shi C. Fine-Grained Segmentation Method of Ground-Based Cloud Images Based on Improved Transformer. Electronics. 2026; 15(1):156. https://doi.org/10.3390/electronics15010156

Chicago/Turabian Style

Zhang, Lihua, Dawei Shi, Pengfei Li, Buwei Liu, Tongmeng Sun, Bo Jiao, Chunze Wang, Rongda Zhang, and Chaojun Shi. 2026. "Fine-Grained Segmentation Method of Ground-Based Cloud Images Based on Improved Transformer" Electronics 15, no. 1: 156. https://doi.org/10.3390/electronics15010156

APA Style

Zhang, L., Shi, D., Li, P., Liu, B., Sun, T., Jiao, B., Wang, C., Zhang, R., & Shi, C. (2026). Fine-Grained Segmentation Method of Ground-Based Cloud Images Based on Improved Transformer. Electronics, 15(1), 156. https://doi.org/10.3390/electronics15010156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop