SDGSAT-1 Cloud Detection Algorithm Based on RDE-SegNeXt

Li, Xueyan; Hu, Changmiao

doi:10.3390/rs17030470

Open AccessArticle

SDGSAT-1 Cloud Detection Algorithm Based on RDE-SegNeXt

by

Xueyan Li

^1,2

and

Changmiao Hu

^1,*

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 470; https://doi.org/10.3390/rs17030470

Submission received: 28 November 2024 / Revised: 11 January 2025 / Accepted: 27 January 2025 / Published: 29 January 2025

Download

Browse Figures

Versions Notes

Abstract

This paper proposes an efficient cloud detection algorithm for Sustainable Development Scientific Satellite (SDGSAT-1) data. The core work includes the following: (1) constructing a SDGSAT-1 cloud detection dataset containing five types of elements: clouds, cloud shadow, snow, water body, and land, with a total of 15,000 samples; (2) designing a multi-scale convolutional attention unit (RDE-MSCA) based on a gated linear unit (GLU), with parallel re-parameterized convolution (RepConv) and detail-enhanced convolution (DEConv). This design focuses on improving the feature representation and edge detail capture capabilities of targets such as clouds, cloud shadow, and snow. Specifically, the RepConv branch focuses on learning a new global representation, reconstructing the original multi-branch deep convolution into a single-branch structure that can efficiently fuse channel features, reducing computational and memory overhead. The DEConv branch, on the other hand, uses differential convolution to enhance the extraction of high-frequency information, and is equivalent to a normal convolution in the form of re-parameterization during the inference stage without additional overhead; GLU then realizes adaptive channel-level information regulation during the multi-branch fusion process, which further enhances the model’s discriminative power for easily confused objects. It is integrated into the SegNeXt architecture based on RDE-MSCA and proposed as RDE-SegNeXt. Experiments show that this model can achieve 71.85% mIoU on the SDGSAT-1 dataset with only about 1/12 the computational complexity of the Swin-L model (a 2.71% improvement over Swin-L and a 5.26% improvement over the benchmark SegNeXt-T). It also significantly improves the detection of clouds, cloud shadow, and snow. It achieved competitive results on both the 38-Cloud and LoveDA public datasets, verifying its effectiveness and versatility.

Keywords:

SDGSAT-1; cloud detection; re-parameterized convolution; detail-enhanced convolution; convolutional attention

1. Introduction

It is estimated that approximately 66% of the Earth’s surface is typically covered by clouds [1]. The presence of clouds can impede the accurate identification of features [2,3], as well as the effectiveness of remote sensing images in target detection [4], semantic segmentation [5], and change detection [6], among other tasks.

On 5 November 2021, the SDGSAT-1 satellite was launched from the Taiyuan Satellite Launch Center by a Long March 6 carrier rocket. The satellite was designed to carry three payloads: thermal infrared, micro-optical, and multispectral imagers. These were intended for the monitoring and assessment of the global sustainable development goals (SDGs). The satellite was operated in an orbit at an altitude of 505 km and an inclination of

97^{°}

. The three payloads have a range of 300 km and a footprint of 100 km, respectively. All three payloads are capable of acquiring data with a width of 300 km, enabling global coverage to be achieved in 11 days [7]. However, while the multispectral and thermal infrared data provided by SDGSAT-1 have significant scientific value, they are still subject to the impact of cloud pollution. In particular, the distinction between clouds and snow, and between clouds and high brightness surfaces, represents a significant challenge in the field of cloud detection.

A multitude of cloud detection algorithms have been put forth to date. The most widely used of these is the spectral thresholding method, while the most representative are the Automatic Cloud Cover Assessment (ACCA) algorithm and the Fmask (Function of Mask) algorithm. The ACCA algorithm, developed by Irish et al. [8], is an automated cloud coverage assessment method for Landsat 7 that utilises eight spectral combinations and thermal infrared band temperature signatures for cloud detection, and detects results with a decent level of error. The Fmask algorithm, proposed by Zhu et al. [9], makes use of all Landsat bands, including thermal infrared bands. It is the most widely used algorithm for both Landsat 7 and SAR images. One of the most widely used cloud detection methods for Landsat 7 and Sentinel-2 images is the Fmask algorithm. This method divides the image into potential cloud pixels (PCPs) and clear-sky pixels, and calculates the cloud probability of each pixel using spectral and temperature features. Subsequently, dynamic thresholding is employed to identify the potential cloud pixels and ascertain the cloud pixels. The Fmask algorithm frequently utilises temperature characteristics in cloud detection, employing thermal infrared bands to differentiate between clouds and snow. The Fmask algorithm is capable of distinguishing between multiple categories, from coarse to fine in scale, and from land to water in land class, and exhibits high detection accuracy. However, the Fmask algorithm is more reliant on thermal infrared bands for the differentiation of clouds and snow. Consequently, the Luo–Trishchenko–Khlopenkov (LTK) algorithm [10] has been proposed as an alternative, which only uses a combination of red, blue, near-infrared (NIR) and short-wave infrared (SWIR) bands to detect clouds, thereby achieving favorable cloud detection results in the absence of TIR bands.

This paper presents a further development of the threshold cloud detection method based on spectral features. For example, Liu, X. et al. [11] used dynamic thresholds in each of the two channels to achieve automated cloud detection in GMS-5 geostationary meteorological satellite images from Japan. Ma Fang et al. [12] employed four channels for integrated computing to achieve cloud detection, thereby reducing the impact of geographic location to some extent. Zhu et al. [13] did not use thermal infrared spectra (thermal band) and incorporated the probability of cirrus to enhance the precision of the conventional Fmask method, particularly for thin and cirrus clouds. Dong proposed an automated threshold based on automatic thresholding for the GF-6 WFV data and proposed the Cloud Detection Algorithm Generation (CDAG) method based on automatic thresholding [14]. The results are superior to those obtained with static thresholding; however, the lack of short-wave infrared bands precludes differentiation between clouds and snow [15]. Wang used the SLIC superpixel segmentation algorithm to improve the detection precision of thick cloud edges from Chinese satellites [16]. Hu utilized a morphological matching algorithm to increase the detection accuracy of undercloud shadow in GF-1 WFV data [17].

Currently, cloud detection research dedicated to SDGSAT-1 data is still relatively limited. Existing research mainly focuses on traditional threshold-based methods using machine learning methods. For example, Ge et al. [18] proposed a dynamic threshold cloud detection method that incorporates gradient features to distinguish clouds and snow with similar spectral characteristics, and the threshold method is less robust and generalisable for image cloud detection of data from different sources. By using gradient information in the transition area between clouds and snow to support spectral data, the method obtains excellent accuracy in cloud detection of snow surface images, but the method does not make full use of advanced edge features and contextual information of thin clouds, cloud shadow, snow and other features. Xie et al. [19] used the spectral characteristics of SDGSAT-1 thermal infrared data, and proposed three normalized cloud detection indices (ITCDI1, ITCDI2, and ITCDI3) and used a small sample SVM classification model to combine irradiance features, thermal infrared spectral differences, and other multi-feature data to effectively improve the accuracy of cloud detection and construct an efficient cloud detection model. However, this method does not use information such as the shape and space of the cloud, and the auxiliary detection of cloud detection using only thermal infrared data has limited effect, and the detection of thin clouds is not effective.

In recent years, the rapid development of deep learning in the fields of computer vision and image processing has also brought about a series of novel possibilities in the field of intelligent processing of remote sensing images. Jiao put forth a series of end-to-end U-Net models for the precise segmentation of clouds and cloud shadow edges in Landsat 8 OLI data [20]. Li advanced a multi-scale convolutional feature fusion approach based on [21]. Shao et al. proposed a cloud detection method based on multi-scale convolutional neural networks (CNNs) [22], which is effective in distinguishing between thick and thin clouds. Fan et al. [23] used transfer learning to apply the self-attention network Swin Transformer to the cloud detection task of the domestically produced GF1/6 satellite series.

The accuracy of cloud detection tasks has increased as more complex models have been introduced into the field of cloud detection. However, these models, which include a large number of convolutional layers, pooling layers, fully connected layers, and other deeper and more parameters simultaneously, require huge computational resources, with long training times and slow inference, making it difficult to meet the deployment requirements of practical engineering tasks. And a current trend in deep learning is to reduce the network size, but this usually uses computationally expensive supervized learning techniques, which cannot take into account the balance between network size, computational complexity model effect. In order to ensure that challenges such as insufficient accuracy in cloud and cloud shadow detection, and difficulties in distinguishing clouds from snow and highlighted surfaces in subsequent data production can be improved, while at the same time enabling the network model to have a lightweight, low computational complexity, and high inference speed with excellent detection performance, we propose an adaptive convolutional network model. In order to improve the accuracy, we propose an adaptive convolutional attention-based re-parameterized convolution with detail-enhanced convolution of RDE-SegNeXt for the SDGSAT-1 cloud detection task.

In summary, our main contributions can be summarized as follows:

1.: We have refined the labeling and constructed a new SDGSAT-1 cloud detection dataset. The dataset consists of 15,000 finely labeled feature classes, including clouds, cloud shadow, snow, water and land, required for the five SDGSAT missions, in order to address the lack of data in the field of SDGSAT cloud detection and to promote the subsequent application and dissemination of the data.
2.: We propose a new convolutional attention module, called RDE-MSCA (re-parameterized detail-enhanced multi-scale convolutional attention), which on the one hand can establish the parameter link between different channels of convolutional kernels through re-parameterization, and on the other hand can reduce the computational complexity, Finally, the gated linear structure enables the model to adaptively weight the detail-enhanced convolution information to improve the expressiveness of the network, so as to make full use of the feature information and computational resources to improve the accuracy of cloud detection. information and computational resources to improve the accuracy and efficiency of cloud detection.
3.: Based on the RDE-MSCA module, we propose a re-parameterized convolutional attention network with detail-enhanced convolution RDE-SegNeXt for the cloud detection task of SDGSAT-1 data, and experimentally demonstrate the model’s excellent ability to discriminate and detect clouds and cloud shadow, and clouds and snow of SDGSAT-1 data, while ensuring low computational complexity and high inference speed. In addition, we also demonstrate its excellent performance for other cloud detection tasks on the public cloud detection dataset 38-Cloud, and further demonstrate the competitive detection accuracy of RDE-SegNeXt on the public remote sensing dataset LoveDA for the detection of other remote sensing data features, demonstrating the universal applicability and effectiveness of the model.

The following section outlines the structure of the paper. Section 2 provides an overview of the fundamental characteristics and critical parameters of SDGSAT-1, along with a detailed account of the SDGSAT-1 cloud detection algorithm, which is based on RDE-SegNeXt. The results of the experiments are presented and analyzed in Section 3. Section 4 presents a discussion of the advantages and disadvantages of the method. Section 5 provides a summary of the work presented in this paper and offers insights into potential future avenues of research.

2. Materials and Methods

2.1. SDGSAT-1

SDGSAT-1 (Sustainable Development Scientific Satellite 1) is the world’s first scientific satellite dedicated to serving the 2030 Agenda for Sustainable Development and the first Earth science satellite of the Chinese Academy of Sciences. This satellite is designed to provide important data support for the monitoring, assessment and scientific research of the global Sustainable Development Goals (SDGs). As part of the 2030 Agenda for Sustainable Development, the launch of SDGSAT-1 marks a major milestone in the field of global sustainable development.

SDGSAT-1 carries the Multispectral Sensor (MII), the Grayscale Sensor (GIU) and the Thermal Infrared Sensor (TIS), which enable the satellite to carry out all-weather, all-weather Earth observation and give it the capabilities of ‘TIS+MII’, ‘TIR+GIU’ and ‘MII+GIU’ and single-load observation modes. The multispectral sensor, which is one of the core payloads of SDGSAT-1 and the source of the image data for this study, is capable of making fine observations of the ground surface through multiple spectral bands, especially playing a key role in environmental monitoring and ecosystem assessment. With multispectral band data, the satellite is able to effectively monitor the color index and transparency of water body and analyze water quality changes in lakes, rivers and oceans. At the same time, multispectral sensors also perform well in the analysis of glacier changes, snow melt and vegetation cover changes, helping to assess the health of ecosystems and providing valuable data support for global climate change research. The band settings of multispectral satellites at home and abroad are shown in Table 1.

In terms of technical performance, all three payloads of SDGSAT-1 have a 300 km bandwidth and a ground target revisit period of about 11 days. Those indicators enabled it to cover a wide geographical area and provide consistent time-series data for long-term environmental change monitoring. The data acquired by SDGSAT-1 will have a wide range of applications not only in academic research, but also in providing policymakers with science-based data to support the development of strategies to address climate change, urbanization and natural resource management, etc. The technical specifications of SDGSAT-1 are shown in Table 2.

2.1.1. Areas

The study area of this experiment covers the entire territory of China, which is located in the eastern part of Asia along the west coast of the Pacific Ocean. The landscape of China slopes gradually from west to east, forming an obvious terraced layout. The continental coastline stretches about 5000 km from east to west, and the different combinations of temperature and rainfall have given rise to rich and varied climatic types. Specifically, the differences between the highlands in the west and the lowlands in the east of China form the overall topographical pattern, with mountains, plateaus and hills covering 67% of the country’s land area, and basins and plains covering about 33%. The mountain range trends are mainly east–west and northeast–southwest, with the Kunlun Mountains, the Himalayas and the Qinling Mountains being the most prominent. Due to its vast land area, large latitudinal range, significant differences between land and sea, and complex and changing topography, China has formed a variety of climate types, ranging from the monsoon climate in the east, the temperate continental climate in the northwest, to the unique alpine climate of the Qinghai-Tibet Plateau. In terms of temperature zones, China has tropical, subtropical, warm temperate, mesothermal, cold temperate and Tibetan plateau climate zones. Such complex and variable terrain and climate conditions provide an ideal environment for validating the algorithms used in this study.

The experimental data selected for this study are SDGSAT-1 multispectral imagery data with a spatial resolution of 10 metres, uniformly covering four regions, namely the northeast, northwest, southwest, and southeast of China, covering different topographic and climatic zones. The time frame consists of 27 images taken between January and December 2022, with cloud cover ranging from 10 to 80 percent, of which 6 are snow-covered images. The distribution of data used in the experiment is detailed in Figure 1.

The data come from the SDG Big Data Platform of the International Research Centre of Big Data for Sustainable Development Goals (CBAS). The SDGSAT-1 multispectral and thermal infrared image data provided by this platform have been rigorously geometrically fine-calibrated and are accompanied by standard L4A level data products with radiometric calibration coefficients. The raw data are available in the form of Digital Number (DN) values, and researchers can obtain the corresponding calibration coefficients from the official website of the Sustainable Development Big Data Research Centre (https://www.sdgsat.ac.cn), where the DN values are converted to Top of Atmosphere (TOA) reflectance.

2.1.2. Sample Production and Labeling Process

The construction of the SDGSAT-1 cloud detection dataset was achieved through the implementation of sample pair cropping and labeling on the aforementioned 27 SDGSAT-1 images. This process culminated in the acquisition of 15,000 pairs of data samples, encompassing sample images and labels, with a size of

512 \times 512

. The comprehensive procedure can be delineated in three stages, as illustrated in Figure 2.

(1): Threshold method to automatically generate initial tags

Our method is based on the threshold method proposed by Chang et al. [24] for GF satellites, and is improved by combining the multispectral and thermal infrared characteristics of SDGSAT-1 to obtain a spectral threshold method suitable for SDGSAT-1 data. This enhanced method takes into account the variations in spectral characteristics and brightness temperature characteristics of land features, including clouds and cloud shadow, and clouds and snow, in both the multispectral band and the thermal infrared band. The application of multi-band thresholds and thermal infrared temperature thresholds enables the automatic preliminary distinction between clouds and other land features, thereby generating an initial label map.

(2): Manual Refinement

Clouds and cloud shadow, clouds and snow, and complex bright surfaces may produce cloud-like reflective or radiative signatures in certain wavebands. As a result, the results of automatic thresholding will inevitably be misscored or missed. To ensure the quality of the labeled samples, manual optimization and quality correction steps are essential.

To improve the accuracy of the annotation, we carry out manual refinement and correction on the basis of the initial thresholding annotation. We superimposed the visualization on the automatically generated initial label. We then checked the suspicious areas and corrected the false detection areas by synthesizing the thermal infrared data with the data from other bands as a reference. We paid particular attention to cloud boundaries, cloud shadow, cloud-snow transition areas, mixed cloud pixels and areas with significant differences in cloud thickness, as well as other feature misclassification areas that may be caused by the heat island effect or the high reflectivity of snow surfaces.

The combination of the aforementioned thresholding and manual refinement significantly reduces the manual pixel-by-pixel labeling workload, while ensuring the accuracy and quality of labeling.

(3): Sample Pair Cropping

Since the sample images and the annotated images are of considerable size and cannot be directly input into the deep learning network for learning, we crop each corrected SDGSAT-1 image into a number of

512 \times 512

patches. To facilitate the subsequent promotion and application of SDGSAT-1 data in sustainable development tasks, each labeled pixel of our labeled samples belongs to the following six categories: land, water body, clouds, cloud shadow, snow, and fill values.

We excluded cropped patches with a padding value exceeding 70% to ensure the amount of valid information in the training samples. After performing the above cropping process on all images, 15,000 pairs of labeled images were obtained, and a certain proportion of sample pairs in each image region were randomly selected for testing. The relationship between the value of each pixel and its corresponding category is shown in Figure 3.

In addition, in order to ensure the training effect of the model on each category of features and to prevent the training bias caused by the low sample share of certain categories (e.g. cloud shadow, snow, and water body), we have performed sample balancing on the dataset. Specifically, on the one hand, we try to ensure that the frequency of occurrence of each category of features is basically the same when selecting data; on the other hand, in order to equalize the apparent overrepresentation of clouds and land, we use data enhancement and oversampling to increase the number of samples in these categories. Data enhancement techniques include rotation, flipping, random cropping, color dithering, etc., to enrich the diversity and number of samples. Through these measures, we ensure that the number of samples for each category in the dataset is relatively balanced, so that the model can fully learn the features of each feature during the training process, and thus enhance the generalization ability of the model and its performance in the cloud detection task.

2.2. RDE-MSCA

In order to better capture the detailed features of clouds, cloud shadow, snow, and other features in the cloud detection task, we propose a redesigned multi-scale convolutional attention module called RDE-MSCA (re-parameterized detail-enhanced multi-scale convolutional attention). This module aims to enhance the model’s ability to perceive high-frequency information (e.g., edges and textures) while keeping the network lightweight and efficient.

The core idea of the RDE-MSCA module is to improve the channel information fusion and feature extraction capabilities of the model in cloud detection by combining intensity level information and gradient level information while maintaining the number of model parameters. As shown in Figure 4, RDE-MSCA consists of two main components:

(1) DEConv branch: in remote sensing image cloud detection tasks, traditional methods [25,26,27] mainly use standard convolution for feature extraction. However, this approach searches a large solution space under unconstrained conditions, limiting the model’s ability to express detailed features. High-frequency information (e.g., cloud boundaries, cloud shadow, and snow details) is crucial for cloud detection, and thus some studies [28,29,30] introduced edge prior to enhance the extraction capability of edge features. Based on this idea, we innovatively build a DEConv layer into SegNeXt’s multi-scale convolutional attention to integrate a priori knowledge into the convolutional layer to better capture the detailed features of clouds, cloud shadow, snow, and other features of the landmass.

In the detail-enhanced convolutional branch, as shown in Figure 5, we deployed one standard convolution and four differential convolutions in parallel, which are responsible for capturing intensity-level and gradient-level information, respectively, in order to enhance the model’s ability to capture features with edge and detail information of clouds, cloud shadow, and other features. Specifically, we deployed a series of

3 \times 3

convolutions in parallel, including vanilla convolution (VC), centre difference convolution (CDC), angular difference convolution (ADC), horizontal difference convolution (HDC) and vertical difference convolution (VDC):

(1): Vanilla convolution: extracts the intensity level information of the image, i.e., traditional convolution operation.
(2): Centre difference convolution: Calculates the centre difference of each pixel in the image, captures the difference between the centre pixel and its surrounding neighborhood, and is used to capture fine-grained gradient information. CDC modifies the convolution weights so that the centre weight is subtracted from the sum of the surrounding weights, and enhances sensitivity to edges during feature extraction by emphasizing the contrast between the centre and the surroundings.
(3): Angular difference convolution: Extracts the gradient information of each angle in the image to effectively capture the corner features and angular differences in the image, which is very useful for detecting corner-like structures. The ADC adjusts the weights to compute the differences in the angular direction and introduces the parameter theta to control the degree of angular differences.
(4): Horizontal difference convolution: calculates the difference in the horizontal direction and is used to enhance horizontal edge features.
(5): Vertical difference convolution: computes the difference in the vertical direction and is used to enhance vertical edge features.

In conclusion, vanilla convolution is employed to obtain intensity-level information, whereas differential convolution is capable of explicitly encoding gradient information through the calculation of pixel differences, thereby enhancing the model’s perception of high-frequency features, which is highly beneficial for cloud detection in complex environments. Nevertheless, the utilization of multiple convolutional layers in parallel has the consequence of increasing the number of parameters and the inference time of the model. To address this issue, we exploit the additivity of the convolutional kernel to reduce the number of parallel convolutional layers to a single standard convolution, thus maintaining computational efficiency. Re-parameterized convolution makes use of two fundamental properties of convolution operations: homogeneity and additivity. If several 2D kernels with the same size operate on the same input with the same step and fill to produce the output, and their outputs are summed to obtain the final output, then it is possible to add these kernels at the corresponding positions to obtain an equivalent kernel that will produce the same final output. By employing re-parameterization techniques, DEConv is able to achieve a high degree of enhancement of input features while maintaining a constant computational cost and inference time. This design renders DEConv particularly well-suited to cloud detection tasks in remote sensing imagery. It is capable of effectively capturing and enhancing the detailed features of clouds, cloud shadow, snow, and other features, thereby improving the model’s ability to discriminate complex scenes. In particular, DEConv is capable of producing the output feature

F_{o u t}

from the input feature

F_{i n}

using the reparameterization technique, with the same computational cost and inference time as a single standard convolution. The procedure is illustrated in Equation (1) (for simplicity, the bias has been omitted).

\begin{matrix} F_{out} & = DEConv (F_{in}) = \sum_{i = 1}^{5} F_{in} \times K_{i} \\ = F_{in} \times (\sum_{i = 1}^{5} K_{i}) = F_{in} \times K_{cvt} \end{matrix}

(1)

where DEConv(·) denotes the operation of our proposed detail-enhanced convolution,

K_{i = 1 : 5}

denotes the kernels of VC, CDC, ADC, HDC, and VDC, respectively, × denotes the convolution operation, and

K_{c v t}

denotes the transformation kernel that combines parallel convolutions.

The above Figure 6 visualizes the exact process of re-parameterization in DEConv. In the backpropagation stage, the kernel weights of each of the five parallel convolutions are updated using the chain rule of gradient propagation. In the forward propagation phase stage, the kernel weights of the parallel convolutions are fixed, and the transformed kernel weights are computed by summing over the corresponding positions. It is valuable that the re-parameterization technique accelerates both the training and testing processes, since they both include a forward propagation phase. So compared to ordinary convolutional layers, DEConv can extract irregularly detailed features in different directions while maintaining the parameter size and without introducing additional computational cost and memory burden in the inference phase.

(2) GLU-based RepConv branching [31,32]: based on the GLU structure, the reconstructed re-parameterized convolution sum is used as a convolutional attention weight branching, which enhances the control of the information interaction between the channels and the information selection ability of the network.

GLU is an improved multi-layer perceptron (MLP) with enhanced gating [33]. GLU has been demonstrated to be effective in numerous cases, as evidenced by the literature [34,35,36]. Furthermore, it is employed in cutting-edge Transformer language models [37,38]. The fundamental concept of GLU is to regulate the flow of information through a gating mechanism, which enables the model to assign varying degrees of importance to different pieces of information, thereby enhancing the network’s capacity for expression, particularly when processing sequential or image data. GLU is particularly well-suited for use in convolutional networks and Transformer models, as it is capable of capturing both local and global features with efficiency. The specific process is shown in Figure 7. Assuming that the input tensor is X, the GLU regulates the transfer of information by introducing a gating mechanism, where A and B can be different mappings of the input X by linear transformations of the network layer (e.g., a fully connected layer as shown in Figure. or a convolutional layer), and

Σ

is an activation function (usually using a sigmoid) used to constrain the value of B to a value in the range of [0,1], thus acting as a ‘gating’. The linear gating unit can be represented in a simplified manner by the following Equation (2):

\begin{matrix} G L U (X) = X_{1} \otimes σ (X_{2}) \end{matrix}

(2)

Specifically, in our module,

X_{1}

is the DEConv branch and

X_{2}

is the RepConv branch.

In the re-parameterized convolution branch part, we combine the large kernel multi-branch deep convolution with GLU to form a convolution unit with multi-scale receptive fields and strong feature selection capability. Same as DEConv, we use the same multi-branch convolutional structure as MSCA in the training phase, but the difference is that we reconstruct the original

1 \times N

and

N \times 1

convolutional tandem structure into

N \times N

individual convolutional layers, which increases the number of parameters in the training phase, but the large kernel convolution is able to perform better feature extraction, and for this reason, we use in the inference phase the RepConv instead of the multi-branch convolutional structure, which can not only help us to reduce the computational cost and the number of parameters, and we think that after having the DEConv branch for detail and edge feature extraction, the re-parameterized convolution of the RepConv branch can better integrate the information from different channels, and the experiments in Section also proved the effect of RepConv.

As shown in the Figure 8, for a given input feature

F_{i n}

, RepConv can use the re-parameterization technique to output

F_{o u t}

to a normal convolutional layer with the same computational cost and inference time as a single standard convolution. The formula is as shown in Equation (3):

\begin{matrix} F_{out} = RepConv (F_{in}) = \sum_{i = 1}^{3} F_{i n} \times K_{i} \end{matrix}

(3)

In RDE-MSCA, for the input feature

F

, the convolutional attention weight

Att

is produced via the RepConv branch, which allows the model to pay more attention to the edge region and increase the sensitivity to boundary details. This is in fact fully equivalent to a linear gated cell structure, where the RepConv branch is able to control the flow of information, selectively retaining important features and suppressing irrelevant or redundant information. This is especially critical for the extraction of edge features, as the edge features of clouds and snow play an important role in the distinction. The specific process is as shown in Equations (4) and (5):

\begin{matrix} A t t = {Conv}_{1 \times 1} (\sum_{i = 0}^{3} {Scale}_{i} (DWConv (F))) \end{matrix}

(4)

\begin{matrix} Out = Att \otimes DEConv (F) \end{matrix}

(5)

where

F

denotes the input features,

X

and

Out

denote the attention weights and outputs, respectively, ⊗ is element-by-element matrix multiplication operations. DWConv denotes the deep convolution used to aggregate local information,

C o n v_{11}

denotes the ordinary convolution used to model the relationship between different channels, and

S c a l e_{i}

denotes the ith channel constituted by different scales of deep convolution in the Repconv branch. With this design, the module is able to capture rich features at different scales while maintaining computational efficiency. In the trend of model lightweighting, RDE-MSCA provides an effective solution that balances model complexity and practicality, achieving a win–win situation in terms of model performance and training inference efficiency.

RDE-MSCA expands the receptive field by using large kernel deep convolutions to process multi-scale information, and implements channel feature fusion with RepConv. In addition, it introduces a shallow differential convolution to explicitly enhance gradient features, avoiding the loss of details caused by simply expanding the receptive field. Through a multi-branch design in the training stage and re-parameterization into a single-branch convolution during inference, RDE-MSCA not only captures rich multi-scale context, but also significantly reduces computational overhead and parameters, achieving a balance of efficiency and detail preservation.

In terms of the attention mechanism, unlike SE [39] and CBAM [40], which mostly use global pooling or pre-generated attention maps to weight channels or spaces, RDE-MSCA emphasizes the collaborative work of “parallel branches”, that is, the differential convolution branch DEConv is responsible for capturing high-frequency gradients and edge details, and RepConv performs global information screening and channel enhancement through GLU, and converges to an efficient single-convolution branch during inference. Compared to the global self-attention strategy of the Transformer, which often requires huge computational complexity and parameter size, RDE-MSCA achieves gated and feature fusion of multiple parallel integration branches by introducing GLU, and realizes adaptive selection of key channels with minimal model cost, while maintaining a balance between global and local features. This significantly reduces resource consumption during inference while maintaining high detection accuracy.

2.3. RDE-SegNeXt

Based on the RDE-MSCA module, we design and propose an improved semantic segmentation model RDE-SegNeXt (re-parameterized detail-enhanced SegNeXt) for the task of quality labeling of SDGSAT-1 remote sensing images. The model significantly improves the detection of clouds, cloud shadow, snow, and other features while maintaining the lightweight and efficient properties of SegNeXt. RDE-SegNeXt uses an encoder–decoder network architecture as shown in Figure 9. The model consists of four stages of encoders and corresponding decoders to progressively reduce spatial resolution and extract multi-scale features.

Encoder part: In the encoder, each stage contains a downsampling block and a set of building blocks as the SegNeXt network. The difference is that we use RDEMSCA instead of MSCAN. Given an image of size

H \times W \times 3

, after these four stages we will obtain multilevel features of 1 / 4, 1/8, 1/16, and 1/32 of the original image resolution; of course, the number of channels in each layer will be increased accordingly. We use a lightweight decoder structure that combines the features from different stages in the encoder. For the decoder part, we use a classification header called LightHamBurger, which aggregates the features of the last three stages of the encoder to improve segmentation accuracy. This allows the model to capture edge and detail information more efficiently.

In our experimental section, we will show that our RDE-SegNeXt outperforms the latest state-of-the-art (SOTA) HRNet [41], Deeplabv3+ [42], Mask2former [43], Poolformer [44], BuildFormer [45], DANet [46], Segmenter [47], SP-PSPNet [48], ConvNeXt [49], Swin Transformer [50], and SegNeXt [51].

Based on the configurations of MSCA and SegNext Tiny, we designed RDE-MSCA with RDE-MSCA instead of MSCA, and did not change the basic network structure of the original SegNext. The detailed network structure of SegNext is shown in Table 3. In this table, “e.r.” denotes the expansion ratio in the feedforward network, and “C” and “L” are the number of channels and building blocks, respectively.

3. Experiments

3.1. Datasets and Evaluation Metrics

3.1.1. Datasets

In our implementation, we trained and tested our proposed model on the self-constructed SDGSAT-1 dataset and the publicly available LoveDA dataset to verify its effectiveness in different remote sensing image segmentation tasks.

The SDGSAT-1 dataset is a novel cloud detection dataset that we constructed using SDGSAT-1 multispectral remote sensing image data to fill the gap of SDGSAT in the area of cloud detection data. The dataset is carefully collected and labeled by us with the aim of providing high-quality and diversified remote sensing data for model training and validation.The SDGSAT-1 dataset consists of 27 SDGSAT-1 satellite images selected from four regions, namely northeast, northwest, southwest and southeast China, and each of them is equipped with corresponding pixel-level labels covering six semantic categories: clouds, cloud shadow, water body, ground surface, snow and other six feature categories. Finally, 15,000 sample pairs were generated, and the data set was divided into a training set and a validation set according to the ratio of 7:3. Samples were randomly selected from each region to ensure a reasonable distribution of each category in the dataset and to improve the generalization ability of the model.

The 38-Cloud dataset is a semantic segmentation dataset for remote sensing image cloud detection task, containing 38 scene images from Landsat 8 satellites and their manually labeled pixel-level ground truth. The dataset is cropped to produce image blocks of size

384 \times 384

pixels, with 8400 blocks in the training set and 9201 blocks in the test set, each with four spectral channels (red, green, blue, and near-infrared). The dataset is meticulously annotated with thin and thick cloud regions while preserving the black edges of Landsat 8 images to ensure the accurate reproduction of remote sensing image features. However, in this study, images with a background area greater than 70% are excluded to concentrate on the cloud detection task. This dataset is extensively utilized for training and evaluating deep learning-based remote sensing cloud detection models, thereby providing a standardized benchmark for algorithm performance. In this study, the 38-Cloud dataset is selected for single-class validation of the cloud detection performance of RDE-SegNeXt, with the aim of evaluating its ability to recognise high-brightness and morphologically variable cloud image elements, and further demonstrating the model’s generalisability and effectiveness.

The LoveDA dataset is a widely used semantic segmentation dataset for remote sensing, which aims to bridge the gap between urban and rural scenes in the field of remote sensing.The LoveDA dataset consists of 5987 high-resolution remote sensing images from three different cities in China: Hangzhou, Changchun and Chengdu [52]. The dataset is divided into urban and rural scenes, offering diverse environmental conditions that enhance the model’s generalization. It is further split into three subsets: a training set of 2522 images for model training, a validation set of 1669 images for hyperparameter tuning and validation, and a test set of 1796 images for evaluating performance on unseen data. Each image is labeled with its corresponding pixel-level annotation covering seven semantic categories: Building, Road, Water, Barren, Forest, Agricultural and Other Background.

3.1.2. Evaluation Metrics

Global metrics (mFscore, aAcc, mIoU, mAcc): comprehensively reflect the performance of the model on the whole dataset and measure the accuracy, precision and recall of the segmentation and classification.

Category-level metrics (single-category IoU and Acc): a refined assessment of the model’s performance on each category, helping to identify recognition difficulties in specific categories.

Intersection over Union (IoU) is used to quantify the overlap between the predicted result and the ground truth region. In cloud detection, IoU can better reflect the percentage of intersection between the segmented region (e.g., clouds or snow) and the ground truth region.The formula is as shown in Equation (6):

\begin{matrix} {IoU}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FP}_{c} + {FN}_{c}} \end{matrix}

(6)

{TP}_{c}

(True Positive): the quantity of pixels for which the true category is

c

and the prediction is also

c

.

{FP}_{c}

(False Positive): the quantity of pixels for which the true category is not

c

but the prediction is

c

.

{FN}_{c}

(False Negative): the quantity of pixels for which the true category is

c

but the prediction is other categories.

Mean Intersection over Union (mIoU) is an equally weighted average of all category IoUs, and can be used to measure the overall segmentation performance of a model for multiple categories in comparison to a single category IoU. The formula is as shown in Equation (7):

\begin{matrix} mIoU = \frac{1}{C} \sum_{c = 1}^{C} {IoU}_{c} \end{matrix}

(7)

In this paper, ‘Acc’ denotes ‘Accuracy’, that is to say, the pixel classification accuracy for each category, c. Acc reflects the model’s recall ability for a category, that is to say, the proportion of the category that is correctly recognized.The formula is as shown in Equation (8):

\begin{matrix} {Acc}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FN}_{c}} \end{matrix}

(8)

In this paper, aAcc is employed to denote overall pixel accuracy. aAcc is the percentage of pixels correctly predicted by the model in all pixel ranges. The formula is as shown in Equation (9):

\begin{matrix} aAcc = \frac{\sum_{c = 1}^{C} {TP}_{c}}{Total Pixels} \end{matrix}

(9)

The mAcc (mean accuracy) is an equally weighted average of the accuracy Acc across all categories, which is more sensitive to the category imbalance problem than aAcc and intuitively reflects whether the model is consistent in recognizing each category. The formula is as shown in Equation (11):

\begin{matrix} mAcc = \frac{1}{C} \sum_{c = 1}^{C} {Acc}_{c} \end{matrix}

(10)

The F1-score for a single category is often defined as the F1-score. It takes into account both Precision and Recall, and a higher value indicates more accurate and comprehensive identification of the target. The mFscore effectively measures the performance in a multi-category scenario by averaging the F1-score for all categories. The formula is as shown in Equation (11)–(14):

\begin{matrix} {Precision}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FP}_{c}} \end{matrix}

(11)

\begin{matrix} {Recall}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FN}_{c}} \end{matrix}

(12)

\begin{matrix} F 1_{c} = \frac{2 \times {Precision}_{c} \times {Recall}_{c}}{{Precision}_{c} + {Recall}_{c}} \end{matrix}

(13)

\begin{matrix} mFscore = \frac{1}{C} \sum_{c = 1}^{C} F 1_{c} \end{matrix}

(14)

Efficiency metrics: The number of parameters and computational complexity are used to evaluate the computational efficiency of the model to ensure that real-time and resource constraints can be met in real-world applications.

FLOPs (Floating Point Operations): represents the number of floating point operations required by the model to process one piece of input data, and is typically used as a measure of the computational complexity of the model.

Inference time: the time it takes for the model to generate output results from the time it receives the input data.

Parameters: refers to the number of all parameters in the model that need to be trained, including the weights and biases of the convolutional kernel, etc.

3.2. Implementation Details

We implemented the proposed RDE-SegNeXt model on the PyTorch deep learning platform using a single NVIDIA RTX 3080Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). To evaluate the performance of the model on cloud detection and remote sensing image segmentation tasks, we trained and tested it on the custom SDGSAT-1 dataset and the publicly available LoveDA dataset. The AdamW optimizer was used in the optimization process, with the parameters set to

β_{1} = 0.9

,

β_{2} = 0.999

, the weight decay coefficient set to 0.01, and the initial learning rate set to

6 \times 10^{5}

. The batch sizes of the training and validation sets were 8 and 1. A parameter grouping strategy was used to tune the learning rate and weight decay for specific layers, where the weight decay coefficients for the ‘

p o s_{b l o c k}

’ and ‘norm’ layers were set to 0, and the learning rate multiplier for the ‘head’ layer was set to 10.

To schedule the learning rate, LinearLR is first used for the first 1500 iterations to linearly increase the learning rate from

1 \times 10^{- 6}

to

6 \times 10^{- 5}

; then a polynomial decay (PolyLR) strategy is used, and starting from the 1500th iteration, the learning rate was gradually reduced from the initial value to 0 until the end of the 71,250 rd iteration. The model was trained for a total of 71,250 iterations and validated every 2000 iterations.

For data pre-processing and enhancement, we resized the input image to

512 \times 512

pixels, normalized the image using mean [123.675, 116.28, 103.53] and standard deviation [58.395, 57.12, 57.375], and converted the image from BGR to RGB format. Data enhancement techniques include random rotation (

90^{°}

,

180^{°}

or

270^{°}

), random horizontal or vertical flipping, random cropping of

512 \times 512

blocks from the original image and color dithering for brightness, contrast, saturation and hue [53].

For the model configuration, MSCAN was used for the backbone network with embedding dimensions of [64, 128, 320, 512] and depths of [3, 3, 12, 3] for each stage, initialized using Kaiming with a discard rate of 0.1 and trainable Batch Normalization (BN) for the normalization layer. LightHamHead was used for the solution terminal with [128, 320, 512] input channels and 512 channels, normalization in the Ham module was configured as Group Normalization (GN) with a group number of 32, and CrossEntropyLoss was used as the loss function.

For the training strategy, we used IterBasedTrainLoop (MMSegmentation 2.2.0, OpenMMLab, Shanghai, China) to manage the learning rate adjustment using the ParamSchedulerHook (MMSegmentation 2.2.0, OpenMMLab, Shanghai, China). The default configuration includes saving model checkpoints every 2000 iterations, logging training logs every 50 iterations, ensuring random data loading in distributed training, logging elapsed time for each iteration, and visualizing results during training.

3.3. Ablation Study

In this ablation experiment, we used the SegNext-Tiny no-pre-training version as the base model and gradually added or removed the following modules: (1) RepConv Attention, (2) DEConv, (3) RDE-MSCA, and analyzed the contribution of each component through the ablation experiment. The results are shown in Table 4.

First, we use SegNext-Tiny as the baseline model, which achieves 66.59% mIoU without pre-training, indicating that the model already has some cloud detection capability without any enhancement module. SegNext-Tiny is the lightest semantic segmentation model in the SegNeXt family, and although the number of parameters and computations are small, there is a limitation in feature extraction and representation capabilities. Specifically, due to the limitations of network depth and width, the model may not perform well in capturing detailed features such as cloud edges, thin clouds, and cloud shadow, resulting in the limited ability of the model to capture details; on the other hand, the shallow network structure may not be able to adequately capture global contextual information, which affects the recognition of large cloud regions, leading to insufficient global feature extraction by the model.

(1) RepConv note: We reconstruct the original multi-branch convolution in MSCA into Repconv based on SegNeXt-Tiny. Compared to the baseline model, mIoU is improved by 0.91%, and mFscore and mAcc are improved accordingly. although the improvement is not large, it proves the effectiveness of Repconv. In cloud detection tasks, any improvement can significantly improve the processing of edge regions or complex scenes. RepConv increases the nonlinear expressiveness of the model through multi-branching large kernel convolutional structure in the inference stage, is able to capture richer features, fuses information from multiple channels through re-parameterized convolution in the inference stage, and reduces computational cost.

(2) DEConv: Based on SegNeXt-Tiny, we redesign the multi-branch parallel convolution in the original MSCA structure as DEConv. The differential convolution in different directions in DEConv effectively helps the network to capture the gradient level information by calculating the difference between pixels, which improves the model’s sensitivity to edges and details. Compared to the baseline model, the performance improvement in this enhancement is significant. mIoU is increased by 2.30%, indicating that the segmentation accuracy of the model for each category is significantly improved. mFscore and mAcc are also significantly improved, indicating that the model has increased the correct recognition rate for each category. This proves that DEConv effectively captures detailed features such as cloud edges, thin clouds, cloud shadow, etc., and the enhancement of detailed features helps the model to distinguish between clouds and snow with similar spectral characteristics, thus reducing false detections and improving the model’s ability to recognize complex cloud morphology.

(3) RDE-MSCA: On the one hand, we performed a modular structure transformation: we replaced MSCA with a convolutional attention structure based on the GLU structure of DEConv and Repconv in parallel, and we replaced all layers of each stage of SegNeXt-Tiny with RDE-MSCA, and compared to the baseline model, the indices were significantly improved, with mFscore: 79.63% (

2.30 %

increase), mIoU: 69.25% (

2.66 %

increase) and mAcc: 78.00% (

3.76 %

increase), which is one of the best performances; this proves the effectiveness of the parallel branching synergies between DEConv, based on the GLU structure, and Repconv, which provides the global feature extraction capability, and DEConv enhances the detailed feature capture and selective filtering of information through the gated unit structure. RDE-MSCA enhances both global and detailed features at all scales, both of which help the model to understand complex cloud structures and background relationships, and the deep feature enhancement also improves the model’s capture of advanced semantic information.

On the other hand, it is worth noting that replacing all layers in each stage of SegNeXt-Tiny with RDE-MSCA will add a large computational overhead to the training and inference of the model, which contradicts our original intention of seeking a balance between accuracy and efficiency, and for the four stages of SegNeXt, the information of the first layer in each stage plays a crucial role in the feature extraction in the subsequent layers. Therefore, we try to introduce or remove the above modules in different layers of the model (e.g., the first layer or all layers) and observe their effects on the model performance. The experimental results show that replacing only the first layer of each stage improves the performance of the network. mFscore: 81.77% (

4.44 %

increase). mIoU: 71.85% (

5.26 %

increase). mAcc: 81.83% (

7.59 %

increase). This reflects the overall improvement in the model’s ability to recognize different types of features for cloud detection, suggesting that the location of the modules has an impact on the feature extraction of the network. The introduction of RDE-MSCA in the first layer of each stage can improve the capture of detailed features in the initial stage of the model, providing rich detailed information for the subsequent layers; and the detailed features are effectively captured in the shallow layer of the network, avoiding the risk of being weakened in the deeper layers, and the detailed features captured in the shallow layer can be continuously maintained and strengthened in the subsequent layers, improving the overall performance of the model. More importantly, adding the RDE-MSCA module only in the first layer increases the computational cost to a limited extent, but improves the performance significantly. Especially in the real situation of resource constraints, this structural design has more practical significance.

The experimental results, on the one hand, fully illustrate the importance of detail features, proving that the edge and detail information is crucial to accurately identify the features such as clouds, cloud shadow and snow in the cloud detection task. On the other hand, replacing the DEConv with the RepConv and only in the first layer may cause the model to introduce too strong global features in the shallow layer, which will interfere with the extraction of detail features in the subsequent layers, leading to the loss of necessary detail information.

This ablation experiment thoroughly analyzes the effect of modules on different feature classes, the importance of module location selection, and the impact of module structure design strategy on the performance of the cloud detection model, and arrives at the following main conclusions: DEConv is the key module to improve the performance of the model for the cloud detection task, and RepConv has an auxiliary effect on the performance of the model, mainly enhancing the global feature extraction ability, but its role is relatively limited. The multi-scale convolutional attention module based on the GLU structure of DEConv and RepConv in parallel is able to play well with the representation ability of local and global features, especially introduced in the shallow network, which can significantly improve the model’s ability to capture detailed features such as clouds, cloud shadow, snow, etc., without significantly increasing the computational cost and model parameters, maximizing the model’s performance and providing real-world applications with an efficient and accurate solution for practical applications.

3.4. Comparison Experiments

The comparative experiment was conducted using the same dataset and hyperparameter settings. We compared several popular deep learning semantic segmentation models, including Hrnet, Deeplabv3+, Mask2Former, PoolFormer, BuildFormer, DANet, Segmenter, Shift Pooling PSPNet (subsequently abbreviated to SP-PSPNet), the ConvNeXt series (Tiny, Small, Base, Large), the Swin Transformer series, the SegNeXt series, and RDE-SegNeXt. The main evaluation metrics of the experiment include overall accuracy (aAcc), mean intersection over union (mIoU) and mean accuracy (mAcc). In the SDGSAT-1 satellite cloud detection experiment, these indicators can comprehensively reflect the performance of each model in segmentation of five types of land features (clouds, cloud shadow, water body, snow, and surfaces). We will focus on and analyze the detection results of the three important land features of clouds, snow, and cloud shadow in cloud detection. In addition, we also verified the performance of each network on the single-category cloud detection task in 38-Cloud, and used the LoveDA dataset to further verify the effectiveness and universality of RDE-SegNeXt.

3.4.1. Comparative Experiments of the SDGSAT-1 Dataset

I.: Quantitative Analysis

In the experiments conducted on the SDGSAT-1 dataset, the performance of the different models on the five types of features (clouds, cloud shadow, water body, snow, and ground surface) is as follows:

RDE-SegNeXt achieves 89.03% in overall accuracy (aAcc) and 71.85% in mean intersection and integration ratio (mIoU). Compared with other models, it still shows obvious competitiveness. BuildFormer and SP-PSPNet achieve 69.01% and 69.12% on mIoU, respectively, which is close to the performance of Swin-L (69.14%) and also demonstrates its decent performance in terms of overall accuracy and IoU. The mIoU of the Swin Transformer series (Swin-T, Swin-S, Swin-B, Swin-L) ranges from 66.61% to 69.14%, with better overall accuracy and stability, but with higher computational complexity. The mIoUs of Hrnet and Deeplabv3+ are 56.30% and 54.42%, respectively, which are relatively weak in multi-scale feature extraction from multispectral data, especially in distinguishing between high-reflective and low-reflective features. DANet and Segmenter are 62.58% and 64.26% on mIoU, respectively, which is not as good as RDE-SegNeXt but improves the feature extraction capability compared to Hrnet and Deeplabv3+. The results are shown in Table 5.

Among the five categories of features, clouds, cloud shadow, and snow have been the three most challenging key categories for cloud detection. From the experimental data in Table 6 and the visualization in Figure 10, the ability of RDE-SegNeXt to distinguish the key categories in this experiment has been improved well:

Cloud category has high reflectivity and variable morphology, and it is difficult to detect the mixed image element at thin clouds, cloud and snow mixed image element and transition zone, and cloud and cloud shadow transition zone, RDE-SegNeXt achieves 87.89% IoU in cloud category and 94.26% Acc, which is the best comprehensive detection effect for cloud category among all the comparative methods, comparing to SegNeXt-L (85.27%) with an improvement of 2.62% in IoU and 0.35% in Acc performance over the previous best performance of PoolFormer (93.91%). IoU in the cloud category also improved by 2.7% and 2.62% compared to BuildFormer and SP-PSPNet, respectively.

The snow category has a high spectral brightness and is easily confused with the cloud category, RDE-SegNeXt achieves 72.24% IoU and 87.17% Acc in the snow category, which is 5.48% and 1.78% better than the previous SOTA methods Swin-L (66.76%) and SegNeXt-L (70.46%) in IoU, and 1.78% better than the previous best performer PoolFormer (93.91%) in Acc. Acc performance is improved by 3.65% over the original best performance of SegNeXt-L (83.52%), which is at the top of all the compared methods. The leading performance of RDE-SegNeXt shows that it has excellent discrimination of high brightness, near-spectral features in scenes where both clouds and snow are highly reflective, which can lead to misclassification if the model is unable to distinguish their texture and detailed morphological features in detail.

The shadow category is darker in color and variable in shape, and the cloud shadow detection IoU of RDE-SegNeXt is 33.55%, which is higher than most of the mainstream models (e.g., 30.31% for Swin-L, 27.72% for SegNeXt-L, etc.), which is a clear advantage in the recognition of dark features, like Hrnet (30.76%) or traditional CNN architectures like Hrnet (30.76%) or Deeplabv3+ (25.83%) are more lacking in effective modeling of dark shaded regions, and thus have lower cloud shadow detection accuracy.

For water (82.38%) and land (83.18%), RDE-SegNeXt also maintains a high level, verifying the model’s overall adaptability to multiple types of features in multispectral scenes.

II.: Efficiency analysis

We measure the efficiency of the model in terms of inference time, flops and parameters, and the results are shown in Table 7.

Inference time: this indicator reflects the time required for the model to process the input samples, which is an important indicator for evaluating the real-time performance. the inference time of RDE-SegNeXt is 0.14 s, which is 82.05% faster than that of Swin-L (0.78 s), and improves the efficiency by 68.89% compared with that of SegNext-L (0.45 s), and the performance (mIoU, mAcc and other indicators) is also better. The performance (mIoU, mAcc, etc.) is also better, striking a good balance between efficiency and accuracy. Although it is slightly higher than SegNext-B (0.20 s) and SegNext-S (0.12 s), it shows stronger performance advantages in both global and category-level metrics.

Computational complexity (FLOPs): This metric is mainly used to measure the efficiency of model calculation. The computational complexity of RDE-SegNeXt is 31.818 GFLOPs, which is only 7.64% of Swin-L (416.77 GFLOPs), and a 50.83% reduction compared to SegNext-L (64.736 GFLOPs), which is close to SegNext-B (31.071 GFLOPs). Compared to Swin-L and Swin-B, RDE-SegNeXt shows lower complexity while maintaining similar or even better performance metrics (e.g., mIoU improved by 2.6% and 2.87%).

Parameters: This metric represents the size of the model and is a key indicator of the storage and deployment requirements of the model. RDE-SegNeXt has a parameter size of 63.172 M, which is about 26.59% of Swin-L (237.568 M). Compared to SegNext-L (47.989 M), it has an increase of 31.65%, but it has improved overall segmentation performance and category-level accuracy. It has achieved the best detection results among all the compared models in terms of the overall performance of clouds, snow, and cloud shadow, indicating that RDE-SegNeXt has found a better compromise between network scale and accuracy.

III.: Qualitative analysis

For the cloud category, clouds are often characterized by uneven thickness and fuzzy edges, especially the top of the cloud is often blended with the atmospheric background, and the edges of the cloud are often not obvious with cloud shadow and snow transition boundaries. The DEConv branch in the RDE-MSCA module relies on multi-directional differential convolution (e.g., CDC, ADC, HDC, VDC) to capture fine-grained gradient variations, and we have demonstrated through the ablation experiments that the introduction of RDE-MSCA in the first layer of each stage enhances the capturing of detailed features in the initial stage of model training and provides rich detailed information for the subsequent layers, which are The fine texture features of clouds that are effectively captured in the shallow layers of the network avoid the risk of being weakened in the deeper layers, and the detailed features captured in the shallow layers can be continuously retained and enhanced in the subsequent layers, thus achieving good detection results of clouds and other features. In addition, the reflectance of clouds in multispectral bands varies significantly, and clouds in different channel synthesis methods have different color and brightness feature information. RepConv uses multi-branch large kernel convolution together with GLU in the training stage, which can fully fuse global features across different scales and channels, which makes the model more discriminative of the color and brightness variations of the clouds and effectively reduces the risk of clouds being weakened with the high altitude background and other features in subsequent layers, thus achieving good detection of clouds and other features. This enables the model to better recognize the color and brightness variations of clouds, thus effectively reducing the spectral confusion between the clouds and the upper-air background as well as the highly reflective features. In addition, compared with the ‘global self-attention’ model of pure attention mechanisms (e.g., transformer series), the GLU-constructed re-parameterized convolution of RepConv’s multi-branch channel fusion is re-parameterized into a standard convolution at the inference stage, which not only guarantees the recognition accuracy, but also retains the inference efficiency and superior cloud detection accuracy with lower computational complexity and fewer model parameters. In contrast, BuildFormer relies heavily on building deeper network structures to improve feature representation, but this often leads to increased computational complexity and is difficult to deploy in real-time applications. DANet, although excellent at multi-scale feature fusion, is not as good at capturing high-frequency details as RDE-SegNeXt’s DEConv branch. Segmenter adopts a Transformer-based architecture that is superior at global feature modeling, but is deficient at handling fine-grained edge information. Segmenter adopts a Transformer-based architecture, which has advantages in global feature modeling but is deficient in handling fine-grained edge information. SP-PSPNet is able to achieve multi-scale feature fusion through the improvement in shift pooling; however, it is not as good as RDE-SegNeXt’s detail enhancement mechanism in retaining edge details.

For the snow category, it is easy to be confused, because both snow and cloud have high reflectivity in the visible light band. However, the DEConv branch of RDE-MSCA is able to extract snow surface texture or terrain change cues in the local difference, and the difference in gradient features between snow regions and clouds (especially thick clouds) can also be. Although DANet is innovative in its two-branch structure, its feature selection mechanism is not as fine-grained as RDE-SegNeXt’s GLU, and Segmenter’s transformer architecture is advantageous in global information capture but not as flexible as RDE-SegNeXt in fine-grained feature selection. Furthermore, compared to Swin Transformer’s self-attention global feature modeling, RDE-SegNeXt separates snow and cloud details at a shallow level, and then cross-scale fusion through RepConv in subsequent layers further consolidates the discrimination, and the GLU-based RepConv branch enables selective feature amplification, and for multi-branch convolutional outputs for high-reflective features, the GLU strengthens the texture and gradient channels of both the snow and cloud and suppresses other interfering features. GLU can strengthen the texture and gradient channels of clouds and snow and suppress other interfering feature channels to achieve fine-grained differentiation, which is more flexible than pure CNN structure stacking or multi-scale aggregation at the decoder only, and helps significantly in the discrimination of similar spectral features.

For the cloud shadow category, its morphology is complex, sometimes in the form of elongated strips or irregular scattering, and there is usually a spatial connection or correlation between clouds and cloud shadow, while the transition boundaries between clouds (especially thin clouds) and cloud shadow are not clear, and there is also a spectral similarity between the light-colored cloud shadow and the dark-colored features such as the shadows of bodies of water and mountains, which is very likely to lead to the detection of confusions. However, with the multi-directional detail feature extraction capability of the DEConv branch, the network is able to capture the gradient features at the light-dark junction at the shallow level, which reduces the misdetection of dark targets when relying solely on spectral, color, brightness and other information, which complements the idea of the Swin Transformer series focusing on the large range of global attention, and the self-attention captures macroscopic structures but tends to overlook edge detail differences in dark areas, while the detail-enhanced convolutional branch can distinguish cloud shadow boundaries more finely, separating cloud shadow from dark areas of clouds, bodies of water, or the surface. On the other hand, the dark and variable shapes of cloud shadow and the usually small area of cloud shadow lead to poor robustness and generalization of most models in learning cloud shadow features, whereas RDE-SegNeXt acquires cloud shadow information at different scales during the training phase by multi-branching large kernel convolution and merges multiple branches into a single convolution kernel during the inference process, which not only significantly reduces the burden of the inference but also preserves the shadow contextual relationships at different scales, such as contextual relationships at different scales, improving the generalization performance to cloud shading categories. This is comparable to networks that only perform multi-scale splicing at the decoder side (e.g., Deeplabv3+’s dilated convolutional pyramid), RDE-SegNeXt’s multi-scale convolutional attention pervades every stage of the encoder, allowing for a more stable tracking of cloud shading boundary changes.

Comprehensive experimental results show that the Swin Transformer series relies on the self-attention mechanism of the sliding window, which is excellent in macro scene segmentation (e.g., large cloud clusters), but is relatively insufficient in modeling the local differentiation between dark regions such as cloud shadow and highlighted regions such as snow. The SegNeXt series, although also adopting the same convolutional attention and lightweight design, is more stable compared to RDE-SegNeXt, which is more stable in tracking the changes of cloud shadow boundaries. SegNeXt, compared with RDE- SegNeXt, it still lacks the optimization of Deep Enhancement for High Brightness/Darkness Edge Details (DEConv) and performing GLU selection between multi-scale branches, and thus is slightly less accurate in the cloud shadow and snow categories.The ConvNeXt series is based on a modified convolutional design with some multi-scale capability, but the degree of differentiation between high brightness and dark shadow is not obvious. Combined with its relatively higher inference speed and FLOPs, the combined effect is not as good as that of RDE-SegNeXt.

Combining the characteristics of SDGSAT-1 multispectral data, RDE-SegNeXt’s network structure design based on the RepConv and DEConv with two-branch convolutional attention of the linear gating unit can effectively deal with the typical interferences such as snow and cloud shadow in the detection of clouds, and it successfully takes into account of the multi-scale sensory field and the capture of high-frequency details, and demonstrates powerful performance in all the three key cloud detection categories. Although the number of parameters and FLOPs of RDE-SegNeXt are slightly higher than those of some lightweight models (e.g., PoolFormer, SegNext-T, SegNext-S, SegNext-B), the accuracy and segmentation effect of RDE-SegNeXt is significantly improved, and the network structure of RDE-SegNeXt can be used for the application scenarios of resource constraints or high real-time requirements. In resource-constrained or real-time demanding application scenarios, RDE-SegNeXt demonstrates better recognition capability and deployability than similar models in multispectral remote sensing imagery, and strikes a good balance between lightweight and high accuracy, which further strengthens its application potential in the downstream tasks of SDGSAT-1 cloud detection.

3.4.2. 38-Cloud Public Cloud Dataset Comparison Experiments

The 38-Cloud dataset is a publicly available benchmark dataset for cloud detection task, which mainly contains 38 images from Landsat-8, and its spatial resolution and multi-band characteristics can better simulate the difficulty of cloud detection in real satellite remote sensing scenarios. The 38-Cloud dataset is usually used to validate the robustness of cloud detection algorithms in complex situations such as different cloud types, surface coverage, and cloud coverage. The 38-Cloud dataset is commonly used to verify the robustness of cloud detection algorithms in complex situations such as different cloud types, surface cover and cloud coverage. In this study, we conduct experiments on the 38-Cloud dataset to further evaluate the generalization ability and stability of the proposed model for cloud detection under different sources of remote sensing data with different imaging conditions, so as to validate the effectiveness and applicability of the RDE-SegNeXt network proposed in this thesis for the cloud detection task. The results of the experiment are shown in Table 8.

I.: Quantitative analysis

The experimental results show that RDE-SegNeXt has an 65.36% IoU, 83.65% Acc, and 79.05% Fscore for cloud classes, which is the best overall performance among all compared models. These metrics are a balanced reflection of the model’s accurate localization of cloud regions (IoU), correct classification rate (Acc), and a combined measure of recall and precision for positive examples (cloud pixels) (Fscore).

Compared to Swin-L (63.01% IoU, 77.31% Fscore), RDE-SegNeXt improves about 2.35% in IoU and nearly 1.74% in Fscore. Compared to SegNeXt-L (64.52% IoU, 78.44% Fscore), RDE-SegNeXt improves by about 0.84% and 0.61%, respectively. Compared to Shift Pooling SP-PSPNet, BuildFormer, DANet, and Segmenter, which are applied in the field of remote sensing image segmentation, the IoU of RDE-SegNeXt is improved between 4% and 7%. Compared to some lighter or more traditional architectures, such as Hrnet, Deeplabv3+, Mask2Former, etc., the IoU of RDE-SegNeXt improves even more, with a difference of 6–10% at most.

By comparing the above quantitative metrics, it can be seen that RDE-SegNeXt shows all-round advantages in cloud detection capability: not only can it capture cloud regions more accurately (reflected in higher IoU and Fscore), but also take into account the overall recognition accuracy (higher Acc), which indicates that it has made significant improvements in both leakage and misdetection of cloud detection.

II.: Quantitative analysis

In a comprehensive comparison with other mainstream networks, it can be seen that Swin Transformer relies on sliding window self-attention to capture large-scale scene information and is good at macro-structural segmentation, but is slightly insufficient for fine-grained detection of clouds. SegNeXt’s original framework, although it also contains convolutional attention and a lightweight design, lacks the ability to detect clouds at the key level of the cloud detection task, as does RDE- SegNeXt. SegNeXt lacks a mechanism like RDE-SegNeXt for parallel screening through RepConv multi-scale channel fusion and GLU at the key level of cloud detection task, which makes the cloud detection accuracy a little bit inferior; the ConvNeXt series performs well in the general vision task, but fails to perform differential processing for the edges of high-brightness clouds and thin clouds in special scenarios, which results in a significantly lower IoU; while Hrnet, Hrnet, Deeplabv3+, and other multi-scale pyramid or multi-branch fusion methods often lack explicit detailed gradient encoding, and are insufficient to portray the high reflectance gradient difference between clouds and background. In contrast, the RepConv branch in RDE-SegNeXt emphasizes multi-scale semantic fusion, which provides excellent coverage of large-area or multilayered clouds; meanwhile, the DEConv branch is specifically designed for cloud edges and hyperspectral detail differencing, which overcomes misclassification due to luminance differences or background mixing; and the GLU gives the network more information when fusing the two branches. The GLU (Gated Linear Unit) gives the network more information when the two branches are fused, and the flexibility of filtering can adaptively highlight the most useful features for cloud detection.

In summary, RDE-SegNeXt’s excellent cloud detection results on the 38-Cloud dataset not only verify its generalization and robustness in specific cloud detection tasks, but also fully demonstrate the comprehensive advantages of the GLU-based RepConv combined with The specific advantages of the multi-scale receptive field combined with the detail gradient enhancement of the information adaptive control in the parallel design of DEConv and RepConv for cloud detection can provide an effective and efficient solution for subsequent cloud detection, cloud shadow recognition, snow detection, and other tasks in remote sensing scenarios.

3.4.3. LoveDA Public Dataset Comparison Experiments

The LoveDA dataset is a challenging benchmark dataset in the field of semantic segmentation of remotely sensed images, which contains a variety of urban and rural scenes covering seven feature classes (background, buildings, roads, water body, wasteland, forests, and agricultural land), and the results of the comparisons with a variety of mainstream models perform as follows:

I.

Quantitative analysis

(a): Overall performance analysis
As shown in the Table 9, RDE-SegNeXt achieved the highest overall performance with mFscore (64.68%), aAcc (64.68%), mIoU (48.51%) and mAcc (61.33%). This is a 1.84% improvement in mIoU and a 1.9% improvement in mFscore over the previous top performer, PoolFormer (46.67% mIoU, 62.78% mFscore). Compared with the Swin Transformer series, which also has high accuracy, RDE-SegNeXt not only has a 6%–8% improvement in mIoU (e.g., 42.44% for Swin-L compared to 48.51% for RDE-SegNeXt), but also improves the F1 scores by about 5.77%, which demonstrates a better comprehensive segmentation capability and global accuracy.
(b): Category analysis
According to the IoU and Acc metrics of each category in the Table 10, RDE-SegNeXt achieves the best or near-best performance in most of the categories, which demonstrates its excellent adaptability to the composite urban and natural scenes. For the roads category RDE-SegNeXt achieves the highest IoU (52.20%) in the roads category, thanks to the effective capture of edges and fine structures by the detail-enhanced convolution, especially for narrow and continuous road targets. For bare ground and agricultural land, RDE-SegNeXt also achieves the best performance on the bare ground and agricultural land categories, indicating the model’s strength in recognizing complex features such as sparsely vegetated areas and agricultural land. For the buildings and forests categories: the RDE-SegNeXt also performs well on the buildings and forests categories, showing the advantage of RepConv in capturing large-scale shapes and local textures with the DEConv parallel branch.

II.

Qualitative Analysis

Multi-scale detail-enhanced convolutional attention: The RepConv branch enables efficient channel fusion and feature selection with the help of GLU to suppress redundant information; the DEConv branch captures high-frequency edges and detailed textures through differential convolution, which is significantly helpful for targets that require precise contours, such as roads, or complex edges of buildings.

Improvement in edge detection capability: In LoveDA, both urban infrastructure (buildings, roads) and natural areas (forests, agricultural land) have variable and complex edge patterns; RDE-SegNeXt improves edge sensing through DEConv, and then the RepConv branch provides global fusion of information, so that IoUs for roads, buildings, and bare land can be improved significantly. The performance of each category is balanced.

Balanced performance in all categories: RDE-SegNeXt maintains the leading performance in most categories, which indicates that the model can achieve a better balance between various types of features, with both generalization ability and robustness.

Leading average accuracy: mAcc reaches 61.33%, which is at the highest level in the overall classification of multi-category scenes, indicating that the model is not prone to serious bias towards certain categories, thus ensuring a more balanced accuracy across categories.

In conclusion, the RDE-SegNeXt proposed in this study shows better overall performance in the experiments on SDGSAT-1 cloud detection and LoveDA public dataset, especially in terms of computational complexity and number of parameters with significant advantages. Compared with the Swin Transformer series, which achieves good performance with high computational cost and huge number of parameters, RDE-SegNeXt effectively improves the edge detection accuracy of clouds, snow, cloud shadow and other features by combining re-parameterized convolution and detail-enhanced convolution modules, and significantly reduces the computational complexity while maintaining high accuracy, making it an ideal lightweight and efficient cloud detection.

4. Discussion

The RDE-SegNeXt model proposed in this paper significantly improves the feature extraction capability and model generalization performance by introducing DEConv and RepConv based on GLU. DEConv makes use of multiple differential convolution layers (e.g., central, angular, horizontal, and vertical differential convolution) to efficiently capture the edge and detail features of features such as clouds, cloud shadow, snow, etc., which enhances the perception of high-frequency information and helps to distinguish features with similar spectral characteristics. RepConv, on the other hand, enhances the fusion of channel information and the feature selection ability through the gating mechanism, which improves the global feature extraction efficiency without increasing the computational complexity. Experimental results show that RDE-SegNeXt achieves mIoU and mFscore of 71.85% and 81.77%, respectively, and outperforms mainstream models in segmentation tasks in key categories such as clouds, cloud shadow, water body and snow. Compared to the Swin Transformer series, RDE-SegNeXt outperforms while maintaining lower computational complexity, especially in the snow category with 72.24% and 87.89% IoU and accuracy, respectively, which validates the model’s advantage in distinguishing features with similar spectral properties. In addition, adding the DEConv module only in the shallow layer significantly improves the performance, indicating that the capture of early detailed features is crucial to the segmentation effect of the whole network. RDE-SegNeXt achieves a balance between model performance and computational efficiency, and provides an effective solution for the task of semantic segmentation of remote sensing images.

For future work, the improved idea of RDE-SegNeXt can be further applied to other remote sensing image analysis tasks, such as change detection, target identification and classification. Combining richer spectral and spatial information and exploring methods for fusion of multi-source remote sensing data are expected to further improve the accuracy and robustness of the model. In addition, the introduction of spatial attention mechanism or graph convolution network can be considered to enhance the model’s understanding of global context and spatial relationship of features. For the difficulties in recognizing cloud shadow and high-brightness landforms, specialized loss functions or data enhancement strategies can be designed to further improve the performance of the model.

5. Conclusions

In this paper, we propose an efficient cloud detection algorithm for SDGSAT-1 data, construct a fine-labeled cloud detection dataset containing five types of elements: clouds, cloud shadow, water body, snow, and ground surface, and design a RDE-SegNeXt. Firstly, in the SDGSAT-1 dataset, RDE-SegNeXt obtains competitive performance with less parameters and computational complexity. On the SDGSAT-1 dataset, RDE-SegNeXt achieves competitive performance with fewer parameters and lower computational complexity. Its computational complexity is only about 1/12 that of the Swin-L model, and the mIoU reaches 71.85%, representing a 2.6% improvement over Swin-L. Meanwhile, cloud detection accuracy is improved by 3.6%, cloud shadow detection by 10.32%, and snow detection by 6.75%. Next, this paper further validates the effect of RDE-SegNeXt on single cloud category detection on the 38-Cloud dataset, and the experimental results once again confirm the model’s ability to accurately differentiate between high-brightness and irregular cloud targets, which reflects better generalization characteristics. Finally, on LoveDA public dataset, RDE-SegNeXt also shows good segmentation accuracy and applicability, further proving its effectiveness and generality. RDE-SegNeXt successfully balances accuracy and efficiency, and provides strong support for the promotion of SDGSAT-1 data in cloud detection and identification of strongly correlated features, especially for cloud, cloud shadow, and snow.

Author Contributions

Conceptualization, X.L.; methodology, X.L.; software, X.L.; validation, X.L.; data curation, X.L. and C.H.; writing—original draft, X.L.; writing—review and editing, X.L. and C.H.; visualization, X.L.; supervision, C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Civil Aerospace Technology Pre-research Project of China’s 14th Five-Year Plan (Guide No. D040404), the Youth Innovation Promotion Association, CAS (No. 2022127), and the “Future Star” Talent Plan of Aerospace Information Research Institute, Chinese Academy of Sciences (No. 2021KTYWLZX07).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. Because the raw data used in this study are from SDGSAT-1 and provided by the Center of Big Data for Sustainable Development (CBAS). As the data are subject to CBAS’s use and share license, they can only be made available with CBAS’s authorization. Permission to share the data can be obtained upon request from the corresponding author.

Acknowledgments

The research findings are a component of the SDGSAT-1 Open Science Program, which is conducted by the Inter-national Research Center of Big Data for Sustainable Development Goals (CBAS). The data utilized in this study issourced from SDGSAT-1 and provided by CBAS.

Conflicts of Interest

The authors declare no conflicts of interest.

References

King, M.D.; Platnick, S.; Menzel, W.P.; Ackerman, S.A.; Hubanks, P.A. Spatial and temporal distribution of clouds observed by MODIS onboard the Terra and Aqua satellites. IEEE Trans. Geosci. Remote Sens. 2013, 51, 3826–3852. [Google Scholar] [CrossRef]
Meng, X.; Shen, H.; Yuan, Q.; Li, H.; Zhang, L.; Sun, W. Pansharpening for cloud-contaminated very high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2840–2854. [Google Scholar] [CrossRef]
Shen, H.; Wu, J.; Cheng, Q.; Aihemaiti, M.; Zhang, C.; Li, Z. A spatiotemporal fusion based cloud removal method for remote sensing images with land cover changes. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 862–874. [Google Scholar] [CrossRef]
Zou, Z.; Shi, Z. Ship detection in spaceborne optical image with SVD networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 5832–5845. [Google Scholar] [CrossRef]
Ma, A.; Wang, J.; Zhong, Y.; Zheng, Z. FactSeg: Foreground activation-driven small object semantic segmentation in large-scale remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5606216. [Google Scholar] [CrossRef]
Chen, H.; Li, W.; Shi, Z. Adversarial instance augmentation for building change detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603216. [Google Scholar] [CrossRef]
Guo, H.; Dou, C.; Chen, H.; Liu, J.; Fu, B.; Li, X.; Zou, Z.; Liang, D. SDGSAT-1: The world’s first scientific satellite for sustainable development goals. Sci. Bull. 2023, 68, 34–38. [Google Scholar] [CrossRef]
Irish, R.R.; Barker, J.L.; Goward, S.N.; Arvidson, T. Characterization of the Landsat-7 ETM+ automated cloud-cover assessment (ACCA) algorithm. Photogramm. Eng. Remote Sens. 2006, 72, 1179–1188. [Google Scholar] [CrossRef]
Zhu, Z.; Woodcock, C.E. Object-based cloud and cloud shadow detection in Landsat imagery. Remote Sens. Environ. 2012, 118, 83–94. [Google Scholar] [CrossRef]
Luo, Y.; Trishchenko, A.P.; Khlopenkov, K.V. Developing clear-sky, cloud and cloud shadow mask for producing clear-sky composites at 250-meter spatial resolution for the seven MODIS land bands over Canada and North America. Remote Sens. Environ. 2008, 112, 4167–4185. [Google Scholar] [CrossRef]
Liu, X.; Xu, J.; Du, B. Automatic Cloud Detection of GMS-5 Images with Dual-Channel Dynamic Thresholding. J. Appl. Meteorol. 2005, 16, 434–444. [Google Scholar]
Ma, F.; Zhang, Q.; Guo, N.; Zhang, J. Research on cloud detection methods for multi-channel satellite cloud maps. Atmos. Sci. 2007, 31, 119–128. [Google Scholar]
Zhu, Z.; Wang, S.; Woodcock, C.E. Improvement and expansion of the Fmask algorithm: Cloud, cloud shadow, and snow detection for Landsats 4–7, 8, and Sentinel 2 images. Remote Sens. Environ. 2015, 159, 269–277. [Google Scholar] [CrossRef]
Dong, Z.; Sun, L.; Liu, X.; Wang, Y.; Liang, T. CDAG-Improved Algorithm and Its Application to GF-6 WFV Data Cloud Detection. Acta Opt. Sin 2020, 40, 143–152. [Google Scholar]
Sun, L.; Mi, X.; Wei, J.; Wang, J.; Tian, X.; Yu, H.; Gan, P. A cloud detection algorithm-generating method for remote sensing data at visible to short-wave infrared wavelengths. ISPRS J. Photogramm. Remote Sens. 2017, 124, 70–88. [Google Scholar] [CrossRef]
Mi, W.; Zhiqi, Z.; Zhipeng, D.; Shuying, J.; SU, H. Stream-computing based high accuracy on-board real-time cloud detection for high resolution optical satellite imagery. Acta Geod. Cartogr. Sin. 2018, 47, 760. [Google Scholar]
Changmiao, H.; Zheng, Z.; Ping, T. Research on multispectral satellite image cloud and cloud shadow detection algorithm of domestic satellite. Natl. Remote Sens. Bull. 2023, 27, 623–634. [Google Scholar]
Ge, K.; Liu, J.; Wang, F.; Chen, B.; Hu, Y. A Cloud Detection Method Based on Spectral and Gradient Features for SDGSAT-1 Multispectral Images. Remote Sens. 2022, 15, 24. [Google Scholar] [CrossRef]
Xie, Y.; Ma, C.; Wan, G.; Chen, H.; Fu, B. Cloud detection using SDGSAT-1 thermal infrared data. In Proceedings of the Remote Sensing of Clouds and the Atmosphere XXVIII. SPIE, Amsterdam, The Netherland, 5–6 September 2023; Volume 12730, pp. 77–82. [Google Scholar]
Jiao, L.; Huo, L.; Hu, C.; Tang, P. Refined UNet v3: Efficient end-to-end patch-wise network for cloud and shadow segmentation with multi-channel spectral features. Neural Networks 2021, 143, 767–782. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Shen, H.; Cheng, Q.; Liu, Y.; You, S.; He, Z. Deep learning based cloud detection for medium and high resolution remote sensing images of different sensors. ISPRS J. Photogramm. Remote Sens. 2019, 150, 197–212. [Google Scholar] [CrossRef]
Shao, Z.; Pan, Y.; Diao, C.; Cai, J. Cloud detection in remote sensing images based on multiscale features-convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4062–4076. [Google Scholar] [CrossRef]
Fan, X.; Chang, H.; Huo, L.; Hu, C. Gf-1/6 satellite pixel-by-pixel quality tagging algorithm. Remote Sens. 2023, 15, 1955. [Google Scholar] [CrossRef]
Chang, H.; Fan, X.; Huo, L.; Hu, C. Improving Cloud Detection in WFV Images Onboard Chinese GF-1/6 Satellite. Remote Sens. 2023, 15, 5229. [Google Scholar] [CrossRef]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10551–10560. [Google Scholar]
Wu, H.; Liu, J.; Xie, Y.; Qu, Y.; Ma, L. Knowledge transfer dehazing network for nonhomogeneous dehazing. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 478–479. [Google Scholar]
Zhang, H.; Patel, V.M. Densely connected pyramid dehazing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3194–3203. [Google Scholar]
Bai, H.; Pan, J.; Xiang, X.; Tang, J. Self-guided image dehazing using progressive feature fusion. IEEE Trans. Image Process. 2022, 31, 1217–1229. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Shen, H.Z.; Fan, F.; Shao, M.W.; Yang, C.S.; Luo, J.C.; Deng, L.J. EAA-Net: A novel edge assisted attention network for single image dehazing. Knowl.-Based Syst. 2021, 228, 107279. [Google Scholar] [CrossRef]
Cai, Z.; Ding, X.; Shen, Q.; Cao, X. Refconv: Re-parameterized refocusing convolution for powerful convnets. arXiv 2023, arXiv:2310.10563. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–15 June 2021; pp. 13733–13742. [Google Scholar]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
Shazeer, N. Glu variants improve transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar]
Narang, S.; Chung, H.W.; Tay, Y.; Fedus, W.; Fevry, T.; Matena, M.; Malkan, K.; Fiedel, N.; Shazeer, N.; Lan, Z.; et al. Do transformer modifications transfer across implementations and applications? arXiv 2021, arXiv:2102.11972. [Google Scholar]
Hua, W.; Dai, Z.; Liu, H.; Le, Q. Transformer quality in linear time. In Proceedings of the International conference on machine learning, PMLR, Baltimore, MA, USA, 17–23 July 2022; pp. 9099–9117. [Google Scholar]
Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O.; et al. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MA, USA, 17–23 July 2022; pp. 5547–5569. [Google Scholar]
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. Lamda: Language models for dialog applications. arXiv 2022, arXiv:2201.08239. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–28 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 3146–3154. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the IEEE/CVF international Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 7262–7272. [Google Scholar]
Yuan, W.; Wang, J.; Xu, W. Shift pooling PSPNet: Rethinking PSPNet for building extraction in remote sensing images from entire local feature pooling. Remote Sens. 2022, 14, 4889. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]

Figure 1. Distribution of multispectral data used in the experiment.

Figure 2. Sample production flowchart.

Figure 3. Label category cross reference.We chose yellow and green to highlight the detection of clouds and cloud shadows, and used labels close to their own color for the other feature categories.

Figure 4. Illustration of the RDE-MSCA structure. Here, D,

5 \times 5

denotes the use of a

5 \times 5

kernel size deep convolution D. We use RepConv to extract multi-scale features and then use them as convolutional attention weights to weight the input features extracted by DEConv.

Figure 4. Illustration of the RDE-MSCA structure. Here, D,

5 \times 5

denotes the use of a

5 \times 5

kernel size deep convolution D. We use RepConv to extract multi-scale features and then use them as convolutional attention weights to weight the input features extracted by DEConv.

Figure 5. DEConv. It contains five convolutional layers deployed in parallel including the following: vanilla convolution (VC), centre difference convolution (CDC), angle difference convolution (ADC), horizontal difference convolution (HDC), and vertical difference convolution (VDC).

Figure 6. DEConv re-parameterization process.

Figure 7. The gated linear unit structure.

Figure 8. Re-parameterized Convolution.

Figure 9. The architecture of RDE-SegNeXt.

Figure 10. Model prediction visualization comparison. We randomly selected some images containing five categories of clouds, cloud shadow, water body, and land within the SDGSAT-1 image data for visualization results comparison on HRNet, Deeplabv3+, Mask2former, Poolformer, ConvNeXt series, Swin Transformer series, and SegNeXt series models.

Table 1. Band settings of domestic and foreign multispectral satellites.

	Landsat-8	Sentinel-2A	Sentinel-2B	GF-1	GF-6	SDGSAT-1
Coastal Blue	0.433–0.453	0.432–0.453	0.432–0.453	—	0.40–0.45	0.37–0.427 0.41–0.467
Blue	0.450–0.515	0.459–0.525	0.459–0.525	0.45–0.52	0.45–0.52	0.457–0.529
Green	0.525–0.600	0.542–0.578	0.541–0.577	0.52–0.59	0.52–0.59	0.51-0.597
Yellow	—	—	—	—	0.59-0.63	—
Red	0.630–0.680	0.649–0.680	0.650–0.681	0.63–0.69	0.63–0.69	0.618–0.696
Red Edge 1 Red Edge 2 Red Edge 3	—	0.697–0.712 0.733–0.748 0.773–0.793	0.696–0.712 0.732–0.747 0.770–0.790	—	0.69–0.73 0.73–0.77	0.744–0.813
NIR Narrow NIR	0.845–0.885	0.780–0.886 0.854–0.875	0.780–0.886 0.853–0.875	0.77–0.89	0.77–0.89	0.798–0.911
Water vapor	—	0.935–0.955	0.933–0.954	—	—	—
Cirrus	1.360–1.390	1.358–1.389	1.362–1.392	—	—	—
SWIR 1	1.560–1.660	1.568–1.659	1.563–1.657	—	—	—
SWIR 2	2.100–2.300	2.115–2.290	2.093–2.278	—	—	—
TIRS 1	10.60–11.19	—	—	—	—	—
TIRS 2	11.50–12.51	—	—	—	—	—

Table 2. SDGSAT-1 satellite technical indicators.

Type	Item of Indicators	Specific Indicators
Orbit	Track Type	Solar synchronous orbit
	Orbital Altitude	505 km
	Orbital Inclination	$97 . 5^{°}$
MII	Image Width	300 km
	Detection Spectrum	B1: 374 nm∼427 nm
		B2: 410 nm∼467 nm
		B3: 457 nm∼529 nm
		B4: 510 nm∼597 nm
		B5: 618 nm∼696 nm
		B6: 744 nm∼813 nm
		B7: 798 nm∼811 nm
	Pixel Resolution	10 m

Table 3. The RDE-SegNeXt network architecture.

Stage	Output Size	e.r.	Tiny	Small	Base	Large
1	$\frac{H}{4} \times \frac{W}{4} \times C$	8	C = 32 L = 3	C = 64 L = 2	C = 64 L = 3	C = 64 L = 3
2	$\frac{H}{8} \times \frac{W}{8} \times C$	8	C=64 L = 3	C = 128 L = 2	C = 128 L = 3	C = 128 L = 5
3	$\frac{H}{16} \times \frac{W}{16} \times C$	4	C = 160 L = 5	C = 320 L = 4	C = 320 L = 12	C = 320 L = 27
4	$\frac{H}{32} \times \frac{W}{32} \times C$	4	C = 256 L = 2	C = 512 L = 2	C = 512 L = 3	C = 512 L = 3

Table 4. Impact of various improvements on the effectiveness of the model.

Methods	mFscore	mIOU	mAcc
SegNext-Tiny	77.33	66.59	74.24
SegNext-Tiny+RepConv	78.03	67.50	75.15
SegNext-Tiny+DEConv	79.19	68.89	77.16
SegNext-Tiny+RepConv+DEConv (ALL)	79.63	69.25	78.00
SegNext-Tiny+RepConv+RepConv (First)	75.21	66.95	77.78
SegNext-Tiny+RepConv+DEConv (First)	81.77	71.85	81.83

Table 5. Performance comparison of different models in cloud detection tasks (overall metrics).

Model	mFscore	aAcc	mIoU	mAcc
Hrnet [41]	68.82	85.90	56.30	62.76
Deeplabv3+ [42]	67.07	84.32	54.42	65.31
Mask2former [43]	72.88	79.89	59.52	70.90
PoolFormer [44]	77.69	88.09	67.64	76.93
BuildFormer [45]	79.63	88.07	69.01	78.82
DANet [46]	73.34	85.26	62.58	70.39
Segmenter [47]	75.26	85.89	64.26	72.61
SP-PSPNet [48]	79.62	88.06	69.12	79.35
ConvNeXt-T [49]	72.53	84.69	61.02	68.92
ConvNeXt-S [49]	76.98	86.56	66.12	75.85
ConvNeXt-B [49]	78.55	87.17	67.98	77.06
ConvNeXt-L [49]	78.73	87.78	68.29	76.60
Swin-T [50]	77.57	87.47	66.61	73.74
Swin-S [50]	78.44	87.76	67.61	76.05
Swin-B [50]	78.74	87.07	67.40	75.47
Swin-L [54]	79.69	88.21	69.14	76.55
SegNext-T [51]	77.33	86.98	66.59	74.24
SegNext-S [51]	78.51	87.38	67.95	76.42
SegNext-B [51]	79.46	87.75	68.98	77.37
SegNext-L [51]	79.53	88.03	69.25	77.50
RDE-SegNext	81.77	89.03	71.85	81.83

Table 6. Performance comparison of different models in cloud detection tasks (category metrics: land, water, shadow, snow, cloud).

Model	Land (1)		Water (2)		Shadow (3)		Snow (4)		Cloud (5)
Model	IoU (%)	Acc (%)	IoU (%)	Acc (%)	IoU (%)	Acc (%)	IoU (%)	Acc (%)	IoU (%)	Acc (%)
Hrnet	80.8	95.31	30.76	31.44	25.38	28.76	63.19	68.47	81.37	89.83
Deeplabv3+	75.22	87.96	25.83	36.20	25.83	30.95	61.97	77.92	83.25	93.80
Mask2former	71.57	93.75	67.75	85.46	25.48	29.30	68.01	77.93	64.80	68.07
PoolFormer	82.80	92.65	79.93	92.04	21.59	23.21	69.25	82.83	84.60	93.91
BuildFormer	82.31	93.06	77.05	88.95	29.91	36.75	70.58	84.78	85.19	90.57
DANet	70.91	94.94	75.17	76.15	14.73	16.21	64.08	79.51	79.90	85.15
Segmenter	80.23	93.85	77.24	81.45	20.19	24.14	62.55	76.15	81.08	87.48
SP-PSPNet	82.16	93.27	78.10	90.84	29.20	35.08	70.88	87.78	85.27	89.79
ConvNeXt-T	78.32	93.93	69.08	71.00	16.30	18.92	61.96	74.96	79.44	85.78
ConvNeXt-S	80.94	93.31	79.53	86.51	23.69	27.27	64.72	84.48	81.72	87.71
ConvNeXt-B	81.18	93.31	79.99	87.98	26.23	31.67	69.42	83.59	83.08	88.77
ConvNeXt-L	82.17	92.24	80.23	88.59	26.21	30.28	68.99	78.47	83.86	93.42
Swin-T	81.43	94.97	72.43	74.47	25.56	29.69	69.21	80.38	84.41	89.17
Swin-S	81.95	92.66	75.86	84.46	27.50	32.86	67.83	77.50	84.82	92.75
Swin-B	81.17	95.06	74.57	78.30	31.75	43.56	65.50	74.29	84.01	86.14
Swin-L	82.45	94.49	79.98	83.16	30.31	37.59	66.76	76.85	86.19	90.66
SegNext-T	80.76	95.07	76.16	80.08	23.64	27.60	69.12	81.08	83.27	87.37
SegNext-S	81.32	94.51	78.82	85.66	26.05	30.98	69.76	82.95	83.78	87.99
SegNext-B	81.78	94.61	80.30	87.35	27.10	31.69	70.33	83.88	84.07	88.14
SegNext-L	81.98	94.82	81.36	88.51	27.73	32.35	70.80	84.61	84.42	88.55
RDE-SegNext	83.04	95.29	82.14	89.64	30.05	34.47	73.39	85.02	86.07	91.71

Table 7. Inference time, computational complexity (FLOPs), and number of parameters for different models.

Model	Time (s)	Flops (G)	Params (M)
Hrnet	0.77	93.78	65.849
Deeplabv3+	0.28	260.10	60.211
Mask2former	0.29	221.617	48.548
PoolFormer	0.11	30.741	15.651
BuildFormer	0.15	116.614	38.354
DANet	0.19	216.064	47.463
Segmenter	0.15	37.388	25.978
SP-PSPNet	0.17	181.248	46.585
ConvNeXt-T	0.21	239.616	59.245
ConvNeXt-S	0.26	262.144	80.880
ConvNeXt-B	0.28	299.008	123.904
ConvNeXt-L	0.40	401.408	238.592
Swin-T	0.32	241.66	58.944
Swin-S	0.45	266.24	80.262
Swin-B	0.53	305.15	122.88
Swin-L	0.78	416.77	237.568
SegNext-T	0.08	6.299	4.226
SegNext-S	0.12	13.303	13.551
SegNext-B	0.20	31.071	27.046
SegNext-L	0.45	64.736	47.989
RDE-SegNext	0.14	31.818	63.172

Table 8. Performance comparison of different models on 38-Cloud dataset.

Model	IoU (%)	Acc (%)	Fscore (%)
Hrnet	61.34	78.57	76.04
Deeplabv3+	58.68	71.78	73.96
Mask2former	53.62	66.12	69.81
PoolFormer	63.80	71.16	77.90
BuildFormer	58.04	69.34	73.45
DANet	60.42	74.32	75.33
Segmenter	55.66	74.21	71.51
SP-PSPNet	60.57	75.04	75.44
ConvNeXt-T	26.79	28.84	42.25
ConvNeXt-S	46.79	53.44	63.75
ConvNeXt-B	52.86	63.72	69.17
ConvNeXt-L	61.39	82.21	76.08
Swin-T	55.26	65.16	71.18
Swin-S	57.94	71.08	73.37
Swin-B	61.72	74.55	76.33
Swin-L	63.01	81.99	77.31
SegNext-T	56.24	75.24	72.00
SegNext-S	56.64	77.90	72.32
SegNext-B	61.75	81.79	76.35
SegNext-L	64.52	81.35	78.44
RDE-SegNext	65.36	83.65	79.05

Table 9. Performance comparison of different models on LoveDA dataset (overall metrics).

Model	mFscore	aAcc	mIoU	mAcc
Deeplabv3+	57.06	61.15	40.75	55.48
Hrnet	56.62	57.64	40.35	57.27
Mask2former	56.23	52.40	39.73	52.40
PoolFormer	62.78	67.12	46.67	61.48
BuildFormer	60.82	64.32	44.97	57.96
DANet	46.54	53.50	30.99	44.60
Segmenter	44.25	52.22	29.15	44.61
SP-PSPNet	61.86	65.34	45.86	58.73
ConvNeXt-T	47.52	51.61	31.91	47.38
ConvNeXt-S	47.71	53.37	32.35	48.37
ConvNeXt-B	50.41	54.99	34.61	50.38
ConvNeXt-L	52.78	57.10	36.98	52.35
Swin-T	56.29	60.13	39.78	53.55
Swin-S	56.47	61.65	40.14	53.68
Swin-B	57.29	61.22	40.76	54.77
Swin-L	58.91	63.13	42.44	55.32
SegNext-T	50.17	54.85	31.35	48.43
SegNext-S	51.62	55.91	32.90	49.26
SegNext-B	53.42	59.33	34.51	51.60
SegNext-L	55.48	60.45	36.12	52.97
RDE-SegNext	57.38	63.23	38.24	54.68

Table 10. Comparison of different class performance of different models on LoveDA dataset.

Model	Background		Building		Road		Water		Barren		Forest		Agricultural
Model	IoU (%)	Acc (%)	IoU (%)	Acc (%)	IoU (%)	Acc (%)	IoU (%)	Acc (%)	IoU (%)	Acc (%)	IoU (%)	Acc (%)	IoU (%)	Acc (%)
Deeplabv3+	46.14	77.77	43.97	70.74	50.77	58.58	42.22	44.78	17.12	19.91	39.22	66.95	45.86	49.65
Hrnet	40.10	59.21	25.83	82.60	49.81	59.37	57.52	63.80	27.18	32.24	34.47	51.73	47.57	51.96
Mask2former	49.14	87.90	45.96	55.32	46.40	51.23	44.04	55.07	21.23	29.65	36.03	50.08	35.34	37.53
PoolFormer	51.35	74.50	57.08	77.44	48.41	53.13	54.49	87.31	22.73	24.91	40.08	54.01	52.52	59.03
BuildFormer	49.05	83.07	59.21	76.86	45.87	56.45	50.28	66.82	27.12	32.12	41.47	59.90	50.35	56.03
DANet	47.42	72.01	48.97	75.51	40.32	48.19	46.17	60.23	21.62	22.44	36.09	54.22	42.58	46.79
Segmenter	44.25	60.12	39.81	54.23	41.77	46.85	44.06	53.67	24.23	26.18	31.62	47.93	37.19	43.58
SP-PSPNet	51.78	75.46	55.25	70.93	44.52	53.28	48.92	59.18	27.79	30.22	42.03	58.45	49.65	53.88
ConvNeXt-T	47.52	51.61	31.91	47.38	50.77	58.58	42.22	44.78	17.12	19.91	39.22	66.95	45.86	49.65
ConvNeXt-S	47.71	53.37	32.35	48.37	49.81	59.37	57.52	63.80	27.18	32.24	34.47	51.73	47.57	51.96
ConvNeXt-B	50.41	54.99	34.61	50.38	46.40	51.23	44.04	55.07	21.23	29.65	36.03	50.08	35.34	37.53
ConvNeXt-L	52.78	57.10	36.98	52.35	48.41	53.13	54.49	87.31	22.73	24.91	40.08	54.01	52.52	59.03
Swin-T	56.29	60.13	39.78	53.55	50.77	58.58	42.22	44.78	17.12	19.91	39.22	66.95	45.86	49.65
Swin-S	56.47	61.65	40.14	53.68	49.81	59.37	57.52	63.80	27.18	32.24	34.47	51.73	47.57	51.96
Swin-B	57.29	61.22	40.76	54.77	46.40	51.23	44.04	55.07	21.23	29.65	36.03	50.08	35.34	37.53
Swin-L	58.91	63.13	42.44	55.32	48.41	53.13	54.49	87.31	22.73	24.91	40.08	54.01	52.52	59.03
SegNext-T	50.17	58.39	35.67	52.73	47.11	58.28	43.39	48.42	17.84	20.18	38.23	59.88	44.67	50.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Hu, C. SDGSAT-1 Cloud Detection Algorithm Based on RDE-SegNeXt. Remote Sens. 2025, 17, 470. https://doi.org/10.3390/rs17030470

AMA Style

Li X, Hu C. SDGSAT-1 Cloud Detection Algorithm Based on RDE-SegNeXt. Remote Sensing. 2025; 17(3):470. https://doi.org/10.3390/rs17030470

Chicago/Turabian Style

Li, Xueyan, and Changmiao Hu. 2025. "SDGSAT-1 Cloud Detection Algorithm Based on RDE-SegNeXt" Remote Sensing 17, no. 3: 470. https://doi.org/10.3390/rs17030470

APA Style

Li, X., & Hu, C. (2025). SDGSAT-1 Cloud Detection Algorithm Based on RDE-SegNeXt. Remote Sensing, 17(3), 470. https://doi.org/10.3390/rs17030470

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SDGSAT-1 Cloud Detection Algorithm Based on RDE-SegNeXt

Abstract

1. Introduction

2. Materials and Methods

2.1. SDGSAT-1

2.1.1. Areas

2.1.2. Sample Production and Labeling Process

2.2. RDE-MSCA

2.3. RDE-SegNeXt

3. Experiments

3.1. Datasets and Evaluation Metrics

3.1.1. Datasets

3.1.2. Evaluation Metrics

3.2. Implementation Details

3.3. Ablation Study

3.4. Comparison Experiments

3.4.1. Comparative Experiments of the SDGSAT-1 Dataset

3.4.2. 38-Cloud Public Cloud Dataset Comparison Experiments

3.4.3. LoveDA Public Dataset Comparison Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI