Physics-Guided Multi-Representation Learning with Quadruple Consistency Constraints for Robust Cloud Detection in Multi-Platform Remote Sensing

Xu, Qing; Zhang, Zichen; Wang, Guanfang; Chen, Yunjie

doi:10.3390/rs17172946

Open AccessArticle

Physics-Guided Multi-Representation Learning with Quadruple Consistency Constraints for Robust Cloud Detection in Multi-Platform Remote Sensing

by

Qing Xu

,

Zichen Zhang

,

Guanfang Wang

and

Yunjie Chen

^*

School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 2946; https://doi.org/10.3390/rs17172946

Submission received: 23 July 2025 / Revised: 20 August 2025 / Accepted: 21 August 2025 / Published: 25 August 2025

(This article belongs to the Special Issue Multi-platform and Multi-modal Remote Sensing Data Fusion with Advanced Deep Learning Techniques (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

With the rapid expansion of multi-platform remote sensing applications, cloud contamination significantly impedes cross-platform data utilization. Current cloud detection methods face critical technical challenges in cross-platform settings, including neglect of atmospheric radiative transfer mechanisms, inadequate multi-scale structural decoupling, high intra-class variability coupled with inter-class similarity, cloud boundary ambiguity, cross-modal feature inconsistency, and noise propagation in pseudo-labels within semi-supervised frameworks. To address these issues, we introduce a Physics-Guided Multi-Representation Network (PGMRN) that adopts a student–teacher architecture and fuses tri-modal representations—Pseudo-NDVI, structural, and textural features—via atmospheric priors and intrinsic image decomposition. Specifically, PGMRN first incorporates an InfoNCE contrastive loss to enhance intra-class compactness and inter-class discrimination while preserving physical consistency; subsequently, a boundary-aware regional adaptive weighted cross-entropy loss integrates PA-CAM confidence with distance transforms to refine edge accuracy; furthermore, an Uncertainty-Aware Quadruple Consistency Propagation (UAQCP) enforces alignment across structural, textural, RGB, and physical modalities; and finally, a dynamic confidence-screening mechanism that couples PA-CAM with information entropy and percentile-based thresholding robustly refines pseudo-labels. Extensive experiments on four benchmark datasets demonstrate that PGMRN achieves state-of-the-art performance, with Mean IoU values of 70.8% on TCDD, 79.0% on HRC_WHU, and 83.8% on SWIMSEG, outperforming existing methods.

Keywords:

physics-guided learning; multi-platform cloud detection; quadruple consistency constraints; uncertainty-aware fusion; semi-supervised segmentation; multi-representation remote sensing

1. Introduction

With the relentless advancement of earth observation technology, the capacity to acquire multi-platform remote sensing data has undergone exponential expansion in spatial, temporal, and spectral dimensions. The synergistic integration of satellite and airborne platforms enables a multidimensional observational matrix, furnishing rich complementary information for surface monitoring [1,2]. Intelligent fusion of such heterogeneous data is poised to substantially enhance accuracy and robustness in feature identification, particularly for mission-critical applications like cloud detection [3]. Nevertheless, pervasive cloud contamination imposes severe impediments to cross-platform data utilization, rendering vast quantities of valuable observations underutilized due to persistent occlusion or cumbersome preprocessing requirements [4]. This fundamentally constrains the potential of multi-platform data synergy.

Despite remarkable strides in deep learning-based semantic segmentation (e.g., DeepLabV3+ and SegFormer [5]), a critical annotation bottleneck persists in cloud detection: delineating cloud masks necessitates synthesizing spectral signatures (e.g., abrupt NIR reflectance attenuation) with textural morphology, elevating per-scene annotation costs to 4.2-fold higher than conventional tasks [6]. This compelling challenge necessitates semi-supervised learning paradigms. To transcend annotation constraints, semi-supervised learning has gained considerable traction in multimodal remote-sensing analytics, yielding incremental advances in pseudo-label refinement [7], contrastive representation learning [8], and cross-platform collaborative training [9].

Despite incremental advances in semi-supervised learning within multimodal remote sensing analytics [10], existing cloud detection methodologies continue to encounter substantial technical bottlenecks in cross-platform applications. Conventional approaches universally neglect atmospheric radiative transfer mechanisms [11] while simultaneously exhibiting inadequate capacity for decoupling multi-scale intrinsic cloud structures from RGB inputs, thereby constraining model generalizability. Under multi-platform environments, these limitations manifest more prominently: high intra-class variability and inter-class similarity challenges become increasingly complex across diverse imaging conditions, with spectral confusion between clouds and snow/bright white features proving more pervasive in cross-platform datasets; cloud boundary ambiguity and detail preservation deficiencies impede effective integration of complementary boundary information from multimodal data; cross-modal feature inconsistency issues restrict comprehensive exploitation of complementary information from multi-source remote sensing data; pseudo-label noise propagation problems become particularly acute in multi-platform semi-supervised learning frameworks [12], lacking reliable cross-platform confidence assessment mechanisms.

Specifically, high intra-class variability and inter-class similarity challenges become substantially more complex in multi-platform environments, where cloud imagery acquired from different platforms exhibits significant disparities in imaging conditions, spatial resolutions, and spectral responses, while spectral confusion phenomena between clouds and snow/bright white features become more prevalent across cross-platform datasets. Furthermore, cloud boundary ambiguity and detail preservation issues manifest prominently in multimodal data fusion scenarios, where traditional pixel-wise loss functions fail to effectively integrate complementary boundary information from disparate platforms, particularly demonstrating suboptimal cross-platform consistency performance in thin and fractured cloud regions [13]. Moreover, cross-modal feature inconsistency problems constrain collaborative utilization of multi-platform data, as single-modal features prove insufficient for adequately exploiting complementary information from multi-source remote sensing data, thereby limiting comprehensive understanding of multi-scale physical cloud properties. Finally, pseudo-label noise propagation issues become especially severe in multi-platform semi-supervised learning frameworks, where the absence of reliable cross-platform confidence assessment mechanisms leads to erroneous propagation of low-quality pseudo-labels, severely compromising model generalization performance across different platforms [14].

To comprehensively leverage complementary information from multi-platform multimodal remote sensing data and address the aforementioned critical challenges, this paper proposes a Physics-Guided Multi-Representation Network (PGMRN). The framework adheres to the principal pipeline of “physical a priori injection—multi-representation feature generation—consistency constraint optimization—hybrid supervised learning”, establishing tri-modal feature representations encompassing Pseudo-NDVI, structural, and textural components through deep fusion of atmospheric physical priors and image intrinsic decomposition. The framework employs a student-teacher dual-network architecture, integrating Physics-Augmented Class Activation Maps (PA-CAM) and Uncertainty-Aware Quadruple Consistency Propagation (UAQCP) mechanisms to establish an end-to-end comprehensive cloud detection system supporting cross-platform data fusion. Validation across ground-based all-sky imagers (SWINSEG/SWIMSEG), dual-platform datasets (TCDD), and high-resolution optical data (HRC_WHU) demonstrates the method’s efficacy in multi-platform environments.

The principal contributions of this work encompass four key aspects:

(1): InfoNCE Contrastive Loss Mechanism: Addressing high intra-class variability and inter-class similarity challenges in multi-platform cloud detection, we propose an InfoNCE-based contrastive loss function. This mechanism constructs compact intra-class structures and well-separated inter-class configurations by drawing similar pixel features closer while pushing dissimilar feature centers apart, effectively enhancing discriminative capabilities across cross-platform data. The contrastive loss maintains consistency with physical priors (cloud regions exhibit spatial continuity), significantly improving robustness of multimodal feature representations.
(2): Boundary-Aware Regional Adaptive Weighted Cross-Entropy Loss: To resolve cloud boundary ambiguity and detail preservation issues in multi-platform data fusion, we design a BARA-C-E Loss module that integrates PA-CAM confidence with Euclidean distance transforms. Through morphological operations for precise boundary extraction and adaptive weight assignment, this mechanism effectively balances categorical centers and edge information across different platform data, facilitating superior recovery of cross-modal edge features and addressing deficiencies of traditional pixel-wise losses in multi-platform boundary learning.
(3): Uncertainty-Aware Uncertainty-Aware Quadruple Consistency Propagation (UAQCP): Targeting cross-modal feature inconsistency problems, we construct a UAQCP encompassing structural consistency, textural consistency, RGB consistency, and physical consistency. Structural consistency constraints compel student models to learn large-scale cloud structures conforming to atmospheric fluid dynamics principles; textural consistency constraints enhance capture capabilities for high-frequency oscillatory features at cloud edges driven by turbulent diffusion; RGB consistency and physical consistency, respectively, ensure cross-network consistency of multi-channel features and Pseudo-NDVI spectral responses, effectively resolving physics-violating issues such as “cloud-snow confusion” in multi-platform environments while comprehensively exploiting complementary information from multi-source remote sensing data.
(4): PA-CAM and Information Entropy Dynamic Confidence Screening Mechanism: To address pseudo-label noise propagation issues in multi-platform semi-supervised learning, we propose the Physics-guided Activation Map Dynamic Confidence (PAMC) mechanism. This approach integrates PA-CAM physical confidence with information entropy from teacher network prediction probability distributions, implementing percentile-based dynamic threshold screening. The physics-probability dual filtering significantly enhances cross-platform pseudo-label accuracy, circumventing limitations of fixed thresholds under varying imaging conditions and ensuring selective propagation of high-quality pseudo-labels in multi-platform environments.

2. Related Work

2.1. Cloud Segmentation

Cloud segmentation constitutes a fundamental task in remote sensing image processing, designed to accurately separate cloud regions from remote sensing images to provide clean data for downstream applications such as surface parameter inversion and disaster monitoring. Traditional methods predominantly utilize spectral features and geometric constraints, including threshold methods that distinguish clouds by setting thresholds for visible and near-infrared bands [15], edge detection methods that extract cloud boundaries [16], and clustering algorithms based on pixel spectral features [17]. However, these approaches are vulnerable to interference from illumination changes and background factors, and their generalization ability is constrained by manually designed features.

Deep learning methods have emerged as the predominant approach for cloud segmentation by automatically learning sophisticated features. U-Net variants (e.g., ResU-Net, Attention U-Net) adopt encoder–decoder structures with attention mechanisms to efficiently concentrate on key cloud features [18,19]. Transformer-based methods utilize self-attention mechanisms to model long-range dependencies, effectively handling irregular cloud shapes and blurred boundaries [20]. Multi-spectral fusion methods integrate information from multiple bands (visible, near-infrared, short-wave infrared) to enhance segmentation of challenging cloud types such as thin and cirrus clouds [21]. Conversely, conventional approaches universally neglect atmospheric radiative transfer mechanisms while simultaneously exhibiting inadequate capacity for decoupling multi-scale intrinsic cloud structures from RGB inputs, thereby constraining model generalizability across diverse imaging conditions.

2.2. Mean-Teacher and Consistent Learning

The Mean-Teacher framework represents a classic semi-supervised learning methodology focused on updating the teacher model with an exponential moving average (EMA) of the student model, which subsequently learns using pseudo-labels to extract value from unlabeled data [22]. Consistency learning reduces overfitting by constraining model output consistency under input perturbations. However, the traditional framework’s stability relies is substantially dependent on EMA update rates, and pseudo-label noise can compromise effectiveness in complex tasks [23].

Recent advancements have significantly improved the Mean-Teacher framework for complex scenarios. The Stable Mean Teacher approach introduces Error Recovery (EoR) modules and Difference of Pixels (DoP) constraints for video action detection [24]. Other studies have applied Mean-Teacher with mutual consistency supervision to remote sensing semantic segmentation, efficiently leveraging utilizing unlabeled data [25]. However, pseudo-label noise propagation problems become particularly acute in multi-platform semi-supervised learning frameworks, lacking reliable cross-platform confidence assessment mechanisms that lead to erroneous propagation of low-quality pseudo-labels.

2.3. Physics-Guided Remote Sensing

Physics-Guided Remote Sensing (PGRS) integrates physical modeling with machine learning to enhance accuracy and interpretability of remote sensing data processing. Traditional approaches include physical model-based inversion such as the PROSAIL model connecting leaf biochemical parameters with canopy spectra [26], and radiative transfer models (RTM) simulating electromagnetic wave transmission for surface parameter inversion [27]. Combined physics-machine learning methods use physical model outputs as inputs to SVMs and random forests, but are hindered by computational complexity and limited interpretability.

Recent physics-constrained deep learning approaches introduce physical constraints as regularization terms in neural networks [28]. Physically informed neural networks (PINNs) embed physical equations as soft constraints for parameter inversion, successfully integrating the advantages of data-driven and physical models [29]. Nevertheless, existing physics-constrained approaches face significant limitations in cross-modal feature inconsistency problems, where single-modal features prove insufficient for adequately exploiting complementary information from multi-source remote sensing data, thereby limiting comprehensive understanding of multi-scale physical cloud properties.

2.4. Multimodal Fusion

Multi-feature fusion represents a critical technique for integrating different feature representations to utilize complementary information and improve model performance. Traditional methods include early fusion (feature concatenation, summation), mid-level fusion (feature interaction, attention), and late fusion (decision voting, weighted averaging) [30]. Early fusion is simple but susceptible to noise, mid-level fusion captures relationships but exhibits high complexity, while late fusion is robust but neglects feature-level interactions.

Advanced fusion strategies emphasize feature alignment and sophisticated integration mechanisms. Cross-modal attention mechanisms highlight complementary information by modeling attention weights between different features [31]. Transformer-based methods utilize self-attention to capture long-distance dependencies for efficient fusion. Adaptive fusion methods dynamically adjust weights according to feature reliability, enhancing robustness to noisy inputs [32]. Notwithstanding the above advances, cloud boundary ambiguity and detail preservation deficiencies impede effective integration of complementary boundary information from multimodal data, as traditional pixel-wise loss functions fail to effectively integrate complementary boundary information from disparate platforms, particularly demonstrating suboptimal cross-platform consistency performance in thin and fractured cloud regions.

3. Methods

3.1. Overall Framework

The Physics-Guided Multi-Representation Network (PGMRN) (shown in Figure 1) proposed in this paper follows the main pipeline of “physical a priori injection-multi-representation feature generation-consistency constraint optimization-hybrid supervised learning”, and establishes a comprehensive cloud detection system to address the core problems of traditional cloud detection, including weak physical interpretability, data dependency, and insufficient robustness under label sparsity. The framework input consists of a single easily accessible RGB image, which initially pre-trained using labeled data to generate Physically Augmented Class Activation Maps (PA-CAM), followed by processing unlabeled data through a multi-representation decomposition module to extract three features, including NDVI, Structure, and Texture. The fundamental principle underlying this approach is the deep fusion of atmospheric physical a priori and image feature decomposition. Subsequently, the generated three-modal features and raw images are concurrently fed into the student-teacher network, while labeled images are processed through the student network. To address the problems of “cross-modal feature inconsistency” and “poor robustness in cloud-polluted areas”, the framework incorporates the UAQCP mechanism, which achieves multi-representation and cross-network feature consistency through dynamic consistency constraints utilizing PA-CAM and information entropy.

3.2. Physics-Guided Feature Generation (PGFG)

Traditional methods neglect the atmospheric radiative transfer mechanism, while simultaneously, the RGB input proves inadequate for decoupling the multi-scale intrinsic structure of clouds, thereby constraining the generalizability of the model. Nevertheless, the statistical results in Figure 2 demonstrate that the NDVI physical attributes differ from the structural and textural attributes of the image itself in two key statistics: mean and variance. Furthermore, distinct variations exist in the standard deviation patterns of the R, G, and B channels within the structural and textural attributes, which indicates the necessity to capture inter-channel information differences. We integrate the atmospheric radiation principles with image eigen-decomposition to construct interpretable and physically traceable multi-representation inputs, which provide a priori knowledge injection for deep learning, thereby overcoming the bottleneck of limited physical interpretability inherent in end-to-end models through the synergistic collaboration of physically interpretable operators (TV-L1 optimization) and deep learning.

3.2.1. Pseudo-NDVI

Cloud segmentation in RGB remote sensing imagery encounters a fundamental challenge—spectral confusion of clouds with snow/bright white features. The Normalized Vegetation Index (NDVI) can significantly differentiate vegetation (high values) from clouds (low values), but consumer-grade drones or satellites are often equipped exclusively with RGB sensors. To incorporate this fundamental atmospheric physical principle, a cloud-sensitive spectral proxy utilizing solely RGB bands is required. To generate multi-representation content from a single input, we apply a linear RGB fusion method based on physical principles to model the interaction between cloud-sensitive bands in NDVI and construct an atmospheric response prior.

According to the radiative transfer mechanism established by Gitelson et al. (TGRS 2003) [33], clouds exhibit very low reflectance in the red band and significantly high reflectance in the green band, while vegetation demonstrates the opposite characteristics. Clouds exhibit distinct spectral signatures due to their optical properties. In contrast to vegetation indices that exploit chlorophyll absorption, our Pseudo-NDVI leverages the fundamental difference in cloud reflectance patterns: high reflectance in green bands due to Mie scattering from water droplets, and relatively lower reflectance in red bands due to absorption. This spectral contrast

R - G

/

R + G

where R and G represent red and green band reflectance, respectively, effectively discriminates clouds from clear sky and surface features based on atmospheric radiative transfer principles rather than vegetation characteristics. We use the physically based Green-Red Normalized Vegetation Index (GRNDVI) to approximate the NDVI, defined as:

P s e u d o - N D V I = \frac{R - G}{R + G + ϵ},

(1)

where

ϵ = 10^{- 6}

, R and G denote the red and green band characteristics of the images, respectively. the Pseudo-NDVI response goes strong for clouds with high optical thickness and continuity, while the Pseudo-NDVI values oscillate or even go negative at the background in the thin/broken cloud region.

3.2.2. TV-L1 Multi-Representation Decomposition

Clouds inherently exhibit multi-scale physical properties, encompassing macroscale structures (S) such as coherent cloud clusters, microscale textures (T) such as turbulent diffusion at cloud boundaries, and noise (R) perturbations induced by photon scattering or sensor artifacts in optically thin clouds. Conventional RGB imagery fails to explicitly resolve these distinct components, necessitating mathematical decomposition frameworks to decouple intrinsic signatures. Within the variational image decomposition paradigm, the energy functional is formulated as [34]:

{m i n}_{S, T, R} {| |S| |}_{1} + λ {| | T | |}_{1} + β {| |R| |}_{1},

(2)

where

{| |\cdot| |}_{1}

is the L1 paradigm.

{| |S| |}_{1}

is the total variation in the structural part S, which measures the edge information or local variation in the image S.

{| | T | |}_{1}

is used to measure the detailed part of the image. λ is the regularization parameter and the balance parameter β is used to control the trade-off between the noise fidelity term and the regularization term. The Chambolle-Pock [35] pairwise algorithm is used for the iterative solution.

Pseudo-NDVI requires only RGB bands, which ensures compatibility across diverse imaging platforms including consumer UAVs, ground-based sky imagers, and satellite systems, while TV-L1 decomposition is resolution-agnostic and supports cross-platform data fusion from different sensing modalities. This design addresses the fundamental challenge of multi-platform remote sensing where different systems provide varying spectral configurations and spatial resolutions, making our approach truly platform-independent.

3.2.3. Physics-Augmented Class Activation Map (PA-CAM)

Structural integrity constitutes a mission-critical attribute in remote sensing imagery. Whereas traditional CAM solely relies on RGB-derived deep features, it neglects atmospheric physical responses of clouds. Concurrently, existing confidence estimation overlooks complementary information across multi-representation features and lacks adaptability to cross-platform heterogeneity. To address these dual limitations, we propose Physics-Augmented CAM (PA-CAM), which integrates TV-L1-derived physical components with UNet deep features to establish a physics-data dual-driven confidence framework.

As shown in Figure 3, the input RGB image is decomposed into structural components (S) by TV-L1 regularization, to obtain the 3-channel physics-guided complementary input

I_{s t r u c t u r e}

, which is fed into the pre-trained UNet model, and the last layer of the decoder, which contains the richest semantic information, is extracted from the convolutional layer feature F and up-sampled to the input image size by bilinear interpolation. The Channel-wise Increase of Confidence (CIC) [36] for activation map is defined as:

C I C = f (X \cdot F) - f (X),

(3)

where

f (\cdot)

denotes the output of the pre-trained Unet network. We use the CIC scores as channel weights and the Physics-Augmented CAM is defined as:

P A - C A M = R e L U (\sum_{c = 1}^{C} {C I C}_{c} F_{c}) .

(4)

Specific implementation steps are in Algorithm 1. Physical modal enhancement CAM can effectively utilize the atmospheric physical response a priori. It not only improves the confidence estimation accuracy, but also creates a new paradigm of fusion of physical laws and deep features.

Algorithm 1 Physics-Augmented CAM algorithm

Require: Structure Image

X

, Model

f (X)

,
Ensure:

L_{P A - C A M}

Initialization;
//get activation of two layers;

F

, logit

\leftarrow f (X)

p r e d i c t_{c l a s s} = s i g m o i d (l o g i t)

C l o u d_{m a s k} = (p r e d i c t_{c l a s s} \geq 0.5)

S k y_{m a s k} = (p r e d i c t_{c l a s s} < 0.5)

M \leftarrow []

C \leftarrow t h e n u m b e r o f c h a n n e l s i n F

for

c

in

[0, \dots, C - 1]

do

M_{c} \leftarrow F_{c}

//normalize the activation map;

M_{c} \leftarrow s (M_{c})

//Hadamard product;

M .

append (

M_{c} \circ X)

end

M \leftarrow B a t c h i f y (M)

//

f^{c} (\cdot)

as the logit of class;

{C I C}_{c} \leftarrow f^{c} (M)

{C I C}_{c} \leftarrow S i g m o i d ({C I C}_{c}) \circ C l o u d_{m a s k}

L_{P A - C A M} \leftarrow R e L U (\sum_{c} {C I C}_{c} F_{c})

3.3. Hybrid Supervision with Prior Knowledge

3.3.1. Contrastive Loss

Cloud detection confronts significant challenges from high intra-class variability and inter-class similarity [37]. Traditional pixel-wise losses prove inadequate for learning discriminative representations. We therefore employ InfoNCE contrastive loss [38] o enhance intra-class consistency and inter-class discrimination. Obtained the cluster centers

c_{b}^{0}

and

c_{b}^{1}

of the two classes of pixel, and for the pixel feature

f_{b, i}

belonging to class c, the loss function is defined as the logarithmic probability of the feature’s positive-sample similarity to the center of the same class with respect to the similarity of the centers of all classes, and the temperature coefficient τ controls the degree of distributional sharpening. The final loss is the average of the two classes of pixel losses over all valid images, defined as:

L_{c o n s t r a s t i v e} = \frac{1}{| B |} \sum_{b \in B} (\frac{1}{N_{b}^{0}} \sum_{i = 1}^{N_{b}^{0}} l_{b, i}^{0} + \frac{1}{N_{b}^{1}} \sum_{j = 1}^{N_{b}^{1}} l_{b, i}^{1}),

(5)

where the loss for a single pixel

l

is defined as:

l_{b, i}^{c} = - l o g \frac{e x p (\frac{〈f_{b, i}, c_{b}^{c}〉 / τ}{| | f_{b, i} | | | | c_{b}^{c} | |})}{\sum_{k \in {0,1}} e x p (\frac{〈f_{b, i}, c_{b}^{k}〉 / τ}{| | f_{b, i} | | | | c_{b}^{k} | |})},

(6)

where B denotes the set of images in the batch,

N_{b}^{c}

denotes the number of pixels of category c in image b, and

f_{b, i}

and

c_{b}^{c}

denote the feature vector of pixel i in image b and the center vector of category c, respectively.

〈\cdot, \cdot〉

is the vector dot product operation, and

| | \cdot | |

denotes the

L_{2}

paradigm. The temperature hyperparameter

τ

is taken as 0.1 in the experiments.

The loss utilizes labeled data to construct a compact intra-class and separated inter-class structure in the feature space by bringing similar pixel feature distances closer together and pushing dissimilar feature centers farther apart. This optimization is consistent with the physical a priori (cloud regions have spatial continuity), while complementing with the consistency constraints, which together enhance the robustness of the Mean Teacher framework.

Figure 4 delineates the region-adaptive weight generation workflow, centered on boundary-aware optimization. The BARA-C-E Loss module (Figure 4d,e) transforms ground-truth boundary geometry into quantifiable weight assignments, thereby resolving the pervasive issue of boundary neglect in conventional models. Through synergistic integration of PA-CAM-derived physical confidence and Euclidean distance transforms, it delivers pixel-precise weighting guidance for edge-sensitive segmentation.

3.3.2. Boundary-Aware Regional Adaptive Weighted Cross-Entropy Loss (BARA-C-E Loss)

To enhance the model’s learning of cloud boundaries, we propose regionally adaptive edge-weighted loss, which centers on extracting precise boundaries and assigning adaptive weights through morphological operations. Boundary-Aware regional adaptive weighted cross-entropy loss is computed by assigning regional weights to each pixel:

L_{r e g i o n} = \frac{1}{|Ω|} \sum_{b \in Ω} ω (i, j) C E (y_{(i, j)}, {\hat{y}}_{(i, j)}),

(7)

where

|Ω|

is the total number of pixels in the image,

y_{(i, j)}

is the true label, and

{\hat{y}}_{(i, j)}

is the probabilistic prediction of the model.

C E (\cdot)

is the standard cross entropy function. The weighted cross-entropy loss is calculated by combining the weight map with the predicted probability map. Where the weights

ω (i, j)

are defined as:

ω (i, j) = \frac{P A - C A M (i, j) + ω_{e d g e} (i, j)}{2},

(8)

where

ω_{0} (i, j) = e^{- d (i, j)},

(9)

ω_{e d g e} (i, j) = \frac{ω_{0} (i, j) - ω_{M A X}}{ω_{M A X} - ω_{M I N}},

(10)

where

d (i, j)

is the Euclidean distance from the current pixel point to the nearest boundary.

ω_{M A X}

and

ω_{M I N}

are the maximum and minimum weights, respectively, and Equation (10) is used to normalize the

ω_{e d g e}

(when

ω_{e d g e} (i, j) < 0.1

, set

ω_{e d g e} (i, j) = 0.1

). The region-adaptive cross-entropy loss

L_{r e g i o n}

weighted by PA-CAM (as in Figure 4) can effectively balance the category center and edge information, and help the model to better restore the edge features.

3.4. Uncertainty Aware Propagation

3.4.1. Dynamic Confidence Level of PA-CAM

The intrinsically low prior probability of cloud pixels coupled with high background confusion renders fixed-threshold methods ineffective. To overcome deficiencies in traditional pseudo-label confidence estimation (e.g., physical agnosticism, static thresholds), we propose Physics-guided Activation Map Dynamic Confidence (PAMC). This mechanism synthesizes physical attributes with deep features while incorporating information entropy, yielding cloud-specific confidence estimates. Specifically, PA-CAM localizes physically significant regions, while dynamic thresholding selectively filters noise-corrupted pseudo-labels. The confidence metric is formalized as:

ω_{P A M C} = l (H^{t} \geq λ) M (i, j),

(11)

where

λ

is a predefined confidence threshold to filter noisy labels.

H^{t}

represents the information entropy of the probability distribution predicted by the teacher model.

H^{t} = - \sum [F_{t} (x_{u})]^{i} l o g [F_{t} (x_{u})]^{i},

(12)

i denotes the

i

-th class. M is the confidence mask, defined as:

M (i, j) = \{\begin{matrix} 1, i f P A M C (i, j) \geq T_{p} \\ 0, o t h e r w i s e \end{matrix},

(13)

where

T_{p} = s o r t e d {(P A M C)}_{[p \cdot n u m_p i x e l]}^{m a x},

(14)

PA-CAM fuses the spectral, texture and other physical features of the cloud, and the confidence estimation is closer to the real distribution of the cloud; meanwhile, for the PA-CAM distribution of different samples, the dynamic adaptive threshold avoids the one-size-fits-all of the fixed threshold, and the probabilistic-physical dual filtering combines the probability of the model features with the heat map of the physical features, which significantly improves the pseudo-labeling accuracy.

3.4.2. Uncertainty-Aware Quadruple Consistency Propagation (UAQCP)

In order to enhance the efficacy of the model, we design the cross-modal consistency constraint mechanism and reduce the probability of uncertainty propagation through dynamic confidence screening in PA-CAM. The quadruple consistency constraint loss is defined as:

L_{c o n} = L_{s t r u c t} + L_{t e x t u r e} + L_{R G B} + L_{p h y}

(15)

The macroscopic morphology of clouds follows the laws of atmospheric hydrodynamics with segmented and smooth structural properties, and the structural prediction of the teacher model is more stable in the Mean-Teacher framework, while the student model is susceptible to noise interference. By constraining the S-component prediction consistency can force the student model to learn the large-scale cloud structure that is more consistent with the physical laws and avoid cloud fragmentation due to local noise. The loss of structural consistency is:

L_{s t r u c t} = \sum_{H W} ω_{P A M C} \cdot {| | P_{s}^{S} - P_{t}^{S} | |}_{2}^{2},

(16)

where

P_{s}^{S}

and

P_{t}^{S}

denote the probability distribution of structural features predicted from the student network and the teacher network, respectively. The cloud edges are driven by turbulent diffusion with high-frequency oscillating texture characteristics, while the traditional consistency constraints focus on the overall prediction and neglect the consistency of the edge details. The constraint T separates the consistency can enhance the ability of the student model to capture the cloud edges, and solve the common problem of cloud segmentation “edge blurring”, which is defined as:

L_{t e x t u r e} = \sum_{H W} ω_{P A M C} \cdot {| | P_{s}^{T} - P_{t}^{T} | |}_{2}^{2},

(17)

where

P_{s}^{T}

and

P_{t}^{T}

the probability distribution of texture features predicted from the student network and the teacher network, respectively. Low-confidence pixels in labeled data (e.g., mixing areas of thin clouds and surface) introduce noise and reduce the effectiveness of the consistency constraints. PAMC dynamically filters high-confidence pixels by fusing multi-scale features to generate pixel-level confidence maps, which are weighted by the confidence maps and the loss of RGB consistency, and can be focused on the areas of the teacher’s model that are more reliably predicted, avoiding the propagation of noise, as defined by:

L_{R G B} = \sum_{H W} ω_{P A M C} \cdot {| | P_{s}^{R G B} - P_{t}^{R G B} | |}_{2}^{2},

(18)

where

P_{s}^{R G B}

and

P_{t}^{R G B}

denote the structural features of the RGB channel extracted by the student and teacher networks, respectively. The pseudo-NDVI channel simulates the spectral response of clouds. According to the physical laws of the atmosphere, the pseudo-NDVI values in the cloud-covered areas should have a low correlation with the structural features. Constraining the consistency of the NDVI features of the student and teacher models can force the model to learn the segmentation results in accordance with the physical laws and solve the problem of “confusing clouds with snow/whiteish features”. The physical consistency is defined as:

L_{p h y} = \sum_{H W} ω_{P A M C} \cdot {| | M S E ({N D V I}_{t}, {N D V I}_{s}) | |}_{2}^{2},

(19)

where

M S E (\cdot)

denotes the mean square error. In conclusion, the total loss is defined as:

L = L_{r e g i o n} + β_{1} L_{c o n} + β_{2} L_{c o n t r a s t},

(20)

where

β_{1}

and

β_{2}

are balancing hyperparameters.

4. Experiments

4.1. Datasets

We conducted experiments on four datasets—SWINSEG/SWINSEG [39], TCDD [40] and HRC_WHU [41]—that were strategically chosen to validate our Physics-Guided Multi-Representation Network (PGMRN) across divergent platform characteristics and, consequently, to demonstrate genuine multi-platform capability.

SWIMSEG/SWINSEG: This dataset comprises 1128 hemispherical sky images acquired via ground-based WAHRSIS all-sky imagers in Singapore (2016), featuring meteorologist-validated binary cloud masks. The SWIMSEG subset contains 1013 sky/cloud patch images, while SWINSEG contains 115 images. These datasets target unimodal RGB segmentation under extreme fisheye distortion and diurnal radiative shifts. Dataset partitioning follows established protocols, employing a semi-supervised learning strategy where 1/10 of the training set is allocated for supervised training. Specifically, 101 images constitute the labeled training set, 913 images form the unlabeled training set, and 114 images serve as the test set. All images maintain an original resolution of 500 × 500 pixels and are uniformly resized to 224 × 224 pixels via bilinear interpolation during training.

TCDD: This comprehensive dataset contains 2300 ground-based cloud images spanning 80 geographic regions across nine Chinese provinces (2019–2022), collaboratively annotated by meteorologists at 512 × 512 resolution. Following the semi-supervised learning framework, 187 images are designated for labeled data training, 1687 images for unlabeled data training, and 426 images for testing validation. Images are resized from their original 512 × 512 pixel resolution to 224 × 224 pixels through interpolation methods during training.

HRC_WHU: This high-resolution dataset encompasses 150 scenes (0.5–15 m GSD) covering five global biomes (water/vegetation/urban/snow/barren), with expert-derived cloud masks for fine-boundary assessment. The dataset configuration allocates 12 images for labeled training, 108 images for unlabeled training, and 30 images for testing evaluation. Original images with 720 × 1280 pixel resolution are preprocessed to the standard 224 × 224 pixel input size through interpolation methods prior to model training.

This multi-platform testbed orchestrates a systematic validation continuum: SWIMSEG/SWINSEG first subjects TV-L1 decomposition to stringent geometric stress testing through severe radial distortion and pronounced vignetting; subsequently, TCDD probes the extrapolation limits of physics-encoded priors by furnishing highly heterogeneous atmospheric states spanning diverse radiative-transfer regimes; finally, HRC_WHU presents subtle spectral confounders—most notably cloud–snow ambiguity—thereby demanding rigorous appraisal of pseudo-NDVI discriminability while simultaneously benchmarking boundary-localization precision. Collectively, these datasets forge an unbroken gradient from terrestrial to orbital vantage points, enabling exhaustive evaluation of geometric fidelity, atmospheric generalizability, spectral discriminative power, and spatial scalability, all rigorously harmonized with our physics-guided architecture.

4.2. Implementation Details

This method is implemented based on the Python 3.10 with torch 2.4.0+cu124 framework and runs on hardware equipped with an NVIDIA RTX A6000 (48 GB), NVIDIA Corporation, Santa Clara, CA, USA. The network architecture uses U-Net as the encoder–decoder backbone and connects to a shallow fully connected network as the segmentation head. The training process uses the Adam optimizer with a batch size of 16 and an initial learning rate of 1⁻⁴. During the pre-training phase, image augmentation techniques such as color jittering and Gaussian noise are employed; in the semi-supervised training phase, spatial transformation augmentation strategies including random rotation, translation, and horizontal/vertical flipping are introduced.

4.3. Comparison Experiments

To demonstrate the efficacy of the Physics-Guided Multi-Representation Network (PGMRN) developed herein, we conduct comprehensive comparisons with representative semi-supervised segmentation methods U2PL [42], BHPC [43], UCMT [44], FixMatch [45], UniMatch [46], and SoftMatch [47], and CorrMatch [48] for a comprehensive comparison. The experiments focus on three critical dimensions: (1) physics-guided modal generation effectiveness across heterogeneous platform architectures, (2) quadruple consistency constraint performance under varying imaging conditions, and (3) cross-platform transferability when deploying across fundamentally disparate sensing modalities (ground-based fisheye systems, standard visual sensors, and high-resolution satellite/aerial platforms). This experimental rigorously validates the multi-platform capabilities articulated in our contribution by demonstrating consistent performance enhancements across heterogeneous sensing systems. The experiments also focus on the core benefits such as the gain of physics-guided modal generation, the validity of the UAQCP, etc.

Experimental results on the TCDD dataset demonstrate PGMRN’s robust advantages across three characteristic scenarios. For sporadically distributed small cloud fragments (Figure 5, row 1), the PGFG module leverages TV-L1 decomposition to disambiguate multi-scale cloud properties into distinct macroscopic structures (S), boundary textures (T), and residual noise (R). This decomposition furnishes isolated and purified structural representations for small targets, effectively eliminating spectral aliasing between targets and background inherent in conventional RGB inputs. Simultaneously, the pseudo-NDVI derived from atmospheric radiative transfer principles—expressed as (R − G)/(R + G)—endows small cloud clusters with unique spectral fingerprints, decisively resolving spectral ambiguities between clouds and snow/bright surfaces. By contrast, baseline methods exhibit significant limitations: U2PL relies exclusively on entropy thresholds to ascertain pixel reliability, frequently misclassifying high-entropy small clouds as unreliable pixels subject to removal. Its negative sample mining further suffers from inadequate spatial continuity constraints, resulting in detection omissions. BHPC’s image-level hard positive sample contrast prioritizes feature similarity but lacks physical priors, consequently confounding highly reflective surfaces and introducing spurious labels. Quantitatively, PGMRN achieves a recall of 85.4% in this scenario (Table 1), exceeding the suboptimal baseline CorrMatch by 3.6%, thereby validating substantial improvements in small target detection.

Transitioning to semi-transparent thin clouds (Figure 5, rows 2–3), PGMRN’s BARA-C-E Loss mechanism synergistically couples Euclidean distance transforms with morphological weighting to assign elevated importance to thin cloud edges, compelling the model to concentrate on marginal features. The structural consistency term (L_struct) within the quadruple consistency framework further ensures the student network acquires smooth structures conforming to atmospheric fluid dynamics, preventing fragmentation in thin cloud regions. Conversely, while UCMT’s tri-network collaboration generates predictive divergence, its UMIX operation merely substitutes pixels based on uncertainty without physical constraints, yielding segmentation outputs that violate atmospheric principles in thin cloud areas. FixMatch’s weak-to-strong consistency employs a fixed threshold, excessively suppressing low-confidence thin clouds and incurring significant information loss. Experiments confirm PGMRN attains a mean Intersection-over-Union (MIoU) of 70.8% for thin clouds, surpassing UCMT by 3.5%, while reducing the error rate to 10.5%, representing a 1.4% improvement over CorrMatch.

For morphologically complex cloud formations (Figure 5, row 4), the InfoNCE contrastive loss constructs compact intra-class clusters while maximizing inter-class separation in the latent space, effectively accommodating high intra-class cloud variability. The Physics-Aware Class Activation Module (PA-CAM) integrates TV-L1 structural components with deep UNet features to deliver physically interpretable confidence via Class-specific Interpretable Confidence (CIC), thereby generating reliable spatial attention for intricate cloud morphologies. In contrast, UniMatch’s DualPerturb lacks physical regularizations, impairing its capacity to preserve complex cloud geometries. SoftMatch’s truncated Gaussian weighting mitigates rigid thresholding limitations but depends on domain-agnostic parameter estimation, exhibiting insufficient adaptability to irregular cloud forms. Ultimately, PGMRN achieves an F1-score of 80.0% in this scenario, outperforming the nearest competitor by 2.3%, confirming its efficacy in processing complex morphologies.

On the HRC_WHU high-resolution aerial imagery, PGMRN’s proficiency in fine thin cloud segmentation (Figure 6, row 1) primarily stems from the atmospheric radiative transfer-inspired pseudo-NDVI

R - G / (R + G + ϵ)

proposed by Gitelson et al., which establishes a robust spectral demarcation between thin clouds and terrestrial surfaces. Additionally, the dynamically computed confidence threshold T_p, adaptively derived from PA-CAM distributions, circumvents the inflexibility of fixed thresholds by enabling image-specific decision boundaries. Whereas CorrMatch’s correlation maps model pixel relationships without physical priors—inducing error propagation in spectrally homologous regions—and SoftMatch’s uniform alignment strategy addresses class imbalance but lacks specialized design for boundary-ambiguous thin clouds, PGMRN achieves a recall of 90.1% (Table 2), exceeding UCMT by 1.5%, demonstrating superior capability in capturing high-resolution fragmented clouds.

For irregular cloud formations with convoluted contours (Figure 6, row 2), the TV-L1 variational framework preserves geometric integrity of structural component S through the energy functional, while texture component T delineates fine boundaries without distortion. An auxiliary physical consistency constraint enforces network outputs to comply with atmospheric physics, thereby eliminating cloud-snow confusion. BHPC’s pixel-level hard positive sample contrast emphasizes local similarity but omits global morphological constraints, resulting in loss of bifurcated structures. UCMT’s teacher update delays premature convergence, yet its UMIX operation lacks intrinsic shape preservation mechanisms. Experiments indicate PGMRN attains an MIoU of 79.0%, outperforming CorrMatch by 1.3%, further substantiating its morphological preservation efficacy.

In high-resolution scenes densely populated by small cloud fragments (Figure 6, row 3), PGMRN’s UAQCP employs

ω_{P A M C}

to dynamically select high-confidence pixels, where entropy

H^{t}

coupled with physical features significantly enhances pseudo-label accuracy. The cross-modal consistency loss jointly optimizes structural, textural, RGB, and physical constraints, reducing uncertainty in small fragment regions. U2PL’s strategy of treating unreliable pixels as negative samples proves inadequate due to its linearly decaying dynamic threshold α_t’s poor adaptation to complex small targets. FixMatch’s hard confidence suppression directly filters low-confidence small fragments, causing detection omissions. Consequently, PGMRN achieves a minimal error rate of 9.1%, robustly validating its precision in detecting dense small cloud fragments.

Multi-scenario validation on SWIMSEG/SWINSEG (Figure 7) reveals PGMRN’s exceptional robustness under severe diurnal illumination variations. This resilience originates from its pseudo-NDVI—rooted in Mie scattering and absorption mechanisms—exhibiting inherent insensitivity to absolute illumination intensity. Concurrently, the structural, textural, and residual components derived from TV-L1 decomposition possess illumination invariance, circumventing the prevalent issue where conventional RGB features confuse clouds with overexposed regions. Traditional semi-supervised approaches like U2PL and UCMT, being overly reliant on RGB information, exhibit surging false positive rates under intense illumination. FixMatch and UniMatch demonstrate unstable augmentation efficacy under extreme lighting, degrading pseudo-label quality. PGMRN attains an MIoU of 83.8% on SWIMSEG daytime data (Table 3), surpassing CorrMatch by 4.7%, with both recall and F1-score exceeding 0.9, underscoring the exceptional illumination robustness conferred by physical modalities.

In low-illumination nocturnal environments, PGMRN sustains its superiority. The pseudo-NDVI maintains cloud-background discrimination independently of absolute illumination through its red-green band ratio formulation. The TV-L1 structural component, encoding solely geometric morphology, remains impervious to illumination intensity, providing stable nocturnal cloud detection. RGB-dependent methods such as SoftMatch and CorrMatch suffer from drastically diminished spectral discriminability at night; avoiding missed detections necessitates threshold relaxation, producing artificially elevated recall (>99%) at the expense of F1-score collapse (78.1%). PGMRN maintains 93.1% recall and 76.1% MIoU on SWINSEG nighttime data, reaffirming the nocturnal adaptability of its physics-guided paradigm.

Assessing temporal stability across combined diurnal-nocturnal data, PGMRN’s foundation in atmospheric radiative transfer principles and TV-L1 decomposition furnishes a temporally consistent feature substrate. UAQCP further reinforce temporal robustness by perpetually aligning multi-modal representations under varying illumination. Purely data-driven methods exhibit pronounced performance volatility across time series due to training distribution shifts, exacerbated by feature representations devoid of physical constraints. Ultimately, PGMRN achieves an MIoU of 83.5% on the day–night composite scenario, exceeding UCMT by 6.5%, comprehensively validating the exceptional stability of its quadruple consistency in multi-temporal applications.

The employed datasets represent distinct sensing platforms: Ground-based fisheye systems (SWIMSEG/SWINSEG) are characterized by a 180° field of view, pronounced geometric distortions, and highly variable illumination. Multi-regional ground-based standard sensors (TCDD) provide 512 × 512 imagery across disparate geographic and meteorological contexts. High-resolution satellite and aerial platforms (HRC_WHU) deliver 0.5–15 m resolution imagery with fundamentally different viewing geometries and atmospheric effects. Despite these pronounced divergences, cross-platform experiments (Table 4) reveal remarkable generalization capability for PGMRN. When trained on fisheye data and evaluated on standard ground sensors (TCDD), our approach attains 50.3% MIoU, surpassing the strongest baseline (UCMT, 40.9%) by a substantial 23.5% relative improvement. This significant gain corroborates the capacity of our physics-guided design to extract platform-invariant atmospheric features. The underlying robustness stems from pseudo-NDVI generation and TV-L1 decomposition jointly encoding atmospheric physical properties that are consistent across modalities. Additionally, the PA-CAM mechanism seamlessly adapts to varying spatial resolutions and viewing geometries without platform-specific calibration, thereby demonstrating genuine multi-platform scalability.

4.4. Ablation Experiments

Our ablation experiments validate the four core contributions across both ground-based (SWINSEG + SWIMSEG) and aerial (HRC_WHU) domains. The InfoNCE Contrastive Loss consistently improves performance in both domains (+1.16% and +1.28% IoU, respectively) by constructing compact intra-class and well-separated inter-class feature structures, thereby demonstrating its effectiveness in addressing high intra-class variability and inter-class similarity in multi-platform cloud detection.This improvement is quantitatively summarized in Table 5.

The Boundary-Aware Regional Adaptive Weighted Cross-Entropy Loss exhibits domain-specific advantages: it yields moderate improvement in the ground-based domain (79.91% IoU), but achieves exceptional performance in the aerial domain (92.26% F-score). This underscores the critical importance of precise boundary preservation in high-resolution aerial imagery, where our morphological operations and adaptive weight assignment mechanisms excel at balancing categorical center alignment and edge information fidelity.

The Uncertainty-Aware Quadruple Consistency Constraint achieves robust gains across both domains by jointly enforcing structural, textural, RGB, and physical consistency. This quadruple mechanism effectively mitigates the “cloud–snow confusion” problem and ensures cross-modal feature coherence. Notably, the structural constraint captures the principles of atmospheric fluid dynamics, while the textural constraint models turbulent diffusion patterns at cloud edges.

PA-CAM and Dynamic Confidence Screening demonstrate progressive performance improvements. PA-CAM alone achieves 80.23% IoU (ground) and 78.14% IoU (aerial); integrating pseudo-NDVI increases performance to 83.06% and 78.40%, respectively; and incorporating full TV-L1 decomposition further enhances it to 83.49% and 79.02%. These results validate the effectiveness of our PAMC mechanism, which combines physics-guided and probabilistic filtering for high-quality pseudo-label propagation. The stronger impact of pseudo-NDVI in ground-based observations reflects the enhanced atmospheric radiative transfer effects captured in these scenarios.

Finally, the synergistic effect of all modules reduces the error rates from 13.28% → 10.01% (ground) and 11.28% → 9.12% (aerial), demonstrating that our physics-guided framework enables adaptive generalization through domain-specific physical priors, while maintaining strong cross-platform applicability.

To mitigate the influence of noisy pseudo-labels in our semi-supervised framework, we implemented a dynamic entropy-based filtering rule [25]. This mechanism leverages Monte Carlo Dropout to quantify prediction uncertainty by performing multiple stochastic forward passes through the teacher model, generating a collection of softmax probability vectors for each voxel. The predictive entropy, derived from the mean probability distribution, serves as the uncertainty measure. The filtering mechanism exclusively incorporates reliable predictions—those with an uncertainty value below a dynamically adjusted threshold H—into the consistency loss calculation. This threshold H is progressively increased from an initial value of three-quarters of the maximum uncertainty to the full maximum uncertainty using a Gaussian ramp-up function, enabling the student model to gradually learn from highly confident to moderately uncertain predictions. Furthermore, in alignment with the established Mean Teacher framework, the exponential moving average (EMA) decay rate was set to 0.99 to balance the contribution between historical and current states during the teacher model update. For the auxiliary TV-L1 decomposition process, the regularization coefficients were set to λ = 1.5 and β = 1.5 as recommended in Reference [34] to ensure an effective trade-off between fidelity and smoothness.

To determine the optimal weighting factors β₁ and β₂ for the consistency and contrastive loss terms, respectively, we conducted comprehensive ablation studies evaluating a range of values. The sensitivity of these hyperparameters was systematically assessed under different conditions of label scarcity, using 10% of labeled training data as shown in Figure 8, and 20% as shown in Figure 9. By analyzing the trade-offs in performance across key metrics guided by these ablation curves, we balanced the contribution of each loss component and selected the values that yielded the most stable and effective learning dynamics. The final chosen values were set to

β_{1} = 0.15

for the consistency loss and

β_{2} = 0.2

for the contrastive loss.

5. Conclusions

In summary, this paper proposes a Physics-Guided Multi-Representation Network (PGMRN), realizing a deep fusion of physical laws and data-driven approaches through the integration of atmospheric physical priors and feature decomposition. PGMRN provides a cost-effective solution for multi-platform cloud detection, requiring only a single RGB sensing modality across diverse platforms including ground-based systems, UAV platforms, and satellite/aerial imagery. Specifically, we embed atmospheric radiative transfer laws (e.g., pseudo-NDVI spectral proxies) into the TV-L1 feature decomposition process, generating complementary feature representations (pseudo-NDVI, structural components, reflectance components) from a single RGB input. To address label scarcity, we design a Uncertainty-Aware Quadruple Consistency Propagation (UAQCP) that efficiently leverages cross-component consistency information in unlabeled data by dynamically filtering high-confidence pixels using Physically Augmented Class Activation Maps (PA-CAM).

Quantitative experimental results demonstrate significant improvements across all benchmark datasets. On the TCDD dataset, PGMRN achieves MIoU of 70.8% (+3.3%), F1-score of 80.0% (+2.3%), and Recall of 85.4% (+3.6%), surpassing the best competing method. Similarly, on the HRC_WHU dataset, the model attains MIoU 79.0% (+1.3%), F1-score 88.2% (+0.6%), and Recall 90.1% (+1.5%). On the SWIMSEG/SWINSEG benchmarks, PGMRN reaches MIoU 83.8% (+4.7%) under daytime conditions and 83.5% (+6.5%) in combined day–night scenarios. Critically, in a cross-platform transfer (trained on SWIMSEG/SWINSEG, evaluated on TCDD), PGMRN still delivers MIoU 50.3%, outperforming existing approaches by 9.4% (absolute) and thus confirming the platform-independence inherent in our physically guided approach. The framework requires only a single RGB image to generate cloud-sensitive feature representations, eliminating dependence on multi-band data. Visualizations confirm that PA-CAM and boundary weights effectively identify uncertain regions (Figure 4), contributing to enhanced generalization.

Future work will extend PGMRN to land cover classification and change detection, exploring fusion with radiative transfer models to further strengthen physical interpretability. This study pioneers a practical paradigm for integrating physics with deep learning using minimal input data, promoting the efficient utilization of multi-platform remote sensing resources.

Author Contributions

Conceptualization, Q.X. and Z.Z.; methodology, Z.Z.; software, Y.C.; validation, Q.X., Y.C. and Z.Z.; formal analysis, G.W.; investigation, Y.C.; resources, Z.Z.; data curation, Q.X.; writing—original draft preparation, Z.Z.; writing—review and editing, Y.C.; visualization, Z.Z.; supervision, Q.X.; project administration, Y.C.; funding acquisition, Q.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Q.; Yuan, Q.; Zeng, C.; Li, X.; Wei, Y. Missing Data Reconstruction in Remote Sensing Image With a Unified Spatial–Temporal–Spectral Deep Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4274–4288. [Google Scholar] [CrossRef]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.M.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.M.; Anders, K.; Gloaguen, R.; et al. Multisource and Multitemporal Data Fusion. in Remote Sensing: A Comprehensive Review of the State of the Art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Xu, C.; Zhan, Y.; Wang, Z.; Yang, J. Multimodal fusion based few-shot network intrusion detection system. Sci. Rep. 2025, 15, 21986. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Papadomanolaki, M.; Vakalopoulou, M.; Karantzalos, K. A Deep Multitask Learning Framework Coupling Semantic Segmentation and Fully Convolutional LSTM Networks for Urban Change Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7651–7668. [Google Scholar] [CrossRef]
Dong, M.; Yang, A.; Wang, Z.; Li, D.; Yang, J.; Zhao, R. Uncertainty-aware consistency learning for semi-supervised medical image segmentation. Knowl.-Based Syst. 2025, 309, 112890. [Google Scholar] [CrossRef]
Xue, Z.; Yang, G.; Yu, X.; Yu, A.; Guo, Y.; Liu, B.; Zhou, J. Multimodal self-supervised learning for remote sensing data land cover classification. Pattern Recognit. 2025, 157, 110959. [Google Scholar] [CrossRef]
Han, J.; Yang, W.; Wang, Y.; Chen, L.; Luo, Z. Remote Sensing Teacher: Cross-Domain Detection Transformer With Learnable Frequency-Enhanced Feature Alignment in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Wang, S.; Sun, X.; Chen, C.; Hong, D.; Han, J. Semi-Supervised Semantic Segmentation for Remote Sensing Images via Multiscale Uncertainty Consistency and Cross-Teacher–Student Attention. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
Zucker, S.; Batenkov, D.; Segal Rozenhaimer, M. Physics-informed neural networks for modeling atmospheric radiative transfer. J. Quant. Spectrosc. Radiat. Transf. 2025, 331, 109253. [Google Scholar] [CrossRef]
Yao, X.; Guo, Q.; Li, A. Cloud Detection in Optical Remote Sensing Images With Deep Semi-Supervised and Active Learning. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Xu, F.; Shi, Y.; Ebel, P.; Yang, W.; Zhu, X.X. Multimodal and Multiresolution Data Fusion for High-Resolution Cloud Removal: A Novel Baseline and Benchmark. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Xia, K.; Wang, L.; Zhou, S.; Hua, G.; Tang, W. Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Localization. Proc. IEEE Int. Conf. Comput. Vis. 2023, 10126–10135. [Google Scholar] [CrossRef]
Wang, X.; Fan, Z.; Jiang, Z.; Yan, Y.; Yang, H. EDFF-Unet: An Improved Unet-Based Method for Cloud and Cloud Shadow Segmentation in Remote Sensing Images. Remote Sens. 2025, 17, 1432. [Google Scholar] [CrossRef]
Li, X.; Yang, X.; Li, X.; Lu, S.; Ye, Y.; Ban, Y. GCDB-UNet: A novel robust cloud detection approach for remote sensing images. Knowl.-Based Syst. 2022, 238, 107890. [Google Scholar] [CrossRef]
Gan, X.; Li, W.; Zhang, Y.; Long, W.; Lu, Y.; Chen, Z. Prior Information-Guided Semi-Supervised Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
Buttar, P.K.; Sachan, M.K. Semantic segmentation of clouds in satellite images based on U-Net++ architecture and attention mechanism. Expert Syst. Appl. 2022, 209, 118380. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal Fusion Transformer for Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–20. [Google Scholar] [CrossRef]
Chen, S.; Fang, Z.; Wan, S.; Zhou, T.; Chetn, C.; Wang, M.; Li, Q. Geometrically aware transformer for point cloud analysis. Sci. Rep. 2025, 15, 16545. [Google Scholar] [CrossRef]
Jonnala, N.S.; Bheemana, R.C.; Prakash, K.; Bansal, S.; Jain, A.; Pandey, V.; Faruque, M.R.I.; Al-Mugren, K.S. DSIA U-Net: Deep. shallow interaction with attention mechanism UNet for remote sensing satellite images. Sci. Rep. 2025, 15, 549. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Wang, J.; Ding, H.Q.; Chen, C.; He, C.; Luo, B. Semi-Supervised Remote Sensing Image Semantic Segmentation via Consistency Regularization and Average Update of Pseudo-Label. Remote Sens. 2020, 12, 3603. [Google Scholar] [CrossRef]
Kumar, A.; Mitra, S.; Rawat, Y.S. Stable Mean Teacher for Semi-supervised Video Action Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; Volume 39, pp. 4419–4427. [Google Scholar]
Chen, Y.; Yang, Z.; Zhang, L.; Cai, W. A semi-supervised boundary segmentation network for remote sensing images. Sci. Rep. 2025, 15, 2007. [Google Scholar] [CrossRef]
Zhang, K.; Li, P.; Wang, J. A Review of Deep Learning-Based Remote Sensing Image Caption: Methods, Models, Comparisons and Future Directions. Remote Sens. 2024, 16, 4113. [Google Scholar] [CrossRef]
Liu, S.; Zhang, T.; Deng, R.; Liu, X.; Liu, H. Physics-guided deep learning framework with attention for image denoising. Vis. Comput. 2025, 41, 6671–6685. [Google Scholar] [CrossRef]
Zérah, Y.; Valero, S.; Inglada, J. Physics-constrained deep learning for biophysical parameter retrieval from Sentinel-2 images: Inversion of the PROSAIL model. Remote Sens. Environ. 2024, 312, 114309. [Google Scholar] [CrossRef]
Wang, Y.; Gong, J.; Wu, D.L.; Ding, L. Toward Physics-Informed Neural Networks for 3-D Multilayer Cloud Mask Reconstruction. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Li, Z.; Zhao, W.; Du, X.; Zhou, G.; Zhang, S. Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning. Remote Sens. 2024, 16, 196. [Google Scholar] [CrossRef]
Guan, J.; Shu, Y.; Li, W.; Song, Z.; Zhang, Y. PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image–Text Retrieval. Remote Sens. 2025, 17, 2117. [Google Scholar] [CrossRef]
Yang, X.; Li, C.; Wang, Z.; Xie, H.; Mao, J.; Yin, G. Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering. Remote Sens. 2025, 17, 503. [Google Scholar] [CrossRef]
Gitelson, A.A.; Kaufman, Y.J.; Stark, R.; Rundquist, D. Novel algorithms for remote estimation of vegetation fraction. Remote Sens. Environ. 2002, 80, 76–87. [Google Scholar] [CrossRef]
Yin, W.; Goldfarb, D.; Osher, S. The Total Variation Regularized L1 Model for Multiscale Decomposition. Multiscale Model. Simul. 2007, 6, 190–211. [Google Scholar] [CrossRef]
Chambolle, A.; Pock, T. A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging. J. Math. Imaging Vis. 2011, 40, 120–145. [Google Scholar] [CrossRef]
Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. In Proceedings of the 2020, IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; IEEE Computer Society: Los Alamitos, CA, USA, 2020; pp. 111–119. [Google Scholar]
Yu, L.; Wang, S.; Li, X.; Fu, C.-W.; Heng, P.-A. Uncertainty-Aware Self-Ensembling Model for Semi-Supervised 3D Left Atrium Segmentation. Lect. Notes Comput. Sci. 2019, 11765, 605–613. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Zhao, H.; Kong, X.; He, J.; Qiao, Y.; Dong, C. Efficient Image Super-Resolution Using Pixel Attention. Lect. Notes Comput. Sci. 2020, 12537, 56–72. [Google Scholar] [CrossRef]
Zhang, Z.; Yang, S.; Liu, S.; Xiao, B.; Cao, X. Ground-Based Cloud Detection Using Multiscale Attention Convolutional Neural Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Liu, J.; Ji, S. A Novel Recurrent Encoder-Decoder Structure for Large-Scale Multi-View Stereo Reconstruction From an Open Aerial Dataset. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6049–6058. [Google Scholar]
Wang, Y.; Wang, H.; Shen, Y.; Fei, J.; Li, W.; Jin, G.; Wu, L.; Zhao, R.; Le, X. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Tang, C.; Zeng, X.; Zhou, L.; Zhou, Q.; Wang, P.; Wu, X.; Ren, H.; Zhou, J.; Wang, Y. Semi-supervised medical image segmentation via hard positives oriented contrastive learning. Pattern Recognit. 2024, 146, 110020. [Google Scholar] [CrossRef]
Shen, Z.; Cao, P.; Yang, H.; Liu, X.; Yang, J.; Zaiane, O.R. Co-training with high-confidence pseudo labels for semi-supervised medical image segmentation. arXiv 2023, arXiv:2301.04465. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.-L. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. NeurIPS 2020, 33, 596–608. [Google Scholar]
Yang, L.; Qi, L.; Feng, L.; Zhang, W.; Shi, Y. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7236–7246. [Google Scholar]
Chen, H.; Tao, R.; Fan, Y.; Wang, Y.; Wang, J.; Schiele, B.; Xie, X.; Raj, B.; Savvides, M. SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning. arXiv 2023, arXiv:2301.10921. [Google Scholar]
Sun, B.; Yang, Y.; Yuan, W.; Zhang, L.; Cheng, M.M.; Hou, Q. CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–24 June 2024. [Google Scholar]

Figure 1. Overview of the proposed algorithm. The framework adopts a student–teacher dual-network architecture. It takes a single, easily obtainable RGB image as input, integrates atmospheric physical priors and multi-representation features derived from intrinsic image decomposition, and incorporates a quadruple consistency propagation mechanism based on uncertainty perception to achieve end-to-end cloud detection. The U-Net weights pretrained in Training Stage 1 are used to initialize both the teacher and student models in Training Stage 2, ensuring stable convergence and knowledge transfer.

Figure 2. Box plot of the mean and standard deviation of cloud images of different cloud types. Statistical analysis was conducted on the spectral and structural characteristics of different cloud types. The box plot shows the distribution of average values (in the first row) and standard deviation values (in the second row) of various cloud image features (including pseudo-NDVI, structure and texture components). Each box represents the quartile distribution, which includes the median line, while the extension lines extend to the 5th percentile and the 95th percentile.

Figure 3. Physics-Augmented CAM structure. The diagram shows the integration of TV-L1 structural decomposition with deep UNet features to create physically informed confidence maps. The input RGB image undergoes TV-L1 regularization to extract structural components, which are then processed by a pre-trained UNet model. The last decoder layer features (F) are extracted and upsampled to input image resolution. Channel-wise Importance of Confidence (CIC) scores are computed using the difference between masked and unmasked network outputs, serving as channel weights for generating the final PA-CAM through ReLU activation of weighted feature summation.

Figure 4. Schematic diagram of regional adaptive weights. The figure demonstrates the six-step process: (a) Original RGB image showing cloud formations; (b) Ground truth binary label with clear cloud boundaries; (c) PA-CAM confidence map highlighting reliable regions; (d) Euclidean distance transform map computing pixel distances to nearest boundaries; (e) Exponentially weighted distance map emphasizing boundary proximity; (f) Final boundary-aware weight map combining PA-CAM confidence with edge enhancement weights.

Figure 5. TCDD dataset comparison test results. These results illustrate four representative scenarios: the first row represents the task of small fragment cloud segmentation; the second and third rows represent thin cloud detection; the fourth row represents complex-shaped clouds. Each row demonstrates the outstanding performance of PGMRN in handling complex cloud detection scenarios under different meteorological conditions. The red boxes highlight the key differences in segmentation quality.

Figure 6. Comparison test results of HRC_WHU dataset. This comparison covers three challenging scenarios related to aerial/satellite images: the first row is the thin cloud segmentation task; the second row involves irregular cloud shapes; the third row includes fragmented small clouds. The red boxes highlight the key differences in segmentation quality.

Figure 7. Comparison test results of SWIMSEG, SWINSEG datasets. This visualization presents four challenging scenarios: the first row represents small-scale cloud fragments; the second row features blurred cloud boundaries; the third row showcases complex cloud formations; and the fourth row contains distinct thin cloud areas. The red boxes highlight the key differences in segmentation quality.

Figure 8. Hyperparameter sensitivity analysis across different loss components and evaluation metrics. The sensitivity curves show the impact of parameter variations on model performance using 10% labeled training data, evaluated across six metrics (Dice_BG, Dice_FG, Error_Rate, F_Score, IoU, and Recall) for three key components: InfoNCE consistency loss, Pseudo-NDV1, and TV-L1 regularization.

Figure 9. Hyperparameter sensitivity analysis across different loss components and evaluation metrics. The sensitivity curves show the impact of parameter variations on model performance using 20% labeled training data, evaluated across six metrics (Dice_BG, Dice_FG, Error_Rate, F_Score, IoU, and Recall) for three key components: InfoNCE consistency loss, Pseudo-NDV1, and TV-L1 regularization.

Table 1. Performance comparison on the TCDD dataset, with optimal results highlighted in red and suboptimal results in blue. Values are presented as mean ± standard deviation across multiple experimental trials. Statistical significance relative to the proposed method is indicated by * (p < 0.05) and ○ (p ≥ 0.05), based on paired t-tests.

Method	Recall ↑	F1-Score ↑	Error Rate ↓	MioU ↑
U2PL [42]	0.840 ± 0.012 ○	0.771 ± 0.012 *	0.127 ± 0.009 *	0.669 ± 0.008 *
BHPC [43]	0.778 ± 0.014 *	0.779 ± 0.011 ○	0.114 ± 0.017 ○	0.679 ± 0.013 ○
UCMT [44]	0.789 ± 0.011 *	0.783 ± 0.016 ○	0.112 ± 0.018 ○	0.684 ± 0.014 ○
FixMatch [45]	0.829 ± 0.014 ○	0.755 ± 0.013 *	0.131 ± 0.007 *	0.648 ± 0.010 *
UniMatch [46]	0.774 ± 0.012 *	0.767 ± 0.010 *	0.126 ± 0.011 *	0.664 ± 0.018 *
SoftMatch [47]	0.781 ± 0.012 *	0.775 ± 0.007 ○	0.116 ± 0.017 ○	0.674 ± 0.013 ○
CorrMatch [48]	0.818 ± 0.014 ○	0.777 ± 0.018 ○	0.119 ± 0.013 ○	0.675 ± 0.009 ○
Ours	0.854 ± 0.017	0.800 ± 0.013	0.105 ± 0.015	0.708 ± 0.015

↑ indicates that a higher value is better; ↓ indicates that a lower value is better.

Table 2. Performance comparison on the HRC_WHU dataset, with optimal results highlighted in red and suboptimal results in blue. Values are presented as mean ± standard deviation across multiple experimental trials. Statistical significance relative to the proposed method is indicated by * (p < 0.05) and ○ (p ≥ 0.05), based on paired t-tests.

Method	Recall ↑	F1-Score ↑	Error Rate ↓	MioU ↑
U2PL	0.809 ± 0.017 *	0.860 ± 0.016 *	0.100 ± 0.014 *	0.756 ± 0.010 *
BHPC	0.876 ± 0.015 ○	0.873 ± 0.014 ○	0.096 ± 0.019 ○	0.776 ± 0.014 ○
UCMT	0.886 ± 0.012 ○	0.874 ± 0.007 ○	0.097 ± 0.011 ○	0.777 ± 0.011 ○
FixMatch	0.786 ± 0.017 *	0.853 ± 0.014 *	0.103 ± 0.016 *	0.745 ± 0.012 *
UniMatch	0.814 ± 0.016 *	0.863 ± 0.014 *	0.098 ± 0.018 ○	0.761 ± 0.017 *
SoftMatch	0.856 ± 0.018 *	0.871 ± 0.009 ○	0.096 ± 0.012 ○	0.773 ± 0.013 ○
CorrMatch	0.876 ± 0.018 ○	0.876 ± 0.014 ○	0.094 ± 0.010 ○	0.780 ± 0.016 ○
Ours	0.901 ± 0.018	0.882 ± 0.015	0.091 ± 0.018	0.790 ± 0.016

“↑ indicates that a higher value is better; ↓ indicates that a lower value is better.

Table 3. Performance comparison on the SWIMSEG and SWINSEG datasets, with optimal results highlighted in red and suboptimal results in blue. Values are presented as mean ± standard deviation across multiple experimental trials. Statistical significance relative to the proposed method is indicated by * (p < 0.05) and ○ (p ≥ 0.05), based on paired t-tests.

Method	Recall ↑	F1-Score ↑	Error Rate ↓	MioU ↑
Day Time (SWIMSEG)
U2PL	0.793 ± 0.009 *	0.847 ± 0.017 *	0.154 ± 0.018 ○	0.741 ± 0.011 ○
BHPC	0.840 ± 0.014 ○	0.845 ± 0.018 ○	0.160 ± 0.016 *	0.750 ± 0.018 *
UCMT	0.855 ± 0.014 *	0.858 ± 0.012 ○	0.150 ± 0.014 ○	0.760 ± 0.017 *
FixMatch	0.845 ± 0.017 ○	0.838 ± 0.017 *	0.170 ± 0.017 *	0.730 ± 0.014 ○
UniMatch	0.850 ± 0.010 *	0.853 ± 0.013 *	0.155 ± 0.010 ○	0.755 ± 0.009 ○
SoftMatch	0.816 ± 0.006 ○	0.870 ± 0.019 *	0.128 ± 0.014 ○	0.779 ± 0.010 *
CorrMatch	0.876 ± 0.016 *	0.882 ± 0.017 *	0.128 ± 0.009 ○	0.791 ± 0.015 *
Ours	0.924 ± 0.012	0.911 ± 0.021	0.098 ± 0.015	0.838 ± 0.018
Night Time (SWINSEG)
U2PL	0.966 ± 0.016 ○	0.788 ± 0.015 *	0.246 ± 0.007 *	0.650 ± 0.013 *
BHPC	0.830 ± 0.007 *	0.835 ± 0.013 ○	0.170 ± 0.012 ○	0.740 ± 0.019 ○
UCMT	0.850 ± 0.016 *	0.853 ± 0.017 ○	0.155 ± 0.017 ○	0.755 ± 0.017 ○
FixMatch	0.835 ± 0.016 ○	0.822 ± 0.021 *	0.180 ± 0.014 *	0.720 ± 0.014 ○
UniMatch	0.845 ± 0.013 *	0.848 ± 0.010 ○	0.160 ± 0.011 *	0.750 ± 0.018 ○
SoftMatch	0.993 ± 0.012 ○	0.781 ± 0.012 *	0.263 ± 0.011 *	0.640 ± 0.011 *
CorrMatch	0.991 ± 0.011 ○	0.788 ± 0.018 *	0.252 ± 0.014 *	0.650 ± 0.013 *
Ours	0.931 ± 0.018	0.864 ± 0.013	0.138 ± 0.017	0.761 ± 0.011
Day + Night Time (SWINSEG)
U2PL	0.818 ± 0.021 *	0.846 ± 0.008 *	0.160 ± 0.020 *	0.738 ± 0.019 ○
BHPC	0.866 ± 0.014 ○	0.872 ± 0.014 *	0.137 ± 0.009 *	0.775 ± 0.013 *
UCMT	0.885 ± 0.010 *	0.878 ± 0.021 ○	0.134 ± 0.011 *	0.784 ± 0.003 *
FixMatch	0.864 ± 0.014 ○	0.817 ± 0.019 *	0.156 ± 0.015 *	0.721 ± 0.016 *
UniMatch	0.867 ± 0.017 *	0.820 ± 0.019 *	0.150 ± 0.010 ○	0.724 ± 0.014 *
SoftMatch	0.839 ± 0.018 *	0.867 ± 0.016 *	0.137 ± 0.012 ○	0.773 ± 0.012 *
CorrMatch	0.893 ± 0.017 *	0.876 ± 0.013 *	0.137 ± 0.014 *	0.783 ± 0.015 *
Ours	0.929 ± 0.013	0.909 ± 0.011	0.100 ± 0.008	0.835 ± 0.009

↑ indicates that a higher value is better; ↓ indicates that a lower value is better.

Table 4. Cross-platform comparison test results, with optimal results highlighted in red and suboptimal results in blue. Values are presented as mean ± standard deviation across multiple experimental trials. Statistical significance relative to the proposed method is indicated by * (p < 0.05) and ○ (p ≥ 0.05), based on paired t-tests.

Method	Recall ↑	F1-Score ↑	Error Rate ↓	MioU ↑
Training on SWIMSEG, SWINSEG and test on TCCD
U2PL	0.336 ± 0.017 *	0.420 ± 0.009 *	0.256 ± 0.013 *	0.302 ± 0.009 *
BHPC	0.352 ± 0.018 ○	0.445 ± 0.012 ○	0.244 ± 0.005 ○	0.325 ± 0.015 *
UCMT	0.450 ± 0.014 ○	0.527 ± 0.011 ○	0.208 ± 0.020 ○	0.409 ± 0.011 *
FixMatch	0.215 ± 0.013 *	0.269 ± 0.007 *	0.306 ± 0.012 *	0.186 ± 0.013 *
UniMatch	0.431 ± 0.014 *	0.529 ± 0.015 *	0.215 ± 0.007 ○	0.403 ± 0.012 *
SoftMatch	0.291 ± 0.013 *	0.368 ± 0.010 ○	0.276 ± 0.012 ○	0.262 ± 0.016 *
CorrMatch	0.399 ± 0.005 ○	0.472 ± 0.012 ○	0.239 ± 0.014 ○	0.355 ± 0.016 ○
Ours	0.553 ± 0.012	0.627 ± 0.014	0.171 ± 0.009	0.503 ± 0.011
Training on TCCD and test on SWIMSEG, SWINSEG
U2PL	0.840 ± 0.011 ○	0.771 ± 0.011 *	0.127 ± 0.009 *	0.669 ± 0.012 *
BHPC	0.778 ± 0.019 *	0.779 ± 0.017 ○	0.114 ± 0.010 ○	0.679 ± 0.018 *
UCMT	0.789 ± 0.018 *	0.783 ± 0.012 ○	0.112 ± 0.019 ○	0.684 ± 0.020 ○
FixMatch	0.829 ± 0.010 *	0.755 ± 0.009 *	0.131 ± 0.008 *	0.648 ± 0.016 *
UniMatch	0.774 ± 0.023 *	0.767 ± 0.018 *	0.126 ± 0.018 ○	0.664 ± 0.019 *
SoftMatch	0.781 ± 0.031 *	0.775 ± 0.028 ○	0.116 ± 0.022 ○	0.674 ± 0.033 ○
CorrMatch	0.818 ± 0.026 *	0.777 ± 0.025 ○	0.119 ± 0.017 ○	0.675 ± 0.023 *
Ours	0.854 ± 0.018	0.800 ± 0.018	0.105 ± 0.017	0.708 ± 0.019

↑ indicates that a higher value is better; ↓ indicates that a lower value is better.

Table 5. Ablation Study of the Physics-Guided Multi-Representation Network (PGMRN) on Ground-Based (SWINSEG + SWIMSEG) and Aerial (HRC_WHU) Datasets.

Method	IoU ↑	F_Score ↑	Recall ↑	Error_Rate ↓	Dice_BG ↑	Dice_FG ↑
Ground-Based Domain (SWINSEG + SWIMSEG)
Baseline (Mean-Teacher)	78.05%	87.53%	85.95%	13.28%	75.76%	84.67%
+InfoNCE	79.21%	88.12%	90.64%	13.46%	75.74%	86.55%
+BARA-C-E loss	79.91%	88.68%	92.50%	13.00%	72.16%	85.72%
+PA-CAM	80.23%	88.91%	90.18%	12.21%	76.24%	86.85%
+Pseudo-NDVI	83.06%	90.64%	90.25%	10.05%	82.87%	88.40%
+TV-L1	83.49%	90.90%	92.87%	10.01%	82.85%	89.11%
Aerial-Based Domain (HRC_WHU)
Baseline (Mean-Teacher)	75.58%	81.77%	91.41%	11.28%	90.08%	84.31%
+InfoNCE	76.86%	85.11%	89.05%	10.14%	91.21%	85.15%
+BARA-C-E loss	77.10%	92.26%	82.55%	9.40%	92.04%	85.31%
+PA-CAM	78.14%	88.98%	86.59%	9.18%	92.10%	86.22%
+Pseudo-NDVI	78.40%	89.34%	86.63%	9.14%	92.09%	86.27%
+TV-L1	79.02%	88.22%	90.07%	9.12%	91.91%	87.30%

↑ indicates that a higher value is better; ↓ indicates that a lower value is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Q.; Zhang, Z.; Wang, G.; Chen, Y. Physics-Guided Multi-Representation Learning with Quadruple Consistency Constraints for Robust Cloud Detection in Multi-Platform Remote Sensing. Remote Sens. 2025, 17, 2946. https://doi.org/10.3390/rs17172946

AMA Style

Xu Q, Zhang Z, Wang G, Chen Y. Physics-Guided Multi-Representation Learning with Quadruple Consistency Constraints for Robust Cloud Detection in Multi-Platform Remote Sensing. Remote Sensing. 2025; 17(17):2946. https://doi.org/10.3390/rs17172946

Chicago/Turabian Style

Xu, Qing, Zichen Zhang, Guanfang Wang, and Yunjie Chen. 2025. "Physics-Guided Multi-Representation Learning with Quadruple Consistency Constraints for Robust Cloud Detection in Multi-Platform Remote Sensing" Remote Sensing 17, no. 17: 2946. https://doi.org/10.3390/rs17172946

APA Style

Xu, Q., Zhang, Z., Wang, G., & Chen, Y. (2025). Physics-Guided Multi-Representation Learning with Quadruple Consistency Constraints for Robust Cloud Detection in Multi-Platform Remote Sensing. Remote Sensing, 17(17), 2946. https://doi.org/10.3390/rs17172946

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Physics-Guided Multi-Representation Learning with Quadruple Consistency Constraints for Robust Cloud Detection in Multi-Platform Remote Sensing

Abstract

1. Introduction

2. Related Work

2.1. Cloud Segmentation

2.2. Mean-Teacher and Consistent Learning

2.3. Physics-Guided Remote Sensing

2.4. Multimodal Fusion

3. Methods

3.1. Overall Framework

3.2. Physics-Guided Feature Generation (PGFG)

3.2.1. Pseudo-NDVI

3.2.2. TV-L1 Multi-Representation Decomposition

3.2.3. Physics-Augmented Class Activation Map (PA-CAM)

3.3. Hybrid Supervision with Prior Knowledge

3.3.1. Contrastive Loss

3.3.2. Boundary-Aware Regional Adaptive Weighted Cross-Entropy Loss (BARA-C-E Loss)

3.4. Uncertainty Aware Propagation

3.4.1. Dynamic Confidence Level of PA-CAM

3.4.2. Uncertainty-Aware Quadruple Consistency Propagation (UAQCP)

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Comparison Experiments

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI