Latent Diffusion Model for Chlorophyll Remote Sensing Spectral Synthesis Integrating Bio-Optical Priors and Band Attention Mechanisms

Liu, Jinming; Zhang, Haoran; Huang, Jianlong; Wen, Hanbin; Chen, Qinpei; Liu, Jiayi; Wen, Chaowen; Tang, Huiling; Sun, Zhaohua

doi:10.3390/app16083892

Open AccessArticle

Latent Diffusion Model for Chlorophyll Remote Sensing Spectral Synthesis Integrating Bio-Optical Priors and Band Attention Mechanisms

by

Jinming Liu

^1,†,

Haoran Zhang

^1,†,

Jianlong Huang

²,

Hanbin Wen

³,

Qinpei Chen

³,

Jiayi Liu

⁴,

Chaowen Wen

⁴,

Huiling Tang

^3,*,‡ and

Zhaohua Sun

^1,*,§

¹

School of Innovation and Entrepreneurship, Southern University of Science and Technology, Shenzhen 518055, China

²

Anhua Ocean Intelligent Equipment Co., Ltd., Shenzhen 518000, China

³

School of Physics & Optoelectronic Engineering, Guangdong University of Technology, Guangzhou 510006, China

⁴

College of Aviation and Drone Technology, Guangdong Communication Polytechnic, Guangzhou 510650, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

^‡

This author should be considered the second corresponding author.

^§

This author should be considered the first corresponding author.

Appl. Sci. 2026, 16(8), 3892; https://doi.org/10.3390/app16083892

Submission received: 22 January 2026 / Revised: 12 February 2026 / Accepted: 2 March 2026 / Published: 17 April 2026

Download

Browse Figures

Versions Notes

Abstract

Global freshwater resources face severe water quality degradation, with chlorophyll-a (Chl-a) concentration serving as a critical eutrophication indicator. While deep learning methods enable accurate Chl-a retrieval from remote sensing reflectance (Rrs) spectra, the scarcity of paired Rrs-Chl-a samples limits model generalization and causes overfitting, particularly in optically complex inland waters. To address this data bottleneck, we propose a physics-constrained latent diffusion model for synthesizing high-fidelity paired Rrs-Chl-a data to augment limited training sets for deep learning-based water quality retrieval. Our framework integrates three key innovations: (1) a lightweight variational autoencoder achieving 8.6:1 latent space compression, reducing computational overhead while preserving spectral features; (2) band-selective attention mechanisms targeting chlorophyll-sensitive wavelengths (440, 550, 680, and 700–750 nm) based on bio-optical principles; and (3) physics-guided conditional encoding that captures concentration-dependent spectral responses across oligotrophic to eutrophic regimes. Evaluated on the GLORIA dataset, our model demonstrates superior performance in spectral similarity (0.535), sample diversity (0.072), and distribution matching (Fréchet distance 0.0008) compared to conventional generative models. When applied to data augmentation, synthetic spectra improved downstream Chl-a retrieval from R²= 0.75 to 0.91, reducing RMSE by 39%. This physics-informed generative approach addresses data scarcity in aquatic remote sensing research, supporting global needs for enhanced understanding of inland and coastal water quality dynamics in data-limited regions.

Keywords:

aquatic optics; generative models; domain knowledge integration; dataaugmentation; earth observation; water quality

1. Introduction

Global freshwater resources face severe water quality degradation. Approximately 2.2 billion people lack access to safely managed drinking water services, and by 2030, water quality deterioration could threaten the livelihoods of 4.8 billion people [1]. Chlorophyll-a (Chl-a) concentration serves as a critical indicator for monitoring water eutrophication, with accurate measurements essential for early warning of harmful algal blooms and protecting aquatic ecosystem health [2,3]. Chlorophyll-a exhibits unique absorption–reflection characteristics at specific wavelengths, enabling large-scale dynamic water quality estimation through remote sensing reflectance (Rrs) spectra. This remote sensing approach overcomes the spatial–temporal constraints of conventional field sampling [4,5]. However, acquiring high-quality paired Rrs-Chl-a data remains a fundamental challenge due to the high costs of satellite–ground synchronization, limited field station coverage, and the need for simultaneous spectral and laboratory measurements (Figure 1). This data scarcity severely constrains the development of data-driven models for operational water quality monitoring.

Deep learning has enabled novel approaches for Chl-a retrieval from Rrs spectra. Unlike conventional band ratio algorithms and semi-analytical models, deep neural networks automatically learn high-dimensional nonlinear relationships between Rrs spectra and Chl-a concentrations without manual feature engineering [6,7]. Multi-layer perceptrons (MLPs), convolutional neural networks (CNNs), and recurrent architectures have demonstrated superior performance in complex inland water Chl-a inversion compared to traditional machine learning methods, owing to their hierarchical representation of spectral characteristics [8,9,10,11]. Recent advances in CNN-LSTM hybrid models have further improved temporal prediction accuracy, achieving R² values exceeding 0.98 in multi-source remote sensing applications [12]. However, the generalization capability of deep learning models critically depends on the scale and diversity of labeled training data [13]. In water quality remote sensing, the scarcity of paired Rrs-Chl-a samples and limited coverage of diverse water body types restrict model generalization [14,15]. This data bottleneck fundamentally constrains the data-driven paradigm in operational water quality inversion.

Generative models offer a promising solution by learning data distributions and synthesizing new samples for augmentation. Variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion models have demonstrated effectiveness in remote sensing applications. Conditional GANs have improved crop classification through synthetic satellite imagery, while conditional diffusion models have reduced overfitting in small-sample hyperspectral classification [16,17]. However, current water quality spectral synthesis models face significant limitations: GANs suffer from mode collapse, producing limited sample diversity [18,19]; VAEs generate conservative samples concentrated around training means [20]; and standard diffusion models encounter computational bottlenecks in high spectral dimensions [21]. The high dimensionality of hyperspectral data—often exceeding 100 spectral bands—poses particular challenges for diffusion-based generation, requiring computationally expensive noise prediction across all spectral channels [22]. More critically, these purely data-driven methods fail to incorporate water optical physics—chlorophyll absorption–reflection properties at characteristic wavelengths (440, 550, and 680 nm)—limiting the physical fidelity of generated spectra.

To address these limitations, this paper proposes a physics-constrained latent diffusion model (PCLA) for generating high-fidelity paired Rrs-Chl-a data. The primary objective is to develop a physics-informed generative framework that synthesizes realistic spectral–parameter pairs to augment limited training data, thereby enhancing deep learning model performance in aquatic remote sensing applications. Unlike purely data-driven approaches, our method explicitly integrates bio-optical principles governing chlorophyll–water interactions. The model achieves this through three synergistic innovations:

Latent diffusion architecture: A lightweight VAE compresses 551-dimensional Rrs spectra into a 64-dimensional latent space, enabling efficient diffusion-based generation (8.6:1 compression ratio).
Band-selective attention: Multi-head attention mechanisms focus on chlorophyll-sensitive wavelength bands (440, 550, 680, and 700–750 nm), encoding bio-optical response patterns into the network architecture.
Physics-guided conditioning: A hierarchical encoder segments chlorophyll concentration ranges based on bio-optical priors (oligotrophic <5 mg/m³, mesotrophic 5–50 mg/m³, and eutrophic >50 mg/m³), ensuring concentration-dependent spectral synthesis.

2. Related Work

2.1. Remote Sensing Inversion Methods for Water Quality Parameters

Remote sensing water quality inversion has been developed through empirical models, semi-analytical models and finally, based on deep learning approaches, data-based models. The first empirical algorithms, such as the OC3/OC4 band ratio algorithm described by O’Reilly et al. [23], determined that the relationships between the ratios of the blue–green bands are polymorphic and can be used to estimate chlorophyll-a. These, however, involve high-Case-1-water conditions, with low accuracy in optically difficult waters with colored dissolved organic matter and non-algal particulates, which have independent variations [24]. In order to address the limitations of empirical models, Lee et al. [25] explained the Quasi-Analytical Algorithm (QAA) as one in which quantitative correlations between remote sensing reflectance and inherent optical properties are presented by solving radiative transfer equations. The GSM model created by Maritorena et al. [26] incurs simultaneous multiparameter inversion with the aid of the simulated annealing optimization. Deep learning approaches have proven to be beneficial in having strong nonlinear modeling performance. The hierarchical convolutions automatically identify spectral features in CNNs [9], and LSTMs are superior at long-range spectral dependency modeling [27]. Ali et al. [8] created an MLP predictor on the GLORIA sample of Sentinel-2 chlorophyll-a inversion kinetics, which showed higher efficiency compared to standard machine learning algorithms, such as XGBoost and random forests. Such end-to-end models do not rely on manual band selection and learn how to produce the best spectral representations via backpropagation.

However, data-driven model performance depends heavily on annotated sample quantity and diversity. Deep networks have far more parameters than traditional machine learning models, requiring larger training sets to prevent overfitting and achieve generalization [28]. Paired data accumulation is constrained by sampling costs, satellite–ground synchronization windows, and meteorological conditions [29]. For example, the GLORIA dataset [30], despite compiling hyperspectral observations from 450 global water bodies, provides insufficient samples with complete multiparameter annotations for complex network training. This limitation motivates exploring generative data augmentation strategies.

2.2. Generative Models for Data Augmentation

Generative data augmentation alleviates the large-scale labeled data requirements of deep learning models. GANs and VAEs have been applied in remote sensing. GANs, proposed by Goodfellow et al. [29], learn data distributions through adversarial training between generators and discriminators. Lekavičius et al. [31] used pix2pix conditional GANs to generate synthetic images for solar panel segmentation, demonstrating that GAN-generated samples improve segmentation IoU. Douzas and Bacao [32] applied conditional GANs to oversample imbalanced datasets by learning minority class distributions. Kingma and Welling [33] proposed the VAE, achieving probabilistic latent space modeling through reparameterization. However, the VAE suffers from posterior collapse, causing information loss and limited sample diversity [34].

Diffusion models demonstrate superior generation quality and stability through progressive denoising, overcoming GAN training instability and VAE generative ambiguity. Ho et al.’s DDPM [35] achieved an FID of 3.17 on CIFAR-10, outperforming contemporary GANs. However, pixel-space iterative denoising incurs high computational costs. Rombach et al. [36] introduced LDM, shifting diffusion to the VAE-encoded latent space to reduce computational costs while maintaining quality. Diffusion models have been applied to remote sensing classification, super-resolution, and fusion [37] but have not addressed spectral–parameter paired data generation or water optical physics modeling. This study addresses water quality inversion data scarcity by proposing a latent diffusion framework for spectral–chlorophyll paired data generation. A physics-guided conditional mechanism enforces radiative transfer laws.

3. Materials and Methods

3.1. Overall Architecture

To address the bottleneck of scarce spectral–parameter pairing data in water quality remote sensing inversion, this paper proposes a physics-guided latent diffusion framework for generating remote sensing reflectance spectra under specified chlorophyll concentration conditions. Given a target chlorophyll concentration

c \in R^{+}

, the model synthesizes a spectrum

R_{r s} \in R^{551}

(corresponding to 350–900 nm at 1 nm resolution) that satisfies the following criteria: (1) aligns with the statistical distribution of true observations; (2) adheres to chlorophyll-induced bio-optical principles. As shown in Figure 2, the framework employs a two-stage design: Stage 1 pre-trains a lightweight variational autoencoder (VAE) to compress the 551-dimensional spectrum into a 64-dimensional latent space (compression ratio 8.6:1), significantly reducing computational overhead for subsequent diffusion while preserving spectral features. Stage 2 fixes the weights of the VAE and uses a conditional denoising network to learn to predict the latent distribution.

The framework uses three major innovations in addition to the two steps mentioned above in order to further improve the physical consistency and conditional controllability of the generated spectra. (1) The latent diffusion mechanism allows implementing the denoising procedure in the VAE-compressed space, as opposed to the original 551-dimensional spectral domain, with an efficient iterative sampling at a reduced computational cost. (2) The physically guided conditional encoder employs a partitioned subnetwork architecture to encode spectral responses for low (<5 mg/m³), medium (5–50 mg/m³), and high (>50 mg/m³) chlorophyll concentration ranges based on bio-optical priors. Soft-layered weights enable smooth transitions across concentration boundaries. (3) The band-selective attention module embeds a denoising network, applying learnable Gaussian masks to chlorophyll-sensitive key bands (440 nm blue light absorption peak, 550 nm green light reflection peak, 680 nm red light absorption peak, and 700–750 nm near-infrared reflection zone). Multi-head attention mechanisms enable cross-band feature interactions, explicitly capturing wavelength-dependent bio-optical characteristics.

3.2. Latent Diffusion Framework

Diffusion models demonstrate strong generation quality through progressive denoising but face computational bottlenecks when applied directly in high-dimensional spectral space. To address this, we adopt latent diffusion. Using a pre-trained VAE to compress spectral data into a 64-dimensional latent space, the diffusion model focuses on semantic features while reducing computational complexity from

O (L)

to

O (d)

(

L = 551

,

d = 64

).

Moreover, latent compression addresses the curse of dimensionality in small-sample settings. With only 341 training samples against 551 spectral bands, the sample-to-dimension ratio falls below 1:1, making direct high-dimensional distribution learning statistically ill-conditioned. The 64-dimensional latent space preserves essential spectral semantics while removing redundant inter-band correlations, enabling more stable diffusion training.

As Figure 3 shows, the latent diffusion model comprises two Markov processes: forward diffusion and backward denoising. The forward process incrementally adds Gaussian noise to latent vector

z_{0}

over T time steps, with transition probabilities

q (z_{t} | z_{t - 1}) = N (z_{t}; \sqrt{1 - β_{t}} z_{t - 1}, β_{t} I)

(1)

where

β_{t} \in (0, 1)

is the noise scheduling coefficient controlling noise intensity at each step. This iterative computation reduces training efficiency. Leveraging Gaussian additivity, the noisy latent vector at any time step can be sampled directly from the initial state:

z_{t} = \sqrt{{\bar{α}}_{t}} z_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I)

(2)

where

α_{t} = 1 - β_{t}

and

{\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}

denotes the cumulative signal retention rate. We employ cosine-squared scheduling for

β_{t}

, enabling smooth

{\bar{α}}_{t}

decay during diffusion and avoiding early-stage information loss from linear scheduling.

The backward process learns the forward process inverse mapping, progressively recovering the original latent vector from noise

z_{T} \sim N (0, I)

. Guided by conditional

c

(chlorophyll concentration embedding), the backward transition probability is

p_{θ} (z_{t - 1} | z_{t}, c) = N (z_{t - 1}; μ_{θ} (z_{t}, t, c), σ_{t}^{2} I)

(3)

where the mean

μ_{θ}

is parameterized by a neural network. Ho et al. [35] demonstrated that predicting noise

ϵ

yields more stable training than directly predicting the mean. Thus, the training objective is the mean squared error of noise prediction:

L_{d i f f} = E_{z_{0}, ϵ, t} [∥ ϵ - ϵ_{θ} (z_{t}, t, c) ∥^{2}]

(4)

The denoising network

ϵ_{θ}

uses a 1D U-Net architecture with downsampling, bottleneck, and upsampling paths, totaling six residual blocks. To perceive the current denoising stage, time step t is mapped to embedding

t_{e m b}

via sinusoidal position encoding. Conditional embedding

c

and

t_{e m b}

undergo independent linear projections before being added to each residual block’s intermediate features:

h^{'} = h + {MLP}_{t} (t_{e m b}) + {MLP}_{c} (c)

(5)

This injection enables uniform propagation of temporal and conditional information across network layers. During inference, starting from

z_{T} \sim N (0, I)

, the DDPM sampler performs 50 denoising steps. To enhance conditional control, we introduce classifier-free guidance:

\hat{ϵ} = ϵ_{θ} (z_{t}, t, ⌀) + s \cdot (ϵ_{θ} (z_{t}, t, c) - ϵ_{θ} (z_{t}, t, ⌀))

(6)

where

s = 2.0

is the guidance scale and ⌀ denotes the empty condition. Finally, denoised

z_{0}

is mapped to the spectral space via the VAE decoder to obtain the generated Rrs spectrum.

3.3. Lightweight Spectral VAE

Latent diffusion VAEs compress high-dimensional spectra into low-dimensional space. However, standard VAEs use single-scale convolutional kernels, limiting capture of multi-scale features such as narrow absorption peaks and broad reflection plateaus. To address this, we propose a lightweight VAE for spectral data. It introduces multi-kernel parallel convolutional blocks for the multi-scale characteristics of Rrs spectra.

As Figure 4 shows, the encoder compresses 551-dimensional spectra into a 64-dimensional latent space using four cascaded multi-kernel blocks and a downsampling layer. Each multi-kernel block contains four parallel branches with kernel sizes

k \in {1, 3, 5, 7}

to capture information from point-wise features to broad platforms. Branch outputs are concatenated and fused via

1 \times 1

convolution. The encoder reduces dimensions: 551 → 276

\to 138 \to 69

. After flattening, it produces mean

μ \in R^{64}

and log variance

log σ \in R^{64}

, with the latent vector sampled via reparameterization:

z = μ + σ ⊙ ϵ, ϵ \sim N (0, I)

(7)

The decoder uses a symmetric architecture, reconstructing latent vectors into 551-dimensional spectral representations

{\hat{R}}_{r s}

using three upsampling layers and multi-kernel blocks. The training objective is

L_{V A E} = {∥ R_{r s} - {\hat{R}}_{r s} ∥}^{2} + β \cdot D_{K L} (N (μ, σ^{2}) ∥ N (0, I))

(8)

To mitigate posterior collapse, we employ KL annealing:

β

linearly increases from 0 to 0.001 over the first 20 epochs, enabling effective reconstruction learning. The lightweight VAE captures multi-scale Rrs spectral features through multi-kernel convolutions, providing 8.6:1 compression for latent diffusion.

3.4. Waveband Selective Attention

Diffusion model quality depends on precise noise prediction by the denoising network. However, standard denoising networks apply uniform convolutions across all dimensions, assuming equal contributions. This conflicts with spectral generation domain knowledge: chlorophyll exhibits wavelength-selective Rrs modulation, with concentration variations primarily manifesting in the 440 nm blue absorption peak, 550 nm green reflection peak, 680 nm red absorption peak, and 700–750 nm near-infrared scattering region. Processing all bands uniformly limits capture of correlations between these critical bands and chlorophyll concentration, compromising generated spectral physical consistency. To address this, we design a band-selective attention module embedded in U-Net residual blocks, guiding the network to focus on chlorophyll-sensitive bands during denoising.

As Figure 5 shows, this module receives FiLM-modulated latent feature

h \in R^{C}

as input. First, soft Gaussian masks

M_{i}

are constructed for the four chlorophyll-sensitive bands, extracting corresponding features via element-wise multiplication:

f_{i} = M_{i} ⊙ h, i \in {1, 2, 3, 4}

(9)

Soft Gaussian masks center on each band’s wavelength with smooth boundary decay instead of hard clipping, preventing discontinuities. Learnable importance weights

α_{i}

are applied to each band feature, enabling adaptive adjustment of band contributions:

f_{i}^{'} = α_{i} \cdot f_{i}, \sum_{i = 1}^{4} α_{i} = 1

(10)

Weighted band features are concatenated and input to a 4-head multi-head attention layer for cross-band feature interaction:

Attended = MHA (Q, K, V), Q = K = V = [f_{1}^{'}, f_{2}^{'}, f_{3}^{'}, f_{4}^{'}]

(11)

Finally, attention output is fused with original features via adaptive gating. Gate signal g is generated by an MLP, controlling the mixing ratio:

out = g ⊙ Attended + (1 - g) ⊙ h

(12)

This gating enables dynamic balancing of attention enhancement and original feature preservation. The band-selective attention module embeds chlorophyll bio-optical priors into denoising, allowing the network to prioritize spectral regions correlated with chlorophyll concentration, enhancing generated spectral band-specific response and physical plausibility.

3.5. Physics-Guided Conditioning

The diffusion model inputs the chlorophyll concentration data into the denoising network through conditional embedding of the data as the variable

c

. In standard procedures, MLPs are used to directly encode a scalar chlorophyll concentration into embedding vectors and conditional encoding is encoded as a purely data-driven mapping. However, according to the bio-optical model framework [38], remote sensing reflectance is governed by

R_{r s} \propto b_{b} / (a + b_{b})

, where a and

b_{b}

denote total absorption and backscattering coefficients. Chlorophyll-a modulates these inherent optical properties in a concentration-dependent and nonlinear manner.

Specifically, various concentration regimes display qualitatively distinct spectral signatures: the low concentration (<5 mg/m³) is dominated by molecular scattering with minimal pigment absorption, producing strong blue reflection; the intermediate concentrations (5–50 mg/m³) are influenced by chlorophyll-specific absorption

a_{p h}^{*} (λ)

at 440 and 680 nm interacting with cellular backscattering, producing green reflection peaks; the high concentrations (>50 mg/m³) exhibit saturated red-band absorption and intense near-infrared scattering. Simple MLP encoders map continuous concentration values in the same uniformly random fashion to the same set of features, which cannot reveal this pattern of segmented bio-optical responses.

To dynamically solve this, we suggest a physics-guided conditional encoder that considers a bio-optical prior via zone-wise encoding of conditional representations. Figure 6 illustrates that this encoder puts in place independent subnetworks of low, medium, and high concentration intervals. Both subnetworks share the same two-layer MLP architecture but are trained with independent sets of spectral response patterns instead of learned patterns, producing encoded vectors

e_{l o w}

,

e_{m i d}

, and

e_{h i g h}

.

Soft-layered weights are developed, originating a weight predictor given input chlorophyll concentration c:

w = [w_{l o w}, w_{m i d}, w_{h i g h}] = Softmax (f_{w} (c))

(13)

where

f_{w}

is a lightweight MLP. Soft-layered weights have two benefits over hard-threshold segmentation: they allow continuous gradient propagation, because they do not create non-differentiable boundaries, and they can activate multiple encoders in an adjacent interval, which allows smooth transitions between features. The results of subnetworks are combined with soft weights:

e_{p h y s i c s} = w_{l o w} \cdot e_{l o w} + w_{m i d} \cdot e_{m i d} + w_{h i g h} \cdot e_{h i g h}

(14)

The encoder (to maintain data-driven representation) has a base MLP branch, which encodes concentrations in their pure form, outputting

e_{b a s e}

. Finally, conditional embedding uses physics-guided and base features combined through a fusion layer:

c_{f u s e d} = {MLP}_{f u s i o n} ([e_{p h y s i c s}; e_{b a s e}])

(15)

where

[\cdot; \cdot]

denotes concatenation. The fused conditional embedding

c_{f u s e d} \in R^{64}

is injected into the residual blocks of denoising networks through FiLM modulation, performing an affine transformation on intermediate features:

γ (c) ⊙ h + β (c)

. This encoder allows encoding both the concentration values and associated bio-optical response patterns to obtain more semantically rich conditional guidance. The physics-guided conditional encoder performs conditional encoder mapping and generates partitioned response properties for chlorophyll concentration to interpolate between scalar inputs and spectral modulation with levels of complexity, achieving more precise conditional control and better generated spectral physical plausibility.

3.6. Training Strategy

The framework employs two-stage training for effective component learning. As Figure 7 shows, the first stage trains a lightweight spectral VAE for Rrs spectral compression and reconstruction. This stage uses the AdamW optimizer with a learning rate of

5 \times 10^{- 4}

and trains for 100 epochs. The KL divergence weight

β

linearly increases from 0 to 0.001 over the first 20 epochs to mitigate posterior collapse. After training, VAE encoder and decoder weights are frozen, providing a stable latent space for diffusion.

The second stage trains a conditional diffusion model in the frozen latent space. The VAE encoder encodes real spectra into latent vectors

z

, which undergo standard deviation normalization for numerical stability. Forward diffusion adds noise to normalized latent vectors, while the denoising network learns to predict noise guided by temporal and conditional embeddings. This stage uses the AdamW optimizer with a learning rate of

5 \times 10^{- 5}

, training for 200 epochs with cosine annealing and 500 warm-up steps. For classifier-free guidance, conditions are set to zero with 10% probability during training, enabling both conditional and unconditional noise prediction.

During inference, the target chlorophyll concentration generates conditional embedding

c

via normalization and the physics-guided encoder. Latent sampling starts from Gaussian noise

z_{50} \sim N (0, I)

and undergoes 50-step iterative denoising via the DDPM sampler. At each step, classifier-free guidance enhances the conditional response:

\tilde{ϵ} = ϵ_{θ} (z_{t}, t, ⌀) + w \cdot (ϵ_{θ} (z_{t}, t, c) - ϵ_{θ} (z_{t}, t, ⌀))

(16)

where the guidance scale is

w = 2.0

. After denoising, the latent vector is denormalized, mapped to the spectral space via the frozen VAE decoder, and spectrally denormalized to obtain the generated Rrs spectrum. Inference takes approximately 2 s per sample on a single GPU and 8 s for 64-sample batch generation, occupying 2.1 GB GPU memory.

3.7. Implementation Details

The model is implemented in PyTorch 2.0 and trained on a single NVIDIA A100 GPU (40 GB). Stage 1 (VAE pre-training) uses batch size 128 and converges in approximately 3 h. Stage 2 (diffusion training) uses batch size 64 with gradient accumulation over 2 steps, requiring approximately 12 h for 200 epochs. The VAE encoder–decoder contains 4 multi-kernel convolutional blocks with channel dimensions [64, 128, 256, 512]. The U-Net denoiser consists of 6 residual blocks at each resolution level with base channel dimension 128. Total model parameters: VAE 2.3 M and diffusion network 15.7 M. The diffusion process employs 1000 time steps during training but only 50 steps during inference via DDPM sampling. Physics loss weight in the conditional encoder is set to

λ_{p h y s i c s} = 0.5

. All experiments use the same random seed (42) for reproducibility. The code and pre-trained models will be made publicly available upon publication.

4. Experimental Results

4.1. Datasets and Experimental Setup

4.1.1. Datasets

We validate PCLA on 341 samples from GLORIA [39], a globally representative hyperspectral dataset spanning 450 water bodies (Table 1). Spectral data covers 350–900 nm (551 bands, 1 nm intervals), with Chl-a ranging within 0.65–201 mg/m³ (median: 25.57), encompassing oligotrophic to hyper-eutrophic conditions. The sample distribution is: <5 mg/m³ (14.1%), 5–25 mg/m³ (34.9%), 25–50 mg/m³ (27.3%), and >50 mg/m³ (23.7%).

The dataset reflects realistic field challenges: 19.64% missing bands (linearly interpolated) and 7.0% trace negatives (

| R r s | < 0.001

, retained for realism). Preprocessing includes: Z-score normalization, GLORIA_ID matching, and missing Chl-a exclusion.

4.1.2. Evaluation Metrics

We assess generation quality via spectral similarity, physical plausibility, and distribution matching.

Spectral Similarity: Spectral Angular Mapping (SAM) measures angular divergence:

SAM (x_{gen}, x_{real}) = arccos (\frac{x_{gen} \cdot x_{real}}{| x_{gen} | | x_{real} |})

(17)

Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) measure point-wise deviations:

RMSE = \sqrt{\frac{1}{N L} \sum_{i = 1}^{N} \sum_{j = 1}^{L} {(x_{i j}^{gen} - x_{i j}^{real})}^{2}}, MAE = \frac{1}{N L} \sum_{i = 1}^{N} \sum_{j = 1}^{L} | x_{i j}^{gen} - x_{i j}^{real} |

(18)

Pearson correlation evaluates trend consistency:

r = \frac{1}{N} \sum_{i = 1}^{N} \frac{Cov (x_{i}^{gen}, x_{i}^{real})}{σ (x_{i}^{gen}) σ (x_{i}^{real})}

(19)

Distribution Matching: Fréchet distance evaluates distribution divergence:

FD = ∥ μ_{gen} - μ_{real} ∥^{2} + Tr (Σ_{gen} + Σ_{real} - 2 {(Σ_{gen} Σ_{real})}^{1 / 2})

(20)

Physical Plausibility: We use peak wavelength error (

| λ_{peak}^{gen} - λ_{peak}^{real} |

), shape similarity (first-order derivative correlation), positive value ratio (Rrs ≥ 0), reasonable range ratio (

- 0.01 \leq Rrs \leq 0.15

), and smoothness (

1 / (1 + mean (| \nabla^{2} x |))

).

4.2. Synthetic Data Quality

We compare PCLA against eight conditional generative models (Table 2): GAN-based (conditional GAN, WGAN-GP, and CSN-GAN), VAE-based (CVAE), diffusion-based (CDDPM and CDDIM), and others (C-Transformer and C-RealNVP). All models are trained with identical GLORIA preprocessing with five-run averaging.

PCLA achieves best shape similarity (0.535), diversity (0.072), and Fréchet distance (0.0008), outperforming the next best by 21.9%, 16.1%, and 33.3%, respectively. Band-selective attention enables precise chlorophyll-a-sensitive band modeling (440/550/680/ 700–750 nm), while physics-constrained guidance prevents mode collapse. CVAE excels in point-wise metrics (SAM 0.183, RMSE 0.0025) but suffers poor diversity (0.012), producing overly conservative samples. GANs show 0.02–0.17 shape similarity, struggling with high-dimensional spectral details. Diffusion models (CDDPM/CDDIM) balance performance but lag behind PCLA in shape (0.03) and diversity (0.05–0.06).

4.3. Ablation Studies

We progressively remove latent diffusion (VAE), band-selective attention, and physics-constrained guidance to validate component contributions (Table 3).

Removing VAE degrades shape similarity by 95% (0.509 → 0.025) and worsens FD 6-fold (0.0008 → 0.0049), validating that 64-dimensional compression resolves high-dimensional challenges. Removing band-selective attention reduces shape similarity by 23% and increases peak error by 56%, confirming that targeted chlorophyll-band modeling enhances fidelity. Removing physics-guided conditioning degrades shape similarity by 25%, demonstrating that concentration-segmented priors ensure consistent generation across nutrient regimes. The full model outperforms the baseline 20-fold in shape similarity and reduces FD by 83%.

4.4. Model Sensitivity Analysis

We perform control variable analysis on nine hyperparameters (Table 4): 171 experiments (60 single-factor, 111 two-factor interaction) evaluated via RMSE, correlation, and SAM.

Sensitivity score quantifies parameter impact:

S = 100 \times Δ R + 50 \times Δ C + 200 \times Δ A + 10 \times Δ D

(21)

where

Δ R

,

Δ C

,

Δ A

, and

Δ D

denote RMSE, correlation, SAM, and diversity variations. SAM receives the highest weight as it directly measures spectral shape.

Key findings (Table 5, Figure 8): vae_lr = 0.0005 provides optimal stability; latent_dim = 64 achieves the lowest RMSE (0.0031) at 8.6:1 compression; and diffusion_lr shows tolerance near the baseline (

5 \times 10^{- 5}

). Low-to-medium sensitivity parameters (batch_size, epochs) exhibit robustness (

Δ

RMSE < 0.001). The squaredcos_cap_v2 scheduler yields the best correlation (0.785), aligning with diffusion theory.

Figure 9 reveals latent_dim × diffusion_epochs interaction: latent_dim = 32 achieves the minimum RMSE at epochs = 150 and maximum correlation (0.862) at epochs = 100, demonstrating that 17.2:1 compression maintains quality. Higher dimensions require longer training to compensate redundancy. The valley pattern on the 3D surface (Figure 9d) indicates the optimal region at low dimensions/medium epochs, consistent with dual-boundary constraints. This validates that the 32-dimensional latent space captures essential chlorophyll bands (440/550/680 nm), enabling 17× lighter models for edge deployment.

4.5. Downstream Task Validation

We validate the synthetic data value via data augmentation. The GLORIA split is: 256 training and 85 test. Each model generates 400 synthetic samples (20 concentration points, 20 samples each). MobileNetV3-Small was trained on original (baseline) and augmented data and evaluated on the test set (Table 6).

Full model augmentation achieves the best improvement:

R^{2}

increases to 0.9085 (+20.85%) and RMSE decreases to 12.77 mg/m³ (

- 39.29 %

), outperforming CVAE (+20.71%/+39.17%) and DDPM (+20.35%/+38.07%). GANs show weaker gains (SN-GAN +15.57%/+27.31%) and Transformers the lowest (+2.90%/+4.50%). High generation quality does not guarantee augmentation effectiveness: CVAE’s superior point-wise metrics lag behind the full model due to poor diversity (0.012 vs. 0.072), highlighting variability’s importance for downstream generalization.

4.6. Comprehensive Evaluation of Generative Quality

Figure 10 validates spectral fidelity, physical plausibility, and downstream performance.

Generated spectra reproduce key bio-optical features (Figure 10a): 443 nm blue absorption, 560 nm green peak, 665 nm red absorption, and a 700–750 nm NIR plateau. Conditional generation (Figure 10b) responds to nutrient levels: NIR dominates at 1.0 mg/m³ and red absorption strengthens at 100.0 mg/m³. The latent space (Figure 10c) shows continuous real–synthetic overlap along Chl-a gradients, confirming that 64-D captures nonlinear spectral-concentration mapping. Band-selective attention (Figure 10d) adapts: oligotrophic waters prioritize NIR (weight 0.40) and eutrophic waters enhance red (0.45).

Diffusion denoising (Figure 10e) establishes the energy distribution early (

t = 999 \to 500

) and refines absorption–reflection late (

t = 250 \to 0

). Peak wavelength (Figure 10f) and derivative distributions (Figure 10g) confirm smoothness (

| \nabla R_{r s} | < 2 \times 10^{- 4}

) without artifacts. Downstream testing (Figure 10h) shows an

R^{2}

improvement of 0.75 → 0.91 (+21%) and RMSE reduction of 21.03 → 12.77 mg/m³ (

- 39 %

), outperforming CVAE/CGAN/CDDPM.

5. Conclusions

This study proposes PCLA, a physics-constrained latent diffusion framework addressing paired data scarcity in aquatic remote sensing. Validation on the GLORIA dataset demonstrates that synthetic Rrs-Chl-a pairs improve downstream Chl-a concentration retrieval via MobileNetV3-Small regression from

R^{2}

= 0.75 to 0.91 with 39% RMSE reduction, confirming the viability of generative data augmentation for small-sample water quality monitoring.

The key contribution lies in bridging mechanistic bio-optical models and data-driven deep learning. By explicitly encoding chlorophyll absorption–scattering principles into neural architectures through physics-guided conditioning and band-selective attention, the framework generates physically plausible spectra rather than purely statistical approximations. This hybrid approach offers practical advantages for operational agencies: the lightweight design (5.8 M parameters, 2.1 GB memory) enables edge deployment on satellite platforms for real-time processing, while synthetic data reduces field campaign costs that typically range within $500–1000 per sample. Beyond chlorophyll estimation, the hierarchical conditional framework provides a template for other optically active parameters such as CDOM and TSM, supporting comprehensive inland water constituent monitoring.

Current limitations include geographic bias in training data toward temperate lakes, performance degradation at concentration extremes (oligotrophic < 2 mg/m³, hyper-eutrophic > 100 mg/m³) due to sample imbalance, and cross-sensor generalization challenges when adapting hyperspectral models to multispectral satellites. Computational requirements suit batch augmentation but may constrain real-time applications requiring sub-100m s latency.

Future research should prioritize three directions. First, cross-dataset validation using independent archives (PACE and EMIT) will establish sensor-agnostic performance boundaries and identify failure modes across diverse optical water types. Second, extending to multiparameter joint generation can capture constituent correlations (e.g., CDOM-chlorophyll co-variation), moving toward holistic water quality synthesis. Third, integrating spatiotemporal conditioning with lake morphology and meteorological drivers could enable algal bloom forecasting with 3–7-day horizons, transitioning from reactive monitoring to proactive management. Developing Bayesian diffusion variants for uncertainty quantification would further support risk-aware decision-making in water safety assessments.

This work demonstrates that integrating domain knowledge into generative models can overcome data scarcity while maintaining physical consistency, a principle applicable beyond aquatic remote sensing to other data-limited Earth observation domains. As satellite constellations expand and operational water quality monitoring increasingly relies on deep learning, methods balancing computational efficiency with physical fidelity will be essential for translating remote sensing into actionable environmental intelligence supporting global water security.

Author Contributions

Conceptualization, J.L. (Jinming Liu) and H.Z.; methodology, J.L. (Jinming Liu) and H.Z.; software, J.L. (Jinming Liu) and J.H.; validation, J.H., H.W., Q.C., J.L. (Jiayi Liu) and C.W.; formal analysis, H.W. and Q.C.; investigation, J.L. (Jinming Liu), H.Z., J.L. (Jiayi Liu) and C.W.; resources, H.T. and Z.S.; data curation, J.L. (Jinming Liu), J.L. (Jiayi Liu) and C.W.; writing—original draft preparation, J.L. (Jinming Liu) and H.Z.; writing—review and editing, H.T. and Z.S.; visualization, J.L. (Jinming Liu) and J.L. (Jiayi Liu); supervision, H.T. and Z.S.; project administration, H.T. and Z.S.; funding acquisition, H.T. and Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Southern University of Science and Technology Research Project on Optical Intelligent Sensors and Intelligent Algorithms (Grant No. Y01422314), the Wuhan East Lake “Air-Space-Ground-Water” Integrated Observation Project (Grant No. K2542Z011), and the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2020B1515130001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available at https://doi.org/10.1594/PANGAEA.948492.

Conflicts of Interest

Author Jianlong Huang was employed by the company Anhua Ocean Intelligent Equipment Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

World Health Organization. Progress on Household Drinking Water, Sanitation and Hygiene 2000–2022: Special Focus on Gender; World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
Wang, J.; Chen, X. A new approach to quantify chlorophyll-a over inland water targets based on multi-source remote sensing data. Sci. Total Environ. 2024, 906, 167631. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Fu, Y.; Lang, Z.; Cai, F. A high-frequency and real-time ground remote sensing system for obtaining water quality based on a micro hyper-spectrometer. Sensors 2024, 24, 1833. [Google Scholar] [CrossRef] [PubMed]
Ness, E.; Fatima, A.; Maktabdar-Oghaz, M.; Luca, C. An investigation into water quality monitoring models using remote sensing. Int. J. Remote Sens. 2025, 46, 1742–1772. [Google Scholar] [CrossRef]
Deng, Y.; Zhang, Y.; Pan, D.; Yang, S.X.; Gharabaghi, B. Review of recent advances in remote sensing and machine learning methods for lake water quality management. Remote Sens. 2024, 16, 4196. [Google Scholar] [CrossRef]
Mohan, S.; Kumar, B.; Nejadhashemi, A.P. Integration of machine learning and remote sensing for water quality monitoring and prediction: A review. Sustainability 2025, 17, 998. [Google Scholar] [CrossRef]
Sagan, V.; Peterson, K.T.; Maimaitijiang, M.; Sidike, P.; Sloan, J.; Greeling, B.A.; Maalouf, S.; Adams, C. Monitoring inland water quality using remote sensing: Potential and limitations of spectral indices, bio-optical simulations, machine learning, and cloud computing. Earth-Sci. Rev. 2020, 205, 103187. [Google Scholar] [CrossRef]
Ali, A.; Zhou, G.; Lopez, F.P.A.; Xu, C.; Jing, G.; Tan, Y. Deep learning for water quality multivariate assessment in inland water across China. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104078. [Google Scholar] [CrossRef]
Aptoula, E.; Ariman, S. Chlorophyll-a retrieval from Sentinel-2 images using convolutional neural network regression. IEEE Geosci. Remote Sens. Lett. 2021, 19, 6002605. [Google Scholar] [CrossRef]
Chen, K.; Zhang, J.; Zheng, Y.; Xie, X. A study on global oceanic chlorophyll-a concentration inversion model for MODIS using machine learning algorithms. IEEE Access 2024, 12, 128843–128859. [Google Scholar] [CrossRef]
Xu, Q.; Yang, G.; Yin, X.; Sun, T. Reconstruction of Sea Surface Chlorophyll-a Concentration in the Bohai and Yellow Seas Using LSTM Neural Network. Remote Sens. 2025, 17, 174. [Google Scholar] [CrossRef]
Zhang, Y.; Li, C.; Wu, X.; Xiang, H.; Jiao, Y.; Chai, H. CNN-BiLSTM model for winter wheat yield estimation by fusing spectral indices and solar-induced fluorescence. Front. Plant Sci. 2024, 15, 1500499. [Google Scholar]
Zhi, W.; Appling, A.P.; Golden, H.E.; Podgorski, J.; Li, L. Deep learning for water quality. Nat. Water 2024, 2, 228–241. [Google Scholar] [CrossRef] [PubMed]
Zheng, Y.; Zhang, X.; Zhou, Y.; Zhang, Y.; Zhang, T.; Farmani, R. Deep representation learning enables cross-basin water quality prediction under data-scarce conditions. npj Clean Water 2025, 8, 33. [Google Scholar] [CrossRef]
Li, W.; Zhao, Y.; Zhu, Y.; Dong, Z.; Wang, F.; Huang, F. Research progress in water quality prediction based on deep learning technology: A review. Environ. Sci. Pollut. Res. 2024, 31, 26415–26431. [Google Scholar] [CrossRef]
Shumilo, L.; Okhrimenko, A.; Kussul, N.; Drozd, S.; Shkalikov, O. Generative adversarial network augmentation for solving the training data imbalance problem in crop classification. Remote Sens. Lett. 2023, 14, 1129–1138. [Google Scholar] [CrossRef]
Sigger, N.; Vien, Q.T.; Nguyen, S.V.; Tozzi, G.; Nguyen, T.T. Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification. Sci. Rep. 2024, 14, 8438. [Google Scholar] [CrossRef]
Barsha, F.L.; Eberle, W. An in-depth review and analysis of mode collapse in generative adversarial networks. Mach. Learn. 2025, 114, 141. [Google Scholar] [CrossRef]
Zhang, M.; Li, W.; Zhang, Y.; Tao, R.; Du, Q. Feature-preserving generative adversarial network data augmentation strategy for hyperspectral image classification. Pattern Recognit. 2023, 142, 109646. [Google Scholar] [CrossRef]
Liu, L.; Chen, B.; Chen, H.; Zou, Z.; Shi, Z. Diverse hyperspectral remote sensing image synthesis with diffusion models. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5532616. [Google Scholar] [CrossRef]
Ferrari, M.; Bruzzone, L. Hyperspectral data augmentation with transformer-based diffusion models. In Proceedings of the Artificial Intelligence and Image and Signal Processing for Remote Sensing XXX; SPIE: Bellingham, WA, USA, 2024; Volume 13196, pp. 115–124. [Google Scholar]
Chen, B.; Liu, L.; Liu, C.; Zou, Z.; Shi, Z. Spectral-Cascaded Diffusion Model for Remote Sensing Image Spectral Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024. [Google Scholar] [CrossRef]
O’Reilly, J.E.; Werdell, P.J. Chlorophyll algorithms for ocean color sensors-OC4, OC5 & OC6. Remote Sens. Environ. 2019, 229, 32–47. [Google Scholar] [PubMed]
Dierssen, H.M. Perspectives on empirical approaches for ocean color remote sensing of chlorophyll in a changing climate. Proc. Natl. Acad. Sci. USA 2010, 107, 17073–17078. [Google Scholar] [CrossRef] [PubMed]
Lee, Z.; Carder, K.L.; Arnone, R.A. Deriving inherent optical properties from water color: A multiband quasi-analytical algorithm for optically deep waters. Appl. Opt. 2002, 41, 5755–5772. [Google Scholar] [CrossRef]
Maritorena, S.; Siegel, D.A.; Peterson, A.R. Optimization of a semianalytical ocean color model for global-scale applications. Appl. Opt. 2002, 41, 2705–2714. [Google Scholar] [CrossRef]
Baek, S.S.; Pyo, J.; Chun, J.A. Prediction of water level and water quality using a CNN-LSTM combined deep learning approach. Water 2020, 12, 3399. [Google Scholar] [CrossRef]
Harmel, R.D.; Preisendanz, H.E.; King, K.W.; Busch, D.; Birgand, F.; Sahoo, D. A review of data quality and cost considerations for water quality monitoring at the field scale and in small watersheds. Water 2023, 15, 3110. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Zhai, M.; Zhou, X.; Tao, Z.; Xie, Y.; Yang, J.; Shao, W.; Zhang, H.; Lv, T. Satellite-ground synchronous in-situ dataset of water optical parameters and surface temperature for typical lakes in China. Sci. Data 2024, 11, 883. [Google Scholar] [CrossRef]
Lekavičius, J.; Gružauskas, V. Data Augmentation with Generative Adversarial Network for Solar Panel Segmentation from Remote Sensing Images. Energies 2024, 17, 3204. [Google Scholar] [CrossRef]
Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 2018, 91, 464–471. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Dang, H.; Tran, T.; Nguyen, T.; Ho, N. Beyond vanilla variational autoencoders: Detecting posterior collapse in conditional and hierarchical variational autoencoders. arXiv 2023, arXiv:2306.05023. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10684–10695. [Google Scholar]
Liu, Y.; Yue, J.; Xia, S.; Ghamisi, P.; Xie, W.; Fang, L. Diffusion models meet remote sensing: Principles, methods, and perspectives. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4708322. [Google Scholar] [CrossRef]
Morel, A. Optical modeling of the upper ocean in relation to its biogenous matter content (Case I waters). J. Geophys. Res. Oceans 1988, 93, 10749–10768. [Google Scholar] [CrossRef]
Lehmann, M.K.; Gurlin, D.; Pahlevan, N.; Alikas, K.; Conroy, T.; Anstee, J.; Balasubramanian, S.V.; Barbosa, C.C.; Binding, C.; Bracher, A.; et al. GLORIA—A globally representative hyperspectral in situ dataset for optical sensing of water quality. Sci. Data 2023, 10, 100. [Google Scholar] [CrossRef]

Figure 1. Remote sensing reflectance (

R_{r s}

) data acquisition methods and challenges: (1) satellite platforms (Landsat-8/9 OLI, Sentinel-2 MSI, and MODIS/VIIRS) provide global coverage but require synchronized field Chl-a validation; (2) field instruments (TriOS RAMSES and ASD FieldSpec) yield high-quality data but are costly and spatially limited; and (3) emerging technologies (UAV micro-spectrometers and HydroColor app) offer flexibility but lack sufficient spectral resolution for deep learning. These economic, temporal, and spatial constraints create a fundamental data bottleneck for remote sensing-based water quality inversion.

Figure 1. Remote sensing reflectance (

R_{r s}

) data acquisition methods and challenges: (1) satellite platforms (Landsat-8/9 OLI, Sentinel-2 MSI, and MODIS/VIIRS) provide global coverage but require synchronized field Chl-a validation; (2) field instruments (TriOS RAMSES and ASD FieldSpec) yield high-quality data but are costly and spatially limited; and (3) emerging technologies (UAV micro-spectrometers and HydroColor app) offer flexibility but lack sufficient spectral resolution for deep learning. These economic, temporal, and spatial constraints create a fundamental data bottleneck for remote sensing-based water quality inversion.

Figure 2. Overall architecture of the two-stage latent diffusion framework. The first stage pre-trains a multi-kernel convolutional VAE for spectral compression. The second stage performs latent diffusion with frozen VAE weights, integrating a physics-guided conditional encoder and a band-selective attention-enhanced denoising network. During inference, conditional spectral generation is accomplished within 50 steps via classifier-free guidance.

Figure 3. Latent diffusion backbone network. The forward process progressively adds noise, while the reverse process iteratively recovers the latent vector through a U-Net denoising network under conditional guidance.

Figure 4. Lightweight spectral VAE architecture. (A) The spectral encoder compresses 551-dimensional spectra into a 64-dimensional latent space through multi-kernel convolutional blocks and downsampling. (B) Multi-kernel convolutional blocks capture multi-scale spectral features via parallel branches with kernel sizes of 1, 3, 5, and 7. (C) The spectral decoder adopts a symmetric structure for spectrum reconstruction. Orange modules represent our improved components.

Figure 5. Band-selective attention module architecture. Left: U-Net residual block structure with band-selective attention embedded after FiLM modulation. Right: Detailed module design, extracting features from four chlorophyll-sensitive bands via soft Gaussian masks, followed by importance weighting and multi-head attention interaction, with output fused through adaptive gating.

Figure 6. Physics-guided conditional encoder architecture. Left: Encoder structure showing chlorophyll concentration input to both a baseline encoder and three regime-specific encoders, with a weight predictor generating soft stratified weights for weighted fusion of regime encodings, which are then concatenated with a baseline encoding and processed through a fusion network to output the conditional embedding. Right: Conditional injection mechanism showing the conditional embedding injected into U-Net residual blocks via FiLM modulation.

Figure 7. Two-stage training and inference framework. (A) The first stage pre-trains the VAE to learn compressed representations and reconstructions of spectra. (B) The second stage freezes VAE weights and trains the conditional diffusion model in latent space, with conditional embeddings generated by the physics-guided encoder. (C) The inference pipeline starts from random noise and generates spectra under target concentration conditions through 50-step DDPM sampling with classifier-free guidance.

Figure 8. Single-factor sensitivity analysis: 9 critical hyperparameter performance curves (dual Y-axis: correlation and RMSE). ★ indicates baseline. (a–c) High, (d–f) medium, and (g–i) low sensitivity. Baselines are located in optimal/stable ranges.

Figure 9. Interaction effects between the latent dimension and diffusion training epochs. (a,b) Heatmaps showing RMSE and correlation variations, where the star (★) marks the optimal configuration (latent dimension = 32, diffusion epochs = 150) achieving the lowest RMSE; (c) line plots illustrating the effect of epochs at different latent dimensions; (d) 3D surface plot revealing the nonlinear trade-off landscape.

Figure 10. Comprehensive analysis of model generation quality. (a) Comparison between real and generated spectra (Chl-a = 10 mg/m³); (b) conditional generation under different chlorophyll concentrations; (c) t-SNE distribution in latent space; (d) band-selective attention heatmap; (e) diffusion denoising process; (f) peak wavelength distribution; (g) spectral first-order derivative distribution; (h) downstream model performance after data augmentation.

Table 1. Statistical information of the GLORIA dataset subset.

Item	Value
Number of samples	341
Number of spectral bands	551
Wavelength range (nm)	350–900
Wavelength resolution (nm)	1.0
Chl-a range (mg/m³)	0.65–201
Chl-a median (mg/m³)	25.57
Chl-a standard deviation (mg/m³)	40.96

Table 2. Comparison of generation quality across different models. Bold indicates the best performance for each metric.

Model	Params	Time	SAM	RMSE	MAE	Corr.	Peak Err.	Shape	Pos.	Reas.	Smooth.	Div.	FD
	(M)	(s)	(rad)				(nm)	Sim.	Ratio	Range
Full Model	5.82	48.65 ± 1.91	0.339 ± 0.017	0.0034 ± 0.0004	0.0023 ± 0.0003	0.762 ± 0.030	42.46 ± 8.53	0.535 ± 0.019	0.995 ± 0.003	1.000	1.000	0.072 ± 0.010	0.0008
Cond. GAN	1.67	11.09 ± 0.94	0.310 ± 0.007	0.0033 ± 0.0002	0.0023 ± 0.0001	0.827 ± 0.005	23.25 ± 4.51	0.131 ± 0.023	0.992 ± 0.004	1.000	1.000	0.052 ± 0.002	0.0013
CVAE	0.88	5.92 ± 0.17	0.183 ± 0.007	0.0025	0.0017	0.936 ± 0.004	11.51 ± 0.62	0.439 ± 0.027	1.000	1.000	1.000	0.012 ± 0.002	0.0028
WGAN-GP	1.67	47.51 ± 1.66	0.337 ± 0.023	0.0032 ± 0.0001	0.0022 ± 0.0001	0.797 ± 0.026	25.15 ± 3.60	0.168 ± 0.017	0.996 ± 0.001	1.000	1.000	0.055 ± 0.001	0.0012
CSN-GAN	1.67	10.57 ± 0.14	0.606 ± 0.005	0.0034	0.0025	0.576 ± 0.011	9.08 ± 0.16	0.020 ± 0.003	0.946 ± 0.007	1.000	0.999	0.049 ± 0.003	0.0062
CDDPM	1.09	6.35 ± 0.30	0.501 ± 0.003	0.0031	0.0022	0.665 ± 0.005	10.16 ± 0.47	0.029 ± 0.002	0.982 ± 0.001	1.000	0.999	0.058	0.0047
CDDIM	1.09	6.55 ± 0.20	0.479 ± 0.002	0.0031	0.0021	0.685 ± 0.004	11.31 ± 0.49	0.031 ± 0.001	0.993 ± 0.001	1.000	0.999	0.055	0.0045
C-Trans.	3.74	30.29 ± 0.91	0.362 ± 0.070	0.0031 ± 0.0004	0.0022 ± 0.0002	0.797 ± 0.070	15.75 ± 9.94	0.055 ± 0.017	0.995 ± 0.009	1.000	1.000	0.000	0.0052
C-RealNVP	3.32	31.82 ± 0.73	0.521 ± 0.006	0.0032	0.0021	0.665 ± 0.008	18.41 ± 2.31	0.027 ± 0.002	0.968 ± 0.001	1.000	0.999	0.062	0.0044

Params: number of parameters; Time: training time per epoch; SAM: Spectral Angle Mapper; Corr.: Pearson correlation; Peak Err.: peak wavelength error; Shape Sim.: shape similarity; Pos. Ratio: positive value ratio; Reas. Range: reasonable range ratio; Smooth.: smoothness; Div.: diversity; FD: Fréchet distance. Standard deviations ± 0.0000 are omitted for clarity.

Table 3. Ablation study results. Bold indicates the full model results.

Configuration	Params	Time	SAM	RMSE	MAE	Corr.	Peak Err.	Shape	Pos.	Smooth.	Div.	FD
	(M)	(s)	(rad)				(nm)	Sim.	Ratio
Full Model	5.82	71.40	0.306 ± 0.112	0.0035	0.0025	0.832±0.154	29.1 ± 44.2	0.509 ± 0.177	0.994	1.000	0.069 ± 0.030	0.0008
w/o Waveband Attn	5.82	67.55	0.365 ± 0.122	0.0037	0.0027	0.737 ± 0.185	45.3 ± 48.0	0.394 ± 0.229	0.996	1.000	0.079 ± 0.036	0.0013
w/o Physics Guided	5.78	59.59	0.373 ± 0.151	0.0037	0.0025	0.737 ± 0.213	39.8 ± 45.0	0.384 ± 0.271	0.986	1.000	0.081 ± 0.042	0.0008
w/o VAE (Direct Diff)	1.55	43.81	0.524 ± 0.025	0.0032	0.0023	0.644 ± 0.043	10.3 ± 20.9	0.025 ± 0.012	0.979	0.999	0.062 ± 0.002	0.0049
Baseline (No Innov.)	1.51	43.32	0.520 ± 0.026	0.0032	0.0023	0.653 ± 0.045	10.7 ± 22.3	0.024 ± 0.012	0.976	0.999	0.061 ± 0.002	0.0049

w/o: without; Attn: Attention; Innov.: Innovations; Params: number of parameters; Time: training time per epoch; SAM: Spectral Angle Mapper; Corr.: Pearson correlation; Peak Err.: peak wavelength error; Shape Sim.: shape similarity; Pos. Ratio: positive value ratio; Smooth.: smoothness; Div.: diversity; FD: Fréchet distance.

Table 4. Hyperparameter baseline configuration and testing ranges.

Parameter	Baseline Value	Testing Range
latent_dim	64	[16, 32, 48, 64, 96, 128, 192, 256]
vae_lr	$5 \times 10^{- 4}$	[ $2 \times 10^{- 3}$ , $1 \times 10^{- 3}$ , $5 \times 10^{- 4}$ , $2 \times 10^{- 4}$ , $1 \times 10^{- 4}$ , $5 \times 10^{- 5}$ ]
diffusion_lr	$5 \times 10^{- 5}$	[ $2 \times 10^{- 4}$ , $1 \times 10^{- 4}$ , $5 \times 10^{- 5}$ , $2 \times 10^{- 5}$ , $1 \times 10^{- 5}$ , $5 \times 10^{- 6}$ ]
vae_epochs	50	[20, 30, 40, 50, 60, 80, 100]
diffusion_epochs	150	[50, 80, 100, 150, 200, 250, 300]
batch_size	32	[8, 16, 24, 32, 48, 64, 96, 128]
guidance_scale	2.0	[0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
num_timesteps	50	[20, 30, 40, 50, 70, 100, 150, 200]
beta_schedule	squaredcos_cap_v2	[linear, squaredcos_cap_v2, cosine]

Table 5. Hyperparameter sensitivity ranking and baseline value verification.

Rank	Parameter	Sensitivity Score	$Δ$ RMSE	$Δ$ Corr.	$Δ$ SAM
1	vae_lr	84.70	0.0021	0.3738	0.3290
2	diffusion_lr	40.12	0.0011	0.1599	0.1601
3	latent_dim	38.30	0.0015	0.2099	0.1383
4	beta_schedule	35.71	0.0002	0.1451	0.1422
5	vae_epochs	35.52	0.0012	0.1627	0.1363
6	num_timesteps	34.28	0.0011	0.1959	0.1218
7	guidance_scale	33.42	0.0006	0.1477	0.1298
8	batch_size	30.72	0.0008	0.1697	0.1108
9	diffusion_epochs	29.69	0.0008	0.1523	0.1100

Sensitivity score is calculated as

S = 100 \times Δ RMSE + 50 \times Δ Corr + 200 \times Δ SAM + 10 \times Δ Div

.

Table 6. Comparison of data augmentation effects. Bold indicates the best performance.

Data Source	$R^{2}$	RMSE	MAE	MAPE (%)	$R^{2}$ Improv. (%)	RMSE Improv. (%)
Baseline (Original)	0.7518	21.03	10.62	72.83	–	–
Full Model	0.9085	12.77	8.45	35.08	+20.85	+39.29
Conditional VAE	0.9075	12.79	8.15	44.19	+20.71	+39.17
Conditional DDPM	0.9048	13.02	7.74	29.34	+20.35	+38.07
Conditional RealNVP	0.9044	13.05	8.19	42.75	+20.30	+37.94
Conditional DDIM	0.8882	14.11	8.75	88.64	+18.14	+32.88
Conditional GAN	0.8881	14.12	8.85	36.37	+18.13	+32.86
Conditional WGAN-GP	0.8823	14.48	9.02	32.51	+17.37	+31.15
Conditional SN-GAN	0.8688	15.29	9.36	34.66	+15.57	+27.31
Conditional Transformer	0.7736	20.08	12.39	74.02	+2.90	+4.50

Improv.: Improvement; MAPE: Mean Absolute Percentage Error. RMSE and MAE are in mg/m³.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Zhang, H.; Huang, J.; Wen, H.; Chen, Q.; Liu, J.; Wen, C.; Tang, H.; Sun, Z. Latent Diffusion Model for Chlorophyll Remote Sensing Spectral Synthesis Integrating Bio-Optical Priors and Band Attention Mechanisms. Appl. Sci. 2026, 16, 3892. https://doi.org/10.3390/app16083892

AMA Style

Liu J, Zhang H, Huang J, Wen H, Chen Q, Liu J, Wen C, Tang H, Sun Z. Latent Diffusion Model for Chlorophyll Remote Sensing Spectral Synthesis Integrating Bio-Optical Priors and Band Attention Mechanisms. Applied Sciences. 2026; 16(8):3892. https://doi.org/10.3390/app16083892

Chicago/Turabian Style

Liu, Jinming, Haoran Zhang, Jianlong Huang, Hanbin Wen, Qinpei Chen, Jiayi Liu, Chaowen Wen, Huiling Tang, and Zhaohua Sun. 2026. "Latent Diffusion Model for Chlorophyll Remote Sensing Spectral Synthesis Integrating Bio-Optical Priors and Band Attention Mechanisms" Applied Sciences 16, no. 8: 3892. https://doi.org/10.3390/app16083892

APA Style

Liu, J., Zhang, H., Huang, J., Wen, H., Chen, Q., Liu, J., Wen, C., Tang, H., & Sun, Z. (2026). Latent Diffusion Model for Chlorophyll Remote Sensing Spectral Synthesis Integrating Bio-Optical Priors and Band Attention Mechanisms. Applied Sciences, 16(8), 3892. https://doi.org/10.3390/app16083892

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Latent Diffusion Model for Chlorophyll Remote Sensing Spectral Synthesis Integrating Bio-Optical Priors and Band Attention Mechanisms

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Inversion Methods for Water Quality Parameters

2.2. Generative Models for Data Augmentation

3. Materials and Methods

3.1. Overall Architecture

3.2. Latent Diffusion Framework

3.3. Lightweight Spectral VAE

3.4. Waveband Selective Attention

3.5. Physics-Guided Conditioning

3.6. Training Strategy

3.7. Implementation Details

4. Experimental Results

4.1. Datasets and Experimental Setup

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Synthetic Data Quality

4.3. Ablation Studies

4.4. Model Sensitivity Analysis

4.5. Downstream Task Validation

4.6. Comprehensive Evaluation of Generative Quality

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI