4.2. Lightweight Encoder–Decoder
As the core carrier for diffusion models to efficiently process images, the latent space directly influences the performance and efficiency of the model. In Latent Diffusion Models (LDMs) [
31], the latent space is constructed through an encoder–decoder, which maps high-dimensional pixel space images to a low-dimensional latent space, thereby reducing the computational complexity of the training and inference process. An LDM employs a pre-trained VQ-VAE [
32] or VQ-GAN [
33] architecture as its encoder–decoder. The encoder compresses input images into latent codes, while the decoder reconstructs latent codes back into pixel-space images. Through joint training, they achieve accurate mapping between the latent space and pixel space.
However, in low-light remote sensing scenarios, the general-purpose encoder–decoder of an LDM has significant limitations. Low-light remote sensing images not only exhibit characteristics such as low illumination, high noise, and blurred details but also present issues specific to remote sensing, including weak spectral information of ground objects and imbalanced contrast between large dark areas and local bright targets (e.g., night lights and reflective buildings).
Designed to adapt to various natural images, the general-purpose encoder–decoders lack targeted optimization for low-light remote sensing scenarios. The encoder tends to lose scarce critical ground object details during compression, such as road textures in dark areas and building edges, resulting in latent codes failing to retain the core features required for enhancement, while the decoder struggles to accurately restore unique illumination distributions of low-light remote sensing images during reconstruction (e.g., light intensity differences between urban and suburban areas or uneven illumination caused by topographic shadows), undermining the authenticity of enhanced results and further interferes with subsequent remote sensing interpretation tasks.
In addition, general-purpose encoder–decoders adopt complex structures with large parameters to pursue versatility. However, low-light remote sensing images often have large sizes and high resolutions, leading to a large amount of redundant computation in enhancement tasks and making it difficult to meet fast processing requirements.
To address the above issues, we constructed and trained a lightweight encoder–decoder specifically for the field of low-light remote sensing enhancement similar to the U-Net structure. Its core design revolves around the feature requirements of low-light remote sensing scenarios, achieving model lightweight while ensuring the integrity of latent code information.
Compared with the U-Net, our improvements mainly include separating the encoding and decoding structures, introducing Depth-wise Separable Convolutions (DSConvs) [
34], adding residual blocks in feature layers of different scales, and adopting a flexible feature fusion mechanism that combines multi-scale features of low-light images during the latent code decoding process. These designs make the model more suitable for the characteristic requirements of low-light remote sensing image enhancement tasks. Following [
23], we trained the encoder–decoder using approximately 10,000 pairs of images with different exposures.
To verify the advantages of the proposed lightweight encoder–decoder in low-light remote sensing enhancement tasks, in
Section 5.2, we conducted a systematic comparison between it and the general-purpose encoder–decoder based on the VQ-VAE commonly used in an LDM, covering two dimensions: encoding–decoding effect and model efficiency. Since the use of a general-purpose encoder–decoder leads to a significant decrease in efficiency, where processing a single high-resolution remote sensing image can take several seconds or even dozens of seconds and the performance degradation demonstrated in the experiments is obvious, we did not conduct ablation studies on this part of the content.
4.3. End-to-End Trained Single-Step Residual Diffusion Model
The core design of the standard diffusion model in
Section 3 follows the paradigm of “noise prediction-iterative denoising”. Its target task in the training phase is defined as “noise estimation”: the model learns to predict the added noise from noisy images, thereby achieving the modeling of data distribution. In the sampling phase, based on the pre-trained noise estimation network, the model starts from pure noise and gradually removes noise through multi-step iterations, ultimately generating images that conform to the target distribution. However, in image reconstruction tasks, such as low-light image enhancement, the actual target task is “reconstructing normal-light images from input low-light images”. This mismatch between “noise estimation” and “image reconstruction” leads to significant limitations in standard diffusion models.
To address this core contradiction, we propose an end-to-end trained single-step diffusion reconstruction mechanism. By reconstructing the training objectives and training process, the diffusion model is made to directly serve the image reconstruction task. Its core design idea is the following: during the training phase, the reverse sampling process of the diffusion model is executed simultaneously, and only the reconstruction loss is used as the optimization objective to achieve task consistency between “training and sampling”.
Diffusion models generally require 50–100 sampling steps [
35]. At this time, using the end-to-end training method will greatly prolong the training time. To accelerate the model training, reduce the number of sampling steps, and achieve efficient single-step reconstruction, we introduce the residual diffusion model [
20] to replace the standard diffusion model as the core of the model. The residual diffusion model decouples the traditional single denoising diffusion process into a dual diffusion process of residual diffusion and noise diffusion, which provides the possibility for reducing the number of sampling steps and achieving efficient single-step reconstruction.
Forward process: The forward diffusion process of the residual diffusion model involves gradually introducing Gaussian noise and residual terms into the original data, aiming to transform the original data distribution into a noise-added degraded image through a series of minor noise perturbations and residuals. This process can be described as a
T-step Markov chain, where each step depends only on the result of the previous step, specifically as follows:
where
,
is the degraded image, and
is the target image. The residual diffusion and noise diffusion are controlled by independent coefficient schedules
and
, respectively, to regulate their diffusion rates. Equation (
9) can also be expressed as
where
. Similar to standard diffusion models,
at any step can be directly represented by the initial data
, residuals, and accumulated noise:
where
,
, and
. By default, we set
as a uniformly increasing sequence, and
; when
, the endpoint of the forward diffusion process is
, which presents the degraded image with random perturbations added.
In the dual diffusion process, the residual diffusion part represents the directional diffusion from the target image to the degraded input image, explicitly providing guidance for the reverse generation process of image restoration. This directionality enables the model to focus more on key information transmission in image restoration tasks and reduces the implicit sampling steps because the degraded image is known, eliminating the need for lengthy reverse processes starting from noise, like standard diffusion models. The noise diffusion, on the other hand, represents random perturbations in the diffusion process. While ensuring the diversity of model outputs, its independent coefficient scheduling mechanism with residual diffusion allows the model to more flexibly balance the needs for determinism and diversity.
Reverse process: The reverse process of the residual diffusion model starts from the degraded image with added random perturbations, gradually reducing the noise and residuals, and ultimately generating a clean target data distribution. If
are given, where
, the reverse process is defined as
where
, and
for a deterministic generation process. When
, we can obtain any one-step reverse process:
The training objective of the residual diffusion model is to fit a residual prediction model to predict
by using a U-Net and optimizing the network parameters
so as to indirectly obtain
from Equation (
11), and then calculate
so as to realize the gradual recovery of the original data.
For our end-to-end reconstruction one-step training, following [
30], we implement one-step implicit sampling, modify the training objective to indirectly predict
through the reverse process, and only calculate the reconstruction loss between
and
. The training and inference processes can refer to Algorithms 1 and 2. Following [
36], we use L1 loss and MS-SSIM loss as the reconstruction loss, the objective function is formulated as
| Algorithm 1: End-to-end residual diffusion model training. |
![Remotesensing 17 03193 i001 Remotesensing 17 03193 i001]() |
| Algorithm 2: End-to-end residual diffusion model inference. |
![Remotesensing 17 03193 i002 Remotesensing 17 03193 i002]() |
4.4. Physical Prior Extractor
Although the end-to-end trained single-step diffusion mechanism has significantly improved the efficiency of low-light image enhancement, in practical applications, it is found that the significant reduction in the number of sampling steps will lead to a certain degree of degradation in the model performance. This is because standard multi-step diffusion models gradually correct the generated results through an iterative process. However, in the absence of multi-step iterative correction, single-step diffusion models are prone to structural inconsistencies or illumination anomalies in the generated results due to local-feature-fitting deviations. Especially in low-light images, since the structural information of the original image is inherently blurred, it is more difficult for single-step diffusion models to accurately capture the structural mapping relationship from low light to normal light, which further exacerbates the performance loss.
To address the performance degradation issue in single-step diffusion, we propose an innovative strategy that derives physical priors based on the Kubelka–Munk theory to guide the generation process of the residual diffusion model. The Kubelka–Munk theory describes the spectral energy of light reflected from the surface of an object, and its basic formula is
where
denotes the wavelength;
x denotes the spatial location;
is the spectral energy of the reflected light;
is the light source spectrum;
is the specular reflection coefficient; and
is the intrinsic reflectivity of the material, which is only related to material properties and is independent of the illumination.
Through reasonable simplifying assumptions on the above model, illumination-related factors such as the light source spectrum e and specular reflection i can be eliminated, leaving only the intrinsic material property , thereby obtaining the illumination-invariant physical priors.
Assuming that the light source energy is uniform, which means independent of wavelength,
degenerates into
; thus, the model is simplified to
By taking the ratio of the first derivative
to the second derivative
of the spectral energy, i.e.,
,
and
can be eliminated, resulting in an illumination invariant
S related to
, that is,
Assuming that the object has a diffuse reflection surface, specular reflection is negligible, i.e.,
; thus, the model is simplified to
By taking the ratio of the spectral energy
E to its first derivative
, i.e.,
,
can be eliminated, resulting in an illumination-invariant
C related to
, that is,
The last illumination invariant prior is from CIConv [
37]. Specifically, under further simplifying assumptions, considering the spectral energy
E; its spatial derivatives
and
; and the spatial derivatives
,
,
, and
of the first-/second-order spectral derivatives
and
, the first derived quantities are defined as follows:
Similarly,
,
, and
can be obtained. On this basis, the illumination invariant
W is defined as the gradient magnitude of these derived quantities:
W can stably reflect the intrinsic edge characteristics of materials without being disturbed by the light intensity, color changes, or specular reflection, thereby achieving robustness against illumination variations:
where
presents a small perturbation, and
and
refer to the sample mean and standard deviation, respectively.
Our prior extractor is shown in
Figure 2. After extracting the prior, we concatenate the prior and inject it as a condition into the U-Net used by the residual diffusion model to guide the model training.
The Gaussian color model is used to estimate the spectral energy and its derivatives from RGB images, providing a basis for the calculation of illumination-invariant physical priors. As a bridge connecting RGB images and spectral energy features, it maps RGB channel values to variables related to spectral energy, such as the spectral energy
, first-order spectral derivative
, and second-order spectral derivative
, through linear transformation, thereby supporting the derivation of subsequent illumination-invariant priors:
S,
C, and
W. For the input RGB image, the Gaussian color model performs a linear transformation through a 3 × 3 matrix
with the formula
where
represents the estimated spectral energy;
and
represent the first-order and second-order spectral derivatives, respectively; and
is the pixel position. After obtaining
,
, and
, we calculate
E,
, and
by convolving
with Gaussian color smoothing and derivative filters of scale
.
is predicted from the input image using three DSConv layers [
34].
,
, and
are calculated from
, and
,
, and
are calculated from
. The latent codes of normal-light images, along with the priors extracted from their corresponding input images, are shown in
Figure 3.