1. Introduction
With the rapid deployment of remote-sensing-enabled Internet of Things systems, high-resolution aerial and satellite images and videos are increasingly used as perceptual inputs for large-scale, continuous monitoring applications. High-resolution remote sensing images typically contain dense man-made structures and heterogeneous land-cover types. However, under nighttime acquisition, illumination is often dominated by weak, phase-dependent moonlight, which makes remote sensing images prone to severe underexposure, color degradation, and loss of fine details, thereby reducing the visibility of thin structures such as roads and roof boundaries. This inevitably affects downstream tasks that rely on high-resolution remote sensing imagery, including disaster assessment, wildlife monitoring, and environmental protection. Therefore, the development of algorithms dedicated to enhancing low-light remote sensing images is crucial.
Recent CNN-based, Transformer-based low-light enhancement methods predominantly follow single-pass feed-forward pipelines. CNN-based methods typically learn direct illumination compensation or Retinex decomposition and restoration from paired supervision, aiming to lift exposure while suppressing noise and recovering details [
1,
2,
3]. In low-light remote sensing, the same paradigm is extended with domain-oriented representations and high-resolution modeling, including dual-domain feature fusion and data-efficient adaptation for challenging shadowed regions [
1,
2]. Transformer-based approaches further introduce global dependency modeling and long-range interaction to improve region-wise correction and detail recovery under spatially varying illumination [
4,
5,
6]. Despite these advances, both CNN and Transformer paradigms typically adopt single-pass feed-forward mappings optimized with pixel-wise objectives, which often lead to over-averaged solutions in severely underexposed regions, thereby compromising fine details and structural consistency. This degradation is particularly detrimental for remote sensing, as it can obscure small-scale targets and blur object boundaries.
To mitigate the limitations of single-pass pixel-wise optimization, recent studies increasingly formulate low-light enhancement within diffusion frameworks, where restoration is modeled as an iterative denoising trajectory over image distributions, enabling progressive refinement of exposure, contrast, and structural details [
7,
8,
9,
10,
11]. Such formulations provide a flexible backbone for incorporating additional guidance signals during sampling, which is particularly beneficial for handling severe degradation and non-uniform illumination. Existing methods typically rely on external modulation or guidance to stabilize the denoising process, including priors such as Fourier-domain constraints and Retinex-inspired decomposition [
7,
8], as well as conditioning through pretrained latent diffusion models for zero-shot enhancement and training-free attribute guidance [
9,
11,
12]. More recently, prompt-based conditioning has emerged as an effective mechanism to inject auxiliary cues into diffusion models, where learnable or retrieved prompts modulate intermediate representations to provide degradation-aware guidance without modifying the backbone architecture [
13,
14,
15,
16,
17]. Despite these advances, the guidance adopted in existing diffusion-based low-light enhancement is often single and static, remaining fixed across the denoising trajectory and tightly coupled to a specific prior or prompt form. Such monolithic guidance makes it difficult to simultaneously accommodate spatially varying illumination correction and preserve semantically meaningful structures, particularly when real-world scene conditions diverge from the priors assumed by the enhancement model. In remote sensing images, this often manifests as region-wise inconsistent correction and distorted local structures, which can obscure small-scale targets and blur object boundaries. These observations indicate that effective diffusion-based enhancement requires complementary and adaptive guidance mechanisms, which motivates the design of dual, learnable prompts to cooperatively steer the diffusion process.
This decomposition is necessary because low-light remote sensing enhancement must address two different objectives at the same time. The first is adaptive correction of spatially varying underexposure. The second is preservation of semantic structure and object boundaries under low contrast. A single guidance signal tends to couple these two roles and is therefore less suitable for handling both illumination adjustment and structure preservation in a stable manner.
To address these issues, we propose a Complementary Illumination–Semantic Prompt Diffusion (CISPD) framework tailored for low-light remote sensing image enhancement. Instead of relying on a single, fixed guidance mechanism, CISPD separates illumination correction from structure preservation through two complementary prompts injected into the denoiser. Specifically, we introduce a self-learned illumination-aware prompt (IAP) retrieved from a learnable prompt pool conditioned on the current latent context, which provides adaptive exposure-related guidance to handle spatially varying underexposure. Meanwhile, a semantic-invariant prompt (SIP) extracted from a vision foundation model supplies stable structural cues that are less affected by illumination variation, improving geometric consistency and suppressing artifacts. During the denoising trajectory, CISPD applies the illumination-aware prompt first to correct non-uniform brightness, and then uses the semantic-invariant prompt to reinforce structures and recover details after exposure correction. To prevent the two prompts from collapsing into redundant guidance, we further impose a contrastive prompt constraint loss (CPL) that keeps their representations distinct, encouraging complementary information flow throughout refinement. With this design, CISPD steers diffusion refinement with adaptive illumination control and a semantic-invariant prompt in a coordinated manner, leading to well-exposed results with faithful structures on low-light remote sensing imagery.
Our contributions are summarized as follows:
We develop CISPD as a prompt-guided residual diffusion framework for low-light remote sensing image enhancement, which couples adaptive illumination-aware prompt retrieval with semantic-invariant structural guidance to address spatially varying underexposure during iterative refinement.
We design a self-learned illumination-aware prompt and a semantic-invariant prompt, injected sequentially into the diffusion denoiser, together with a contrastive constraint to encourage complementary guidance.
Extensive experiments on both low-light remote sensing and natural-image datasets demonstrate that CISPD achieves superior enhancement quality and robust generalization under spatially varying illumination conditions.
3. Methodology
3.1. Residual Diffusion with Implicit Priors
As shown in
Figure 1, we propose CISPD, which aims to enhance a low-light remote sensing image
by learning the illumination-degradation residual
, where
is the clear reference. In the denoising UNet, the illumination-aware prompt is injected first and the semantic-invariant prompt is injected afterward through sequential cross-attention. Let
be the latent representation of
. The forward diffusion process yields a noisy latent
at timestep
t as follows:
The proposed framework optimizes a conditional denoiser to predict the noise , guided by two synergistic prompts that separate the semantic reflectance and complex illumination context. This design follows the task decomposition in low-light remote sensing enhancement, where illumination correction requires adaptive degradation-aware guidance, while structure preservation requires stable semantic guidance that is less affected by illumination variation.
3.2. Semantic-Invariant Prompting
The semantic component of a remote sensing image represents the intrinsic physical attributes of land-cover objects, which should remain invariant regardless of illumination shifts. To preserve these intrinsic properties, we introduce the Semantic-Invariant Prompt .
We leverage a vision foundation model DINOv3 [
47] as the feature extractor for SIP. We choose DINOv3 because SIP is designed to encode semantic structure rather than illumination intensity, and a pretrained vision foundation model provides semantically richer and more stable descriptors than directly using low-level image features. This is suitable for remote sensing images, where land-cover layouts and object boundaries should remain stable under illumination variation. Given the input
, we extract the semantic feature
, which provides stable geomorphological descriptors. These features are projected into the prompt space via a projection layer
to form the static semantic feature:
serves as a semantic anchor in the diffusion process, ensuring that the geomorphological integrity is preserved during the intensity restoration.
3.3. Self-Learned Illumination-Aware Prompting
We propose a self-learned illumination-aware prompt to complement the semantic-invariant prompt by modeling illumination-specific variation rather than semantic structure. This prompt is designed for adaptive exposure correction under spatially varying degradation. We define a Prompt Pool as a learnable library of degradation prompt embeddings, where each embedding implicitly represents a latent mode of illumination-related degradation.
For a specific input, we extract its contextual feature
from the latent space of denoising UNet and perform top-k sparse retrieval to fit its unique degradation distribution. We calculate the similarity between a query
and learnable keys
associated with the pool:
To provide a locally inductive bias for non-uniform lighting, we select the top-k most relevant descriptors to construct the illumination-aware prompt:
This sparse selection mechanism allows the model to fit complex illumination distributions without being constrained by idealized physical assumptions. In CISPD, illumination adjustment is not controlled by a hand-crafted exposure threshold. Instead, it is governed by similarity-based sparse retrieval, where the similarity scores and the top-k selection determine which illumination descriptors are activated for the current latent context.
3.4. Illumination–Semantic Prompt Injection via Sequential Cross-Attention
To ensure that the prompts precisely guide the denoising process, we implement a Sequential Cross-Attention mechanism within the latent space of the denoising UNet. The denoiser interacts with and in a cascading manner to progressively refine the latent features.
Let
be the intermediate feature map of the UNet. In the first stage, the feature map attends to the illumination descriptor
to perform localized brightness compensation:
Here, , , and denote learnable linear projections that generate the query, key, and value representations, respectively. The symbol dim denotes the feature dimension used for scaled dot-product normalization.
In the second stage, the updated feature
attends to the semantic-invariant anchor
to reinforce geomorphological structures and eliminate potential artifacts introduced by illumination enhancement:
This sequential design follows the restoration requirement of low-light remote sensing images. The latent feature should first be corrected for spatially varying underexposure through
. After this correction,
provides semantic-invariant structural guidance to preserve boundaries and suppress artifacts. Therefore, brightness lifting is followed by structure-aware correction, which helps suppress over-enhancement and preserve local details during illumination adjustment. The comparison with alternative prompt integration strategies is provided in
Section 4.7.
Formally, the two prompts condition the denoiser through the intermediate latent feature transformation defined by Equations (5) and (6). Therefore, the denoiser can be written as , where the dependence on and is realized by replacing the original intermediate feature with the prompt-refined feature inside the denoising UNet. In this way, prompt information is injected into the reverse process through latent feature modulation rather than through an external sampling guidance term.
3.5. Illumination–Semantic Prompt Disentanglement via Contrastive Prompt Learning
To further ensure that
and
are truly separated into the semantic and illumination domains, we implement a Contrastive Prompt Loss (
). We enforce cosine similarity between the semantic-invariant anchor and the illumination-contextual descriptor:
Here, B denotes the batch size, and b denotes the sample index within the batch. and denote the semantic-invariant prompt and the illumination-aware prompt of the b-th sample, respectively. The function sim denotes cosine similarity. By minimizing , we decrease the similarity between and while increasing the similarity between and . Therefore, is anchored to semantic-invariant features, and is encouraged to encode information that is distinct from this semantic anchor. In this way, the loss suppresses representational overlap between the two prompts and promotes functional disentanglement between illumination guidance and structure guidance.
3.6. Sampling and Optimization
The final training objective combines the residual diffusion loss, the contrastive prompt loss, and a pixel-level fidelity term:
where
,
, and
are weighting hyperparameters that balance the contribution of each loss component. We set them to 1 as a unified default configuration for all datasets. This equal-weight setting avoids additional dataset-specific manual tuning and keeps the optimization setup consistent across experiments.
supervises the denoiser to recover the illumination-degradation residual
, and therefore mainly drives illumination enhancement in the residual diffusion space.
enforces the functional separation between
and
, so illumination-related correction and semantic-invariant structure guidance do not collapse into redundant conditioning.
constrains the restored result to remain close to the clear reference, which helps reduce structural distortion during enhancement. As a result, the optimization promotes exposure restoration while preserving structural consistency.
Accordingly, the reverse transition is parameterized as
, where the prompts affect the reverse process only through the prompt-conditioned noise prediction network. During inference, the denoising process is implemented through the Sequential Cross-Attention mechanism, which updates the intermediate latent feature before noise prediction:
The enhanced remote sensing image is ultimately reconstructed as . By pivoting on the diffusion residual, the framework restores localized radiance gradients and topographic details in a highly efficient sampling pass.
4. Experiments
4.1. Dataset
We evaluate the proposed method on both low-light remote sensing datasets and paired natural Low-Light Image Enhancement (LLIE) datasets.
Remote sensing datasets. We adopt two datasets for low-light remote sensing image enhancement. iSAID-dark [
1] is a paired dataset constructed from high-resolution remote sensing images. The dataset selects 751 images as the base dataset and generates paired low-/normal-light samples via a synthetic degradation process. To increase scene diversity and make training feasible, multiple random crops are extracted and resized to
, yielding 3755 training image pairs and 66 validation image pairs. darkrs [
1] contains 86 real nighttime remote sensing images captured by drones. Since paired ground truth is unavailable, it is mainly used to evaluate real-world generalization through qualitative comparisons.
Together, these two remote sensing datasets cover complementary challenging conditions. iSAID-dark evaluates restoration under paired low-light degradation with strong spatial illumination variation, while darkrs evaluates robustness on real nighttime scenes with complex illumination and sensor-dependent noise.
Natural image datasets. We further evaluate on three common paired datasets with official splits. LOLv1 [
48] contains 500 paired low/normal-light images captured in real-world environments, where 485 pairs are used for training and 15 pairs for testing. LOLv2-Real [
49] provides 689 training pairs and 100 testing pairs collected from real-scene captures. LOLv2-Syn [
49] is a synthetic paired dataset with 900 training pairs and 100 testing pairs, constructed to simulate diverse low-light scenarios.
4.2. Metrics
For paired datasets, we report PSNR and SSIM to measure reconstruction fidelity and structural similarity. To evaluate perceptual quality, we adopt LPIPS for remote sensing evaluation on iSAID-dark and FID for natural image evaluation on LOLv1, LOLv2-Real, and LOLv2-Syn. We retain these metrics because they are widely used in low-light enhancement and provide direct comparability with existing baselines. Higher PSNR and SSIM indicate better fidelity and structure preservation, while lower LPIPS and FID reflect better perceptual quality.
4.3. Training Schedules
The proposed diffusion framework is implemented in PyTorch 2.1.0 and trained on dual NVIDIA RTX4090 (NVIDIA Corporation, Santa Clara, CA, USA) GPUs. Training initiates at a learning rate of , progressively attenuated via cosine annealing. The Adam optimizer is employed for parameter optimization, incorporating an exponential moving average with a weight of 0.995 for model weights. The diffusion framework operates across timesteps with linearly scaled values from 0.0001 to 0.02. Image inputs are processed as pixel patches with a batch size of 2. Data augmentation includes horizontal flips and random rotations at fixed angles , and .
4.4. Qualitative Evaluation
We provide qualitative comparisons on both remote sensing and natural image datasets, including iSAID-dark, darkrs, LOLv1, and LOLv2-Real. Among them,
Figure 2 and
Figure 3 correspond to the target remote sensing task, while
Figure 4 and
Figure 5 are included as supplementary cross-domain validation on standard paired natural-image low-light benchmarks to show that the proposed prompt-guided diffusion mechanism is not restricted to remote sensing data. For a fair visual assessment, we focus on three key aspects: (i) whether severely underexposed regions are sufficiently lifted, (ii) whether object boundaries, structural layouts, and low-contrast details are preserved after enhancement, and (iii) whether common artifacts are avoided.
Results on iSAID-dark. Figure 2 presents qualitative comparisons on iSAID-dark. This dataset is characterized by large-scale underexposure and strong spatial illumination variation, where effective enhancement requires lifting dark regions while preserving thin structures and low-contrast textures in aerial views. As shown in
Figure 2, some competing methods increase global brightness but tend to compress local contrast, which weakens the visibility of subtle scene cues such as parking-slot markings and boundary transitions on the asphalt. In contrast, methods that emphasize conservative correction may leave shadowed regions insufficiently recovered, yielding limited visibility gains in severely dark areas. More specifically, CUE exhibits noticeable over-brightening in the underexposed region, leading to washed-out appearances and reduced texture separability. SCI introduces evident chromatic noise-like patterns over relatively homogeneous areas, which distracts structural perception and harms visual consistency. NeRCo produces relatively limited illumination lifting, so the dark region remains less informative compared with other results. By comparison, CISPD achieves a better balance between exposure correction and structure preservation: it lifts the shadowed area to reveal meaningful scene content while keeping the parking-line patterns and low-contrast object boundaries clearer after brightness enhancement, and it avoids the obvious color corruption observed in SCI and the contrast collapse caused by aggressive brightening.
Results on darkrs. Figure 3 shows comparisons on darkrs, which consists of real nighttime remote sensing images with complex illumination conditions and sensor-dependent noise, where the enhancement quality is largely reflected by whether the method avoids over-amplification and maintains stable color statistics across the scene. Overall, several competing approaches produce overly strong exposure lifting, which makes the scene appear over-exposed and reduces the visibility of structural transitions, especially around the circular layout and surrounding paths. For instance, FourLLIE and LLFormer significantly brighten the entire image, resulting in a pale appearance that weakens tonal separation between different regions and reduces depth cues in the layout. NeRCo introduces a clear warm color bias, making the overall tone deviate from a natural nighttime distribution and affecting the consistency between illuminated and non-illuminated areas. SCI shows an evident brightness over-correction accompanied by a strong tint, which further harms visual realism. In contrast, CISPD performs a more controlled illumination adjustment: it improves visibility in dark regions while maintaining stable global tone and avoiding the strong color shift seen in NeRCo, and it better preserves the structural layout and boundary transitions of the scene without the over-exposure effect that appears in several baselines, indicating stronger real-scene generalization on darkrs.
This result also indicates that CISPD remains stable under challenging real nighttime imaging conditions, where non-uniform illumination and sensor-dependent noise appear simultaneously but paired ground truth is unavailable.
Results on LOLv1. Figure 4 demonstrates that CISPD recovers natural brightness and contrast while maintaining faithful colors and textures. On this dataset, a common failure mode of competing methods is to improve global brightness but sacrifice local contrast or color fidelity, resulting in washed-out regions, tone shifts, or detail loss in challenging areas. Some methods also introduce over-smoothing when suppressing noise, which removes high-frequency textures. In contrast, CISPD yields cleaner details and more visually coherent results: it enhances dark regions without excessive saturation, maintains more natural tone transitions, and preserves textures and edges in local areas.
Focusing on the zoomed regions in
Figure 4, RUAS tends to leave the red-box area under-enhanced, where the fur texture and the boundary against the background remain indistinct. LLFormer largely lifts the exposure, yet the red-box crop shows weakened micro-texture on the fur, and the green-box crop exhibits a more saturated yarn tone with less clear thread patterns. RetinexFormer suppresses noise but also smooths fine structures in the green-box region, leading to reduced high-frequency details. In contrast, CISPD restores the exposure in both crops while retaining the fur strands in the red-box region and preserving yarn grooves and grape boundaries in the green-box region, without introducing noticeable saturation drift.
Results on LOLv2-Real. Figure 5 further validates CISPD on LOLv2-Real, which contains more diverse real-scene degradations and is generally more difficult than LOLv1. Existing approaches can produce inconsistent correction across regions, leading to remaining dark areas or over-brightened outputs with degraded local structures. In addition, uneven illumination correction may cause local contrast collapse or unnatural appearance. CISPD delivers more coherent illumination adjustment across the image and preserves local geometric structures with fewer artifacts, indicating improved robustness under diverse real-world lighting conditions.
The zoomed crops in
Figure 5 further reveal the local behavior under real-scene degradations. QuadPrior shows a visible color cast and reduced local contrast in the green-box crop, and the red-box crop exhibits softened contours around the cables and the bag. CWNet improves overall brightness, but the green-box region presents weaker separation between adjacent stripe transitions, and the red-box region still contains less stable edge definition. In comparison, CISPD preserves clearer stripe patterns in the green-box crop and maintains sharper cable boundaries in the red-box crop, while keeping the global tone consistent, which leads to more reliable local structures on LOLv2-Real.
4.5. Quantitative Evaluation
We quantitatively compare CISPD with state-of-the-art LLIE methods on the paired remote sensing dataset iSAID-dark and the paired natural image datasets LOLv1, LOLv2-Real, and LOLv2-Syn. We report fidelity-oriented metrics together with perceptual metrics to reflect both reconstruction accuracy and perceptual quality and to ensure consistent comparison with prior low-light enhancement methods.
Results on iSAID-dark. Table 1 reports results under two settings: directly evaluating on iSAID-dark and retraining on iSAID-dark for comparison. The column headed iSAID-dark denotes direct evaluation on the iSAID-dark test set without retraining on iSAID-dark, which is used to assess cross-domain generalization. The column headed iSAID-dark retrain denotes the results obtained after retraining on the iSAID-dark training split, which is used to assess in-domain performance under dataset-specific supervision. Without retraining, CISPD achieves the best 21.51 dB in PSNR and 0.707 in SSIM, outperforming the second-best method by 3.07 dB in PSNR and 0.128 in SSIM. Although LPIPS is not the best under direct cross-dataset evaluation, CISPD still shows the strongest PSNR and SSIM margins. After retraining, CISPD achieves the best overall performance with 26.53 dB in PSNR, 0.856 in SSIM, and 0.101 in LPIPS. Compared with the second-best method, CISPD improves PSNR by 1.26 dB and SSIM by 0.035, and further reduces LPIPS by 0.028. These results show that CISPD remains effective under a stronger and more up-to-date remote sensing comparison.
Results on LOLv1 and LOLv2. Table 2 summarizes quantitative comparisons on LOLv1, LOLv2-Real and LOLv2-Syn. On LOLv1, CISPD achieves the best PSNR and the lowest FID among methods that report this metric, while PyDiff reports a slightly higher SSIM. On LOLv2-Real, CISPD remains competitive with 23.31 dB in PSNR and 0.888 in SSIM, while CUGD reports a slightly higher SSIM and PyDiff reports a higher PSNR. On LOLv2-Syn, CISPD achieves the best PSNR and SSIM, exceeding the second-best PSNR by 1.23 dB and the second-best SSIM by 0.004. Overall, the updated comparison shows that CISPD remains highly competitive across standard paired natural-image benchmarks. We note that CISPD does not obtain the top PSNR, SSIM, or FID on LOLv2-Real. LOLv2-Real contains more diverse real-scene degradations than LOLv1 and LOLv2-Syn, where residual misalignment and sensor-specific noise can penalize pixel-wise fidelity and feature-distribution metrics. In such cases, CISPD still maintains competitive quantitative results and strong qualitative structure preservation.
4.6. Efficiency Analysis
We analyze the model complexity of diffusion-based approaches on the LOLv1 dataset. Since diffusion models can be computationally demanding, we report parameter count and MACs to characterize computational cost at the architecture level. As summarized in
Table 3, CISPD has a moderate parameter count among the compared diffusion-based baselines and requires substantially fewer MACs. These results indicate that CISPD achieves a favorable balance between restoration performance and computational complexity in terms of parameter count and MACs. We do not claim deployment-oriented efficiency from these metrics alone.
4.7. Ablation Study
Most ablation experiments are conducted on the iSAID-dark dataset under the same training and evaluation setting as the main comparison, and we report PSNR and SSIM for quantitative analysis. We further supplement cross-dataset validation on LOLv1 for the key hyperparameters of IAP length and top-k selection.
Effect of IAPs and SIPs. Table 4 evaluates the contributions of the two prompts by removing each component from CISPD. When removing IAPs, PSNR drops from 26.53 dB to 26.32 dB and SSIM decreases from 0.856 to 0.821, indicating that illumination-aware guidance is necessary for improving exposure correction and maintaining structural consistency. When removing SIPs, performance further degrades to 26.24 dB in PSNR and 0.818 in SSIM, showing that the semantic-invariant guidance plays a key role in preserving stable structures during enhancement. We also evaluate a variant without learnable keys of IAPs, which obtains 26.39 dB in PSNR and 0.826 in SSIM. Compared with the full model, this variant shows lower SSIM, suggesting that learnable keys help retrieve illumination prompts that better match the latent context and thus improve structural fidelity. Overall, combining SIPs and IAPs achieves the best performance, validating the necessity of using dual prompts in CISPD.
Effect of the IAP length. Table 5 studies the length of IAPs by varying it among 32, 64, and 128. Using length 64 yields the best results with 26.53 dB in PSNR and 0.856 in SSIM. A shorter prompt length of 32 reduces performance to 26.31 dB in PSNR and 0.829 in SSIM, indicating insufficient representational capacity for capturing illumination cues needed by the denoiser. Increasing the length to 128 does not improve performance, and instead results in 26.33 dB in PSNR and 0.831 in SSIM. This suggests that excessively long prompts may introduce redundant information and weaken the effectiveness of prompt injection. Therefore, we adopt length 64 as the default setting in CISPD. Its cross-dataset stability is further validated on LOLv1 in
Table 6.
Effect of the contrastive prompt loss. Table 7 validates the role of the contrastive prompt loss that encourages the two prompts to remain complementary. Without this loss, performance decreases to 26.09 dB in PSNR and 0.802 in SSIM, which indicates that simply using dual prompts is insufficient, and their interaction needs explicit regulation. We further examine two reduced variants by removing the negative or positive component. Removing the negative component yields 26.19 dB in PSNR and 0.817 in SSIM, while removing the positive component yields 26.15 dB in PSNR and 0.813 in SSIM. Both variants perform worse than the full loss formulation, showing that the two terms contribute jointly to learning distinct and useful prompt representations. This result is consistent with the design goal of CPL, which is to enforce functional disentanglement between illumination guidance and structure guidance. With the full contrastive prompt loss, CISPD achieves 26.53 dB in PSNR and 0.856 in SSIM, giving a gain of 0.44 dB in PSNR and 0.054 in SSIM over the variant without the contrastive constraint.
Effect of the number of IAPs. Figure 6 investigates the influence of the IAP pool size on the iSAID-dark. As the pool size increases from a small scale, PSNR improves steadily and reaches its best value at a moderate pool size. When the pool is too small, the retrieved illumination cues are less diverse, which limits the ability of CISPD to handle spatially varying exposure. When the pool becomes excessively large, performance drops from the peak, indicating that enlarging the candidate set does not necessarily improve retrieval quality and can introduce less relevant prompts that weaken the guidance signal. Therefore, we adopt the pool size corresponding to the best performance in
Figure 6 as the default setting.
Effect of the top-k selection. Figure 6 also studies the top-k selection used to construct IAPs from retrieved candidates. The results show that intermediate top-k values achieve the best performance. Using a very small top-k may miss cues required for correcting non-uniform brightness. Using an excessively large top-k tends to mix less relevant candidates, which reduces the selectivity of retrieval and weakens the illumination prior. This indicates that illumination adjustment in CISPD is practically governed by retrieval selectivity rather than by a fixed brightness threshold. Accordingly, we use the top-k value that achieves the best performance in
Figure 6 as the default setting. Its cross-dataset stability is further validated on LOLv1 in
Table 6.
Effect of the prompt integration strategy. We compare three prompt integration strategies on the iSAID-dark dataset. As shown in
Table 8, direct input concatenation gives the weakest performance, which indicates that simple joint fusion cannot effectively separate illumination correction from structure preservation. Sequential fusion improves the results under both orders. Among them, the proposed IAP → SIP strategy achieves the best PSNR and SSIM, while the reverse order remains inferior. This result suggests that correcting illumination-related distortion before semantic-invariant structural refinement is more effective for low-light remote sensing image enhancement. Therefore, CISPD adopts IAP → SIP as the default integration strategy.
Cross-dataset validation of key hyperparameters. We further validate the two key hyperparameters on LOLv1 to examine whether the default settings selected on iSAID-dark remain stable across datasets. As shown in
Table 6, the default IAP length 64 still achieves the best performance on LOLv1, while both shorter and longer prompts lead to inferior results. A similar trend is observed for top-k selection, where the default value also gives the best PSNR and SSIM. These results suggest that the selected hyperparameters are not specific to a single dataset and remain stable across both remote sensing and natural-image low-light enhancement benchmarks.
5. Conclusions
In this work, we studied low-light remote sensing image enhancement under the practical challenges of spatially varying illumination, sensor noise, and scene-dependent degradations, where preserving thin structures and boundary fidelity is critical for reliable remote sensing interpretation. To this end, we proposed CISPD, a complementary illumination–semantic prompt diffusion framework that reformulates enhancement as an iterative denoising process and introduces complementary prompt guidance to separate illumination correction from structure preservation. Concretely, CISPD retrieves a self-learned illumination-aware prompt from a learnable prompt pool conditioned on the current latent context, providing adaptive cues for correcting non-uniform underexposure. Meanwhile, a semantic-invariant prompt extracted from a vision foundation model supplies stable structural priors that help maintain geometric consistency and suppress artifacts after brightness correction. By injecting the two prompts sequentially along the diffusion trajectory, CISPD enables targeted exposure adjustment while retaining scene structures and fine details. In addition, we adopt a contrastive prompt constraint to prevent redundant guidance and encourage the two prompts to encode complementary information, which stabilizes refinement and improves structural fidelity.
Extensive experiments on paired low-light remote sensing datasets and real nighttime remote sensing imagery demonstrate that CISPD consistently improves both reconstruction fidelity and perceptual quality, while producing more coherent region-wise correction and clearer structures in qualitative comparisons. We further validated the generalization of CISPD on standard paired natural-image datasets, showing that the proposed guidance mechanism is not limited to a single domain. Efficiency analysis indicates that CISPD attains competitive model complexity with reduced computational cost compared with recent diffusion-based baselines, making it more practical for high-resolution enhancement. Ablation studies further confirm the roles of the illumination-aware and semantic-invariant prompts, the prompt design choices, and the contrastive constraint, and verify that appropriate prompt pool size and retrieval configuration are important for stable performance.
Current remote sensing evaluation is constrained by the available benchmark setting, which consists of one paired synthetic dataset and one real nighttime dataset without paired ground truth. Future work will further extend the validation to broader remote sensing benchmarks and more diverse acquisition conditions. In particular, actual inference speed and memory usage will be further optimized for practical high-resolution processing. We also plan to explore more robust prompt retrieval under domain shifts and real sensor noise, so that the framework can better adapt to diverse acquisition settings without additional supervision.