1. Introduction
Images acquired by drones in low-illumination environments, such as at night, are often difficult to use directly. This greatly restricts their ability to meet the demand for around-the-clock applications in fields such as disaster response and public safety. If the images captured under these low-illumination conditions could be enhanced to satisfy application requirements, the operational capabilities of drones would be significantly improved.
Research on low-illumination image enhancement can be broadly categorized into three types: (1) traditional enhancement methods, which center on image processing and physical modeling, aiming to improve image brightness and contrast; (2) deep learning-based enhancement methods, which can be roughly divided into supervised and unsupervised approaches, using machine learning to establish a mapping relationship from low-illumination images to normally lit images; and (3) methods for fusing and enhancing Nighttime Color Visible (NCV) images and Thermal infrared (TIR) images, which utilize thermal infrared data to compensate for the information lost by NCV images in low-illumination environments, and further employ machine learning algorithms to construct a mapping relationship between the fused images and normal images.
(1) Traditional Low-Illumination Image Enhancement Methods: In the early stages, traditional low-illumination image enhancement methods were primarily based on fundamental linear or nonlinear transformations, such as linear brightness stretching or nonlinear Gamma correction. These methods are characterized by simple principles and high computational efficiency. However, they lack adaptability to local image features, resulting in limited enhancement effects.
Subsequently, methods based on global and local histogram adjustment became dominant. Representative works include those by Pisano et al. (1998) [
1,
2], Lee et al. (2013) [
3], and RNC et al. (2016) [
4]. Since these methods adjust images solely at the level of pixel intensity distribution and do not address the physical essence of image formation, they struggle to balance contrast enhancement with color fidelity.
Another category of traditional methods is based on the Retinex theory, which can effectively address the shortcomings of histogram adjustment methods. The Retinex theory was first proposed by Land et al. (1971) [
5] and has been continuously improved by researchers such as Jobson et al. (1997) [
6], Rahman et al. (1996) [
7], Palma-Amestoy et al. (2009) [
8], Fu et al. (2015) [
9], and Guo et al. (2017) [
10]. These improvements have led to good enhancement results for low-illumination images. Nevertheless, issues such as noise amplification and over-enhancement of illumination still persist.
Overall, traditional methods are inherently limited by hand-crafted priors and distribution optimization. This leads to a trade-off between denoising, edge preservation, and color fidelity, making it difficult to completely eliminate artifacts such as blocky patterns, halos, noise amplification, and color casts. Furthermore, computational complexity increases sharply with image resolution, making it difficult to satisfy the demands for high-quality and high-temporal (real-time) enhancement. To address these limitations, the research focus has gradually shifted toward data-driven deep learning methods. By training on large-scale datasets, these networks can automatically learn the mapping from low-illumination (NCV) images to Daytime Color Visible (DCV) images, thereby overcoming the shortcomings of hand-crafted priors and seeking a new balance between contrast, detail, color, and processing speed.
(2) Deep Learning-Based Low-Illumination Image Enhancement Methods: Deep learning-based methods can be broadly categorized into supervised and unsupervised learning approaches, depending on the training datasets employed. Supervised learning refers to the use of strictly pixel-level paired NCV and DCV images, whereas unsupervised learning refers to the use of unpaired data.
Supervised learning methods, represented by the pix2pix model proposed by Isola et al. (2017) [
11] and the LightenNet model proposed by Li et al. (2018) [
12], benefit from strict semantic consistency constraints provided by paired images, thus generating transformation results with high geometric accuracy. However, the performance of these models is entirely dependent on the quality and scale of the paired image datasets. Most importantly, the datasets must feature pixel-level precise alignment to achieve ideal training results.
Although these methods can guarantee effective enhancement through explicit mapping relationships, they are extremely difficult to apply to outdoor scenarios. For instance, drones cannot acquire strictly paired data of the same area at different times. Even with post-processing techniques, it is difficult to overcome interference caused by dynamic changes in ground objects (such as moving vehicles and people) and parallax variations caused by different imaging angles when the drone captures the same scene at different times. Manual pixel-level registration and annotation of such image data by professionals would be prohibitively expensive, rendering the construction of large-scale paired datasets infeasible.
Due to the stringent requirements for dataset construction in supervised enhancement methods, researchers have attempted to train neural networks using unpaired datasets. Among these, Generative Adversarial Networks (GANs) are typical unsupervised learning models. They do not rely on explicit labels or strictly paired samples; instead, by designing specific loss functions and structural constraints, these models can achieve high-quality image enhancement and transformation.
The CycleGAN model proposed by Zhu et al. (2017) [
13] was first applied to low-illumination image enhancement by Qu et al. (2019) [
14]. This model utilizes a cycle-consistency loss function to ensure high similarity between the converted image and the original. However, the presence of cycle-consistency constraints and dual discriminators often leads to issues such as halos, loss of detail, and unstable training in the output images. To address the issues of detail degradation and color distortion in low-illumination images, Mao et al. (2023) proposed the Retinex-GAN model [
15], which combines the classic Retinex image decomposition theory with the GAN mechanism. This unsupervised method brightens low-illumination images without relying on large paired datasets while also adapting to complex scenes.
Furthermore, to address the task of low-illumination image enhancement, some studies have proposed more lightweight GAN architectures. For example, UEGAN enhances image brightness and contrast by learning the illumination mapping relationship between low-illumination and normally lit images, achieving good results in natural image enhancement tasks [
16].
Despite the good enhancement effects achieved by unsupervised methods such as the original GAN, CycleGAN, Retinex-GAN, and UEGAN using unpaired datasets, the transformed or enhanced images often exhibit abnormal spots, edge artifacts, and other color and semantic mutations. The primary reason is that using unpaired datasets means there is no structural or semantic correlation between the image pairs. Compared to paired images, these methods lack structural and semantic constraints between the image pairs. For example, the PSGAN model proposed by Liu et al. (2020) [
17] for remote sensing image enhancement tends to generate artifacts and struggles to maintain consistency in ground object structures when processing remote sensing images with complex ground scenes.
(3) Methods for Fusion and Enhancement of NCV and TIR Images: For low-illumination environments, the imaging process inevitably results in the loss of a large amount of detail. It is difficult to achieve ideal information restoration and supplementation using a single enhancement algorithm or deep learning model. To address this issue, fusing NCV and TIR imagery to enhance image quality has become a mainstream solution. Early representative research includes the adaptive multi-scale TIR and NCV image fusion method proposed by Pei et al. (2024) [
18]. Although TIR images can compensate for information lost during NCV imaging to some extent, they exhibit significant shortcomings in color reproduction and detail representation. Consequently, the visual quality of the fused images still struggles to match that of DCV images.
To further improve the structural representation capability of NCV and TIR images, Yang et al. (2024) proposed the HDCGAN model [
19], which introduces Hyperspherical Directional Cosine (HDC) space to enhance semantic consistency during the image enhancement process. However, this model adopts unpaired training datasets. Due to the lack of an effective semantic consistency constraint mechanism in unpaired data, the generated enhanced images still suffer from problems such as missing detail information, color block artifacts, and structural contour distortion. Furthermore, although the CHITNet fusion network proposed by Du et al. (2025) [
20] innovates in the information transformation mechanism for TIR and CV image fusion, this research primarily focuses on structural information transformation and lacks specific modeling for natural color restoration and visual consistency in the fused images.
In recent years, fusion methods based on deep learning models, such as DenseFuse (2019) [
21] and NestFuse (2020) [
22] proposed by Li et al., SeAFusion (2022) [
23] proposed by Ma et al., and the more recently proposed RFN-Nest [
24] and DDcGAN [
25], have achieved significant progress in the structural fusion of TIR and CV images, effectively enhancing edge and texture information. However, these methods typically output grayscale fusion results or pseudo-color images. Their research focus is concentrated on structural information fusion rather than natural color reconstruction. Achieving natural and stable color restoration while maintaining fused structural information remains an important problem awaiting a solution in the current field of infrared and visible image fusion and enhancement.
Through the above comprehensive analysis, it can be found that various improved GAN models [
26] are currently the most competitive models for enhancing image quality. Their most prominent advantage is that their training does not rely on strict paired datasets. However, precisely because unpaired datasets lack effective constraints on semantic consistency, the training results of GANs inevitably suffer from color and semantic distortions. To address this issue, this paper proposes the use of Weakly Paired Image Datasets (WPIDs) instead of unpaired datasets.
Beyond the development of the enhancement methods discussed above, the accurate evaluation of enhanced image quality and its effectiveness in downstream applications remains another critical concern. Existing image quality evaluation metrics have certain limitations; most metrics focus on perceptual experience (e.g., sharpness, color) and semantic consistency, failing to establish effective evaluation criteria for the critical issue of whether image structures are completely preserved. They also do not fully consider the practical application effects in downstream tasks such as image segmentation.
Currently, mainstream image transformation and enhancement quality evaluation methods primarily utilize three categories of metrics: (1) pixel-level error metrics; (2) perception-based metrics; and (3) no-reference image quality metrics.
(1) Pixel-Level Error Metrics: These primarily include MSE, PSNR, and SSIM. Mean Squared Error (MSE) [
27] is the most fundamental metric in image quality assessment. Its advantages lie in simple calculation and differentiability, making it convenient to use as a loss function during the training phase. However, MSE essentially only measures brightness error and struggles to reflect subjective quality in tasks such as image enhancement. Peak Signal-to-Noise Ratio (PSNR) [
28] is the logarithmic form based on MSE, used to measure the ratio between signal and noise. The Structural Similarity Index (SSIM) [
29] is one of the most widely used pixel-level structural perception metrics, integrating three dimensions: luminance, contrast, and structure, reflecting the perceptual similarity between images. It effectively supplements the shortcomings of PSNR, but still has certain blind spots regarding fine-grained differences in complex structures.
(2) Perception-Based Metrics: These primarily include FID, KID, and LPIPS. FID [
1] is currently one of the most representative deep perception metrics, effectively measuring the realism and statistical consistency of images as a whole, and is widely used in image enhancement tasks. KID [
30] uses Maximum Mean Discrepancy to provide an unbiased estimate of the difference in image distributions. Compared to FID, it is more stable, being especially suitable for small samples and scenes with large distribution deviations. In addition, KID avoids the numerical instability caused by matrix square root operations and has gradually been adopted as a supplement or alternative to FID in recent years. LPIPS [
31] is based on pre-trained convolutional networks, evaluating visual consistency by comparing the distance between local features of images, capable of capturing fine-grained perceptual differences, making it suitable for evaluating visual improvement effects in tasks such as image enhancement and style transfer.
(3) No-Reference Image Quality Metrics: These primarily include NIQE and BRISQUE. NIQE [
32] assesses the degree to which a given image deviates from the distribution of natural image statistics, but has stability issues in certain cases. BRISQUE [
33] models the local spatial structure statistics of images, capable of detecting quality defects such as blur and noise, but is highly sensitive to low-quality images.
In summary, although the aforementioned metrics can evaluate the quality of enhanced images to a reasonable extent, they all suffer from certain shortcomings. The primary issues include the difficulty in assessing local distortions and color shifts, as well as their reliance on high-quality prior data. Furthermore, they cannot be used to evaluate the performance of generated images in downstream tasks (such as object detection and classification). To address this, Han et al. [
34] proposed the RmAP metric. Its main idea is to compare the mAP (mean Average Precision) of object detection between real images and generated images, thereby reflecting the structural restoration quality of the generated images and their effectiveness in downstream tasks like object detection. However, the RmAP metric requires reliance on real annotation data and lacks the ability to reflect information regarding non-target areas (such as background textures), thus limiting its application scenarios.
To evaluate the performance of enhanced images in downstream tasks as accurately and objectively as possible, in this paper, we introduce an evaluation method based on the segmentation results of the Segment Anything Model (SAM). SAM is a universal vision segmentation foundation model proposed by Meta, possessing zero-shot segmentation capabilities [
35]. Although it has not been specifically fine-tuned for UAV data, experiments show that it has relatively stable structural perception capabilities in remote sensing scenarios. We attempt to establish a stable and objective structural perception evaluation mechanism for enhanced image quality to solve the problem of accurately evaluating the enhancement effects of low-illumination images.
The main contributions of this paper are as follows:
- (1)
This work presents a Weakly Paired Image Dataset (WPID) construction strategy for UAV remote sensing image acquisition under low-illumination conditions, which addresses the challenge of building strictly paired datasets in outdoor scenarios.
- (2)
A HDCGAN+ image enhancement approach trained on WPID is introduced, which improves semantic consistency and stability throughout the image transformation process.
- (3)
We further develop a structural consistency evaluation framework built upon SAM segmentation outputs. Integrated with conventional perceptual metrics, this framework enhances the evaluation accuracy for practical performance in downstream remote sensing tasks.
4. Discussion and Analysis
This study addresses issues such as the poor quality of UAV remote sensing imagery in low-illumination environments, the data dependency conflicts of existing enhancement methods, and the limitations of evaluation systems. It proposes technical solutions, including the LRR-FusionColor multimodal fusion method, the use of a weak pairing mechanism to improve the enhancement performance of HDCGAN, and a structural consistency evaluation based on SAM segmentation results. The effectiveness of these methods was validated using day–night dual-modal UAV imagery from Xianghe City, Hebei Province. However, this study still has some shortcomings and areas for improvement, which are discussed and analyzed below.
4.1. Sensitivity Analysis of WPID Misalignment
The dataset used in this study was collected from nighttime UAV scenes in Xianghe County and Hancheng, Hebei Province, mainly covering remote sensing images of typical suburban areas. Although the dataset covers various ground object types and spatial structures and can reflect the actual application environment of UAV nighttime remote sensing to a certain extent, the current data are mainly derived from a relatively single region and do not cover extreme weather conditions (e.g., rain and fog environments) or more complex cross-regional scenarios. Therefore, the experimental results of this paper mainly verify the effectiveness of the method under conventional UAV nighttime imaging conditions. Future research will be further extended to different regions and public datasets to systematically evaluate the cross-scene applicability and generalization ability of the model.
To address the core limitations of difficulty in acquiring strictly paired data and easy semantic distortion of unpaired data in low-illumination UAV remote sensing image enhancement, this study proposes a WPID mechanism. This mechanism only requires images captured day and night in the same area with consistent scene semantics, without strict pixel-level registration.
Since applications such as UAV inspection, security monitoring, and emergency observation usually adopt a repeated route acquisition strategy, daytime and nighttime images of the same area often have high semantic consistency, but inevitably have a certain spatial offset. Based on the characteristics of this practical application, WPID is proposed to reduce the cost of data collection while retaining necessary semantic correspondence.
Experimental results show that under the combined action of the weakly paired training mechanism and the structural improvement of HDCGAN+, the improved model outperforms typical methods such as CycleGAN in FID, KID and other indicators, and can better restore vegetation hue and building texture in subjective visual effects, indicating that the mechanism can maintain good enhancement performance while reducing data acquisition costs.
To further verify the rationality of the weakly paired mechanism and analyze the influence of different weakly pairing degrees on model performance, this paper supplements a WPID offset sensitivity experiment. By controlling the spatial offset degree of daytime color visible (DCV) images relative to nighttime color visible (NCV) images, datasets with different offset ratios including 0%, 10%, 20%, 30%, 50% and 100% were constructed, and HDCGAN+ was trained under the same reference settings.
Table 6 shows the results of traditional evaluation indicators under different offset ratios.
It can be seen in
Table 6 that when the offset ratio is in the range of 0–30%, FID, KID and structural consistency indicators (ΔN, ΔA) change relatively gently, indicating that moderate spatial misalignment will not significantly affect the enhancement performance while maintaining a certain semantic correspondence. With the further increase in the offset ratio, the quality of the generated results gradually decreases. When the offset ratio reaches 100% (completely unpaired), FID and KID increase significantly, indicating that the difference between the generated image distribution and the real image distribution increases significantly, resulting in obvious deterioration of enhancement performance.
In addition, the BIoU indicator is mainly used to measure the performance of enhanced images in terms of semantic region overlap and boundary structure retention. It can be seen in
Table 6 that under the condition of a small offset ratio (0–30%), the BIoU indicator is relatively stable as a whole, indicating that moderate spatial offset will not significantly affect the structural consistency of enhanced images while maintaining a certain semantic correspondence. As the offset ratio further increases to 50% and above, the BIoU indicator gradually decreases, indicating that the semantic region overlap and boundary matching degree begin to weaken. When the offset ratio reaches 100% (completely unpaired), both indicators decrease significantly, indicating that the spatial correspondence between semantic objects is destroyed, resulting in a significant reduction in the structural expression ability of enhanced images.
The increase in FID and KID with the offset ratio indicates that the offset ratio does affect the visual quality of the output images, while the two indicators ΔN and ΔA do not show linear consistency with the change in the offset ratio, which exactly illustrates the inconsistency between the human eye judgment results of enhanced images and the computer processing results in the task link. In addition, during the training process of the GAN model, since its loss function cannot converge automatically, the results relying on manual judgment can only give the optimal results highly dependent on vision and cannot give task-oriented results, which requires further in-depth discussion.
Figure 10 shows the enhancement results under different offset ratios. From left to right in the figure are the daytime reference image and the enhancement results obtained under the offset ratios of 0%, 10%, 20%, 30%, 50% and 100%. Overall, as the spatial overlap between images decreases, the quality of enhanced images gradually decreases. When the offset ratio is in the range of 0–30%, the enhanced results can still maintain the main ground object structures and texture details such as roads and buildings well; however, when the offset ratio continues to increase, the structure retention ability is significantly weakened. Especially under the condition of 100% complete non-overlap, due to the lack of effective semantic correspondence, the fidelity and consistency of some building boundaries and road structures become weaker and weaker compared with the Day Truth, and the quality of enhanced images is the worst at this time.
To further analyze the influence of different offset ratios on enhanced images, the SAM model was used to segment the results of each ratio, and the visualization results of its segmentation masks are shown in
Figure 11. From left to right in the figure are the segmentation results of the daytime reference image, and the segmentation results of enhanced images obtained under the offset ratios of 0%, 10%, 20%, 30%, 50% and 100%.
The segmentation results in
Figure 11 are basically consistent with the evaluation results of the two indicators ΔN and ΔA in
Table 6. The effect of 50% offset is not worse than that of 10% offset. The main reasons for such results are that the training termination condition of HDCGAN+ is the optimal result judged by human eyes, which is not associated with any parameters or information in the task stage. In addition, although WPID greatly enhances the semantic consistency constraint between training data pairs, the structure of the GAN model itself determines that it cannot make good use of this semantic consistency to improve the image enhancement effect. Furthermore, the SAM model has not been specially trained for the data in this study, and its segmentation results have mis-segmentation in some samples, which may also have a certain impact on the structural consistency indicators.
In addition, although the WPID mechanism has shown significant advantages, it still has some shortcomings: first, it is difficult to apply it to extreme low-illumination environments. The experiment only verifies the effect under conventional nighttime low-illumination conditions, and does not involve extreme low-illumination environments such as moonless nights and heavy haze. In such scenes, the noise of visible images increases significantly and information loss is serious, and the model is prone to generate texture artifacts; second, the model has high computational cost. After the input resolution is extended to 512 × 512, the feature calculation amount of a single image increases significantly, and training and inference still need to rely on high-performance server hardware. At present, it cannot be adapted to the low-power hardware onboard UAVs, making it difficult to meet the real-time application requirements of shooting while enhancing (e.g., live monitoring of nighttime disaster scenes). In addition, when visible images have severe overexposure or structural loss, the structural prior relied on by the LRR-FusionColor module may degenerate, thus limiting the final color restoration effect.
The experimental data in this paper are mainly from UAV images collected in Xianghe County, Hebei Province. Although the scenes include typical ground object types such as roads, buildings, vegetation and water bodies, the overall regional scope is still relatively limited and does not cover more regional differences and complex surface types. To further verify the applicability of the method under different scene conditions, a comparative experiment is supplemented in the experimental part using UAV image data collected in the suburban area of Hancheng Town, Tangshan. In addition, among the currently publicly available UAV visible–thermal infrared multimodal datasets, data resources with cross-temporal acquisition, natural spatial offset and weakly paired characteristics are still relatively limited, which limits the systematic comparative verification across regions or datasets to a certain extent. Future research will further expand the data collection regions and scene types, and verify the generalization ability of the method under more data conditions.
To solve the above problems, future research will further optimize the WPID mechanism and supporting models from two aspects. First, it will improve the adaptability to extreme low-illumination scenes at the data level by constructing WPID including severe weather conditions, combined with multi-scale noise suppression strategies to improve the robustness of the model under complex imaging conditions; second, it will improve the GAN model, explore its structure adapted to WPID data and lightweight methods, improve its image enhancement effect, and try to improve the natural convergence of the training process and reduce computational complexity as much as possible, so as to enhance its deployment feasibility in UAV real-time remote sensing applications.
4.2. Verification of Generalization Ability on Cross-Regional Data
To further verify the applicability of the proposed method in different regional scenes, supplementary experiments are carried out on UAV nighttime image data collected in Hancheng Town, Tangshan City. The dataset contains a total of 220 pairs of image samples, including nighttime color visible (NCV) images, thermal infrared (TIR) images and daytime color visible (DCV) images. Although the data have been registered, there may still be a certain spatial offset between images due to factors such as cross-temporal acquisition and UAV attitude changes. Therefore, this experiment mainly analyzes the enhancement effects of different methods through visual comparison without strict quantitative evaluation.
In this experiment, TIR and NCV images are first structurally fused by LRRNet, and the color information of nighttime images is restored by the FusionColor method; then NCV and DCV images are input into the HDCGAN+ model for enhancement and compared with the original HDCGAN method. It should be noted that the original HDCGAN method uses completely unpaired training data, so its generated results may be somewhat different from the method in this paper in local areas. To ensure the consistency of comparison, this paper shows the enhancement results of different methods on the same input samples.
Figure 12 shows the enhancement results of different methods on the Tangshan dataset. From left to right are the enhancement results of FusionColor, CycleGAN, EnlightenGAN, RetinexFormer, HDCGAN, HDCGAN+ and the daytime reference image (Day Truth). From the perspective of human visual judgment, it can be observed that compared with all methods, HDCGAN+ performs more stably in brightness restoration, texture details and structure retention, and its results are visually closer to the daytime reference image, indicating that the proposed method has good transfer ability under cross-regional data conditions.
4.3. Structural Consistency Evaluation Index Based on SAM Segmentation Results
In the evaluation of low-illumination UAV remote sensing image enhancement effects, the block-based structural consistency evaluation system based on SAM constructed in this study makes up for the limitations of traditional evaluation indicators that focus on perceptual quality but neglect structural integrity and downstream task adaptability. Existing mainstream evaluation indicators (e.g., PSNR, SSIM, FID) can only quantify pixel errors or global perceptual similarity, and cannot effectively evaluate whether the image structure is completely retained. However, downstream tasks of UAV remote sensing (e.g., target detection, ground object classification) have high requirements for structural consistency. For example, the integrity of building outlines in disaster emergency scenarios directly affects the judgment of damage degree. The structural consistency evaluation index based on SAM segmentation results proposed in this paper can solve the image quality evaluation for downstream tasks.
In addition, this study uses the zero-shot segmentation ability of SAM to realize full-image structural blocking without additional annotation, and designs two indicators: block number difference and block coverage area difference, focusing on quantifying the enhancement effect from the perspective of global structure retention. The similarities and differences between traditional distribution indicators (FID/KID) and structural consistency indicators (ΔA/ΔN/BIoU) reflect the natural trade-off between perceptual quality and structural retention. The indicators in this paper pay more attention to the structural stability in subsequent remote sensing applications, so SAM indicators are more interpretable at the task level.
Experimental results show that the enhanced images of the HDCGAN+ model perform optimally in all structural consistency indicators, and their segmentation masks are highly similar to daytime real images in road continuity and building outline integrity, effectively identifying the pseudo-enhancement problem with acceptable performance of traditional indicators but missing structures, and providing a quantitative basis for the correlation between enhancement effect and practical application value.
However, this evaluation method still has three shortcomings: first, this paper directly adopts the SAM general image segmentation model, which itself has certain segmentation errors. At the same time, it has not been specially optimized for small ground objects (e.g., rural paths, farmland ridges) or low-contrast ground objects (e.g., green belts under low light) common in UAV remote sensing in this study, which may further strengthen the errors and easily lead to missing or mis-segmentation, thereby affecting the accuracy of various structural evaluation indicators; second, the indicator weights proposed in this paper lack scene adaptability. Different application scenarios have different requirements for evaluation dimensions. For example, public security scenarios pay more attention to the structural integrity of people and vehicles, and disaster emergency scenarios pay more attention to the clarity of building damage boundaries. The existing system adopts equal weight calculation, and its application effect in different scenarios needs to be tested and improved. Although theoretically, the quality of images will affect the segmentation of SAM, there is currently no quantitative method to evaluate this impact. Therefore, SAM evaluation indicators are mainly used for relative comparison of structural consistency rather than absolute semantic accuracy evaluation. Future research can further verify the reliability of the evaluation system by combining UAV remote sensing dedicated segmentation models or manual annotation data.
Subsequent optimization of this evaluation method can be carried out from three aspects: first, improve the segmentation adaptability of SAM to remote sensing ground objects. Use UAV remote sensing ground object annotation datasets containing buildings, roads, vegetation and small ground object categories to fine-tune the SAM decoder, strengthen the model’s segmentation ability for low-contrast and small ground objects, and introduce ground object type weight coefficients to assign higher penalty weights to segmentation errors of core ground objects in the scene, so as to improve the accuracy of indicator calculation; second, test the effect of the indicator system in other downstream tasks such as target detection and ground object classification, and incorporate the relevant evaluations of these tasks into the current evaluation system when necessary; third, design a scene-adaptive weight mechanism, construct a weight prediction model based on scene features (e.g., ground object density, lighting conditions, task types), and dynamically adjust the weight of each evaluation indicator by inputting scene information to improve the scene adaptability of the evaluation system.
4.4. Analysis of Reasons for Differences Between Different Evaluation Indicators
It can be observed in the experimental results that different methods show certain differences between traditional perceptual quality indicators (e.g., FID and KID) and SAM-based structural consistency indicators (e.g., HDCGAN performs well in FID/KID indicators but relatively low in structural consistency indicators). This difference stems from the different goals of evaluation indicators.
FID and KID mainly measure the overall distribution similarity between generated images and real images, focusing on reflecting the overall visual perceptual quality and statistical feature consistency of images. However, such indicators do not directly evaluate the spatial structure accuracy or ground object boundary consistency of images. Therefore, in some cases, even if the image has a good overall visual effect, its local structure may still have offset or distortion.
In contrast, SAM-based structural consistency indicators directly calculate information closely related to boundary consistency such as region number and area based on image segmentation results, and pay more attention to the integrity of ground object structure and the consistency of spatial layout. Such indicators are closer to the actual needs of remote sensing images in downstream tasks such as target recognition, change detection and semantic analysis.
Therefore, the difference between FID/KID and SAM-based indicators does not represent inconsistent evaluation, but reflects the trade-off relationship between perceptual quality and structural consistency of different methods. The HDCGAN+ proposed in this paper performs better in evaluating the performance of enhanced images in specific downstream tasks, indicating that it has better performance in maintaining the integrity of ground object structures, which is of greater significance for the subsequent application and processing of remote sensing images.