1. Introduction
In recent years, generative models have significantly advanced natural image generation technologies, encompassing various domains, such as text-to-image generation, video synthesis, image translation, and editing [
1,
2,
3,
4]. These advancements have found widespread application in fields such as artistic design, data augmentation, and multimodal intelligent systems. Despite the notable improvements in image quality and diversity, the generated scenarios remain predominantly focused on everyday human activities, with relatively limited research on satellite remote sensing image generation. However, satellite remote sensing images hold substantial value for applications in the field of remote sensing.
For instance, generating realistic satellite remote sensing images based on a scene description or a provided sketch is highly valuable for tasks such as urban planning, data augmentation, and pseudo-label generation in weakly supervised learning. This capability is particularly critical for the construction of remote sensing datasets. Research on intelligent processing algorithms for remote sensing data has demonstrated that abundant and diverse data samples enhance a model’s ability to learn and understand remote sensing image features, improving its generalization performance and accuracy [
5,
6,
7,
8]. However, because of limitations in the acquisition conditions and meteorological factors, existing satellite remote sensing datasets often suffer from insufficient data volumes and a lack of diverse features. Consequently, leveraging generative models to synthesize artificial satellite remote sensing images in order to supplement and expand these datasets has emerged as a significant research direction.
Compared to other natural images, remote sensing images contain a diverse range of objects with varying shapes, textures, and colors. The spatial arrangement of these objects must adhere to reasonable relative positions and topological relationships. Artificially generated remote sensing image data must simultaneously ensure the correctness of multiple spatial elements, including shape features, texture features, topological features, and color characteristics. Consequently, the task of remote sensing image generation presents significantly greater challenges.
Currently, research on remote sensing image generation primarily focuses on methods based on generative adversarial networks (GANs) [
9,
10,
11,
12] and diffusion models [
2,
13,
14,
15]. For instance, Reed et al. developed StackGAN [
16], which utilizes stacked generators to produce high-resolution remote sensing images with dimensions of 256 × 256. However, GAN-based methods often suffer from instability during training. In contrast, diffusion models demonstrate superior generative capabilities and offer a relatively stable training process.
The emergence of diffusion models has greatly enhanced the effectiveness of text-to-image generation [
2,
13,
14,
15]. These models can generate high-quality images guided by textual prompts and, after extensive training on large-scale datasets, can produce diverse images, ranging from photorealistic landscapes to fantastical scenes.
To enhance spatial control, conditional pixel-level diffusion models (DMs), such as ControlNet [
17], T2I-Adapter [
18], and ControlNet++ [
19], introduce mechanisms that allow users to input guiding sketches for image generation. These sketches enable these models to produce highly controllable images. However, their performance in remote sensing image generation remains unsatisfactory, as illustrated in
Figure 1. They often fail to produce images that encompass rich remote sensing information, such as the quantity of objects, object shapes, and spatial relationships between objects. GeoSynth [
20] employed OpenStreetMap (OSM) images as conditional guidance for image generation, achieving notable results. Compared to Canny edge maps or segmentation maps, OSM images contain spatial information, such as roads and buildings, enabling more effective guidance. Nevertheless, OSM data only include major structural features, like buildings and roads, while omitting finer details, such as trees and building shadows. This limitation leads to the generated images lacking in precision and realism in terms of fine details.
To address the aforementioned challenges, this study proposes OP-Gen, which is designed to enhance the detail fidelity and realism of pre-trained DMs in remote sensing image generation. The model consists of two branches: ControlNet and OSM-prompt (OP). The ControlNet branch extracts structural information from OSM data; however, because of its lack of detailed content, relying solely on this branch may result in missing fine-grained details. While textual descriptions can provide additional guidance, they primarily convey high-level semantics and struggle to supplement fine details effectively.
To overcome this limitation, we introduce the OP branch, which leverages a Contrastive Language-Image Pretraining (CLIP) [
21] image encoder to extract features from OSM images and a CLIP text encoder to obtain textual features. These features are subsequently fused using an OP-Controller, enabling the integration of textual information with the image structure to enhance the generation quality. The information extracted from the ControlNet and OP branches is injected into the diffusion model, guiding image generation. This ensures that the model preserves the global layout (e.g., the number and arrangement of buildings) while enriching fine details, producing images that closely resemble real-world remote sensing imagery. Furthermore, to optimize the injection of fine details from the OP branch, we propose the T-control mechanism. In standard diffusion models, early denoising steps focus on learning high-level features, while later steps refine fine details [
22,
23]. If the OP branch information is injected at all time steps, the model is prone to overfitting, leading to artifacts or distortions in the generated images. To mitigate this, T-control injects CLIP-encoded high-level textual information during the early denoising steps, while, in the later steps, it injects fused fine-grained details from both OSM images and text. This strategy prevents overfitting so that the generated remote sensing images maintain structural coherence while preserving high-fidelity details.
The main contributions of this work are as follows.
A novel dual-branch remote sensing image generation algorithm. This study proposes an efficient two-branch framework, where the ControlNet branch extracts the structural framework of the image, while the OP branch, designed for detail extraction based on the image structure, addresses the issue of detail loss in existing remote sensing image generation methods. By incorporating this OP branch, the model generates images with significantly enhanced detail fidelity and richness.
A time-step-based training and inference strategy. This work introduces a temporal control strategy to mitigate the overfitting of detail guidance information in the training process of diffusion models, which often leads to image artifacts and distortions. By employing this strategy, we significantly improve the realism and structural coherence of the generated remote sensing images.
Figure 1.
Remote sensing images generated by stable-diffusion-3.5-large [
24]. They exhibit rich colors and high photorealism; however, compared to real remote sensing images, their style significantly deviates, and the generated images primarily capture localized regions, resulting in a lower level of information richness than authentic remote sensing data.
Figure 1.
Remote sensing images generated by stable-diffusion-3.5-large [
24]. They exhibit rich colors and high photorealism; however, compared to real remote sensing images, their style significantly deviates, and the generated images primarily capture localized regions, resulting in a lower level of information richness than authentic remote sensing data.
4. Results
To validate the effectiveness of the proposed OP-Gen model, we compared it with the state-of-the-art remote sensing image generation models. For a quantitative evaluation of the performance of different models, we used standard metrics such as the Fréchet inception distance (FID) [
47], structural similarity index measure (SSIM) [
48], and CLIP score [
49], which are widely adopted in assessing the quality of image generation models. In the quantitative evaluation, we selected GeoSynth [
20] as the baseline model. GeoSynth not only utilizes OSM and text for guidance but also incorporates location information. For a fair comparison, we evaluated GeoSynth using only OSM and text guidance, consistent with the setup of our model. Additionally, we compared our model with other state-of-the-art algorithms, all of which used text and images for guidance. The test results for these algorithms were obtained from the respective authors’ official websites. Further details can be found in
Section 4.4.
Apart from the quantitative evaluation, we performed a qualitative analysis of the generated images. In
Section 4.5, we compare the images generated by our model with those produced by other models, focusing on aspects such as the framework structure, spatial information, and detail accuracy.
4.1. Evaluation Metrics
The FID is a metric used to assess the differences between the distributions of generated images and real data. It extracts image features using a pre-trained Inception network (typically Inception-v3). For each image, the Inception network generates a high-dimensional feature vector representing certain high-level semantic information of the image. The mean and covariance of the features of the real images and generated images are then computed, and the FID between these two sets of features is calculated to obtain the FID score. A smaller FID indicates the better quality of the generated images, as it implies a closer distribution to that of the real images. The calculation formula is as follows:
where
and
represent the feature means of the real and generated images, respectively;
and
are the feature covariance matrices of the real and generated images, respectively; and
denotes the trace of the matrix.
The SSIM is an indicator used to measure the similarity between two images in terms of structure, brightness, contrast, and other aspects. Its value ranges from 0 to 1, where 1 indicates that the two images are identical and 0 indicates that they are completely different. The SSIM calculation formula consists of three main components: brightness, contrast, and structure. With the original image denoted as
x and the processed image denoted as
y, the SSIM calculation formula can be expressed as follows:
where
represents the luminance comparison measure,
represents the contrast comparison measure, and
represents the structural comparison measure.
The CLIP score measures the semantic relevance between the text and image. It is based on the CLIP model, which calculates the cosine similarity between the text feature vector and the image feature vector. The CLIP model then provides a semantic relevance score between them. For a given text description
T and an image
I, the CLIP score can be expressed as follows:
The score typically ranges from to 1, where a value closer to 1 indicates higher semantic relevance between the text and the image.
4.2. Dataset
In this study, we utilized the dataset curated by Srikumar et al. [
20], which consists of paired high-resolution satellite images and OSM images. To annotate each satellite image, we employed the LLaVA [
50] multimodal LLM. During dataset selection, we filtered out image pairs that consisted solely of bare land, water bodies, or forests. This filtering process was necessary because our method relies on OSM images as conditions to guide image generation, leveraging structured information, such as roads and buildings, to assist the model in generating more realistic remote sensing images. However, OSM images corresponding to areas dominated by bare land, water, or forests lack meaningful structured information, providing little valuable guidance for the model. As a result, retaining such samples would not only fail to improve the generation quality but also increase the computational overhead and training time. Therefore, we excluded these image pairs from our dataset. The final dataset contained 44,848 image pairs, with each image having a resolution of 512 × 512 pixels. Additionally, in
Section 4.4, we present a comparative analysis between our model and the work of Srikumar et al. [
20].
4.3. Experimental Setup
The experiments were conducted in a computing environment equipped with four A6000 GPUs, with a total training duration of 40 h. The implementation was based on the PyTorch 1.12.0 framework. To enhance the model’s training effectiveness, we adopted a staged training strategy, which consisted of two sequential steps. In the first stage, we trained only the ControlNet branch while keeping the parameters of the U-Net frozen so that ControlNet sufficiently learned the image control information. During this phase, we employed the AdamW optimizer with a learning rate of 1 to stabilize the optimization process. In the second stage, we trained the ControlNet and OP branches simultaneously. The ControlNet branch was initialized with the weights obtained from the first stage to retain the learned control information. At this stage, the U-Net parameters remained frozen, and we continued using the AdamW optimizer with a learning rate of 1 . Additionally, we introduced our proposed time-step-based T-control training strategy to further refine the model’s performance. The empirical results demonstrated that this two-stage training approach enabled the model to effectively learn the overall structural composition of the image while preserving rich fine-grained details.
4.4. Quantitative Results
We conducted a quantitative comparison of our proposed OP-Gen and other algorithms, using metrics such as the FID, SSIM, and CLIP score.
In this experiment, we selected five recently proposed diffusion-based remote sensing image generation methods for comparison. All these methods were published in 2024 or 2025 and have demonstrated strong performance in remote sensing image generation tasks. We chose GeoSynth as one of the baselines, primarily because it shares similarities with our approach, as both utilize OSM images as conditions to guide image generation. Furthermore, GeoSynth was presented at the Conference on Computer Vision and Pattern Recognition (CVPRW) in 2024. In addition to GeoSynth, we compared our method against DiffusionSat, CRS-Diff, RSDiff, and RSVQ-Diffusion, all of which are diffusion-based remote sensing image generation methods published in 2024 or 2025. Specifically, DiffusionSat was introduced at the International Conference on Learning Representations (ICLR) in 2024, while CRS-Diff was published in Transactions on Geoscience and Remote Sensing (TGRS) in 2024, and these methods have exhibited outstanding performance in remote sensing image generation tasks. Since these approaches represent the latest advancements in the field and have demonstrated strong generation capabilities, we included them as baselines to comprehensively evaluate the generation quality and advantages of our proposed method.
The detailed comparison results are presented in
Table 1. Since some of the compared works did not have open-source code available, we were unable to retrain them on the same dataset; thus, the metrics were taken from the respective papers. As shown, compared to other methods, our approach reduces the FID by more than five points, improves the SSIM by >25%, and increases the CLIP score by more than 6%. These results demonstrate that our OP-Gen algorithm, when using text and image guidance, achieves the best performance across all evaluated metrics.
While our quantitative comparisons with other algorithms were conducted using only images and text as guidance for image generation, some algorithms incorporate additional prior information beyond these two conditions, such as the image capture location and time, which are significantly more difficult to obtain and limit their practical applicability. For the purpose of exploration, we also compared our method against these algorithms, with the results presented in
Appendix A. As shown, even when these algorithms leverage additional and less accessible prior information, our method continues to demonstrate superior performance.
4.5. Qualitative Results
Our qualitative results demonstrate that OP-Gen exhibits strong capabilities in generating high-quality remote sensing images. It effectively responds to spatial information in OSM images to guide the image generation process, producing images with rich detail. The generated images are shown in
Figure 5. The overall architectural and road frameworks of the generated images align well with the OSM images, reflecting the controllability of the OP-Gen algorithm in the image generation process. Furthermore, the generated images contain abundant details, such as trees along the roads and building shadows. The inclusion of these fine-grained details enhances the realism of the images generated by OP-Gen.
In
Figure 6, we compare the images generated by OP-Gen with real satellite remote sensing images. The comparison results indicate that the images generated by OP-Gen exhibit a high degree of similarity to real satellite images in both the overall structural composition and detailed information. Whether in terms of the spatial distribution and shape of ground objects or the texture and color details, OP-Gen effectively reflects the characteristics of real remote sensing imagery, demonstrating a high level of realism.
We also conducted a qualitative comparison of the image generation performance with that of other algorithms. The comparison results are shown in
Figure 7. Compared to the baseline models SD2.1 and GeoSynth, the images generated by our method exhibit significantly superior results, with clear structures, such as buildings and streets, and rich, detailed information (e.g., shadow directions).
6. Conclusions
Existing satellite remote sensing image generation algorithms often fall short in terms of detail, leading to a noticeable gap between the generated and real images. To address this issue, we propose OP-Gen, a remote sensing image generation algorithm guided by OSM images and text. This algorithm dynamically injects fused image and text details based on time steps, enabling more detailed information to be incorporated into the generated remote sensing satellite images. In this paper, we provide qualitative and quantitative comparisons between our work and other state-of-the-art algorithms. Qualitatively, the images generated by our model exhibit superior detail compared to those generated by other models. Quantitatively, our model achieves an FID of 45.01, an SSIM score of 0.1904, and a CLIP score of 0.3071, all of which are the best performance outcomes among existing algorithms of the same type.
In our future research, we plan to explore how to reduce the dependency on conditional images during the image generation process, with the goal of generating high-quality remote sensing images from simple hand-drawn sketches.