OP-Gen: A High-Quality Remote Sensing Image Generation Algorithm Guided by OSM Images and Textual Prompts

Xiong, Huolin; Li, Zekun; Lv, Qunbo; Zhu, Baoyu; Zhang, Yu; Yu, Chaoyang; Tan, Zheng

doi:10.3390/rs17071226

Open AccessArticle

OP-Gen: A High-Quality Remote Sensing Image Generation Algorithm Guided by OSM Images and Textual Prompts

by

Huolin Xiong

^1,2,3,†,

Zekun Li

⁴,

Qunbo Lv

^1,2,3,†,

Baoyu Zhu

^1,2,3,

Yu Zhang

^1,2,3,

Chaoyang Yu

^1,2,3 and

Zheng Tan

^1,2,3,*

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China

²

School of Optoelectronics, University of Chinese Academy of Sciences, No. 19(A) Yuquan Road, Shijingshan District, Beijing 100049, China

³

Department of Key Laboratory of Computational Optical Imagine Technology, Chinese Academy of Sciences, No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China

⁴

National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(7), 1226; https://doi.org/10.3390/rs17071226

Submission received: 24 January 2025 / Revised: 14 March 2025 / Accepted: 27 March 2025 / Published: 30 March 2025

Download

Browse Figures

Versions Notes

Abstract

The application of diffusion models in the field of remote sensing image generation has significantly improved the performance of generation algorithms. However, existing methods still exhibit certain limitations, such as the inability to generate images with rich texture details and minimal geometric distortions in a controllable manner. To address these shortcomings, this work introduces an innovative remote sensing image generation algorithm, OP-Gen, which is guided by textual descriptions and OpenStreetMap (OSM) images. OP-Gen incorporates two information extraction branches: ControlNet and OSM-prompt (OP). The ControlNet branch extracts structural and spatial information from OSM images and injects this information into the diffusion model, providing guidance for the overall structural framework of the generated images. In the OP branch, we design an OP-Controller module, which extracts detailed semantic information from textual prompts based on the structural information of the OSM image. This information is subsequently injected into the diffusion model, enriching the generated images with fine-grained details, aligning the generated details with the structural framework, and thus significantly enhancing the realism of the output. The proposed OP-Gen algorithm achieves state-of-the-art performance in both qualitative and quantitative evaluations. The qualitative results demonstrate that OP-Gen outperforms existing methods in terms of structural coherence and texture detail richness. Quantitatively, the algorithm achieves a Fréchet inception distance (FID) of 45.01, a structural similarity index measure (SSIM) of 0.1904, and a Contrastive Language-Image Pretraining (CLIP) score of 0.3071, all of which represent the best performance among the current algorithms of the same type.

Keywords:

remote sensing image generation; diffusion model; deep learning; ControlNet

1. Introduction

In recent years, generative models have significantly advanced natural image generation technologies, encompassing various domains, such as text-to-image generation, video synthesis, image translation, and editing [1,2,3,4]. These advancements have found widespread application in fields such as artistic design, data augmentation, and multimodal intelligent systems. Despite the notable improvements in image quality and diversity, the generated scenarios remain predominantly focused on everyday human activities, with relatively limited research on satellite remote sensing image generation. However, satellite remote sensing images hold substantial value for applications in the field of remote sensing.

For instance, generating realistic satellite remote sensing images based on a scene description or a provided sketch is highly valuable for tasks such as urban planning, data augmentation, and pseudo-label generation in weakly supervised learning. This capability is particularly critical for the construction of remote sensing datasets. Research on intelligent processing algorithms for remote sensing data has demonstrated that abundant and diverse data samples enhance a model’s ability to learn and understand remote sensing image features, improving its generalization performance and accuracy [5,6,7,8]. However, because of limitations in the acquisition conditions and meteorological factors, existing satellite remote sensing datasets often suffer from insufficient data volumes and a lack of diverse features. Consequently, leveraging generative models to synthesize artificial satellite remote sensing images in order to supplement and expand these datasets has emerged as a significant research direction.

Compared to other natural images, remote sensing images contain a diverse range of objects with varying shapes, textures, and colors. The spatial arrangement of these objects must adhere to reasonable relative positions and topological relationships. Artificially generated remote sensing image data must simultaneously ensure the correctness of multiple spatial elements, including shape features, texture features, topological features, and color characteristics. Consequently, the task of remote sensing image generation presents significantly greater challenges.

Currently, research on remote sensing image generation primarily focuses on methods based on generative adversarial networks (GANs) [9,10,11,12] and diffusion models [2,13,14,15]. For instance, Reed et al. developed StackGAN [16], which utilizes stacked generators to produce high-resolution remote sensing images with dimensions of 256 × 256. However, GAN-based methods often suffer from instability during training. In contrast, diffusion models demonstrate superior generative capabilities and offer a relatively stable training process.

The emergence of diffusion models has greatly enhanced the effectiveness of text-to-image generation [2,13,14,15]. These models can generate high-quality images guided by textual prompts and, after extensive training on large-scale datasets, can produce diverse images, ranging from photorealistic landscapes to fantastical scenes.

To enhance spatial control, conditional pixel-level diffusion models (DMs), such as ControlNet [17], T2I-Adapter [18], and ControlNet++ [19], introduce mechanisms that allow users to input guiding sketches for image generation. These sketches enable these models to produce highly controllable images. However, their performance in remote sensing image generation remains unsatisfactory, as illustrated in Figure 1. They often fail to produce images that encompass rich remote sensing information, such as the quantity of objects, object shapes, and spatial relationships between objects. GeoSynth [20] employed OpenStreetMap (OSM) images as conditional guidance for image generation, achieving notable results. Compared to Canny edge maps or segmentation maps, OSM images contain spatial information, such as roads and buildings, enabling more effective guidance. Nevertheless, OSM data only include major structural features, like buildings and roads, while omitting finer details, such as trees and building shadows. This limitation leads to the generated images lacking in precision and realism in terms of fine details.

To address the aforementioned challenges, this study proposes OP-Gen, which is designed to enhance the detail fidelity and realism of pre-trained DMs in remote sensing image generation. The model consists of two branches: ControlNet and OSM-prompt (OP). The ControlNet branch extracts structural information from OSM data; however, because of its lack of detailed content, relying solely on this branch may result in missing fine-grained details. While textual descriptions can provide additional guidance, they primarily convey high-level semantics and struggle to supplement fine details effectively.

To overcome this limitation, we introduce the OP branch, which leverages a Contrastive Language-Image Pretraining (CLIP) [21] image encoder to extract features from OSM images and a CLIP text encoder to obtain textual features. These features are subsequently fused using an OP-Controller, enabling the integration of textual information with the image structure to enhance the generation quality. The information extracted from the ControlNet and OP branches is injected into the diffusion model, guiding image generation. This ensures that the model preserves the global layout (e.g., the number and arrangement of buildings) while enriching fine details, producing images that closely resemble real-world remote sensing imagery. Furthermore, to optimize the injection of fine details from the OP branch, we propose the T-control mechanism. In standard diffusion models, early denoising steps focus on learning high-level features, while later steps refine fine details [22,23]. If the OP branch information is injected at all time steps, the model is prone to overfitting, leading to artifacts or distortions in the generated images. To mitigate this, T-control injects CLIP-encoded high-level textual information during the early denoising steps, while, in the later steps, it injects fused fine-grained details from both OSM images and text. This strategy prevents overfitting so that the generated remote sensing images maintain structural coherence while preserving high-fidelity details.

The main contributions of this work are as follows.

A novel dual-branch remote sensing image generation algorithm. This study proposes an efficient two-branch framework, where the ControlNet branch extracts the structural framework of the image, while the OP branch, designed for detail extraction based on the image structure, addresses the issue of detail loss in existing remote sensing image generation methods. By incorporating this OP branch, the model generates images with significantly enhanced detail fidelity and richness.
A time-step-based training and inference strategy. This work introduces a temporal control strategy to mitigate the overfitting of detail guidance information in the training process of diffusion models, which often leads to image artifacts and distortions. By employing this strategy, we significantly improve the realism and structural coherence of the generated remote sensing images.

Figure 1. Remote sensing images generated by stable-diffusion-3.5-large [24]. They exhibit rich colors and high photorealism; however, compared to real remote sensing images, their style significantly deviates, and the generated images primarily capture localized regions, resulting in a lower level of information richness than authentic remote sensing data.

2. Related Work

2.1. Diffusion Models

DMs are generative models that iteratively denoise images to synthesize high-quality outputs, achieving state-of-the-art performance in image generation. Ho et al. [22] introduced the Denoising Diffusion Probabilistic Model (DDPM), which employs a variance-preserving diffusion process to transform noise into realistic images. However, the DDPM suffers from high computational costs due to its extensive sampling steps.

To improve the efficiency, researchers have proposed methods such as denoising diffusion implicit models (DDIMs) [23] and the Pseudo Linear Multistep Sampler (PLMS) [25]. These approaches [23,25,26,27] accelerate sampling by reducing the number of steps while maintaining the generation quality and significantly enhancing the inference speed.

Recent advancements further optimize the computational efficiency during training and sampling by projecting data into a lower-dimensional latent space. Representative models, including LSGM [28], the latent diffusion model (LDM) [1], and DALL-E 2 [2], leverage latent-space diffusion to balance efficiency and quality. Notably, the LDM employs a variational autoencoder (VAE) [29] to encode images into compact latent codes, substantially reducing the computational costs while enabling high-resolution image synthesis.

2.2. Conditional Generation Based on Diffusion Models

Beyond generating diverse and high-fidelity images, DMs excel in conditional generation. Among the various conditioning modalities, text is the most prevalent [1,24,30,31,32]. For example, Imagen [13] integrates pre-trained language models like T5 [33] with a cascaded diffusion framework to generate photorealistic images. Similarly, the LDM [1], also known as Stable Diffusion (SD), utilizes cross-attention in a latent space to introduce conditional control, enhancing the precision while reducing the computational costs.

To extend diffusion models to complex scenarios, recent research has explored multimodal and multi-condition generation. For instance, UniControl [34] unifies multiple conditioning inputs (e.g., text, sketches, depth maps) within a single model, improving the control and efficiency. RPG [35] enhances text-to-image (T2I) synthesis by leveraging the multimodal reasoning capabilities of large language models (LLMs), while Ranni [36] introduces semantic panels to refine complex scene generation.

For fine-grained control, models have been optimized to incorporate structured conditioning, such as normal maps and depth maps. T2I-Adapter [18] and SCEdit [37] introduce fine-grained attribute conditioning to improve local details. ControlNet [17] extends diffusion models with learnable conditional branches, enabling precise control over sketches, depth, and edge maps. Additionally, KNOBGEN [38] employs a dual-path framework to adapt to varying sketch granularity, offering greater flexibility in sketch-based image synthesis and fine-grained feature injection.

2.3. Remote Sensing Image Generation

In recent years, remote sensing image generation has garnered increasing attention. However, the number of studies in this domain remains relatively limited compared to those on natural image generation. Most existing works focus on fine-tuning LDMs on remote sensing datasets [39,40,41,42] or designing specialized models for specific tasks.

For instance, HSIGene [43] demonstrates strong performance in generating hyperspectral images, while Changen2 [44] enables the generation of temporally dynamic remote sensing images. MetaEarth [45] achieves high-resolution remote sensing image generation, and SatCLIP [46] integrates geographical location information into deep learning models. Furthermore, GeoSynth [20], which combines SatCLIP and ControlNet, utilizes OSM images, textual descriptions, and geographical location data as conditions to generate impressive remote sensing images. However, a notable limitation of GeoSynth is the lack of detail fidelity in the generated images.

In contrast, this work also employs OSM images and text as conditions, but without relying on geographical location data. Instead, it extracts fine-grained details from textual descriptions based on the spatial information in OSM images and incorporates these details into the diffusion model. This approach enables the generation of high-quality remote sensing images while supporting controllable synthesis through the integration of detailed information, significantly enhancing the realism and usability of the generated satellite images.

3. Method

3.1. Preliminaries

The SD model [22] is a DM that operates in a latent space, where the generation process is defined by progressively adding noise to the input data

z_{0}

. The training objective of the model is to create a denoising network

ϵ_{θ}

that is capable of predicting and removing noise at each time step t. The SD model consists of the following three main components: VAE, CLIP, and U-Net. The VAE encoder maps images into the latent space, while the VAE decoder reconstructs the latent representations back into real images. The CLIP text encoder is responsible for projecting tokenized text sequences into the embedding space. The U-Net model predicts the Gaussian noise added at each time step. The loss function during the training process is formulated as follows:

L_{L D M} = E_{ϵ (x), t, ϵ \sim N (0, 1)} [| | ϵ - ϵ_{θ} (z_{t}, t) {| |}_{2}^{2}]

(1)

where t is uniformly sampled from time steps

{1, \dots, T}

and

z_{t}

is the t-step latent variable of the input.

ϵ_{θ}

is the noise prediction model.

Conditional guidance in SD involves integrating various conditions, such as text, reference images, masks, and other control signals, into the SD model using different techniques. ControlNet is a common method for the introduction of external control conditions. Its training objective is to predict noise at different stages using a learnable branch network, denoted as

ϵ_{θ}

. Given the latent variable

z_{0}

, the model progressively adds noise to reach

z_{t}

, where t represents the number of noise addition steps. Here,

c_{t}

indicates the textual control condition, and

c_{f}

stands for a specific control condition (e.g., Canny edge images, depth maps). The loss function

L_{c o n d i t i o n}

during the training process is expressed as follows:

L_{c o n d i t i o n} = E_{z_{0}, t, c_{t}, c_{f}, ϵ \sim N (0, 1)} [| | ϵ - ϵ_{θ} (z_{t}, t, c_{t}, c_{f}) {| |}_{2}^{2}]

(2)

3.2. Structure of the OP-Gen Algorithm

In this work, we propose an efficient remote sensing image generation framework, OP-Gen, built upon the SD2.1 base model. The framework consists of two branches: the ControlNet branch and the OP branch. The ControlNet branch extracts crucial boundary and spatial information related to buildings and roads from OSM images, which is essential in generating accurate remote sensing images. The OP branch, on the other hand, is designed to extract detailed semantic information from the text based on the structure of the OSM image. During the noise prediction process in the U-Net of the SD model, the information extracted from both branches is injected, guiding the generation of high-quality remote sensing images. The overall structure of OP-Gen is shown in Figure 2.

3.2.1. ControlNet Branch

The ControlNet branch adopts the classic ControlNet [17] architecture, which primarily involves replicating and training the encoder and middle parts of the U-Net in the SD model. The process of this branch begins by encoding the OSM image into a latent space using convolutional layers. Subsequently, the encoded image features are processed within the latent space using the replicated training structure. After feature processing, a zero convolution module is applied, and the resulting features are integrated into the decoder module of the SD U-Net through skip connections. The role and effect of the ControlNet branch in the image generation process of our algorithm are illustrated in the upper part of Figure 2.

3.2.2. OP Branch

The OP branch consists of three components: the image encoder, the text encoder, and the OP-Controller module. Both the image and text encoders utilize the CLIP modules to extract features from text and images, respectively, aligning them into a shared feature space. The primary function of the OP-Controller module is to merge the text features and image features while extracting relevant detailed textual information based on the spatial structure of the OSM image. The OP branch is capable of extracting detailed semantic information by leveraging the spatial structural features of the image and injecting this information into the SD model through cross-attention. The detailed process is shown in the lower part of Figure 2.

The structure of the OP-Controller module is shown in Figure 3, where

x_{p}

represents the features of the prompt encoded by the CLIP text encoder and

x_{i}

represents the features of the image encoded by the CLIP image encoder. First, the module performs the initial extraction of text and image features separately using convolutional layers. Then, multiple Transformer modules are employed to deeply fuse the two types of features. After each Transformer fusion, a skip connection mechanism is applied to prevent information loss and enhance the feature representation capability. Finally, a convolutional layer is used to further process the fused features and produce the output. Through this series of feature extraction and fusion operations, the OP-Controller module effectively extracts detailed semantic features that incorporate the spatial structure of the image.

3.3. T-Control Module

In the OP branch, injecting the information extracted by the OP-Controller module at every time step may lead to the overfitting of detailed information, causing the generated images to suffer from hallucinations and distortions. This is confirmed experimentally in Section 5.1.2

The T-control module determines, during training and inference, whether the features injected into the SD model at each time step are the text features extracted by CLIP or the fused features extracted by the OP-Controller module. The specific operation of the T-control module is illustrated in Figure 4. In the early steps, the high-level text semantic features extracted by CLIP are injected, while, in the later steps, the fused features extracted by the OP-Controller module are injected. This approach helps to enrich the image details while preventing overfitting.

4. Results

To validate the effectiveness of the proposed OP-Gen model, we compared it with the state-of-the-art remote sensing image generation models. For a quantitative evaluation of the performance of different models, we used standard metrics such as the Fréchet inception distance (FID) [47], structural similarity index measure (SSIM) [48], and CLIP score [49], which are widely adopted in assessing the quality of image generation models. In the quantitative evaluation, we selected GeoSynth [20] as the baseline model. GeoSynth not only utilizes OSM and text for guidance but also incorporates location information. For a fair comparison, we evaluated GeoSynth using only OSM and text guidance, consistent with the setup of our model. Additionally, we compared our model with other state-of-the-art algorithms, all of which used text and images for guidance. The test results for these algorithms were obtained from the respective authors’ official websites. Further details can be found in Section 4.4.

Apart from the quantitative evaluation, we performed a qualitative analysis of the generated images. In Section 4.5, we compare the images generated by our model with those produced by other models, focusing on aspects such as the framework structure, spatial information, and detail accuracy.

4.1. Evaluation Metrics

The FID is a metric used to assess the differences between the distributions of generated images and real data. It extracts image features using a pre-trained Inception network (typically Inception-v3). For each image, the Inception network generates a high-dimensional feature vector representing certain high-level semantic information of the image. The mean and covariance of the features of the real images and generated images are then computed, and the FID between these two sets of features is calculated to obtain the FID score. A smaller FID indicates the better quality of the generated images, as it implies a closer distribution to that of the real images. The calculation formula is as follows:

FID = ∥ μ_{r} - μ_{g} ∥^{2} + Tr (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{1 / 2})

(3)

where

μ_{r}

and

μ_{g}

represent the feature means of the real and generated images, respectively;

Σ_{r}

and

Σ_{g}

are the feature covariance matrices of the real and generated images, respectively; and

T r

denotes the trace of the matrix.

The SSIM is an indicator used to measure the similarity between two images in terms of structure, brightness, contrast, and other aspects. Its value ranges from 0 to 1, where 1 indicates that the two images are identical and 0 indicates that they are completely different. The SSIM calculation formula consists of three main components: brightness, contrast, and structure. With the original image denoted as x and the processed image denoted as y, the SSIM calculation formula can be expressed as follows:

SSIM (x, y) = {[l (x, y)]}^{α} \cdot {[c (x, y)]}^{β} \cdot {[s (x, y)]}^{γ}

(4)

where

l (x, y)

represents the luminance comparison measure,

c (x, y)

represents the contrast comparison measure, and

s (x, y)

represents the structural comparison measure.

The CLIP score measures the semantic relevance between the text and image. It is based on the CLIP model, which calculates the cosine similarity between the text feature vector and the image feature vector. The CLIP model then provides a semantic relevance score between them. For a given text description T and an image I, the CLIP score can be expressed as follows:

CLIP-Score (T, I) = cosine_similarity (CLIP_text (T), CLIP_image (I))

(5)

The score typically ranges from

- 1

to 1, where a value closer to 1 indicates higher semantic relevance between the text and the image.

4.2. Dataset

In this study, we utilized the dataset curated by Srikumar et al. [20], which consists of paired high-resolution satellite images and OSM images. To annotate each satellite image, we employed the LLaVA [50] multimodal LLM. During dataset selection, we filtered out image pairs that consisted solely of bare land, water bodies, or forests. This filtering process was necessary because our method relies on OSM images as conditions to guide image generation, leveraging structured information, such as roads and buildings, to assist the model in generating more realistic remote sensing images. However, OSM images corresponding to areas dominated by bare land, water, or forests lack meaningful structured information, providing little valuable guidance for the model. As a result, retaining such samples would not only fail to improve the generation quality but also increase the computational overhead and training time. Therefore, we excluded these image pairs from our dataset. The final dataset contained 44,848 image pairs, with each image having a resolution of 512 × 512 pixels. Additionally, in Section 4.4, we present a comparative analysis between our model and the work of Srikumar et al. [20].

4.3. Experimental Setup

The experiments were conducted in a computing environment equipped with four A6000 GPUs, with a total training duration of 40 h. The implementation was based on the PyTorch 1.12.0 framework. To enhance the model’s training effectiveness, we adopted a staged training strategy, which consisted of two sequential steps. In the first stage, we trained only the ControlNet branch while keeping the parameters of the U-Net frozen so that ControlNet sufficiently learned the image control information. During this phase, we employed the AdamW optimizer with a learning rate of 1

\times 10^{- 5}

to stabilize the optimization process. In the second stage, we trained the ControlNet and OP branches simultaneously. The ControlNet branch was initialized with the weights obtained from the first stage to retain the learned control information. At this stage, the U-Net parameters remained frozen, and we continued using the AdamW optimizer with a learning rate of 1

\times 10^{- 5}

. Additionally, we introduced our proposed time-step-based T-control training strategy to further refine the model’s performance. The empirical results demonstrated that this two-stage training approach enabled the model to effectively learn the overall structural composition of the image while preserving rich fine-grained details.

4.4. Quantitative Results

We conducted a quantitative comparison of our proposed OP-Gen and other algorithms, using metrics such as the FID, SSIM, and CLIP score.

In this experiment, we selected five recently proposed diffusion-based remote sensing image generation methods for comparison. All these methods were published in 2024 or 2025 and have demonstrated strong performance in remote sensing image generation tasks. We chose GeoSynth as one of the baselines, primarily because it shares similarities with our approach, as both utilize OSM images as conditions to guide image generation. Furthermore, GeoSynth was presented at the Conference on Computer Vision and Pattern Recognition (CVPRW) in 2024. In addition to GeoSynth, we compared our method against DiffusionSat, CRS-Diff, RSDiff, and RSVQ-Diffusion, all of which are diffusion-based remote sensing image generation methods published in 2024 or 2025. Specifically, DiffusionSat was introduced at the International Conference on Learning Representations (ICLR) in 2024, while CRS-Diff was published in Transactions on Geoscience and Remote Sensing (TGRS) in 2024, and these methods have exhibited outstanding performance in remote sensing image generation tasks. Since these approaches represent the latest advancements in the field and have demonstrated strong generation capabilities, we included them as baselines to comprehensively evaluate the generation quality and advantages of our proposed method.

The detailed comparison results are presented in Table 1. Since some of the compared works did not have open-source code available, we were unable to retrain them on the same dataset; thus, the metrics were taken from the respective papers. As shown, compared to other methods, our approach reduces the FID by more than five points, improves the SSIM by >25%, and increases the CLIP score by more than 6%. These results demonstrate that our OP-Gen algorithm, when using text and image guidance, achieves the best performance across all evaluated metrics.

While our quantitative comparisons with other algorithms were conducted using only images and text as guidance for image generation, some algorithms incorporate additional prior information beyond these two conditions, such as the image capture location and time, which are significantly more difficult to obtain and limit their practical applicability. For the purpose of exploration, we also compared our method against these algorithms, with the results presented in Appendix A. As shown, even when these algorithms leverage additional and less accessible prior information, our method continues to demonstrate superior performance.

4.5. Qualitative Results

Our qualitative results demonstrate that OP-Gen exhibits strong capabilities in generating high-quality remote sensing images. It effectively responds to spatial information in OSM images to guide the image generation process, producing images with rich detail. The generated images are shown in Figure 5. The overall architectural and road frameworks of the generated images align well with the OSM images, reflecting the controllability of the OP-Gen algorithm in the image generation process. Furthermore, the generated images contain abundant details, such as trees along the roads and building shadows. The inclusion of these fine-grained details enhances the realism of the images generated by OP-Gen.

In Figure 6, we compare the images generated by OP-Gen with real satellite remote sensing images. The comparison results indicate that the images generated by OP-Gen exhibit a high degree of similarity to real satellite images in both the overall structural composition and detailed information. Whether in terms of the spatial distribution and shape of ground objects or the texture and color details, OP-Gen effectively reflects the characteristics of real remote sensing imagery, demonstrating a high level of realism.

We also conducted a qualitative comparison of the image generation performance with that of other algorithms. The comparison results are shown in Figure 7. Compared to the baseline models SD2.1 and GeoSynth, the images generated by our method exhibit significantly superior results, with clear structures, such as buildings and streets, and rich, detailed information (e.g., shadow directions).

5. Discussion

Key innovations in this work include the introduction of the OP branch in addition to the ControlNet branch and the design of the T-control training and inference strategy for more efficient information injection. Section 5.1 includes the results of the ablation experiments conducted to validate the effectiveness of these innovative methods.

In Section 5.2, we further investigate the optimal application point of T-control to maximize its regulatory effectiveness.

5.1. Ablation Experiment

5.1.1. Ablation Experiment on the OP-Controller Module

A key contribution of this work is the introduction of the OP-Controller within the OP branch, which can extract detailed information from both images and text and fuse this information for injection into the SD model for guidance. To validate the effectiveness of the OP-Controller, we conducted an ablation study by removing the OP-Controller module and injecting only CLIP-encoded text information. The model was trained with the same dataset, and the image generation results are shown in Figure 8. After removing the OP-Controller module, the generated images contained very few details, capturing only the general framework. Compared to real images, the generated results were visually much less accurate.

To more clearly illustrate the differences in the richness and detail in the generated images after the ablation of the OP-Controller module, we applied Laplacian feature extraction to the images. This operation highlights texture, shape, and other structural information. As shown in Figure 9, the removal of the OP-Controller module resulted in a significant reduction in texture information in the generated images.

Finally, we conducted a quantitative analysis. The results are shown in Table 2. After removing the OP-Controller module, the model’s FID, SSIM, and CLIP score were all worse.

The quantitative and qualitative analysis results show that the OP-Controller module that we propose effectively enhances the model’s image generation capabilities.

5.1.2. Ablation Experiment on T-Control

Another contribution of this work is the introduction of T-control, a method for step-based information injection during training and inference. To validate the effectiveness of T-control in the model training process, we removed T-control during training and retrained the OP-Gen model using the same dataset. After training, we compared the model with and without T-control, and the image generation results are shown in Figure 10. After removing T-control, the model generates images with more detailed information, but overfitting occurs regarding the detailed semantic information based on the OSM image structure framework extracted by the OP-branch’s OP-Controller. This overfitting results in distortions and artifacts, particularly at the building edges.

To provide a more intuitive comparison, we extracted the Laplacian feature maps of the images generated before and after the ablation of the T-control module to observe changes in texture and shape. Additionally, we converted the images to the HSV color space to analyze changes in the topological relationships of the colors. The results are presented in Figure 11. After the removal of the T-control module, the texture information in the generated images becomes more abundant but appears disorganized and inconsistent, with reasonable topological relationships. Furthermore, when converted to the HSV color space, the color topological relationships in the images also become less coherent and less realistic.

In addition to the qualitative analysis, we performed a quantitative analysis of the changes in various metrics after removing T-control. As shown in Table 3, we observed worse FID, SSIM, and CLIP score values after the removal of T-control.

The quantitative and qualitative analysis results show that T-control plays a crucial role in enhancing the performance of our model.

5.2. Optimal Injection Point for T-Control

To explore the optimal time steps for the injection of the fused information extracted by the OP-Controller and the text features encoded by CLIP for high-quality, low-hallucination remote sensing image generation, we conducted a series of comparative experiments. First, we investigated at which time step during training the fused information extracted by the OP-Controller should be injected. As shown in Figure 12, the best image generation results are achieved when the fused information is injected at time step

t \leq 200

.

We also performed a quantitative analysis, and the results, which are presented in Table 4, were consistent with those of our qualitative analysis.

Similarly, we explored the optimal time step for the injection of the fused information extracted by the OP-Controller during the inference process. We tested the influence of injecting the fused information at different time steps during inference, using the model that performed the best according to Table 4. The generated images are shown in Figure 13. Again, injecting the fused information at time step

t \leq 200

yielded the best image generation results.

We also conducted a quantitative analysis; the results are provided in Table 5 and align with the conclusions drawn from the qualitative analysis.

Finally, we analyzed the variation in the image generation quality across different time steps at which the injected influence occurred, as shown in Figure 14. The results indicate that injecting the detailed fused information extracted by the OP-Controller into the diffusion model during the later denoising steps significantly enhances the image generation quality. This observation aligns with the findings of Song et al. [23], which suggest that early denoising steps in diffusion models focus on generating high-level features, while later steps refine intricate details. Conversely, applying the influence during earlier time steps leads to a decline in image quality, which is consistent with our previous analysis. Specifically, applying the influence across too many time steps results in overfitting to the detailed information, ultimately degrading the overall quality of the generated images.

6. Conclusions

Existing satellite remote sensing image generation algorithms often fall short in terms of detail, leading to a noticeable gap between the generated and real images. To address this issue, we propose OP-Gen, a remote sensing image generation algorithm guided by OSM images and text. This algorithm dynamically injects fused image and text details based on time steps, enabling more detailed information to be incorporated into the generated remote sensing satellite images. In this paper, we provide qualitative and quantitative comparisons between our work and other state-of-the-art algorithms. Qualitatively, the images generated by our model exhibit superior detail compared to those generated by other models. Quantitatively, our model achieves an FID of 45.01, an SSIM score of 0.1904, and a CLIP score of 0.3071, all of which are the best performance outcomes among existing algorithms of the same type.

In our future research, we plan to explore how to reduce the dependency on conditional images during the image generation process, with the goal of generating high-quality remote sensing images from simple hand-drawn sketches.

Author Contributions

Conceptualization, H.X. and Z.T.; methodology, H.X.; software, H.X.; investigation, H.X., B.Z., Y.Z., C.Y. and Z.L.; writing—original draft preparation, H.X.; writing—review and editing, H.X. and Z.T.; project administration, Q.L.; funding acquisition, Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (No. 2022YFB3904800).

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

GeoSynth [20] and DiffusionSat [40] are currently state-of-the-art multi-condition remote sensing image generation algorithms. However, in addition to utilizing text and imagery as conditions, these methods rely on additional, less readily accessible information as input conditions. In contrast, OP-Gen only uses easily obtainable image and text inputs as conditions. Nevertheless, we conducted quantitative and qualitative comparisons with GeoSynth and DiffusionSat. The qualitative comparison of the FID, SSIM, and CLIP score values is presented in Table A1. Our algorithm achieves the best CLIP score and demonstrates competitive results in terms of the FID and SSIM.

Table A1. Quantitative comparison of OP-Gen and other methods in terms of FID, SSIM, and CLIP score when using text and image guidance for image generation.

Method	FID	SSIM	CLIP Score
GeoSynth [20]	11.90	0.2910	0.3030
DiffusionSat [40]	15.80	0.1703	0.1862
OP-Gen (Ours)	45.01	0.1904	0.3071

We further compared the visual quality of the images generated by OP-Gen, GeoSynth, and DiffusionSat, where OP-Gen relied solely on image and text as conditions. The results are shown in Figure A1.

Figure A1. Comparison of images generated by OP-Gen, GeoSynth, and DiffusionSat.

Although our algorithm utilizes fewer prior conditions, the images generated by our method are superior to those those produced by GeoSynth and DiffusionSat in terms of overall structural consistency and fine-grained details.

Appendix A.2

To provide a more comprehensive evaluation of our model’s performance, we assessed its complexity and compared the parameter count of our model with that of the base model, Stable Diffusion 2.1, augmented with ControlNet, as shown in Table A2. Our model achieves a significant improvement in remote sensing image generation capabilities while only increasing the parameter count by 24.8 percent.

Table A2. Comparison of the parameter count between ControlNet-augmented SD2.1 and OP-Gen.

Method	U-Net	VAE	ControlNet	CLIP-t	CLIP-i	OP-Controller	Total
SD2.1 (with ControlNet)	865 M	83 M	144 M	354 M	*	*	1.44 B
OP-Gen (Ours)	865 M	83 M	144 M	354 M	278 M	79 M	1.797 B (0.248↑)

* It indicates that the algorithm does not have this module.

References

Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
Li, Y.; Liu, H.; Wu, Q.; Mu, F.; Yang, J.; Gao, J.; Li, C.; Lee, Y.J. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22511–22521. [Google Scholar]
Bourdis, N.; Marraud, D.; Sahbi, H. Constrained optical flow for aerial image change detection. In Proceedings of the 2011 IEEE International Geoscience and Remote Sensing Symposium, Vancouver, BC, Canada, 24–29 July 2011; pp. 4176–4179. [Google Scholar]
Kolos, M.; Marin, A.; Artemov, A.; Burnaev, E. Procedural synthesis of remote sensing images for robust change detection with neural networks. In Proceedings of the Advances in Neural Networks–ISNN 2019: 16th International Symposium on Neural Networks, ISNN 2019, Moscow, Russia, 10–12 July 2019; Proceedings, Part II 16. Springer: Berlin/Heidelberg, Germany, 2019; pp. 371–387. [Google Scholar]
Song, J.; Chen, H.; Yokoya, N. SyntheWorld: A Large-Scale Synthetic Dataset for Land Cover Mapping and Building Change Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 8287–8296. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1316–1324. [Google Scholar]
Ruan, S.; Zhang, Y.; Zhang, K.; Fan, Y.; Tang, F.; Liu, Q.; Chen, E. DAE-GAN: Dynamic aspect-aware gan for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13960–13969. [Google Scholar]
Tao, M.; Tang, H.; Wu, F.; Jing, X.Y.; Bao, B.K.; Xu, C. DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16515–16525. [Google Scholar]
Zhou, Y.; Zhang, R.; Chen, C.; Li, C.; Tensmeyer, C.; Yu, T.; Gu, J.; Xu, J.; Sun, T. Towards Language-Free Training for Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17907–17917. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
Navard, P.; Yilmaz, A. A Probabilistic-based Drift Correction Module for Visual Inertial SLAMs. arXiv 2024, arXiv:2404.10140. [Google Scholar] [CrossRef]
Zhao, R.; Shi, Z. Text-to-remote-sensing-image generation with structured generative adversarial networks. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8010005. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3836–3847. [Google Scholar]
Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4296–4304. [Google Scholar]
Li, M.; Yang, T.; Kuang, H.; Wu, J.; Wang, Z.; Xiao, X.; Chen, C. ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback. arXiv 2024, arXiv:2404.07987. [Google Scholar]
Sastry, S.; Khanal, S.; Dhakal, A.; Jacobs, N. GeoSynth: Contextually-Aware High-Resolution Satellite Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 460–470. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the Forty-first International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Liu, L.; Ren, Y.; Lin, Z.; Zhao, Z. Pseudo numerical methods for diffusion models on manifolds. arXiv 2022, arXiv:2202.09778. [Google Scholar]
Salimans, T.; Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inf. Process. Syst. 2022, 35, 5775–5787. [Google Scholar]
Vahdat, A.; Kreis, K.; Kautz, J. Score-based generative modeling in latent space. Adv. Neural Inf. Process. Syst. 2021, 34, 11287–11302. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Xue, Z.; Song, G.; Guo, Q.; Liu, B.; Zong, Z.; Liu, Y.; Luo, P. RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Chen, J.; Yu, J.; Ge, C.; Yao, L.; Xie, E.; Wang, Z.; Kwok, J.; Luo, P.; Lu, H.; Li, Z. PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Qin, C.; Zhang, S.; Yu, N.; Feng, Y.; Yang, X.; Zhou, Y.; Wang, H.; Niebles, J.C.; Xiong, C.; Savarese, S.; et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv 2023, arXiv:2305.11147. [Google Scholar]
Yang, L.; Yu, Z.; Meng, C.; Xu, M.; Ermon, S.; Bin, C. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Feng, Y.; Gong, B.; Chen, D.; Shen, Y.; Liu, Y.; Zhou, J. Ranni: Taming text-to-image diffusion for accurate instruction following. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4744–4753. [Google Scholar]
Jiang, Z.; Mao, C.; Pan, Y.; Han, Z.; Zhang, J. Scedit: Efficient and controllable image diffusion generation via skip connection editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 8995–9004. [Google Scholar]
Navard, P.; Monsefi, A.K.; Zhou, M.; Chao, W.L.; Yilmaz, A.; Ramnath, R. KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models. arXiv 2024, arXiv:2410.01595. [Google Scholar]
Ou, R.; Yan, H.; Wu, M.; Zhang, C. A Method of Efficient Synthesizing Post-disaster Remote Sensing Image with Diffusion Model and LLM. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 1549–1555. [Google Scholar]
Khanna, S.; Liu, P.; Zhou, L.; Meng, C.; Rombach, R.; Burke, M.; Lobell, D.B.; Ermon, S. DiffusionSat: A Generative Foundation Model for Satellite Imagery. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Tang, D.; Cao, X.; Hou, X.; Jiang, Z.; Meng, D. CRS-Diff: Controllable Generative Remote Sensing Foundation Model. arXiv 2024, arXiv:2403.11614. [Google Scholar]
Sebaq, A.; ElHelw, M. Rsdiff: Remote sensing image generation from text using diffusion model. Neural Comput. Appl. 2024, 36, 23103–23111. [Google Scholar] [CrossRef]
Pang, L.; Cao, X.; Tang, D.; Xu, S.; Bai, X.; Zhou, F.; Meng, D. Hsigene: A foundation model for hyperspectral image generation. arXiv 2024, arXiv:2409.12470. [Google Scholar]
Zheng, Z.; Ermon, S.; Kim, D.; Zhang, L.; Zhong, Y. Changen2: Multi-temporal remote sensing generative change foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 725–741. [Google Scholar] [CrossRef]
Yu, Z.; Liu, C.; Liu, L.; Shi, Z.; Zou, Z. MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation. arXiv 2024, arXiv:2405.13570. [Google Scholar] [CrossRef]
Klemmer, K.; Rolf, E.; Robinson, C.; Mackey, L.; Rußwurm, M. Satclip: Global, general-purpose location embeddings with satellite imagery. arXiv 2023, arXiv:2311.17179. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R.L.; Choi, Y. Clipscore: A reference-free evaluation metric for image captioning. arXiv 2021, arXiv:2104.08718. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2024, 36, 34892–34916. [Google Scholar]
Gao, X.; Fu, Y.; Jiang, X.; Wu, F.; Zhang, Y.; Fu, T.; Li, C.; Pei, J. RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation. Appl. Sci. 2025, 15, 1121. [Google Scholar] [CrossRef]

Figure 2. The network architecture of OP-Gen. The upper part represents the ControlNet branch, and the lower part represents the OP branch. From left to right, the image features undergo a multi-step denoising process, progressively generating a remote sensing image. Throughout this process, the ControlNet branch continuously injects the structural framework information from the OSM image, while the OP branch, guided by T-control, dynamically injects either textual information or a fusion of textual and image information, depending on the time step. During training, only the parameters of the ControlNet and OP-Controller modules are updated.

Figure 3. Structure of the OP-Controller module.

Figure 4. Schematic diagram of the role of T-control during training or inference.

Figure 5. Images generated by OP-Gen when guided using OSM images and text.

Figure 6. Images generated by OP-Gen when guided using OSM images and text compared with real satellite remote sensing images.

Figure 7. Comparison of images generated using OP-Gen, SD2.1, and GeoSynth.

Figure 8. Comparison of images generated by the OP-Gen model after the ablation of the OP-Controller module.

Figure 9. Comparison of Laplacian feature maps of images generated by OP-Gen before and after the ablation of the OP-Controller module.

Figure 10. Comparison of images generated by the OP-Gen model after the ablation of the T-control module.

Figure 11. Comparison of Laplacian feature maps and HSV space-converted images generated by OP-Gen before and after the ablation of the T-control module.

Figure 12. Comparison of images generated by models trained with influences applied at different time steps during training. From left to right, the time steps at which the SD model is injected with the fused features processed by the OP-Controller progressively increase.

Figure 13. Comparison of the images generated by the model during inference when influence is applied at different time steps. From left to right, the time steps at which the SD model is injected with the fused features processed by the OP-Controller progressively increase.

Figure 14. The trends of the FID, SSIM, and CLIP score as functions of the optimal time steps selected for T-control. The red lines represent the influence applied during the training phase, while the black lines denote the influence applied during the inference phase.

Table 1. Quantitative comparison of OP-Gen with other methods in terms of FID, SSIM, and CLIP score when using text and image guidance for image generation.

Method	FID	SSIM	CLIP Score
GeoSynth (CVPRW 2024) [20]	86.82	0.1517	0.2888
DiffusionSat (ICLR 2024) [40]	75.34	0.1412	0.1631
CRS-Diff (TGRS 2024) [41]	50.72	0.1503	0.2033
RSDiff (NCA 2024) [42]	66.49	*	*
RSVQ-Diffusion (Applied Sciences 2025) [51]	90.36	*	0.2722
OP-Gen (with minimum change compared to baseline)	45.01 (5.71↓)	0.1904 (0.0387↑)	0.3071 (0.0183↑)

* The metric is not mentioned in the paper, and the code is not available. ↓ and ↑ represent “the smaller, the better” and “the larger, the better”, respectively.

Table 2. Comparison of FID, SSIM, and CLIP score for the OP-Gen model after the ablation of the OP-Controller module.

	FID	SSIM	CLIP Score
OP-Gen	45.01	0.1904	0.3071
OP-Gen (without OP-Controller)	85.92	0.1506	0.2792

Table 3. Comparison of FID, SSIM, and CLIP score for the OP-Gen model after the ablation of the T-control module.

	FID	SSIM	CLIP Score
OP-Gen	45.01	0.1904	0.3071
OP-Gen (without T-control)	51.11	0.1803	0.3001

Table 4. Comparison of FID, SSIM, and CLIP score for the OP-Gen model trained with influences applied at different time steps during training.

OP-Gen Time Step	FID	SSIM	CLIP Score
$t \leq 200$	45.01	0.1904	0.3071
$t \leq 400$	46.30	0.1872	0.3072
$t \leq 600$	49.63	0.1857	0.2987
$t \leq 800$	51.01	0.1809	0.2989

Table 5. Comparison of FID, SSIM, and CLIP score for the OP-Gen model during inference when influence is applied at different time steps.

OP-Gen Time Step	FID	SSIM	CLIP Score
$t \leq 200$	45.01	0.1904	0.3071
$t \leq 400$	52.41	0.1882	0.2913
$t \leq 600$	67.68	0.1607	0.2645
$t \leq 800$	73.16	0.1526	0.2218

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiong, H.; Li, Z.; Lv, Q.; Zhu, B.; Zhang, Y.; Yu, C.; Tan, Z. OP-Gen: A High-Quality Remote Sensing Image Generation Algorithm Guided by OSM Images and Textual Prompts. Remote Sens. 2025, 17, 1226. https://doi.org/10.3390/rs17071226

AMA Style

Xiong H, Li Z, Lv Q, Zhu B, Zhang Y, Yu C, Tan Z. OP-Gen: A High-Quality Remote Sensing Image Generation Algorithm Guided by OSM Images and Textual Prompts. Remote Sensing. 2025; 17(7):1226. https://doi.org/10.3390/rs17071226

Chicago/Turabian Style

Xiong, Huolin, Zekun Li, Qunbo Lv, Baoyu Zhu, Yu Zhang, Chaoyang Yu, and Zheng Tan. 2025. "OP-Gen: A High-Quality Remote Sensing Image Generation Algorithm Guided by OSM Images and Textual Prompts" Remote Sensing 17, no. 7: 1226. https://doi.org/10.3390/rs17071226

APA Style

Xiong, H., Li, Z., Lv, Q., Zhu, B., Zhang, Y., Yu, C., & Tan, Z. (2025). OP-Gen: A High-Quality Remote Sensing Image Generation Algorithm Guided by OSM Images and Textual Prompts. Remote Sensing, 17(7), 1226. https://doi.org/10.3390/rs17071226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OP-Gen: A High-Quality Remote Sensing Image Generation Algorithm Guided by OSM Images and Textual Prompts

Abstract

1. Introduction

2. Related Work

2.1. Diffusion Models

2.2. Conditional Generation Based on Diffusion Models

2.3. Remote Sensing Image Generation

3. Method

3.1. Preliminaries

3.2. Structure of the OP-Gen Algorithm

3.2.1. ControlNet Branch

3.2.2. OP Branch

3.3. T-Control Module

4. Results

4.1. Evaluation Metrics

4.2. Dataset

4.3. Experimental Setup

4.4. Quantitative Results

4.5. Qualitative Results

5. Discussion

5.1. Ablation Experiment

5.1.1. Ablation Experiment on the OP-Controller Module

5.1.2. Ablation Experiment on T-Control

5.2. Optimal Injection Point for T-Control

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI