Synthesizing Remote Sensing Images from Land Cover Annotations via Graph Prior Masked Diffusion

Deng, Kai; Wei, Siyuan; Pang, Shiyan; Jiang, Huiwei; Su, Bo

doi:10.3390/rs17132254

Open AccessArticle

Synthesizing Remote Sensing Images from Land Cover Annotations via Graph Prior Masked Diffusion

by

Kai Deng

¹

,

Siyuan Wei

¹,

Shiyan Pang

²,

Huiwei Jiang

^3,*

and

Bo Su

¹

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China

²

Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China

³

National Geomatics Center of China, Beijing 100830, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(13), 2254; https://doi.org/10.3390/rs17132254

Submission received: 8 May 2025 / Revised: 28 June 2025 / Accepted: 28 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Fifth Anniversary of “AI Remote Sensing” Section)

Download

Browse Figures

Versions Notes

Abstract

Semantic image synthesis (SIS) in remote sensing aims to generate high-fidelity satellite imagery from land use/land cover (LULC) labels, supporting applications such as map updating, data augmentation, and environmental monitoring. However, the existing methods typically focus on pixel-level semantic-to-image translation, neglecting the spatial and semantic relationships among land cover objects, which hinders accurate scene structure modeling. To address this challenge, we propose GMDiT, an enhanced conditional diffusion model that extends the masked DiT architecture with graph-prior modeling. By jointly incorporating relational graph structures and semantic labels, GMDiT explicitly captures the object-level spatial and semantic dependencies, thereby improving the contextual coherence and structural fidelity of the synthesized images. Specifically, to effectively capture inter-object dependencies, we first encode the semantics of each node using CLIP and then employ a simple yet effective graph transformer to model the spatial interactions among nodes. Additionally, we design a scene similarity sampling strategy for the reverse diffusion process, improving contextual alignment while maintaining generative diversity. Experiments on the OpenEarthMap dataset show that GMDiT achieves superior performance in terms of FID and other metrics, demonstrating its effectiveness and robustness in the generation of structured remote sensing images.

Keywords:

remote sensing; semantic image synthesis; diffusion models

1. Introduction

Semantic image synthesis in remote sensing aims to generate realistic satellite imagery from rasterized land use/land cover (LULC) annotations. This technique is crucial for simulating diverse land cover patterns and producing high-quality training and validation data, especially in scenarios with limited annotated imagery or imbalanced class distributions. Current approaches to this task can be broadly categorized into two main groups: methods based on the generative adversarial networks (GANs) [1,2,3,4] and diffusion-based methods [5,6].

Methods based on generative adversarial networks (GANs) [1,2,3,4] have significantly advanced the field of remote sensing image synthesis through the application of adversarial learning techniques and innovative architectural designs. Despite these advances, GAN-based models face inherent challenges that limit their practical applicability in remote sensing [7]. The instability encountered during training often leads to unstable convergence behavior, requiring the careful calibration of the hyperparameters and the discriminator’s balance [3]. Mode collapse is another major concern, where the model may produce limited variations and fail to capture the full distribution of diverse types of land cover, especially in heterogeneous scenes [8]. Furthermore, GANs often struggle to generate high-fidelity representations for large-scale and structurally complex geographical areas, such as urban regions with irregular road networks or agricultural zones with mixed land use patterns [8].

Diffusion models [9] provide more stable training than GANs, and latent diffusion models [7] enable high-resolution synthesis in compressed latent space. However, even advanced diffusion models often struggle to capture spatial relationships at the object level, leading to suboptimal performance in complex scenes [10]. Mask diffusion transformers [6,11] partially mitigate this issue by focusing attention on important regions to improve contextual modeling and semantic consistency. In parallel, graph-based generative methods, such as GraphGAN [12] and recent graph diffusion models [13,14], explicitly incorporate relational structures into the generative process, effectively modeling complex node- and edge-level dependencies. These techniques are particularly successful in structured data domains, including molecular graph generation and network topology synthesis.

Despite the effectiveness of the aforementioned methods in the synthesis of remote sensing imagery, we found that the existing semantic image synthesis approaches predominantly focus on pixel-level mapping transformations, inadequately modeling the spatial structural relationships during the image synthesis process. Semantic labels not only accurately represent the category of each object but also convey the spatial layout and correlations between adjacent objects. This oversight results in suboptimal performance when generating images that require a detailed understanding of spatial contexts and complex interactions between different objects.

In this paper, we propose a novel graph-conditioned diffusion transformer (GMDiT) method that employs graph structures to represent semantic labels and integrates them into the image synthesis process. This approach more effectively models the global structural information of scenes, thereby significantly enhancing the quality and overall performance of the generated images. In addition, sampling strategies play a crucial role in balancing the quality and diversity of the results generated. Inspired by the classifier-free guidance sampling strategy, we propose a sampling method based on similar scene condition guidance to further enhance the diversity of the generated images while maintaining their fidelity and semantic consistency. Specifically, we first employ the K-means clustering algorithm to select scenes with similar semantic labels from the semantic library as conditional inputs. Based on this, we optimize the sampling strategy by integrating the diffusion model’s predicted features conditioned on both the target and similar semantic masks. By performing score interpolation between these two condition-guided features, the visual fidelity is higher and the spatial structural integrity and semantic accuracy are better preserved in the generated samples under the constraints of the semantic mask. As shown in Figure 1, the optimized sampling strategy effectively improves the generation capability of the model in complex scenes, balancing visual quality and semantic relevance, thus improving both the diversity and semantic consistency of the generated results.

To demonstrate the effectiveness of our method, we conducted experiments on the OpenEarthMap dataset. Both the quantitative and qualitative results verify that our framework can produce high-fidelity and diverse results, achieving superior performance compared to previous methods. Overall, the contributions are summarized as follows:

We propose a novel graph-prior masked diffusion-based remote sensing image synthesis framework using the guidance of a semantic map. The model encodes the semantic map using a graph transformer, which effectively captures the spatial dependencies and contextual relationships between different land cover classes, thus enhancing the fidelity and consistency of the generated results.
We propose a scene-aware guidance strategy to enhance the diversity of the generated images by leveraging the semantic context of similar scenes.
Our method achieves state-of-the-art performance on the OpenEarthMap dataset.

2. Related Work

2.1. Diffusion Models

Diffusion models constitute a generative framework that progressively denoises a noisy signal to synthesize new data samples, effectively reversing a predefined noise process to approximate the original data distribution [15]. Recent developments in diffusion probabilistic models (DPMs) [9] and score-based generative models [9] have led to remarkable advances in image generation [16,17,18], often surpassing the performance of previously dominant generative adversarial networks (GANs) [19] on various tasks. These models demonstrate superior fidelity, diversity, and stability, establishing diffusion models as state-of-the-art approaches in the field of generative modeling.

Following these advances, latent diffusion models (LDMs) were introduced to improve the efficiency of diffusion-based models by operating in the latent space, significantly reducing computational costs while enabling high-resolution image generation [7]. Similarly, diffusion transformers (DiTs) [10] leverage the powerful transformer architecture to scale diffusion models for complex tasks, such as high-quality image synthesis and semantic generation. In addition, the masked diffusion transformer [6] further extends this approach by incorporating masking mechanisms to improve spatial structural understanding as well as generation fidelity and consistency in intricate scenes. In our framework, we use the masked diffusion model because it effectively captures the spatial structural relationships in complex scenes by incorporating masking mechanisms, thereby improving the quality and consistency of the generated images.

2.2. Semantic Image Synthesis

Semantic image synthesis aims to generate realistic images conditioned on semantic annotations. Early works relied primarily on generative adversarial networks (GANs) to control image layouts [3,20]. For example, Pix2pix [20] introduced an encoder–decoder generator paired with a PatchGAN [21] discriminator for semantic-to-image translation. Building on this, Pix2pixHD [4] improved high-resolution image synthesis by incorporating coarse-to-fine generators and multiscale discriminators. Subsequently, SPADE [3] significantly improved image quality by proposing a spatially adaptive normalization module that injects semantic information into multiple layers. In addition, CC-FPSE [2] advanced this line of work by introducing a conditional convolutional kernel prediction based on semantic layouts, combined with a feature pyramid semantic embedding discriminator, leading to more detailed and semantically consistent synthesis.

Unlike GANs, diffusion models progressively simulate the data generation process by denoising random noise [22,23]. Semantic diffusion models (SDM) [18] feed noisy images to a U-Net encoder while incorporating semantic layouts into the decoder through multilayer spatially adaptive normalization operators. FreestyleNet [24] proposes a Rectified Cross-Attention (RCA) module that enables flexible layout-to-image synthesis by leveraging semantic masks. PLACE [25] further refines this process by adapting the integration of the layout and semantic features over diffusion time steps, while GeoSynth [26] targets satellite image synthesis, blending global style with image-driven layout control.

Despite these significant advances, the existing methods lack deep object-level semantic analysis and fail to explicitly model the interactions between different objects within the scene. This limitation results in reduced semantic consistency and fidelity in the generated images, especially when dealing with complex or densely structured environments, motivating the need for more context-aware semantic image synthesis approaches.

3. Method

Consider a land cover semantic map

S \in R^{H \times W \times C}

with C classes that enables controllable remote sensing image synthesis with the land cover semantic map. As shown in Figure 2, the semantic map is first converted to a graph structure, where each category alone is represented as a node. An edge matrix captures the implicit interactions between these nodes, reflecting the spatial and semantic relationships among different land cover categories. The nodes are embedded using a graph transformer encoder, which integrates multimodal features through CLIP [27] and incorporates temporal conditions using time semantic embeddings. To guide image generation, we adopt a masked diffusion transformer (MDiT) framework [11], an architecture that extends traditional diffusion models by introducing token-wise masking. Instead of predicting the full latent representation at each denoising step, MDiT reveals only a subset of latent tokens and learns to reconstruct the masked ones. This mechanism encourages localized modeling and improves the controllability of the model, which is particularly important for structured image synthesis tasks. During the training phase, a diffusion transformer (DiT) [10] architecture, consisting of an encoder and decoder, is used to synthesize images based on node information through cross-attention mechanisms. The process gradually refines noisy latent images to align with the semantic map, ultimately transforming graph-based node information into realistic remote sensing images, effectively preserving both spatial and temporal dependencies. For efficiency, all computations are performed in the latent space of a variational autoencoder (VAE) [28].

3.1. Preliminary

Song [29] extended the discrete Markov process of DDPM into a continuous-time SDE, thus transforming the discrete denoising process of DDPM into a continuous-time dynamic process. In the forward process, diffusion models gradually perturb the real data

x_{0} \sim p_{data} (x_{0})

toward a noise distribution

x_{T} \sim N (0, σ_{\max}^{2} I)

using the following stochastic differential equation (SDE):

d x = f (x, t) d t + g (t) d w

(1)

where f is the drift coefficient (vector-valued), g is the diffusion coefficient,

w

is a standard Wiener process, and time t progresses from 0 to T. In the reverse process, samples can be generated using the following SDE:

d x = [f (x, t) - g {(t)}^{2} \nabla_{x} log p_{t} (x)] d t + g (t) d \bar{w}

(2)

where

\bar{w}

is a standard Wiener process in reverse time, and

d t

represents an infinitesimal negative time step. The reverse SDE (2) can be reformulated as a probability flow ordinary differential equation (ODE) [29]):

d x = [f (x, t) - \frac{1}{2} g {(t)}^{2} \nabla_{x} log p_{t} (x)] d t

(3)

which shares the same marginal distributions

p_{t} (x)

as the forward SDE (1) at all time points t. We are in close agreement with the EDM formulation [22], setting

f (x, t) : = 0

and

g (t) : = \sqrt{2 t}

. Under this specification, the forward SDE reduces to

x = x_{0} + n

, where

n \sim N (0, t^{2} I)

, and the probability flow ODE simplifies to

d x = - t \nabla_{x} log p_{t} (x) d t

(4)

To learn the score function

s (x, t) : = \nabla_{x} log p_{t} (x)

, the EDM employs a denoising function

D_{θ} (x, t)

that minimizes the loss of score matching:

E_{x_{0} \sim p_{data}} E_{n \sim N (0, t^{2} I)} {∥ D_{θ} (x_{0} + n, t) - x_{0} ∥}^{2}

(5)

This enables the estimation of the score

\hat{s} (x, t) = (D_{θ} (x, t) - x) / t^{2}

.

3.2. Semantic Graph Construction

Semantic maps are widely used in remote sensing for land use/land cover (LULC) classification, where different regions, such as buildings, roads, vegetation, and water bodies, are labeled to reflect their functional and physical categories [30]. These maps provide dense pixel-wise annotations, with each pixel assigned a semantic label, enabling the detailed spatial characterization of the Earth’s surface. Although conventional semantic maps are proficient in identifying the presence of elements within a scene, they exhibit inherent limitations in their capacity to represent the spatial and semantic interactions between different types of land cover. Specifically, they lack the explicit structural representations necessary to describe inter-object interactions, such as adjacency, functional linkage, or hierarchical organization. Consequently, this restricts their applicability in tasks that demand structured scene understanding, spatial reasoning, or context-aware generation.

To address this, we converted semantic maps into structured graph representations. In this graph-based formulation, each object or homogeneous land cover region is treated as a node, and its spatial or semantic relationships are modeled as edges. These edges can encode a range of dependencies, including direct adjacency (e.g., a building next to a road), topological proximity (e.g., a forest bordering a river), or learned co-occurrence patterns derived from domain-specific priors. This reframing transforms pixel-based maps into high-level relational graphs that more accurately reflect the structural composition of geographic scenes.

By introducing graph-based reasoning, models can explicitly capture object-level interactions, enforce layout consistency, and condition generation processes on interpretable semantic structures. This not only enhances the representational capacity of LULC inputs but also improves performance on tasks such as remote sensing image synthesis, layout-constrained image generation, and scene interpretation, where spatial structure and semantic alignment are critical.

Formally, let the semantic map contain N distinct objects, denoted as

N = {N_{1}, N_{2}, \dots, N_{n}}

(6)

where each object

N_{i}

corresponds to an independent region (e.g., a building, road, or water body). For each object

N_{i}

, we associate a textual description

T_{i}

that semantically characterizes the attributes of the object, enabling richer representations of the characteristics that integrate both spatial and semantic information.

The relationships between objects are captured in an edge matrix

E \in R^{N \times N}

, where each element

E_{i j}

indicates the existence and strength of the interaction between nodes

N_{i}

and

N_{j}

. Edges are established based on spatial proximity or semantic affinity, where spatial adjacency is determined by analyzing the geometric boundaries of regions, and semantic similarity is derived from the textual embeddings. Thus, the constructed semantic graph is represented as

G = (N, E)

, where nodes carry rich multimodal features, and edges explicitly encode contextual dependencies. This graph-based formulation enables the model to reason object-level relationships and spatial structures, providing a more informative and controllable foundation for downstream remote sensing image generation.

3.3. Graph Transformer

After constructing the semantic graph

G = (N, E)

, we initialize the node features using a lightweight feature embedding module. Each node feature

n_{i} \in R^{d_{in}}

is projected into a lower-dimensional semantic space via a learnable linear transformation followed by a nonlinear activation:

h_{i} = σ (W n_{i} + b)

(7)

where

σ (\cdot)

denotes the activation of LeakyReLU, and

W \in R^{d_{out} \times d_{in}}

,

b \in R^{d_{out}}

are learnable parameters.

After embedding the node features, the features are processed through a multistage graph transformer module [31]. The graph transformer applies self-attention operations across the graph structure, allowing nodes to aggregate the information from the neighboring nodes conditioned on the edge connections. Formally, for each stage, the node feature update is defined as

h_{i}^{'} = \sum_{j \in N (i)} α_{i j} W_{v} h_{j}

(8)

where

N (i)

denotes the neighbors of node i,

W_{v}

is a learning value projection matrix, and

α_{i j}

are attention weights computed as

α_{i j} = {softmax}_{j} (\frac{{(W_{q} h_{i})}^{⊤} (W_{k} h_{j})}{\sqrt{d_{k}}})

(9)

where

W_{q}, W_{k}

are the query and key projection matrices, respectively, and

d_{k}

is the dimension of the key vectors.

Finally, the graph transformer outputs updated node embeddings at different levels, denoted as

(g_{f})

, which are used for subsequent cross-modal fusion and semantic diffusion generation.

3.4. Graph-Masked Diffusion Transformer Framework

The graph-masked diffusion transformer (GMDiT) framework extends the classical diffusion model by introducing token-level masking during the denoising process. Unlike standard diffusion models that predict the full representation of the data at each step, GMDiT operates on partially masked latent tokens. At each time step, only a subset of tokens is denoised, while masked tokens are predicted based on the available contextual information. This masking strategy improves the robustness of the model, accelerates convergence, and enhances the semantic controllability during generation.

Formally, given an input semantic layout or condition, we first obtain a latent representation using a variational autoencoder (VAE) [28]. Specifically, the input is encoded into a latent variable

z \in R^{N \times d}

, where N is the number of latent tokens, and d is the embedding dimension of the token. During training, Gaussian noise is added to the latent variable to simulate the forward diffusion process, producing a noisy latent representation

z_{t}

in time step t.

At each time step, the model aims to predict the clean latent variable

z_{0}

from its noisy version

z_{t}

using a denoising function

ϵ_{θ}

. To enhance spatial modeling and semantic reasoning, a binary mask

M_{t} \in {0, 1}^{N}

is applied to partially occlude a subset of latent tokens. This strategy of masking at the token level compels the model to deduce the occluded regions from contextual information, thereby enhancing its comprehension of the structural layout.

The denoising objective under the masked setting is defined as

{\hat{z}}_{0} = ϵ_{θ} (z_{t} ⊙ M_{t}, t)

(10)

where ⊙ denotes element-wise multiplication, and

ϵ_{θ}

represents the masked diffusion transformer that operates over partially visible latent representations. The masking operation encourages the model to learn strong dependencies among spatial tokens and enhances its robustness in reconstructing structurally complex scenes.

To incorporate semantic control into the diffusion process, graph-based node embeddings

g_{f} \in R^{K \times d}

, obtained from the graph transformer, are introduced as conditional inputs. Specifically,

g_{f}

is injected into MDiT via a cross-attention mechanism at each denoising step. During self-attention among image tokens,

g_{f}

is used as the keys and values, and the attention is computed as

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V

(11)

where the query Q is derived from the unmasked image tokens, and

K, V

are projected from the graph embeddings

g_{f}

.

This conditional denoising strategy ensures that the generated images not only resemble realistic distributions but also conform to the spatial structures and semantic relationships encoded in the semantic graph. Using progressively graph-prior conditions during masked denoising, the GMDiT achieves fine-grained controllability, semantic consistency, and improved sample quality in remote sensing image synthesis. All experiments and model implementations in this work were carried out based on the MDT-XL/2 architecture [6], which employed 24 transformer encoder layers and 2 decoder layers, a hidden dimension of 1024, and 16 attention heads. This high-capacity configuration enables the effective modeling of complex spatial layouts and ensures sufficient capacity for graph-conditioned generation.

3.5. Loss Function

As illustrated in Figure 2, the optimization objective comprises two mean absolute error (MAE) reconstruction losses. The first loss aims to reconstruct the masked tokens to their original noise tokens (the noisy input before masking) and is defined as follows:

L_{m} = \frac{1}{| M |} \sum i \in M |x_{i}^{r e c} - x_{i}^{n o i s e}|

(12)

where

M

denotes the set of masked pixel positions;

x_{i}^{r e c}

and

x_{i}^{n o i s e}

represent the reconstructed pixel and the noisy pixel before masking at position i, respectively. The second loss aims to reconstruct the unmasked tokens to the corresponding clean image and is defined as

L_{u m} = \frac{1}{| U |} \sum j \in U |x_{j}^{r e c} - x_{j}^{c l e a n}|

(13)

where

U

represents the set of unmasked pixel positions, and

x_{j}^{c l e a n}

is the clean pixel value at position j. Therefore, the overall optimization objective combines these two losses as follows.

L = L_{m} + L_{u m} .

(14)

3.6. Scene-Aware Guidance

In traditional classifier-free guidance (CFG) methods [32], conditional generation is typically achieved by interpolating between predictions obtained with and without conditional information. This linear combination strategy helps to control the trade-off between fidelity and diversity, making it widely adopted in text-to-image or layout-to-image diffusion models. However, such methods usually rely on condition signals that are fixed or randomly sampled, without considering the semantic compatibility between the input layout and the condition. In the context of remote sensing image synthesis, where spatial and semantic consistency is crucial, this randomness can result in guidance that is semantically irrelevant, structurally incoherent, or even misleading. For example, conditioning a rural semantic map with an urban layout reference may produce unrealistic or contradictory results.

To address this issue, we propose a scene-aware guidance strategy, which replaces random condition selection with a more principled retrieval mechanism based on semantic similarity. Instead of interpolating between arbitrary conditional and unconditional predictions, our method retrieves scene-relevant semantic priors from a curated semantic library, thereby providing contextually aligned and structurally meaningful guidance.

Figure 2. Overview of the training framework. A latent noisy image is iteratively denoised using a VAE-based encoder–decoder. Semantic maps and implicit node interactions are embedded as graphs and processed by a graph ransformer. CLIP encodes node semantics, and cross-attention links node features with image tokens. The model is conditioned on the semantic layout and time steps and is supervised by masked alignment and reconstruction losses.

As illustrated in Figure 3, we first construct a semantic library composed of pre-annotated maps from various geographic scenes. These semantic maps are embedded using a pretrained VGG16 network [33], which extracts scene-level features, capturing both spatial composition and category distribution. We then apply K-means clustering to these feature vectors, grouping the semantic maps into

K = 10

representative scene clusters, such as dense urban areas, sparse rural regions, forest landscapes, or coastal zones. This clustering step enables the model to retrieve guidance from semantically and structurally coherent subsets rather than from the entire heterogeneous dataset.

During the denoising process of the GMDiT model, our approach replaces the conventional CFG mechanism with semantically aware conditioning. Specifically, the input semantic map

S_{input}

is encoded via VGG16 to produce a characteristic vector

f_{input} \in R^{d}

. Using this feature, we identify its closest scene cluster and randomly sample a reference map

S_{ref}

from within that cluster. This reference provides two levels of conditional information: a local semantic layout (Cond) and a global scene context (SceneCond). The layout captures object placement and structure, while the scene context captures the overall style, texture, or composition characteristics.

Formally, the denoising step at time step t is expressed as

{\hat{ϵ}}_{guided} = ϵ_{θ} (x_{t}, t, Cond, SceneCond)

(15)

where

x_{t}

is the noisy latent representation at time t, and

ϵ_{θ}

denotes the denoising network conditioned on both spatial layout and scene-level priors. By injecting these structured condition signals, the model can dynamically align the generation process with th relevant semantic patterns and spatial constraints.

This scene-aware strategy enables the diffusion process to generate images that are not only visually plausible but also spatially coherent and semantically aligned with real-world scene structures. It mitigates the instability caused by random conditions, improves inter-object consistency, and provides a scalable path toward controllable and context-aware remote sensing image synthesis.

4. Experiment Setups

Training Details. During training, a masking ratio of 0.5 was applied to the input tokens. We adopted the AdamW optimizer [34] with a fixed learning rate of 1× 10⁻⁵ and a weight decay of 0.002. The model was trained for 100 epochs using four NVIDIA RTX 3090 GPUs, with a batch size of 12 per GPU (i.e., a total batch size of 48). We employed mixed precision training to improve computational efficiency and reduce memory usage. All experiments were conducted using the PyTorch 2.4 framework, and model checkpoints were saved based on the best validation performance.

Sample Details. We adopted the EDM sampling strategy proposed in [22] to generate high-quality remote sensing images. Specifically, we followed the standard configuration of the EDM sampler, which progressively denoised a Gaussian noise input

x_{T}

through a series of learned denoising steps to obtain the final image

x_{0}

. During sampling, we set the number of denoising steps to 40 to balance generation quality and efficiency. The time-step schedule followed a cosine distribution, and classifier-free guidance was applied during sampling with a 3.5 guide scale. Semantic conditions (Cond) and scene-level context (SenceCond) were injected at each step to steer the denoising trajectory, ensuring that the generated images aligned with the intended semantic layout.

Evaluation Metrics. We quantitatively assessed the image generation performance of GDiT using the Fréchet Inception Distance (FID) [20], a widely adopted metric that evaluates the similarity between the distributions of generated and real images based on deep feature representations. FID compares the mean and covariance of features extracted from a pretrained Inception network and is formally defined as

FID = {∥μ_{r} - μ_{g}∥}^{2} + Tr (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{1 / 2})

(16)

where

μ_{r}, Σ_{r}

and

μ_{g}, Σ_{g}

denote the means and covariances of the real and generated image features, respectively. Lower FID values indicate that the distribution of the generated images is closer to that of the real images, implying better synthesis quality.

4.1. Dataset

OpenEarthMap is a highly challenging and diverse dataset for understanding land cover in remote sensing [35]. It consists of more than 2.2 million labeled segments derived from 5000 high-resolution aerial and satellite images, covering 97 geographically diverse regions across 44 countries and 6 continents [35]. Each image is annotated with 8 land cover classes, with ground sampling distances ranging from 0.25 to 0.5 m per pixel.

To prepare the dataset for model training, all images were uniformly cropped into non-overlapping patches of 256 × 256 pixels. The resulting patches were then divided into training and validation subsets in a 7:3 ratio. This preprocessing step yielded 46,404 patches for training and 7734 for validation, providing a rich and well-structured data foundation for developing and evaluating robust generative and change detection models.

4.2. Comparison with SOTA Models

To demonstrate the effectiveness and superiority of our proposed method, we conducted comparisons with several representative semantic image synthesis approaches, including Pix2PixHD, SPADE, SDM, LDM, PLACE, Freestyle, and GeoSynth. Pix2PixHD [4] is a high-resolution image-to-image translation framework based on conditional GANs. It introduces multiscale generators and discriminators to better preserve local and global details, enabling realistic outputs from semantic layouts.

SPADE [3] incorporates spatially adaptive normalization layers, which modulate features based on the input semantic map. This design allows the generator to retain the spatial structure more effectively during synthesis.

SDM [18] introduces a class-conditional diffusion framework that leverages semantic segmentation maps to guide the image generation process. The model incorporates spatial semantic priors directly into the diffusion steps, enabling semantically controllable image synthesis.

LDM [7] (latent diffusion model) operates in a compressed latent space to improve efficiency while maintaining the resolution and quality of the output. It applies the diffusion process in a learned feature space rather than in a pixel space.

PLACE [25] incorporates layout priors into the diffusion process to guide structure-aware image synthesis. It uses explicit spatial layout conditions to shape the denoising trajectory.

Freestyle [24] allows for open vocabulary and region-level control for the generation of semantic images. It supports flexible prompting mechanisms using both visual and textual inputs.

GeoSynth [26] constitutes a semantic-to-image diffusion model specifically engineered for remote sensing applications. It assimilates regional semantic features with generation guidance to enhance congruence with geospatial layouts.

For a fair comparison, all baseline models were retrained for 100 epochs using their default configurations. The best-performing checkpoint for each method, based on validation performance, was selected for evaluation.

Table 1 provides a comprehensive comparison of the representative generative models evaluated on the OpenEarthMap validation set, focusing on image quality (FID), computational complexity (FLOP), and inference efficiency (time per sample). Our proposed GMDiT demonstrates highly competitive performance for all metrics.

In terms of image fidelity, GMDiT achieves the lowest FID score of 23.15, substantially outperforming both GAN-based and diffusion-based baselines. Compared to GAN models such as Pix2PixHD (FID: 51.74) and SPADE (FID: 44.56), GMDiT has a lower FID by more than 50%, reflecting its superior capability to generate semantically consistent and realistic images. Among the diffusion-based approaches, GMDiT consistently outperforms PLACE (FID: 49.51), Freestyle (FID: 30.28), and GeoSynth (FID: 26.43), with FID improvements of 26.36, 7.13, and 3.28, respectively, demonstrating its strong capability to preserve layout fidelity under the guidance of graph-structured priors.

In terms of inference speed, GMDiT achieves a per-image inference time of 2.01 s, significantly faster than SDM (7.86 s), PLACE (8.87 s), and Freestyle (7.46 s), and comparable to LDM (2.59 s) and GeoSynth (2.05 s). These results suggest that GMDiT not only excels in image quality but also maintains practical computational efficiency, making it a strong candidate for real-world remote sensing image generation tasks.

Figure 4 further validates the effectiveness of our proposed method. In urban areas, our method has clear advantages over Pix2Pix and SPADE in terms of structural integrity and semantic consistency. Although SDM maintains good structural alignment, it suffers from significant color distortion. PLACE achieves relatively high semantic alignment with the input layout, but its outputs tend to be overly smooth, leading to the loss of fine-grained details. GeoSynth performs well for individual buildings but still suffers from over-smoothing and incomplete architectural structures. Both Freestyle and our approach strike a good balance between structure and semantic alignment. Compared to Freestyle, our method generates more coherent color and texture for natural land cover such as forests and grasslands.

Similar observations can be made for rural areas. Compared with other models, our method produces more detailed textures and smoother spatial transitions between different land covers, preserving the highest level of consistency with the input semantic maps. These results demonstrate that the integration of graph-structured priors enables more effective modeling of spatial relationships and object interactions, contributing to both semantic fidelity and visual realism.

4.3. Ablation Study

To investigate the effectiveness of the key components in our proposed framework, we conducted a series of ablation experiments on the OpenEarthMap validation set. Specifically, we focused on two core modules: (1) the use of graph-structured semantic priors and (2) the scene-aware retrieval mechanism based on K-means clustering [36].

Table 2 highlights the impact of the graph prior. When our model incorporates the graph-structured semantic prior, the FID score is reduced from 26.41 to 23.15, demonstrating a notable improvement in image fidelity. Figure 5 presents the qualitative comparison. As shown in Figure 5b, compared to the baseline in Figure 5a, the generated results exhibit clearer structures, enhanced contrast, and improved consistency among identical semantic categories.

These observations confirm that the graph prior effectively models spatial dependencies and object-level relationships, which play a crucial role in maintaining semantic layout coherence. By explicitly encoding structured spatial information, the proposed method significantly enhances the realism and structural accuracy of the synthesized remote sensing imagery.

To further evaluate the effectiveness of the scene-aware retrieval mechanism, we conducted an ablation study by varying the number of K-means clusters used to group the semantic library. Specifically, we tested five configurations with

K \in {0, 5, 10, 15, 20}

, where

K = 0

denotes the random retrieval of the entire semantic set without clustering. As shown in Table 3, increasing the number of groups improves the FID score to a certain point, with the best performance achieved at

K = 10

. This indicates that moderate scene granularity provides sufficient semantic alignment for conditional guidance. However, further increasing K to 15 or 20 results in slight performance degradation, possibly due to over-fragmentation and reduced sample diversity within clusters. These results suggest that balancing specificity and diversity in scene retrieval is essential for optimizing synthesis quality. As shown in Figure 6, when

N = 0

, the model performs conditional generation using traditional classifier-free guidance (CFG), where condition signals are randomly sampled from the entire semantic library without scene-level distinction. As N increases, the introduction of K-means clustering enables the model to retrieve semantically similar guidance more effectively. This structured retrieval leads to gradual improvements in the FID, with the best performance observed at

N = 10

. However, further increasing N to 15 or 20 results in a slight performance drop, possibly due to the over-segmentation of the scene space and reduced diversity within clusters.

5. Discussion

In this work, we propose a graph-aware, semantically guided diffusion framework (GMDiT) tailored for high-fidelity remote sensing image synthesis. In extensive quantitative and qualitative evaluations, the proposed method shows strong performance in preserving spatial structures, maintaining semantic consistency, and generating realistic textures.

One key finding is that the integration of graph-structured semantic priors significantly enhances generation quality by explicitly modeling object-level relationships and spatial dependencies. Compared with baseline diffusion and GAN-based models, our approach exhibits clearer boundaries, better inter-class consistency, and improved FID scores. Ablation studies further confirm the critical role of graph priors in controlling object layout and improving sample coherence.

Another important contribution lies in the use of a scene-aware retrieval strategy based on K-means clustering. By retrieving condition signals from semantically similar regions, the model avoids the randomness and instability often observed in conventional classifier-free guidance. Our experiments show that moderate clustering granularity (e.g.,

K = 10

) yields optimal results, highlighting the importance of balancing scene specificity and retrieval diversity.

Despite these strengths, the proposed framework still has limitations. First, the clustering quality is inherently dependent on the feature extractor used (e.g., VGG16), which may not always capture fine-grained semantic variations. Second, while graph priors improve controllability, their construction relies on predefined semantic maps, which may limit generalizability to unseen domains or classes.

In future work, we aim to explore adaptive graph construction methods that automatically infer inter-object relations from raw inputs without relying on external segmentation. We also plan to investigate lightweight graph encoders and scene-retrieval modules to reduce inference time and improve scalability.

6. Conclusions

In this paper, we propose GMDiT, a graph-aware, scene-guided diffusion framework for remote sensing image synthesis. The model incorporates graph-structured semantic priors to capture object-level spatial relationships and employs a K-means-based scene retrieval mechanism to enhance conditional guidance during generation. By introducing masked token prediction and graph-conditioned denoising, GMDiT achieves fine-grained semantic controllability and high-fidelity image generation.

Extensive experiments on the OpenEarthMap dataset demonstrate that GMDiT outperforms both GAN-based and state-of-the-art diffusion models in terms of FID, semantic consistency, and structural clarity. Ablation studies further validate the individual contributions of graph priors and scene-aware retrieval, highlighting the importance of structured semantic modeling for layout-preserving synthesis.

Overall, our work provides a novel and effective framework for controllable remote sensing image generation, particularly beneficial for land use/land cover (LULC) simulation tasks. By enabling the generation of diverse and semantically consistent LULC imagery, GMDiT supports downstream applications such as data augmentation, synthetic training sample generation, and change scenario simulation.

Author Contributions

Conceptualization, H.J.; investigation, B.S.; methodology, K.D.; writing—original draft preparation, K.D., S.W.; writing—review and editing, K.D., S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Young Talent Support Program of China Association for Science and Technology (CAST).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is publicly available at: https://open-earth-map.org/.

Acknowledgments

This work was supported by the Young Talent Support Program of China Association for Science and Technology (CAST).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Liu, X.; Yin, G.; Shao, J.; Wang, X.; Li, H. Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis. arXiv 2019, arXiv:1910.06809. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic Image Synthesis With Spatially-Adaptive Normalization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2332–2341. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Zheng, H.; Nie, W.; Vahdat, A.; Anandkumar, A. Fast Training of Diffusion Models with Masked Transformers. arXiv 2023, arXiv:2306.09305. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Alharmi, G.; Al-Khazraji, A. Generative adversarial networks: A recent survey. In Proceedings of the 6th Smart Cities Symposium (SCS 2022), Bahrain, 6–8 December 2022; IET: Singapore, 2022; Volume 2022, pp. 547–552. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. arXiv 2022, arXiv:2010.02502. [Google Scholar]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4195–4205. [Google Scholar]
Gao, S.; Zhou, P.; Cheng, M.M.; Yan, S. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 23164–23173. [Google Scholar]
Wang, H.; Wang, J.; Wang, J.; Zhao, M.; Zhang, W.; Zhang, F.; Xie, X.; Guo, M. GraphGAN: Graph Representation Learning with Generative Adversarial Nets. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Tseng, A.M.; Diamant, N.; Biancalani, T.; Scalia, G. GraphGUIDE: Interpretable and Controllable Conditional Graph Generation with Discrete Bernoulli Diffusion. arXiv 2023, arXiv:2302.03790. [Google Scholar]
Vignac, C.; Krawczuk, I.; Siraudin, A.; Wang, B.; Cevher, V.; Frossard, P. DiGress: Discrete Denoising Diffusion for Graph Generation. arXiv 2022, arXiv:2209.14734. [Google Scholar]
Sohl-Dickstein, J.N.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv 2015, arXiv:1503.03585. [Google Scholar]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Wang, W.; Bao, J.; Zhou, W.; Chen, D.; Chen, D.; Yuan, L.; Li, H. Semantic image synthesis via diffusion models. arXiv 2022, arXiv:2207.00050. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
<i>Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the Design Space of Diffusion-Based Generative Models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Xue, H.; Huang, Z.; Sun, Q.; Song, L.; Zhang, W. Freestyle Layout-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Lv, Z.; Wei, Y.; Zuo, W.; Wong, K.Y.K. PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Sastry, S.; Khanal, S.; Dhakal, A.; Jacobs, N. GeoSynth: Contextually-Aware High-Resolution Satellite Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 460–470. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Lou, C.; Al-qaness, M.A.; AL-Alimi, D.; Dahou, A.; Abd Elaziz, M.; Abualigah, L.; Ewees, A.A. Land use/land cover (LULC) classification using hyperspectral images: A review. Geo-Spat. Inf. Sci. 2024, 28, 345–386. [Google Scholar] [CrossRef]
Yang, J.; Liu, Z.; Xiao, S.; Li, C.; Lian, D.; Agrawal, S.; Singh, A.; Sun, G.; Xie, X. Graphformers: Gnn-nested transformers for representation learning on textual graph. Adv. Neural Inf. Process. Syst. 2021, 34, 28798–28810. [Google Scholar]
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Xia, J.; Yokoya, N.; Adriano, B.; Broni-Bediako, C. Openearthmap: A benchmark dataset for global high-resolution land cover mapping. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6254–6264. [Google Scholar]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics; University of California Press: Oakland, CA, USA, 1967; Volume 5, pp. 281–298. [Google Scholar]

Figure 1. Visual results of our proposed method. (a,c,e) Land cover semantic maps; (b,d,f) corresponding synthesized remote sensing images.

Figure 3. Semantic-retrieval-enhanced diffusion sample process.

Figure 4. Qualitative comparison of generation results of different methods.

Figure 5. (a) With graph, (b) without graph prior.

Figure 6. Effect of different numbers of K-means clusters on image generation quality.

Table 1. Performance comparison of generative models on OpenEarthMap validation set. Lower FID indicates better performance (↓).

Method	TYPE	FID ↓	Inference Time (S) ↓
Pix2PixHD	GAN	51.74	0.007
SPADE	GAN	44.56	0.010
SDM	Diffusion	101.46	7.86
PLACE	Diffusion	49.51	8.87
Freestyle	Diffusion	30.28	7.46
LDM	Diffusion	29.19	2.59
GeoSynth	Diffusion	26.43	2.05
GMDiT(Ours)	Diffusion	23.15	2.01

Table 2. FID comparison with and without graph-structured semantic priors. Lower FID indicates better performance (↓).

Type	FID ↓
Without Graph Prior	26.41
With Graph Prior	23.15

Table 3. FID comparison under different numbers of K-means clusters. Lower FID indicates better performance (↓).

Number of Clusters (K)	FID ↓
0 (cfg)	26.83
5	24.65
10	23.15
15	23.78
20	24.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, K.; Wei, S.; Pang, S.; Jiang, H.; Su, B. Synthesizing Remote Sensing Images from Land Cover Annotations via Graph Prior Masked Diffusion. Remote Sens. 2025, 17, 2254. https://doi.org/10.3390/rs17132254

AMA Style

Deng K, Wei S, Pang S, Jiang H, Su B. Synthesizing Remote Sensing Images from Land Cover Annotations via Graph Prior Masked Diffusion. Remote Sensing. 2025; 17(13):2254. https://doi.org/10.3390/rs17132254

Chicago/Turabian Style

Deng, Kai, Siyuan Wei, Shiyan Pang, Huiwei Jiang, and Bo Su. 2025. "Synthesizing Remote Sensing Images from Land Cover Annotations via Graph Prior Masked Diffusion" Remote Sensing 17, no. 13: 2254. https://doi.org/10.3390/rs17132254

APA Style

Deng, K., Wei, S., Pang, S., Jiang, H., & Su, B. (2025). Synthesizing Remote Sensing Images from Land Cover Annotations via Graph Prior Masked Diffusion. Remote Sensing, 17(13), 2254. https://doi.org/10.3390/rs17132254

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Synthesizing Remote Sensing Images from Land Cover Annotations via Graph Prior Masked Diffusion

Abstract

1. Introduction

2. Related Work

2.1. Diffusion Models

2.2. Semantic Image Synthesis

3. Method

3.1. Preliminary

3.2. Semantic Graph Construction

3.3. Graph Transformer

3.4. Graph-Masked Diffusion Transformer Framework

3.5. Loss Function

3.6. Scene-Aware Guidance

4. Experiment Setups

4.1. Dataset

4.2. Comparison with SOTA Models

4.3. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI