GCCG-RSI: Ground LiDAR and Image-Guided Geometry-Constrained Controllable Generation for Remote Sensing Image

Hu, Di; Qin, Riyu; Yuan, Xia; Yang, Shuting; Zhao, Chunxia

doi:10.3390/rs18101512

Open AccessArticle

GCCG-RSI: Ground LiDAR and Image-Guided Geometry-Constrained Controllable Generation for Remote Sensing Image

by

Di Hu

^1,2,†

,

Riyu Qin

^2,†,

Xia Yuan

^2,*

,

Shuting Yang

^3,4 and

Chunxia Zhao

²

¹

College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

²

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

³

School of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

⁴

Institute of Agricultural Economy and Information Technology, Ningxia Academy of Agriculture and Forestry Sciences, Yinchuan 750002, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(10), 1512; https://doi.org/10.3390/rs18101512

Submission received: 3 March 2026 / Revised: 20 April 2026 / Accepted: 4 May 2026 / Published: 11 May 2026

(This article belongs to the Special Issue Automatic Segmentation, Reconstruction, and Modelling from Laser Scanning Data)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel geometry-constrained controllable generation model is proposed to synthesize remote sensing images from ground-level images and corresponding point clouds.
A dual remote sensing feature fusion module that leverages the complementary characteristics of image and point cloud data is designed to guide the diffusion model for generating realistic remote sensing imagery.

What are the implications of the main finding?

This approach significantly enhances the fidelity and realism of synthesized remote sensing images while effectively reducing spatial structural randomness.
It establishes a robust and efficient solution for cross-modal and cross-view image generation, offering significant value for observation in inaccessible areas like UAV no-fly zones and underground regions.

Abstract

Remote sensing image analysis is crucial for many research fields, yet acquiring frequent high-quality remote sensing imagery is not always feasible due to prohibitive costs and logistical efforts. As a solution, ground-to-satellite cross-view image generation has emerged as a promising approach for synthesizing remote sensing images from readily available ground sensor data. However, existing methods face two critical limitations that bottleneck their performance, including instability in object structural attributes in ground views and reduced image fidelity and consistency due to environmental occlusions. To address these challenges, this paper proposes a geometrically constrained controllable generation model specifically tailored for remote sensing image generation, called GCCG-RSI. To overcome the limitation of structural instability, GCCG-RSI introduces LiDAR ranging accuracy to constrain the geometric shapes of the generated image. To mitigate occlusion-induced fidelity issues, GCCG-RSI employs an attention mechanism to derive a unified fused representation that integrates texture and spatial structure information. The representation is utilized as a conditional control signal to guide the diffusion model in accurately synthesizing remote sensing imagery. Experimental results demonstrate that, compared with state-of-the-art methods, GCCG-RSI infers remote sensing images with superior realism and fidelity using ground-view images and point clouds with limited perspective. Overall, the proposed method provides an effective image preprocessing approach that contributes to significantly narrowing the domain discrepancy between ground and satellite images, thereby facilitating the execution of downstream tasks.

Keywords:

remote sensing image; cross-view image generation; LiDAR and image fusion; attention mechanism; diffusion model

1. Introduction

Cross-view image generation has been a fundamental challenge in computer vision, aiming to infer the target view from an observed perspective [1]. In robotics, ground-view images are readily accessible in real time, whereas acquiring corresponding and up-to-date overhead or satellite imagery often incurs substantial logistical and financial costs for frequent updates [2]. While platforms like Google Maps (https://developers.google.com/maps/documentation/maps-static/intro (accessed on 15 November 2025)). offer access to pre-existing geospatial data and historical satellite imagery, these resources inherently lack real-time currency [3]. Unmanned aerial vehicles (UAVs) offer a potential solution for acquiring proximate and real-time aerial views within controlled environments. However, their operational feasibility is severely constrained by adverse weather conditions and by ubiquitous regulations that establish extensive no-fly zones. These gaps motivate the task of generating remote sensing images (RSIs) from street-view images. However, the substantial viewpoint differences between street-level and aerial perspectives make this task particularly challenging, especially when objects are occluded. Furthermore, similar objects in one view may appear drastically different in another, introducing the view-invariance problem. The complexity escalates when the scene contains multiple objects, as underlying variability factors proliferate, significantly increasing the difficulty of generating plausible target views.

In the field of street-to-aerial image generation, early methods primarily relied on generative adversarial networks (GANs) [4,5] or variational autoencoders (VAEs) [6,7], but these approaches often suffered from issues such as mode collapse, limited diversity, and difficulties in capturing high-level semantic consistency [1]. Consequently, GAN-based and VAE-based methods have been gradually replaced by diffusion model-based models, which demonstrate superior generation quality and enhanced controllability [8,9,10]. Nevertheless, these diffusion models still face challenges, including slow inference speed and constraints in handling multi-view consistency. Recent progress has addressed some of these limitations by employing joint diffusion models, which enable more coherent generation across different perspectives [11]. However, these methods remain restricted in their applicability, as they can only generate certain types of panoramic images [12]. Furthermore, these methods often entail computationally demanding fine-tuning processes on large-scale panoramic datasets, thereby restricting their scalability and practical utility.

For cross-view image generation based on generative methods, relying on a single image for transformation often encounters significant challenges. On one hand, occluded regions in the ground-level original images affect the transformation process, leading to distortion artifacts in the generated outputs [13]. These occlusions, caused by buildings, vegetation, or other obstacles, introduce incomplete or misleading visual cues that disrupt the projection between perspectives. Furthermore, perspective transformation involves image resampling. In the transformed image, regions distant from the camera correspond to fewer pixels in the original image, resulting in reduced resolution and blurriness. On the other hand, substantial perspective differences between viewpoints introduce inherent randomness in the spatial attributes of generated images, such as object positions, orientations, and scales [3,14].

To address these challenges, this paper proposes GCCG-RSI, a controllable RSI generation model guided by both colored point cloud-based RSI (CP-RSI) and image-projected RSI (I-RSI), as illustrated in Figure 1. The core novelty of our work is the strategic fusion of LiDAR point clouds with ground-view imagery to constrain and enhance cross-view satellite image synthesis. To address the limitation that a single modality struggles to adequately represent the spatial attributes of targets, GCCG-RSI leverages the precise geometric structural information from LiDAR to guide the denoising process. This mechanism effectively directs the diffusion model to generate the RSI with accurate and logical road layouts. Furthermore, the proposed method integrates rich pixel-level information extracted from I-RSI to accurately capture key ground-object features. By fusing the two RSI representations, the model is navigated to generate an RSI where both road structures appear realistic and precise. This synergistic fusion strategy enables GCCG-RSI to overcome the semantic ambiguity of single-source data, thereby significantly enhancing the fidelity and controllability of cross-view remote sensing image generation. To summarize, the main contributions of this paper are as follows:

(1): We propose a geometry-constrained controllable generation model called GCCG-RSI. This model mitigates geometric structural inaccuracies arising from the inherent randomness of remote sensing image generation.
(2): We design a dual remote sensing image feature fusion module that leverages an attention mechanism to facilitate mutual guidance and information complementarity between the ground image and point clouds. The approach effectively enhances the realism and geometric fidelity of the generated images by incorporating fused features as control conditions into the diffusion model.
(3): We conduct a comprehensive experimental evaluation on two datasets across diverse environments. The experimental results demonstrate that our proposed method robustly and consistently generates remote sensing images, thereby serving as valuable references for downstream tasks.

2. Related Work

2.1. Image Generation in Cross-View Localization

Image generation-based methods have demonstrated distinct advantages in the field of cross-view localization [15,16]. These approaches learn the mapping between images from different perspectives, facilitating transformation from the source to the target view. Regmi et al. [17] introduced XFork and X-Sequence, two generative adversarial network (GAN) based models that synthesize scene images and their corresponding segmentation images, thereby mitigating domain disparities arising from varying perspectives. Zhao et al. [18] employed hemisphere projection to transform ground panoramic images into geometric representations approximating satellite perspectives. By integrating image synthesis and retrieval tasks into an end-to-end trainable multitasking architecture, they pioneered the solution to the problem of top-down perspective conversion from street view to satellite images. Wu et al. [19] proposed PanoGAN, an adversarial feedback GAN framework. It enhances generation performance by feeding the discriminator’s feature responses back to the generator.

Shi et al. [20] enhanced information interaction across domains by explicitly establishing geometric correspondences between satellite-view and street-view images. At the core of this method lies the satellite-to-street projection (S2SP) module. It first estimates the height probability distribution, constructs a satellite-view multi-plane image (MPI), transforms it into a street-view MPI, and finally renders the street-view image. Li et al. [21] investigated effective strategies for leveraging unlabeled data in large-scale cross-view localization, encompassing both unsupervised and semi-supervised settings. They proposed an unsupervised framework that incorporates cross-view projection to facilitate the retrieval of initial pseudo-labels, along with a fast reordering mechanism. Overall, these approaches mitigate cross-view domain discrepancies by simulating visual projection [22]. However, in scenarios characterized by extreme perspective variations or complex scenes, the generated images may struggle to faithfully reconstruct all the intricate textures and geometric details of the true target perspective.

2.2. Diffusion Model for Image Generation

Diffusion models [23,24] have emerged as a prominent class of generative models, achieving excellent performance in sample quality across diverse image generation benchmarks [25], such as class-conditional image generation [26], text-to-image generation [23], and image-to-image translation [27]. Dhariwal et al. [28] proposed ADM-G, enabling diffusion models to incorporate class label conditioning. In the framework, gradients from a classifier trained on noisy images can be incorporated into the image during the sampling process. Ho et al. [29] introduced a classifier-free training and sampling strategy that involves interpolating between the predictions of a diffusion model with and without conditional input. To accelerate training and sampling efficiency, Rombach et al. [23] proposed the latent diffusion model (LDM), in which images are first compressed to a lower resolution, followed by denoising training in the latent space. These models can be conditioned on diverse inputs, such as images [23], depth maps, edges, poses [30], or text [31], demonstrating highly impressive performance. However, the necessity of conducting a large number of sampling steps (typically 50) during inference to generate high-quality samples has constrained their deployment in real-time applications and limited their broader applicability.

In addition, recent advancements in diffusion models have significantly propelled the field of satellite image synthesis, addressing challenges ranging from geometric misalignment to large-scale controllable generation. Lin et al. [13] proposed a geometry-guided cross-view diffusion framework that establishes explicit 3D geometric correspondences to mitigate pose ambiguity. By leveraging diffusion models to capture the intrinsic one-to-many mapping nature of the task, the approach enables diverse and geometrically consistent image generation in both satellite-to-ground and ground-to-satellite directions. Arrabi et al. [32] introduced the two-stage Geometric Preserving Ground-to-Aerial (GPG2A) framework that bridges the domain gap by predicting Bird’s Eye View (BEV) layouts from ground images to synthesize diffusion-based aerial imagery conditioned on both geometric layouts and textual descriptions. Ye et al. [33] developed SkyDiffusion, which leverages a Curved-BEV transformation and multi-to-one mapping strategy within a diffusion framework to effectively handle occlusion challenges in dense urban environments. Yu et al. [34] proposed MetaEarth, a resolution-guided self-cascading generative framework designed to synthesize global-scale, multi-resolution, and unbounded remote sensing imagery. By employing novel noise sampling strategies, the model achieves seamless tile stitching while maintaining high fidelity. Concurrently, Liu et al. [35] introduced Text2Earth, a diffusion model comprising 1.3 billion parameters, which was trained on the extensive Git-10M dataset. The approach incorporates resolution guidance and dynamic condition-adaptation mechanisms to enable versatile applications, thereby establishing new benchmarks for text-driven remote-sensing image generation.

3. Materials and Methods

As illustrated in Figure 2, GCCG-RSI comprises three modules, including image-point clouds geometric projection (IP-GP), dual remote sensing image feature fusion (DRF), and geometric-constrained conditional diffusion model (GCDM). Specifically, IP-GP projects the ground image and point clouds onto the aerial perspective via geometric transformations to synthesize preliminary remote sensing images. Subsequently, DRF employs the attention mechanism to fuse features from the two remote sensing images, yielding a unified feature representation that incorporates both global context and precise geographic information. On this basis, GCDM leverages the fused features as conditional control signals to guide the diffusion model in generating remote sensing images that are not only geometrically consistent with the real structure but also possess texture authenticity.

3.1. Image-Point Clouds Geometric Projection

In our method, IP-GP serves as the critical geometric and appearance alignment. Its primary objective is to bridge the substantial disparity between ground and satellite perspectives, thereby providing accurate and informative guidance for the subsequent diffusion model. This module constructs two complementary RSI representations by deeply integrating image and point clouds, including CP-RSI and I-RSI.

The synthesis of CP-RSI aims to integrate the rich texture information of ground images with the precise geometric structure of point clouds, thereby generating an RSI representation that combines accurate spatial information and realistic visual appearance. The implementation of this process requires prior time synchronization and coordinate system calibration between the LiDAR and the camera. The camera intrinsic parameter matrix

K

and the transformation matrix from the LiDAR coordinate system to the camera coordinate system, i.e., the extrinsic matrix

E

, are obtained based on the joint calibration. The definitions are as follows:

K = [\begin{matrix} f_{x} & 0 & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}], E = [\begin{matrix} R_{3 \times 3} & t_{3 \times 1} \\ 0_{1 \times 3} & 1 \end{matrix}],

where

K

denotes the camera intrinsic parameter matrix, and

f_{x}

and

f_{y}

represent the focal lengths in pixels along the x and y axes, respectively. (

u_{0}

,

v_{0}

) specifies the principal point coordinates.

E

represents the extrinsic matrix, which encodes the rigid body transformation from the LiDAR coordinate system to the camera coordinate system.

R_{3 \times 3}

denotes the 3 × 3 rotation matrix,

t_{3 \times 1}

denotes the 3 × 1 translation vector, and

0_{1 \times 3}

indicates the zero vector.

For a point

p_{l} (x, y, z)

in the LiDAR coordinate system, its corresponding coordinates in the camera coordinate system are denoted as

p_{c} (x_{c}, y_{c}, z_{c})

.

p_{l}

is first converted into homogeneous coordinates and then transformed into the camera coordinate system to derive

p_{c}

. The formula is as follows:

p_{c} = R \cdot p_{l} + T = [\begin{matrix} R_{3 \times 3} & t_{3 \times 1} \\ 0_{1 \times 3} & 1 \end{matrix}] [\begin{matrix} x \\ y \\ z \\ 1 \end{matrix}],

where T represents the translation matrix. The orientation of the camera determines the new Z-axis direction. Leveraging the pinhole camera model, the three-dimensional (3D) point

p_{l}

is projected onto the normalized imaging plane, yielding dimensionless normalized coordinates

(x^{'}, y^{'})

. The normalized coordinates are then transformed into discrete pixel coordinates

(u, v)

on the image based on the camera intrinsic matrix

K

:

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x^{'} \\ y^{'} \\ 1 \end{matrix}] .

For the point clouds data, each point

(x, y, z)

is traversed to calculate its corresponding image pixel coordinates based on the above formula. If the calculated coordinates fall within the valid image boundaries, the RGB values of the respective pixel are extracted and assigned to the 3D point, resulting in a colored point cloud represented by a six-dimensional vector

(x, y, z, r, g, b)

.

Finally, the colored point clouds are projected onto a two-dimensional remote sensing image. The processing region is defined as a 40 × 40 m rectangular area centered on the robot, yielding an RSI with a resolution of 512 × 512 pixels. During the projection, each 3D point is mapped to its corresponding pixel in the RSI based on its coordinates. When multiple points are projected to the same pixel, only the color of the point with the largest z-coordinate is retained, yielding a clear and realistic CP-RSI.

For the generation of I-RSI, we follow the projection method proposed by Wang et al. [36]. This approach leverages spherical geometry and establishes the correspondence between remote sensing image and ground image pixels via a virtual camera model. The projection formulation is expressed as follows:

\{\begin{matrix} u_{p} = [1 - arctan 2 (W_{b} / 2 - u_{b}, H_{b} / 2 - v_{b}) / π] W_{p} / 2, \\ v_{p} = [0.5 - arctan 2 (- f, \sqrt{{(W_{b} / 2 - u_{b})}^{2} + {(H_{b} / 2 - v_{b})}^{2}}) / π] H_{p}, \end{matrix}

(1)

where

H_{p} \times W_{p}

denotes the size of the original image, and

H_{b} \times W_{b}

represents the size of the generated RSI following spherical transformation. The pixel coordinates on the imaging plane of satellite perspective are denoted as

u_{b}, v_{b}

. By applying the above transformation to each pixel in the ground image, the I-RSI can be generated.

3.2. Dual Remote Sensing Images Feature Fusion

The module of DRF constitutes a core component of GCCG-RSI, designed to achieve effective fusion of CP-RSI and I-RSI. Specifically, encoders process the two RSIs to generate preliminary feature images, denoted as

F_{cp}

from CP-RSI and

F_{i}

from I-RSI. The design motivation originates from the inherent complementarity between these two modalities in scene representation. The joint feature representation that synthesizes precise geometric structures with rich semantic textures is constructed through the deep integration of the two modalities, thereby providing high-quality and strongly constrained control signals.

To facilitate effective feature integration, DRF employs the attention-based fusion that adaptively prioritizes salient features, establishes contextual dependencies, and learns optimal fusion weights in a data-driven manner. It consists of two primary branches, including a cross-attention branch and a self-attention branch. The detailed architecture of the module is illustrated in Figure 3.

3.2.1. Cross Attention Branch

DRF employs a bidirectional interactive design to achieve deep fusion between

F_{cp}

and

F_{i}

, which exhibit consistency and represent complementary features. Initially, the feature images

F_{cp}

and

F_{i}

are linearly projected to generate their respective query, key, and value tensors:

\begin{matrix} Q_{cp} = F_{cp} W_{cp}^{Q}, & K_{cp} = F_{cp} W_{cp}^{K}, & V_{cp} = F_{cp} W_{cp}^{V}, \\ Q_{i} = F_{i} W_{i}^{Q}, & K_{i} = F_{i} W_{i}^{K}, & V_{i} = F_{i} W_{i}^{V}, \end{matrix}

where

W^{Q}

,

W^{K}

, and

W^{V}

denote learnable projection weight matrices. Subsequently, the mechanism facilitates bidirectional information interaction via two parallel cross-attention modules to fully exploit the complementary information between the two feature representations. In the first module, attention is directed from I-RSI to CP-RSI, where

F_{i}

serves as the source of

Q_{i}

, and

F_{cp}

supplies

K_{cp}

and

V_{cp}

. This design enables

F_{i}

to actively retrieve and aggregate the most relevant structural contextual information from

F_{cp}

, thereby providing structural guidance and correction for the I-RSI features. Subsequently, DRF aggregates

Q_{i}

,

K_{cp}

, and

V_{cp}

via a multi-head cross-attention mechanism (MHCA) to obtain the output

F_{1}

, computed as follows:

F_{1} = MHCA (Q_{i}, K_{cp}, V_{cp}) .

The second cross-attention module performs attention from CP-RSI to I-RSI. In this module,

F_{cp}

serves as the source of

Q_{cp}

, whereas

F_{i}

provides

K_{i}

and

V_{i}

. This mechanism enables each feature vector in CP-RSI to retrieve and aggregate the corresponding detailed texture information from the feature space in I-RSI. Consequently, the rich texture is integrated into the abstract geometric structures. The output

F_{1}

of this module is computed as follows:

F_{2} = MHCA (Q_{cp}, K_{i}, V_{i}) .

Finally, two complementary feature images

F_{1}

and

F_{2}

are concatenated to yield the feature

F_{cross}

, which encapsulates bidirectional interactive cross-attention information with enhanced representational capacity:

F_{cross} = F_{1} \oplus F_{2} .

3.2.2. Self Attention Branch

In the self-attention branch, the input features

F_{cp}

and

F_{i}

are initially concatenated along the channel dimension to generate a preliminary fused feature

F_{icp}

:

F_{icp} = F_{i} \oplus F_{cp} .

To capture long-range dependencies and global contextual relationships within the mixed features,

F_{icp}

serves simultaneously as the queries, keys, and values for a multi-head self-attention (MHSA) module. This module leverages the self-attention mechanism to capture intrinsic correlations and inter-dependencies between the two distinct feature sources. The output feature

F_{self}

is computed as follows:

F_{self} = MHSA (F_{icp}) .

Subsequently, the features

F_{cross}

and

F_{self}

are normalized using layer normalization (LN) to stabilize training dynamics and unify feature scales. The normalized features are then concatenated along the channel dimension to yield the preliminary fusion output

F_{cs}

of the DRF module:

F_{cs} = LN (F_{cross}) \oplus LN (F_{self}) .

The concatenation substantially increases the number of feature channels, necessitating the dimensionality reduction of

F_{cs}

to mitigate model complexity and computational overhead. To address this issue, a

1 \times 1

convolutional layer is employed to compress the channels of

F_{cs}

, yielding the final output feature

F_{fused}

. This feature integrates cross-modal complementary information captured by the cross-attention mechanism with internal global context information enhanced by the self-attention mechanism. Consequently,

F_{fused}

emerges as a fused RSI feature characterized by accurate geometric structure, rich texture details, and robust representation capability. It provides high-quality and geometrically controllable conditional information for subsequent diffusion models.

3.3. Geometric-Constrained Conditional Diffusion Model

The fundamental mechanism of image diffusion models involves a progressive denoising process designed to reconstruct image samples that align with the training data distribution from random noise. The denoising procedure is typically conducted in the latent space following encoding to enhance computational efficiency. This paradigm has been employed in representative methods such as Stable Diffusion [23]. In the proposed method, a pre-trained latent diffusion model is utilized as the foundational framework.

Image diffusion models progressively denoise images and generate samples from the training domain. The model consists of two processes, including forward diffusion and reverse denoising. In the forward process, the diffusion model gradually adds Gaussian noises to a ground truth image

z_{0}

according to a predetermined schedule

β_{1}, β_{2}, \dots, β_{T}

:

q (z_{t} | z_{t - 1}) = N (z_{t}; \sqrt{1 - β_{t}} z_{t - 1}, β_{t} I),

where

z_{t}

denotes the noisy latent representation of step t. The generation of RSI corresponds to the inverse of the above forward process, accomplished through a series of learnable denoising steps. This process can be formalized as

p_{θ} (z_{t - 1} | z_{t})

, where a denoising network

ϵ_{θ}

parameterized by

θ

is trained to predict the noise

ϵ

present in the current noisy latent representation

z_{t}

. Upon training completion, the model can generate the RSI by initiating from pure noise

z_{T}

and performing denoising iterations for T times. In each iteration, the trained

ϵ_{θ}

predicts the noise, and the latent representation is updated following the DDIM sampling algorithm [37]. This iterative procedure continues until a clear latent representation

z_{0}

is obtained, which is subsequently reconstructed into a high-quality remote sensing image

x_{0} = D (z_{0})

via the decoder

D

.

To enable controllable generation of geometric structures in RSI, our method leverages the principle of ControlNet [38] and proposes a generation framework that employs fused RSI features as the control condition. Specifically, the feature

F_{fused}

derived from DRF serves as the primary control condition. The approach injects geometric, structural, and appearance information into the generation process through latent-space alignment and feature-fusion mechanisms. In particular, I-RSI and CP-RSI are projected into conditional embeddings aligned with the latent space of the diffusion model via the dedicated conditional encoding network.

Subsequently, the two feature representations are deeply integrated via DRF. Inspired by the architecture of ControlNet, the structure and weights of the encoder and intermediate blocks are replicated from the pre-trained Stable Diffusion model to construct a conditional control branch. Within this branch, conditional embeddings are processed via the zero convolution and incorporated as residuals into the corresponding layers of the U-Net. The zero convolution initialization ensures that the control branch output remains approximately zero during the initial training phase, thereby preserving the generative capability of the pre-trained main model and facilitating stable training.

For the given training data, the forward process of the diffusion model progressively introduces noise into the ground-truth satellite image. After t denoising steps, a noisy satellite image

S_{t}

is obtained. Specifically, the proposed model learns to perform denoising conditioned on three factors: the time step t,

F_{cp}

, and

F_{i}

. The

F_{cp}

and

F_{i}

representations are processed to construct a unified conditional representation

F_{fused}

, which guides the model in predicting the noise added to

S_{t}

. The training objective of the network is defined as follows:

L = E_{S_{t}, t, F_{cp}, F_{i}, ϵ \sim N (0, 1)} [| | ϵ - ϵ_{θ} (S_{t}, t, F_{fused}) {| |}_{2}^{2}] .

To elucidate the model training and inference processes, we present a formal description via pseudocode in Algorithm 1.

Algorithm 1 Controlled Latent Diffusion for Cross-View Satellite Image Generation

Require: Ground-truth satellite image

x_{0}

, feature images

F_{c p}

and

F_{i}

, fused feature

F_{f u s e d}

Parameters: Encoder

E

, decoder

D

, denoising U-Net

ϵ_{θ}

with ControlNet branch, noise

schedule

{β_{t}}_{t = 1}^{T}

, learning rate

η

Training phase:

1:: $z_{0} \leftarrow E (x_{0})$
2:: Sample $t \sim$ Uniform(1,2, …,T)
3:: Sample $ϵ \sim N (0, I)$
4:: $α_{t} \leftarrow 1 - β_{t}$ , ${\bar{α}}_{t} \leftarrow \prod_{s = 1}^{t} α_{s}$
5:: $z_{t} \leftarrow \sqrt{{\bar{α}}_{t}} z_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ$
6:: $\hat{ϵ} \leftarrow ϵ_{θ} (z_{t}, t, F_{f u s e d})$
7:: $L \leftarrow | | ϵ - \hat{ϵ} {| |}_{2}^{2}$
8:: $θ \leftarrow θ - η ▿_{θ} L$

Inference phase:

1:: $z_{T} \sim N (0, I), t = T$
2:: for $t > 0$ do
3:: $\hat{ϵ} \leftarrow ϵ_{θ} (z_{t}, t, F_{f u s e d})$
4:: $z_{t - 1} \leftarrow DDIMSampler (z_{t}, \hat{ϵ}, t, β_{t})$
5:: end for
6:: $x_{0} \leftarrow D (z_{0})$
7:: return $x_{0}$

4. Results

4.1. Experimental Data and Evaluation Metrics

Experiments are conducted on the KITTI [39] and Ford Multi-AV [40] datasets. The KITTI dataset was acquired using a vehicle platform equipped with a 64-channel LiDAR, covering urban, rural, and highway scenarios. The Ford Multi-AV dataset comprises point clouds gathered from four 32-channel LiDARs units over multiple traversals at various times, seasons, and weather conditions. Following the data augmentation strategy proposed by Shi et al. [41] with satellite imagery, we established a training set of region-specific samples. Furthermore, two dedicated testing sets are constructed: the same subset for intra-domain evaluation, maintaining geographical consistency with the training region, and the cross subset to rigorously assess cross-domain generalization performance by incorporating distinct regions. For the remote sensing image generation task in this paper, we define the spatial extent of the ground-truth satellite images as 40 × 40 m. It aligns with the FoV area projected from the ground image onto the remote sensing image. To account for the limited FoV angle in ground images, we impose a binary mask on the ground-truth satellite images, thereby simulating a restricted FoV angle with a range of 84.5°.

For the experimental evaluation, this paper employs a group of widely used image quality assessment metrics in the field of image generation [42,43] to comprehensively assess the content consistency and visual realism of generated images. The metrics include structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS), and Fréchet inception distance (FID). PSNR quantifies reconstruction accuracy at the pixel level by computing the mean squared error (MSE) between two images, serving as a fundamental metric for assessing image fidelity. The formula is expressed as follows:

PSNR = 10 \cdot lg \frac{{V_{m a x}}^{2}}{M S E},

where

V_{m a x}

denotes the upper bound of the dynamic range of pixel values. MSE is employed to quantify the pixel-level discrepancy between the generated and the ground-truth images, expressed as:

MSE = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 1}^{n - 1} {[I (i, j) - K (i, j)]}^{2},

where

m n

denotes the total number of pixels in the image, and

I (i, j)

and

K (i, j)

represent the pixel value of the ground truth and generated image at location

(i, j)

, respectively. A higher PSNR value indicates less image distortion, which implies better reconstruction quality. SSIM evaluates image similarity across three dimensions: luminance, contrast, and structure. The model aligns more closely with the perceptual characteristics of the human visual system. The general form of SSIM for the generated image x and the real image y is defined as follows:

SSIM = {[l (x, y)]}^{α} \cdot {[c (x, y)]}^{β} \cdot {[s (x, y)]}^{γ},

where

l (x, y)

,

c (x, y)

, and

s (x, y)

denote the luminance, contrast, and structure comparison functions, respectively.

α, β,

and

γ

are parameters adjusting the relative weights of each component. The formulas of these comparison functions are given as follows:

l (x, y) = \frac{2 μ_{x} μ_{y} + c_{1}}{μ_{x}^{2} + μ_{y}^{2} + c_{1}}, c (x, y) = \frac{2 σ_{x} σ_{y} + c_{2}}{σ_{x}^{2} + σ_{y}^{2} + c_{2}}, s (x, y) = \frac{σ_{x y} + c_{3}}{σ_{x} σ_{y} + c_{3}} .

In the formula,

μ_{x}

and

μ_{y}

denote the local pixel means of images x and y, respectively, serving as an estimate of brightness.

σ_{x}

and

σ_{y}

represent the standard deviations, utilized to quantify contrast.

σ_{x y}

indicates the covariance between the two images, employed to assess structural similarity. The constants

c_{1}

,

c_{2}

, and

c_{3}

are small positive values introduced to prevent division by zero. The SSIM index ranges from 0 to 1, with higher values corresponding to greater structural similarity between the two images. LPIPS leverages a deep convolutional neural network to extract high-level image features and quantifies perceptual discrepancies between image patches within the feature space. This metric effectively models human perception regarding image quality and semantic similarity, making it suitable for assessing the structural fidelity of generated images. Given a pair of images x and

x_{0}

, the LPIPS distance is formulated as follows:

d (x, x_{0}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} | | w_{l} ⊙ ({\hat{y}}_{l}^{h w} - {\hat{y}}_{0 l}^{h w}) {| |}_{2}^{2},

where l denotes the l-th layer of the network, and

h w

represents the positional index in two-dimensional space.

{\hat{y}}_{l}

and

{\hat{y}}_{0 l}

are normalized feature maps, and

w_{l}

represents the weights of each channel learned from large-scale human perception data. A lower LPIPS value indicates that the two images are more perceptually similar. FID provides a comprehensive evaluation of the visual authenticity and diversity of generated images by quantifying the statistical distributional discrepancies in the feature space. Specifically, FID figures out the Fréchet distance between two multivariate Gaussian distributions, each derived from the feature vectors of real and generated images, respectively:

FID = | | μ_{r} - μ_{g} {| |}^{2} + T r (\sum_{r} + \sum_{g} - 2 \sqrt{\sum_{r} \sum_{g}}) .

Here,

μ_{r}

and

μ_{g}

denote the mean vectors of the real and generated features, respectively, while

\sum_{r}

and

\sum_{g}

represent their corresponding covariance matrices.

T r (\cdot)

indicates the trace of the matrix. A lower FID score signifies that the distribution of the generated image more closely approximates that of the real image, reflecting superior realism and enhanced diversity. Considering that PSNR and SSIM assess only pixel-level similarity, and LPIPS lacks prior knowledge specific to satellite imagery, these metrics alone are insufficient for evaluating the similarity between satellite images. Consequently, this paper adopts the viewpoint similarity metric

S i m_{s}

, as proposed in GPG2A [32], to evaluate the similarity between the ground truth and generated images:

S i m_{s} = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 - 2 \times (f_{a} \cdot {\hat{f}}_{a})}{4},

where N denotes the number of samples, and

f_{a}

and

{\hat{f}}_{a}

represent the L2-normalized features extracted by SAFA [44] from the real and generated image, respectively. As the generated image exhibits greater realism and geometric consistency with the real image, the distance between

f_{a}

and

{\hat{f}}_{a}

diminishes, resulting in a reduced

S i m_{s}

score.

The proposed method is implemented based on the pre-trained Stable Diffusion v1.5 [23], which utilizes a diffusion decoder with activated training parameters and sets the classifier-free guidance scale to 9.0 [29]. In the inference phase, a 50-step DDIM sampling strategy [37] is employed. Regarding data processing, the experiments consistently follow real geographic coordinates and vehicle heading directions. The original satellite images are precisely cropped to a resolution of 512 × 512 pixels to serve as reference ground truth data. Meanwhile, the ground-collected images are preprocessed to a uniform resolution of 512 × 1024 pixels for model input. In the geometric projection component, the projection parameters, including the vertical field of view (FoV), pitch angle, and scaling ratio, are set to 17.5°, 1.9°, and 4, respectively, which ensure geometric consistency during projection transformations.

In addition, the number of attention heads in the multi-head attention mechanism is set to 8 to strike a balance between the model’s expressive power and computational efficiency. For quantitative evaluation, the weight parameters for brightness, contrast, and structure in SSIM are all set to 1.0, following the standard SSIM calculation method. The

S i m_{s}

is calculated using the SAFA model pre-trained on the KITTI dataset. All experiments are conducted under a unified hardware environment, employing an RTX 3090 GPU with a fixed batch size of 1, thereby ensuring the comparability and reproducibility of the experimental results. The detailed experimental parameter configuration is presented in Table 1.

4.2. Evaluation Results

To assess the performance of the proposed method, comparative experiments were conducted on the KITTI and Ford Multi-AV datasets. The proposed method was benchmarked against SelGAN [45], GPG2A [32], Instr-p2p [46], ControlNet [38], SkyDiffusion [33], and GCC [13]. All comparative models utilized identical ground images as input. The experimental results are summarized in Table 2.

As shown in the experimental results, the proposed method achieves superior performance, thereby validating its overall effectiveness. Specifically, it outperforms all comparative methods across PSNR, SSIM, LPIPS, and FID, demonstrating its advantage in generating high-fidelity and realistic remote sensing images. This improvement is attributed to the incorporation of ground image projection as a control condition. The projection provides preliminary road structure information while preserving the texture and color of objects such as houses, trees, and roads, thereby facilitating the model’s inference of object layout and color in the generated image. In comparison, GCC [13], GPG2A [32], and SelGAN [45] leverage semantic information to assist image generation. However, the semantic images lack realistic textures, and the substantial perspective discrepancy between the ground and satellite images poses challenges for models to accurately predict texture transformations. Consequently, the generated results exhibit lower realism compared to the proposed method. Instr-p2p [46] enhances generation quality by constructing high-quality text-image pairs and employing descriptive text to guide image synthesis. Nevertheless, text-based descriptions remain deficient at capturing detailed image texture features compared with image-based control conditions.

In the evaluated approaches, GPG2A [32], SkyDiffusion [33], and GCC [13] represent state-of-the-art diffusion-based frameworks specifically tailored for ground-to-satellite image synthesis. Notably, SkyDiffusion [33] and GCC [13] achieve significant performance gains over their counterparts on metrics PSNR, SSIM, LPIPS, and FID, attributable to their advanced architectural designs that better preserve consistency of texture during the view transformation process. Compared with the three approaches, the proposed method outperforms the parallel methods in

S i m_{s}

, demonstrating its superiority in restoring accurate road geometric structures. It is primarily attributed to the incorporation of colored point clouds projection as an additional control condition, which furnishes precise road geometry information. Such a constraint directs the model to accurately infer the structural transformations from ground-level to satellite perspectives, thereby yielding generated images that align with the geometric structure of real satellite imagery. Conversely, GPG2A [32] and GCC [13] employ a network-predicted semantic image as a geometric prior. However, the inherent errors in this prior lead to a degradation in the

S i m_{s}

metric relative to the proposed method. Moreover, ControlNet [38], SkyDiffusion [33], and Instr-p2p [46] also exhibit inferior performance in terms of the

S i m_{s}

metric due to the absence of explicit geometric prior guidance.

In addition, a horizontal comparison of experimental performance across different datasets reveals that variations in scenarios inevitably result in performance discrepancies. The Ford Multi-AV dataset, which is dominated by highway scenes, imposes more strict requirements on the model’s capacity to infer geometric structures. Consequently, most comparative methods exhibit a decline in the

S i m_{s}

metric when applied to the Ford Multi-AV dataset compared to the KITTI dataset. In contrast, the proposed method leverages DRF to fuse a colored-point-cloud projection, which provides both accurate geometric and dense texture information as conditional guidance. This approach not only sustains the

S i m_{s}

metric on the Ford Multi-AV dataset but also achieves superior performance relative to the KITTI dataset.

To further validate the effectiveness of the proposed method, qualitative experiments were conducted on the KITTI and Ford Multi-AV datasets. Figure 4 illustrates the remote sensing image generation results obtained by SelGAN [45], Instr-p2p [46], ControlNet [38], SkyDiffusion [33], GCC [13], and the proposed approach. Figure 5 presents the control conditions employed by our method, alongside visualized generation results. As shown in Figure 5, the image generated by our method may exhibit minor pixel-level inaccuracies along certain high-frequency edges, which stem from two inherent limitations of our method. First, the perspective discrepancy introduces sampling artifacts and quantization errors during 3D-to-2D projection. It fundamentally limits the fidelity of high-frequency details in the control condition. In addition, the projected guidance produces blurred textures while preserving coarse geometry, acting as a low-pass filter that attenuates high-frequency edge information. Therefore, the model receives an already-smoothed representation of boundaries, making it difficult to reconstruct sharp, pixel-accurate edges. Despite these errors, the overall geometric structure of the generated image aligns well with the colored point clouds projection results, while its color and texture features closely approximate those of the real image. It can be attributed to the accurate road structure guidance provided by the colored point clouds. Furthermore, its color information is integrated with the blurred textures in the projected image via DRF, collectively enabling the model to generate a remote sensing image that better resembles the overall layout of the real satellite image.

In comparison, the image generated by other methods still exhibits discernible discrepancies from real satellite images, particularly in the results of SelGAN [45] and Instr-p2p [46], as shown in Figure 4. While SkyDiffusion [33] and GCC [13] closely approximate the ground truth in terms of basic terrain and general green landforms, significant discrepancies remain when capturing specific structural features such as buildings. These comparative methods lack precise geometric prior support, which hinders the model’s ability to accurately infer road structures. Furthermore, they do not pre-align ground images to the satellite perspective, impeding the direct inference of object features from ground-level imagery. This limitation manifests as inconsistencies in the shapes, positions, and colors of trees, buildings, and other ground elements.

4.3. Ablation Study

To systematically assess the effectiveness of the proposed method, we present ablation studies evaluating various combinations of the IP-GP module and DRF module on the KITTI dataset. The experimental results are summarized in Table 3. Compared with the baseline that does not incorporate IP-GP and DRF, introducing DRF alone enables the model to fuse the features of ground images with those of the original point clouds. This integration combines the overall scene information of the point clouds with the texture details of the ground image, thereby enhancing the realism of the generated image. However, due to the absence of structured alignment between the point clouds and the image, the model struggles to capture accurate geometric road information. Consequently, the geometric structure of the generated image exhibits discrepancies compared with real satellite images, leading to only a limited improvement in the

S i m_{s}

metric.

When applying the IP-GP module independently, the model initially aligns ground-view images with the satellite perspective and integrates point clouds to generate a colored synthetic image, which serves as a geometric control condition. Under the joint guidance of these two conditions, metrics such as PSNR, LPIPS, FID, and

S i m_{s}

all exhibit improvement. Notably,

S i m_{s}

decreased significantly from 0.445 to 0.382, demonstrating that the structure of the generated image closely approximates that of real satellite imagery. This empirically validates the efficacy of utilizing colored point clouds as a geometric control condition to guide the model to generate structurally accurate results.

Furthermore, incorporating DRF into the IP-GP framework yields further enhancements across all performance metrics. This improvement is primarily attributed to the integration of self-attention and cross-attention mechanisms within DRF. More specifically, the self-attention mechanism operates on the concatenated features of the ground image and point clouds. By modeling intra-feature dependencies, it adaptively captures and integrates latent correlations across heterogeneous feature sources, thereby enabling the intrinsic fusion of rich color textures from I-RSI and precise road geometry information from CP-RSI. Concurrently, the cross-attention mechanism is specifically designed to capture complementary information between I-RSI and CP-RSI feature images. On the one hand, inaccurate road features in the I-RSI representation can be queried and refined using geometric cues from the CP-RSI. On the other hand, coarse color information in the CP-RSI can be retrieved and enriched by leveraging fine-grained texture features from the I-RSI. Through the synergistic integration of these two components, the model effectively exploits the complementary guidance from the dual remote sensing image representations, thereby substantially enhancing its ability to synthesize high-quality and high-fidelity images.

To investigate the role of different remote sensing image representations as control conditions in image generation, this paper further evaluates the model performance using I-RSI and CP-RSI as control inputs under various combination conditions on the KITTI dataset. The experimental results are presented in Table 4. Compared to the baseline without RSI guidance, the incorporation of I-RSI alone yields significant improvements in PSNR, LPIPS, FID, and

S i m_{s}

. It indicates that the preliminarily aligned road structure and texture information encoded in I-RSI effectively guide the model to synthesize higher-quality images. When CP-RSI is employed as a standalone control condition, the SSIM, FID, and

S i m_{s}

metrics exhibit further improvements relative to the baseline. These results demonstrate that the precise road geometry information provided by CP-RSI significantly enhances the capacity of the method to model complex road structures and generate outputs that align closely with real satellite images in terms of spatial layout. Furthermore, the semantically associated color information preserved in CP-RSI supplies weak supervision signals, enabling the approach to generate plausible semantic categories and appearances in corresponding regions.

Employing I-RSI and CP-RSI concurrently as baseline conditions yields comprehensive and significant improvements across all evaluation metrics. It demonstrates the high complementarity of the two representations at the informational level. I-RSI encapsulates rich, detailed texture and local appearance information, whereas CP-RSI contributes precise road topology and global geometric information. By integrating the two types of remote sensing images via a DRF module equipped with a hybrid attention mechanism, the model synergistically leverages both information sources. This approach enhances visual realism while preserving geometric accuracy, thereby generating higher-quality images. The experimental results further validate the efficacy of the IP-GP framework.

4.4. Downstream Applications

To assess the effectiveness of the proposed remote sensing image generation algorithm, we performed comprehensive qualitative and quantitative evaluations utilizing cross-view localization as a downstream task. Cross-view localization aims to determine the three-degree-of-freedom (3-DoF) pose of ground-level sensors within remote sensing imagery by matching ground data to overhead references. Existing methods are broadly categorized into image-to-image matching frameworks and image-to-LiDAR approaches. To evaluate the practical utility of the generated remote sensing images, we reproduced ten representative algorithms on the KITTI dataset using both ground-truth satellite images and their generated counterparts, as shown in Table 5.

Among image-to-image methods, Song et al. [47] achieve the best performance on generated imagery, with an average distance error of only 2.62 m and an orientation error of 3.16°. This approach exhibits the smallest absolute gap compared to its original image performance (1.48 m, 0.49°). LiDAR-to-image methods generally demonstrate larger absolute errors than image-to-image approaches, reflecting the significant domain gap and inherent information loss associated with translating sparse 3D point clouds into dense RGB imagery. Nevertheless, Hu et al. [3] demonstrate the most competitive performance in this category (4.58 m, 2.86°), suggesting that their pipeline retains more localization-salient features than its counterparts. Overall, although the performance of the cross-view localization algorithm on generated images exhibits slight degradation compared to that on original images, it is capable of maintaining substantial localization accuracy.

Table 5. Comparison of cross-view localization performance between generated and original images on the KITTI dataset.

Type	Method	Generated Image		Original Image
Type	Method	Dist (m)	$θ$ (°)	Dist (m)	$θ$ (°)
Image-to-image	LM [41]	14.62	4.91	12.08	3.72
	SliceMatch [48]	8.52	5.05	7.96	4.12
	CCVPE [49]	3.77	4.51	1.22	0.67
	Hu et al. [50]	4.83	5.72	2.10	3.94
	Song et al. [47]	2.62	3.16	1.48	0.49
LiDAR-to-image	Zhang et al. [51]	6.39	5.12	5.88	3.42
	Zhang et al. [52]	5.76	3.90	4.49	2.27
	Sun et al. [53]	7.14	3.25	5.25	2.83
	Wang et al. [54]	8.47	5.31	6.76	3.21
	Hu et al. [3]	4.58	2.86	3.66	1.85

Figure 6 illustrates qualitative localization results on the KITTI dataset using generated images. The results demonstrate that the pose estimation derived from the generated imagery aligns nearly identically with those obtained from ground truth references, even in challenging scenarios involving occlusions and dynamic objects. Overall, the consistent performance across quantitative metrics and qualitative visual examples demonstrates that our remote sensing image generation algorithm yields satellite imagery functionally equivalent to real acquisitions for cross-view localization. It effectively preserves the feature distributions and geometric integrity essential for precise spatial reasoning. This validates that our approach not only achieves high perceptual fidelity but also maintains the task-specific quality required for reliable downstream deployment. Consequently, generated imagery can serve as a cost-effective substitute for expensive large-scale satellite data collection in the training and operation of localization systems.

5. Discussion

The proposed method represents a significant advancement in cross-view remote sensing image generation by addressing the fundamental challenge of bridging substantial perspective discrepancies while preserving realistic textures. Unlike text-driven synthesis methods that struggle to capture fine-grained spatial and textural details, our approach leverages dual-modal control conditions—geometric guidance from colored point clouds and appearance guidance from ground-view imagery. This hybrid strategy fills a critical gap in the literature where existing techniques either prioritize semantic alignment at the expense of photorealism or generate plausible textures without adequate geometric fidelity. By systematically decomposing the problem into perspective alignment and feature fusion sub-tasks, our framework demonstrates that explicit geometric modeling is essential for generating satellite imagery that is not only visually convincing but also structurally consistent with real-world layouts. The substantial improvement in structural similarity metrics validates the importance of incorporating 3D spatial information for cross-view translation tasks, offering a promising direction for applications in autonomous navigation, urban planning, and large-scale mapping, where geometric accuracy is paramount.

The principal strength of our approach lies in its modular design, which disentangles geometric transformation from texture synthesis, enabling targeted improvements in each domain. The IP-GP module’s ability to project ground-view imagery into the satellite perspective while integrating point cloud data establishes a robust geometric control mechanism, as evidenced by the significant reduction in

S i m_{s}

from 0.445 to 0.382 in Table 3. This explicit alignment circumvents the structural discrepancies observed when relying solely on learned transformations. Complementarily, the DRF module enhances photorealism by fusing point cloud semantics with fine-grained textures from ground images, addressing the limitation of previous methods that produced overly smooth or unrealistic surface appearances. The synergistic combination of these modules outperforms either component in isolation, demonstrating that geometric priors and textural details are mutually reinforcing rather than redundant. Furthermore, the use of colored point clouds as a geometric scaffold provides interpretable control over the generation process, unlike black-box generative models, thereby facilitating error diagnosis and potential integration with downstream geometric verification systems.

Despite these advantages, several limitations warrant acknowledgment. First, the efficacy of the IP-GP module is contingent upon the quality and density of input point clouds. Sparse or noisy LiDAR data may degrade the accuracy of perspective projection and subsequently compromise the geometric fidelity of synthesized images. Second, the proposed method relies on accurately calibrated LiDAR and camera data. It could be sensitive to errors in LiDAR-camera calibration, slow for a 50-step DDIM inference, or even in heavy occlusion conditions. Third, the current implementation assumes a relatively accurate initial pose estimate between the ground and satellite views. Future work should explore more robust point cloud completion techniques to handle incomplete data, investigate adaptive fusion strategies that preserve edge sharpness, and develop lightweight architectures to enhance practical applicability. More specifically, we advocate for exploring self-supervised calibration algorithms that jointly learn sensor alignment and view transformation. Additionally, alternative depth representations should be investigated to reduce the framework’s sensitivity to precise sensor registration.

6. Conclusions

This paper presents GCCG-RSI, a controllable framework designed to address the task of remote sensing image generation. The objective is to enhance generation accuracy by synergistically leveraging CP-RSI, which provides reliable geometric structures and coarse color information, and I-RSI, which offers fine-grained texture details. Specifically, GCCG-RSI initially transforms the ground-view image into I-RSI via the IP-GP module, yielding a representation that provides rich texture cues and achieves preliminary geometric alignment with the target viewpoint. Subsequently, the model integrates point clouds and images to generate CP-RSI, thereby introducing more precise geometric priors and auxiliary color information.

To effectively integrate two types of RSI information, GCCG-RSI introduces the DRF module. This module not only models the global dependencies of the concatenated dual RSI features but also facilitates bidirectional feature interaction between I-RSI and CP-RSI, thereby fully capturing the complementary information inherent in different representations. Through this process, the model generates a fused RSI feature that combines precise geometric structure with detailed texture information. This fused representation serves as a condition to guide the diffusion model in generating the RSI that is geometrically consistent with real satellite images. Experimental results on the KITTI and Ford Multi-AV datasets demonstrate the significant effectiveness and superiority of GCCG-RSI in the RSI generation task. This method provides an effective image generation approach for cross-view localization, substantially reducing the domain gap between ground and satellite images. In our future endeavors, we strive to incorporate textual information to enhance semantic guidance during the image generation process.

Author Contributions

All authors contributed to this manuscript. Conceptualization, D.H. and X.Y.; methodology, D.H. and R.Q.; software, D.H.; validation, D.H., R.Q., and S.Y.; formal analysis, R.Q.; investigation, S.Y.; resources, C.Z.; data curation, R.Q.; writing—original draft preparation, D.H.; writing—review and editing, X.Y.; visualization, R.Q.; supervision, C.Z.; project administration, S.Y.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NingXia Academy of Agriculture and Forestry Sciences Science and Technology Innovation Guidance Technology Research Project, “Research and Demonstration of Key Technologies for Smart Planting of Wine Grapes in Ningxia,” under grant NKYG-23-02.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, D.; Yuan, X.; Xu, X.; Zhao, C. A Review of Ground-to-Aerial Cross-View Localization Research. J. Electron. Inf. Technol. 2025, 47, 5016–5032. [Google Scholar]
Szász, B.; Heil, B.; Kovács, G.; Mészáros, D.; Czimber, K. Comparison of Advanced Terrestrial and Aerial Remote Sensing Methods for Above-Ground Carbon Stock Estimation—A Comparative Case Study for a Hungarian Temperate Forest. Remote Sens. 2025, 17, 2173. [Google Scholar] [CrossRef]
Hu, D.; Yuan, X.; Xi, H.; Li, J.; Song, Z.; Xiong, F.; Zhang, K.; Zhao, C. Road structure inspired UGV-satellite cross-view geo-localization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16767–16786. [Google Scholar] [CrossRef]
Zhu, Y.; Chen, S.; Lu, X.; Chen, J. Cross-view image synthesis from a single image with progressive parallel GAN. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4701513. [Google Scholar] [CrossRef]
Niu, Z.; Li, Y.; Gong, Y.; Zhang, B.; He, Y.; Zhang, J.; Tian, M.; He, L. Multi-Class Guided GAN for Remote-Sensing Image Synthesis Based on Semantic Labels. Remote Sens. 2025, 17, 344. [Google Scholar] [CrossRef]
Lai, Z.; Tang, C.; Lv, J. Multi-view image generation by cycle CVAE-GAN networks. In Proceedings of the International Conference on Neural Information Processing; Springer: Cham, Switzerland, 2019; pp. 43–54. [Google Scholar]
Cai, H.; Huang, W.; Yang, S.; Ding, S.; Zhang, Y.; Hu, B.; Zhang, F.; Cheung, Y.M. Realize generative yet complete latent representation for incomplete multi-view learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3637–3652. [Google Scholar] [CrossRef]
Sordo, Z.; Chagnon, E.; Hu, Z.; Donatelli, J.J.; Andeer, P.; Nico, P.S.; Northen, T.; Ushizima, D. Synthetic scientific image generation with VAE, GAN, and diffusion model architectures. J. Imaging 2025, 11, 252. [Google Scholar] [CrossRef]
Li, W.; He, J.; Ye, J.; Zhong, H.; Zheng, Z.; Huang, Z.; Lin, D.; He, C. Crossviewdiff: A cross-view diffusion model for satellite-to-street view synthesis. arXiv 2024, arXiv:2408.14765. [Google Scholar]
Seo, M.; Jung, J.; Choi, D.G. Improved flood insights: Diffusion-based SAR-to-EO image translation. Remote Sens. 2025, 17, 2260. [Google Scholar] [CrossRef]
Guo, Z.; Hu, W.; Zheng, S.; Zhang, B.; Zhou, M.; Peng, J.; Yao, Z.; Feng, M. Efficient Conditional Diffusion Model for SAR Despeckling. Remote Sens. 2025, 17, 2970. [Google Scholar] [CrossRef]
Lee, Y.; Kim, K.; Kim, H.; Sung, M. Syncdiffusion: Coherent montage via synchronized joint diffusions. Adv. Neural Inf. Process. Syst. 2023, 36, 50648–50660. [Google Scholar]
Lin, T.J.; Wang, W.; Shi, Y.; Perincherry, A.; Vora, A.; Li, H. Geometry-guided cross-view diffusion for one-to-many cross-view image synthesis. In Proceedings of the 2025 International Conference on 3D Vision (3DV); IEEE Computer Society: Washington, DC, USA, 2025; pp. 866–881. [Google Scholar]
Hu, D.; Yuan, X.; Zhao, C. Active layered topology mapping driven by road intersection. Knowl.-Based Syst. 2025, 315, 113305. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Song, S.; Li, B.; Hui, L.; Dai, Y. Cross-View Geo-Localization via 3D Gaussian Splatting-Based Novel View Synthesis. Remote Sens. 2025, 17, 3673. [Google Scholar] [CrossRef]
Bajbaa, K.; Anwar, A.; Saqib, M.; Anwar, H.; Sharma, N.; Usman, M. From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis. arXiv 2025, arXiv:2509.24369. [Google Scholar] [CrossRef]
Regmi, K.; Borji, A. Cross-view image synthesis using geometry-guided conditional gans. Comput. Vis. Image Underst. 2019, 187, 102788. [Google Scholar] [CrossRef]
Zhao, L.; Zhou, Y.; Hu, X.; Gan, W.; Huang, G.; Zhang, C.; Hou, M. Street-to-satellite view synthesis for cross-view geo-localization. In Proceedings of the International Conference on Remote Sensing Technology and Survey Mapping (RSTSM 2024); SPIE: Bellingham, WA, USA, 2024; Volume 13166, pp. 48–54. [Google Scholar]
Wu, S.; Tang, H.; Jing, X.Y.; Zhao, H.; Qian, J.; Sebe, N.; Yan, Y. Cross-view panorama image synthesis. IEEE Trans. Multimed. 2022, 25, 3546–3559. [Google Scholar] [CrossRef]
Shi, Y.; Campbell, D.; Yu, X.; Li, H. Geometry-guided street-view panorama synthesis from satellite imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 10009–10022. [Google Scholar] [CrossRef]
Li, G.; Qian, M.; Xia, G.S. Unleashing unlabeled data: A paradigm for cross-view geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 16719–16729. [Google Scholar]
Ze, X.; Song, Z.; Wang, Q.; Lu, J.; Shi, Y. Controllable satellite-to-street-view synthesis with precise pose alignment and zero-shot environmental control. arXiv 2025, arXiv:2502.03498. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 10684–10695. [Google Scholar]
Wang, X.; Cai, W.; Ding, Y.; Di, X.; Li, S.; Yin, Z.; Jia, H.; Fu, J. RGB to Infrared Image Translation Based on Diffusion Bridges Under Aerial Perspective. Remote Sens. 2025, 17, 3703. [Google Scholar] [CrossRef]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef]
Zheng, G.; Li, S.; Wang, H.; Yao, T.; Chen, Y.; Ding, S.; Li, X. Entropy-Driven Sampling and Training Scheme for Conditional Diffusion Generation. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 754–769. [Google Scholar]
Kawar, B.; Elad, M.; Ermon, S.; Song, J. Denoising Diffusion Restoration Models. arXiv 2022, arXiv:2201.11793. [Google Scholar]
Dhariwal, P.; Nichol, A.Q. Diffusion Models Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. arXiv 2021, arXiv:2207.12598. [Google Scholar]
Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Singapore, 2024; Volume 38, pp. 4296–4304. [Google Scholar]
Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Arrabi, A.; Zhang, X.; Sultani, W.; Chen, C.; Wshah, S. Cross-view meets diffusion: Aerial image synthesis with geometry and text guidance. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 5356–5366. [Google Scholar]
Ye, J.; He, J.; Li, W.; Lv, Z.; Lin, Y.; Yu, J.; Yang, H.; He, C. Leveraging BEV paradigm for ground-to-aerial image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2025; pp. 28451–28461. [Google Scholar]
Yu, Z.; Liu, C.; Liu, L.; Shi, Z.; Zou, Z. Metaearth: A generative foundation model for global-scale remote sensing image generation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1764–1781. [Google Scholar] [CrossRef]
Liu, C.; Chen, K.; Zhao, R.; Zou, Z.; Shi, Z. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model. IEEE Geosci. Remote Sens. Mag. 2025, 13, 238–259. [Google Scholar] [CrossRef]
Wang, X.; Xu, R.; Cui, Z.; Wan, Z.; Zhang, Y. Fine-grained cross-view geo-localization using a correlation-aware homography estimator. Adv. Neural Inf. Process. Syst. 2023, 36, 5301–5319. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 3836–3847. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Agarwal, S.; Vora, A.; Pandey, G.; Williams, W.; Kourous, H.; McBride, J. Ford multi-AV seasonal dataset. Int. J. Robot. Res. 2020, 39, 1367–1376. [Google Scholar] [CrossRef]
Shi, Y.; Li, H. Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 17010–17020. [Google Scholar]
Toker, A.; Zhou, Q.; Maximov, M.; Leal-Taixé, L. Coming down to earth: Satellite-to-street view synthesis for geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 6488–6497. [Google Scholar]
Lu, X.; Li, Z.; Cui, Z.; Oswald, M.R.; Pollefeys, M.; Qin, R. Geometry-aware satellite-to-ground image synthesis for urban areas. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 859–867. [Google Scholar]
Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-aware feature aggregation for image based cross-view geo-localization. Adv. Neural Inf. Process. Syst. 2019, 32, 1–11. [Google Scholar]
Tang, H.; Xu, D.; Sebe, N.; Wang, Y.; Corso, J.J.; Yan, Y. Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 2417–2426. [Google Scholar]
Brooks, T.; Holynski, A.; Efros, A.A. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 18392–18402. [Google Scholar]
Song, Z.; Lu, J.; Shi, Y. Learning dense flow field for highly-accurate cross-view camera localization. Adv. Neural Inf. Process. Syst. 2024, 36, 70612–70625. [Google Scholar]
Lentsch, T.; Xia, Z.; Caesar, H.; Kooij, J.F. SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 17225–17234. [Google Scholar]
Xia, Z.; Booij, O.; Kooij, J.F.P. Convolutional Cross-View Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3813–3831. [Google Scholar] [CrossRef] [PubMed]
Hu, W.; Zhang, Y.; Liang, Y.; Han, X.; Yin, Y.; Kruppa, H.; Ng, S.K.; Zimmermann, R. PetalView: Fine-grained Location and Orientation Extraction of Street-view Images via Cross-view Local Search. In Proceedings of the 31st ACM International Conference on Multimedia; ACM: New York, NY, USA, 2023; pp. 56–66. [Google Scholar]
Zhang, Y.; Wang, J.; Wang, X.; Li, C.; Wang, L. 3d lidar-based intersection recognition and road boundary detection method for unmanned ground vehicle. In Proceedings of the 2015 IEEE 18th International Conference on Intelligent Transportation Systems, Gran Canaria, Spain, 15–18 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 499–504. [Google Scholar]
Zhang, Y.; Wang, J.; Wang, X.; Dolan, J.M. Road-Segmentation-Based Curb Detection Method for Self-Driving via a 3D-LiDAR Sensor. IEEE Trans. Intell. Transp. Syst. 2018, 19, 3981–3991. [Google Scholar] [CrossRef]
Sun, P.; Zhao, X.; Xu, Z.; Wang, R.; Min, H. A 3D LiDAR data-based dedicated road boundary detection algorithm for autonomous vehicles. IEEE Access 2019, 7, 29623–29638. [Google Scholar] [CrossRef]
Wang, G.; Wu, J.; He, R.; Tian, B. Speed and Accuracy Tradeoff for LiDAR Data Based Road Boundary Detection. IEEE/CAA J. Autom. Sin. 2021, 8, 1210–1220. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of GCCG-RSI. The model takes the ground image and point clouds as conditional inputs to jointly guide the generation of RSI that aligns with the geometric structure of real satellite imagery.

Figure 2. The framework of GCCG-RSI, where

F_{i}

and

F_{cp}

denote the feature images extracted from the ground image and colored point clouds via conditional encoding networks, respectively.

z_{t}

represents the latent representation at the t-th iteration.

Figure 2. The framework of GCCG-RSI, where

F_{i}

and

F_{cp}

denote the feature images extracted from the ground image and colored point clouds via conditional encoding networks, respectively.

z_{t}

represents the latent representation at the t-th iteration.

Figure 3. The architecture of the DRF module.

Figure 4. Comparison of the generated images for urban and highway scenes in the KITTI and Ford Multi-AV datasets.

Figure 5. Visualization of control conditions for the RSI generation on the KITTI and Ford Multi-AV datasets. (a,b) demonstrate the performance of the method in the town scenarios, whereas (c–f) illustrate its performance on the highway. The red line delineates the road structure, the yellow box highlights the vegetation texture, and the green box marks the road texture.

Figure 6. Qualitative results of cross-view localization on generated images in the KITTI dataset. The red arrows indicate the ground truth pose, and the yellow arrows represent the predicted pose. (a) LM [41]. (b) SliceMatch [48]. (c) CCVPE [49]. (d) Hu et al. [50]. (e) Song et al. [47]. (f) Zhang et al. [51]. (g) Zhang et al. [52]. (h) Sun et al. [53]. (i) Wang et al. [54]. (j) Hu et al. [3].

Table 1. Parameter configuration in the experiments.

Parameter	Setting Value
Resolution of satellite image	512 × 512
Resolution of ground image	512 × 1024
Vertical FoV of projection	17.5°
Pitch angle of projection	1.9°
Scale of projection	4
Num of attention heads	8
$(α, β, γ)$ in SSIM	(1.0, 1.0, 1.0)
Batch size	1
Learning rate	0.00001
Learning rate scheduler	Cosine annealing
Optimizer	AdamW
Number of epochs	300
Scale factor	0.18215
Model of GPU	NVIDIA GeForce RTX 3090

Table 2. Experimental results of the comparison on KITTI and Ford Multi-AV datasets. Best and second-best results shown in bold and underline, respectively. ↑ denotes that a larger value represents better performance, while ↓ indicates that a smaller value is preferable.

Dataset	Method	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓	${Sim}_{s}$ ↓
KITTI	SelGAN [45]	11.02	0.108	0.724	80.90	0.512
	GPG2A [32]	10.08	0.117	0.710	67.92	0.399
	Instr-p2p [46]	10.14	0.126	0.701	52.18	0.495
	ControlNet [38]	10.93	0.135	0.676	42.21	0.392
	SkyDiffusion [33]	12.26	0.153	0.655	39.39	0.367
	GCC [13]	11.65	0.172	0.671	47.36	0.382
	Proposed	11.33	0.145	0.684	39.58	0.334
Ford	SelGAN [45]	10.92	0.109	0.775	91.02	0.523
	GPG2A [32]	9.81	0.120	0.752	65.57	0.412
	Instr-p2p [46]	9.89	0.118	0.741	59.46	0.513
	ControlNet [38]	10.88	0.140	0.723	47.36	0.421
	SkyDiffusion [33]	11.69	0.131	0.713	45.52	0.377
	GCC [13]	11.19	0.164	0.736	53.18	0.394
	Proposed	11.25	0.139	0.698	45.87	0.326

Table 3. Ablation study of the role of our method on the KITTI dataset (bold: best). ↑ denotes that a larger value represents better performance, while ↓ indicates that a smaller value is preferable. ✓ signifies the inclusion of the module in the method, whereas ✕ denotes its exclusion.

IP-GP	DRF	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓	${Sim}_{s}$ ↓
✕	✕	10.05	0.134	0.731	43.88	0.445
✕	✓	10.16	0.135	0.718	43.51	0.437
✓	✕	10.31	0.132	0.713	42.12	0.382
✓	✓	11.33	0.145	0.684	39.58	0.334

Table 4. Ablation study for combinations of control conditions on the KITTI dataset (bold: best). ↑ denotes that a larger value represents better performance, while ↓ indicates that a smaller value is preferable. ✓ signifies the inclusion of the module in the method, whereas ✕ denotes its exclusion.

I-RSI	CP-RSI	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓	${Sim}_{s}$ ↓
✕	✕	10.05	0.134	0.731	43.88	0.432
✕	✓	10.28	0.134	0.721	42.18	0.362
✓	✕	10.38	0.124	0.708	43.58	0.412
✓	✓	11.33	0.145	0.684	39.58	0.334

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, D.; Qin, R.; Yuan, X.; Yang, S.; Zhao, C. GCCG-RSI: Ground LiDAR and Image-Guided Geometry-Constrained Controllable Generation for Remote Sensing Image. Remote Sens. 2026, 18, 1512. https://doi.org/10.3390/rs18101512

AMA Style

Hu D, Qin R, Yuan X, Yang S, Zhao C. GCCG-RSI: Ground LiDAR and Image-Guided Geometry-Constrained Controllable Generation for Remote Sensing Image. Remote Sensing. 2026; 18(10):1512. https://doi.org/10.3390/rs18101512

Chicago/Turabian Style

Hu, Di, Riyu Qin, Xia Yuan, Shuting Yang, and Chunxia Zhao. 2026. "GCCG-RSI: Ground LiDAR and Image-Guided Geometry-Constrained Controllable Generation for Remote Sensing Image" Remote Sensing 18, no. 10: 1512. https://doi.org/10.3390/rs18101512

APA Style

Hu, D., Qin, R., Yuan, X., Yang, S., & Zhao, C. (2026). GCCG-RSI: Ground LiDAR and Image-Guided Geometry-Constrained Controllable Generation for Remote Sensing Image. Remote Sensing, 18(10), 1512. https://doi.org/10.3390/rs18101512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GCCG-RSI: Ground LiDAR and Image-Guided Geometry-Constrained Controllable Generation for Remote Sensing Image

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Image Generation in Cross-View Localization

2.2. Diffusion Model for Image Generation

3. Materials and Methods

3.1. Image-Point Clouds Geometric Projection

3.2. Dual Remote Sensing Images Feature Fusion

3.2.1. Cross Attention Branch

3.2.2. Self Attention Branch

3.3. Geometric-Constrained Conditional Diffusion Model

4. Results

4.1. Experimental Data and Evaluation Metrics

4.2. Evaluation Results

4.3. Ablation Study

4.4. Downstream Applications

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI