MDD-VIR: Vis-to-IR Remote Sensing Image Generation Method Based on Mechanism-Data Dual-Driven Strategy

Li, Yue; Sun, Dechang; Wang, Xiaorui; Ren, Fafa; Zhang, Chao

doi:10.3390/rs18101502

Open AccessArticle

MDD-VIR: Vis-to-IR Remote Sensing Image Generation Method Based on Mechanism-Data Dual-Driven Strategy

by

Yue Li

,

Dechang Sun

,

Xiaorui Wang

,

Fafa Ren

and

Chao Zhang

^*

School of Optoelectronic Engineering, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1502; https://doi.org/10.3390/rs18101502

Submission received: 10 March 2026 / Revised: 24 April 2026 / Accepted: 6 May 2026 / Published: 11 May 2026

(This article belongs to the Special Issue AI-Driven Remote Sensing Image Restoration and Generation)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A Vis-to-IR remote sensing image generation method based on mechanism-data dual-driven strategy (MDD-VIR) is proposed, which couples the global radiation scattering mechanism of cross-band remote sensing imagery with a deep generative model for infrared remote sensing images.
Diverse experimental results demonstrate that MDD-VIR exhibits outstanding accuracy and effectiveness in complex terrain scenarios and multi-band infrared remote sensing image generation tasks, with the generated images achieving an average structural similarity index measure (SSIM) value of 91.07%.

What are the implications of the main findings?

This method addresses critical challenges limiting the accuracy and fidelity of traditional simulation models through a synergistic mechanism-data dual-driven design, and achieves multiple objectives encompassing strong physical consistency, high fidelity, and high efficiency.
This method synergistically exploits the unique advantages of mechanism-driven and data-driven paradigms, significantly improves the overall performance of generative models and providing an interpretable, more comprehensive solution for remote sensing image generation across visible to infrared wavelengths.

Abstract

High-fidelity infrared remote sensing imagery serves as a critical foundation for the development of technologies such as infrared scene simulation and long-range imaging detection. Addressing the core limitations of two categories of methods: traditional physical modeling methods—low fidelity and efficiency—and deep learning-based generation methods with insufficient interpretability and weak generalization capabilities, we propose a visible-to-infrared (Vis-to-IR) remote sensing image generation method based on the multi-dimensional features of scene elements and mechanism-data dual-driven strategy (MDD-VIR) in this paper. First, a scene element multi-dimensional feature extractor (SEMFE) is designed by analyzing and reconstructing limited datasets, bridging physical mechanisms and intelligent learning. From a game-theoretic perspective, we present a Unet3+-based frequency-domain adaptive spatial channel reconstruction convolution module (FASCRC_Unet3+) and a feature fusion discrimination method based on proactive material weighting (FFD_PMW) to enhance the model’s ability to learn and transform high-value regional and multi-scale features. Furthermore, a collaborative optimization loss function (Loss_CO) is designed to integrate dual-driven paradigm advantages to facilitate efficient iteration. Experiments show that the average SSIM of MDD-VIR simulated images reached 91.07%. Innovatively fusing physical algorithms with intelligent models, this approach enables the Vis-to-IR remote sensing image generation model to achieve the multiple objectives of robust physical consistency, high fidelity, and high efficiency.

Keywords:

remote sensing image generation; mechanism-data dual-driven; VIS-to-IR; scene element multi-dimensional feature extractor (SEMFE)

1. Introduction

Compared with visible-light remote sensing, which relies on sunlight, infrared remote sensing—particularly thermal infrared remote sensing—can operate continuously day and night. Thanks to its all-day capability and ability to penetrate clouds and fog to some extent, it is widely used in fields such as environmental monitoring [1,2], agricultural management [3,4], facility maintenance [5,6], energy exploration [7,8], and military security [9,10]. However, in contrast to easily accessible visible-light remote sensing data, existing ground-truth infrared remote sensing imagery generally suffers from data scarcity and information deficiencies, owing to strict hardware requirements, high acquisition costs, and complicated data processing procedures [11,12,13,14,15]. This makes it difficult to meet the urgent demand for multispectral infrared remote sensing data in technologies such as infrared scene simulation, intelligent interpretation, and big data model construction [16,17,18,19,20,21,22,23,24,25,26,27]. Consequently, methods for generating infrared remote sensing imagery have emerged.

Researchers worldwide have conducted extensive and diverse studies on infrared image generation methods [19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44]. For example, the digital imaging and remote sensing image generation (DIRSIG) model [30,31]—a classic tool in the early stage of physics-based modeling and simulation—is capable of comprehensively accounting for multiple physical factors, including atmospheric transmission, solar radiation, and inter-object heat conduction, thereby enabling the simulation of relatively realistic infrared scene remote sensing images and provides a foundational framework for subsequent research. In 2025, DIRSIG underwent an upgrade and optimization, focusing on enhancing its simulation capabilities for thermal infrared bands [32].

In contrast, deep learning-based image generation methods are represented by the foundational generative adversarial networks (GANs) [45] framework proposed by Goodfellow et al. Through end-to-end adversarial game training, these methods learn the characteristics of data distributions, delivering advantages including high generation efficiency and rich texture details in the generated infrared remote sensing images. The classic models Pix2Pix [46] and CycleGAN [47] have become general-purpose network frameworks for most subsequent image conversion and generation tasks. In 2025, X. Yang et al. proposed AGMD-GAN [33], a conversion model for unpaired visible-to-infrared (Vis-to-IR) images, based on attention mechanisms and multi-scale feature discrimination methods. Özkanoğlu et al. proposed the InfraGAN [34] architecture by introducing structural similarity index constraints, thereby improving the similarity of simulated infrared remote sensing images. In recent years, diffusion models (DMs) [35,36] have also made significant progress in image generation. By progressively adding noise to simulate physical diffusion processes, they learn to generate high-clarity images [37,38,39].

With the iterative updates of these two major categories of typical methods and the gradual exposure of their bottleneck limitations, the integration of physical imaging mechanisms and intelligent network models has become an inevitable trend in emerging infrared image generation technologies [40,41,42,43,44]. For instance, O. Berman embedded image temperature prior information into image transformation, developing the PETIT-GAN [42] model suitable for converting panchromatic images to thermal imaging data, and Y. M. Fang proposed a Physics-Informed Diffusion (PID) [43] model for converting RGB images into infrared images that adhere to physical laws. While these approaches strengthen the physical constraints of the models to a certain extent, they fail to tackle the core issue of the limited comprehensive performance of the models and have limited adaptability to the texture features of remote sensing images. To address this issue, we proposed a high-fidelity infrared remote sensing image generation method (HFIRSIGM_GRSMP) [44] in late 2024, which couples global radiation scattering mechanisms with an intelligent network model. This approach has achieved initial breakthroughs in key intelligent computing integration technologies, improving the model’s simulation accuracy and physical interpretability. However, when dealing with complex remote sensing imaging scenarios and targeted image generation tasks involving high-value targets, the model still has room for improvement in its comprehensive performance. Thus, there is an urgent need to conduct an in-depth exploration of a more comprehensive and efficient intelligent computing fusion collaborative innovation generation model, so as to break through the core predicament of existing technologies.

Against this backdrop, we propose a Vis-to-IR remote sensing image generation method based on a mechanism-data dual-driven strategy (MDD-VIR). Specifically, a scene element multi-dimensional feature extractor (SEMFE) is first constructed, which enables an in-depth analysis and structured reconstruction of limited known data, thereby providing support for the collaborative modeling of “physical constraints—deep generative model.” Second, with Pix2Pix adopted as the baseline network, we designed two components for the adversarial attack and defense ends: a Unet3+-based [48] frequency-domain adaptive spatial channel reconstruction convolution module (FASCRC_Unet3+), and a feature fusion discrimination method based on proactive material weighting (FFD_PMW). Additionally, a collaborative optimization loss function (

L o s s_{C O}

) is proposed based on valid multidimensional feature information, which further improves the comprehensive performance of the model. Finally, multiple experimental results demonstrate that MDD-VIR delivers excellent accuracy and validity in both complex landform scenarios and multi-band infrared remote sensing image generation tasks. The method proposed in this paper not only provides critical data support for the digital simulation of infrared scenes, large-scale model training, and downstream technologies such as object detection, but also promotes interdisciplinary integration.

The main contributions of this paper are as follows:

The SEMFE module bridges the gap between physical mechanisms and intelligent learning.
The FASCRC_Unet3+ module enhances the generator’s ability to extract multi-scale features of complex terrain;
The FFD_PMW module improves the discriminator’s ability to learn features of high-value objects across multiple scales and improves its classification accuracy;
The $L o s s_{C O}$ significantly enhances the overall performance of the generative model through intelligent computing fusion inference strategies.
The MDD-VIR achieves the multiple objectives of strong physical consistency, high fidelity, and high efficiency through a synergistic mechanism-data dual-driven strategy.

2. Materials and Methods

2.1. Problem Analysis

During remote sensing imaging, the coupled radiation scattering mechanism involving multiple scene elements gives rise to high complexity and strong uncertainty in the light field. The radiant energy distribution of the image is affected by a combination of factors, including light source characteristics, observation angles, landform material properties, and atmospheric conditions.

Moreover, as different land cover types exhibit distinct reflection, absorption, transmission, and scattering properties for electromagnetic waves at different wavelengths, remote sensing images corresponding to different spectral bands show significant differences in their imaging modalities and spectral features. Figure 1 illustrates the full-chain remote sensing imaging mechanisms, covering key links in Vis-to-IR remote sensing image simulation. The remote sensing imaging chains for visible light and infrared modalities are fundamentally consistent, and both follow the multi-element scene light field coupling mechanism. However, the inherent differences in spectral radiation characteristics between these two modalities lead to considerable disparities in the texture details and physical properties of the corresponding images. Figure 2 illustrates the primary causes and manifestations of these imaging characteristic differences between visible light and infrared remote sensing.

In visible light remote sensing imaging, the radiant energy received by detectors is dominated by reflection, which depends on the surface reflectance characteristics of different land cover types. Infrared remote sensing is divided into several sub-bands, including the near-infrared (760–1000 nm) and short-wave infrared (1000–2500 nm) bands, which primarily capture solar radiation reflected by land features, as well as the mid-infrared (3000–5000 nm) and thermal infrared (8000–14000 nm) bands, which primarily detect radiation emitted by the land features themselves. Additionally, the atmosphere causes part of the light to be scattered and absorbed by gases and aerosols, thereby altering the reflected spectral distribution. This principle not only serves as the foundation for remote sensing image analysis but also acts as the key to improving the quality and resolution of Vis-to-IR remote sensing image generation.

For the task of generating remote sensing images ranging from visible light to infrared, traditional physical modeling and simulation methods quantitatively model the imaging chain based on infrared radiation transfer theory to generate infrared images, offering good physical interpretability [16,17,18,19,28]. However, such methods struggle to capture the nonlinear effects of complex radiation fields and suffer from drawbacks such as modeling complexity and insufficient image realism [17,43].

Among deep learning-based generative methods, DMs rely on iterative denoising to generate images rich in detail with a uniform distribution, offering advantages in high-dimensional data processing [35,36]; however, they consume significant computational resources and involve high training complexity [37,38]. In contrast, GANs achieve nonlinear mapping learning and autonomous feature extraction through the adversarial training of a generator and a discriminator [45,46,47], offering high simulation efficiency, superior visual quality, and low computational resource requirements [28,33,47]. Given the combined requirements for efficiency and accuracy in remote sensing image generation tasks, this paper employs a GAN as the underlying network model; however, its “black-box” nature results in insufficient physical interpretability, thereby reducing the reliability and fidelity of the generated results [49,50].

In summary, both purely physics-driven and data-driven approaches have inherent strengths and limitations. Thus, there is an urgent need to develop an intelligent computing fusion innovation paradigm driven by both physical mechanisms and prior data [43,44,45,46,47], which will provide a critical pathway for technological breakthroughs in Vis-to-IR remote sensing image generation.

2.2. Overall Architecture of the MDD-VIR

The overall architecture of the MDD-VIR is illustrated in Figure 3, which comprises four major components: the SEMFE module, the generation module, the discrimination module, and the

L o s s_{C O}

module.

First, the SEMFE is designed. This module conducts in-depth mining and structured reconstruction of the physical attributes of land cover types, spectral mapping patterns, and measured infrared data features. It provides data support for precise feature extraction in the subsequent generation module and effective quantitative evaluation in the discrimination module, and further lays the foundation for the collaborative optimization modeling of “physical constraints–deep generative model.” Second, based on the Pix2Pix, the FASCRC_Unet3+ module is designed and integrated into the generator, while the FFD_PMW method is proposed for the discriminator. Notably, by leveraging the spectral characteristic differences and transformation relationships of coupled radiative scattering components among multiple elements in remote sensing scenes, the

L o s s_{C O}

is designed based on valid multi-dimensional feature information, which guides the model to converge and iterate rapidly while adhering to real radiative laws. Ultimately, the MDD-VIR model is established through an innovative intelligent computing fusion architecture, enabling the large-scale, rapid, and high-fidelity conversion and generation of Vis-to-IR remote sensing images.

2.3. SEMFE

Infrared remote sensing image generation methods typically face two major challenges: the insufficient representation of physical information and a lack of targeted constraints during neural network training. These limitations often prevent models from achieving the simulation fidelity and scene adaptability required for practical applications. To address these, we propose the SEMFE. As illustrated in Figure 4, SEMFE employs a parallel processing workflow—weighting ground object materials, mapping radiative spectral features, and extracting measured infrared data—to convert limited raw input into structured feature representations that preserve both physical meaning and representational power. This approach facilitates collaborative modeling that integrates physical constraints with deep generative models, enhancing both the realism and adaptability of the generated infrared images.

In SEMFE, the ground object weight allocation step serves to enhance regional specificity. It uses material-classified images (obtained by segmenting visible-light remote sensing images by material) as input, supported by ground-truth data, classical infrared radiation mechanisms, and publicly available material spectral reflectivity datasets. By leveraging the differences in radiative properties among various land cover types and assessing the regional importance for specific image generation tasks, the method constructs material weighting coefficients (

σ_{k} \in (0, 1]

) to quantitatively characterize the importance of the ground object category (k) in the image generation task. Specifically, this paper classifies ground object types into five typical categories: soil, vegetation, water bodies, buildings, and roads. Under the assumption that high-value areas such as military bases, transportation hubs, and airport runways are designated as key scene elements, the weights for the aforementioned five typical materials are set sequentially to 0.3, 0.2, 0.1, 0.5, and 0.7, respectively. These weights are then applied during subsequent network training to quantitatively distinguish the model’s attention to different regions.

This module overcomes the limitation of “uniform processing” in traditional image generation by enabling the weighted enhancement of high-value key areas. In turn, it directs the generative model to focus on critical regions, thereby improving the informational effectiveness and application value of the generated images. The weights in this class cannot be directly transferred to arbitrary datasets or scenarios; they must be calibrated based on the ground cover material categories in the dataset and the specific objectives of the image generation task to ensure that the physical constraints remain reliable under the experimental setup.

Furthermore, the spectral feature mapping stage serves as a critical link in the transition from visible to infrared light. Its core objective is to establish a spectral mapping relationship between visible and infrared images based on the differences in their radiative energy spectra, utilizing measured multimodal data and the spectral radiative characteristics of ground objects. This stage provides reliable physical radiative constraints for the subsequent intelligent network generation process. The quantitative characterization model for

D a t a_{M A P} (k, λ_{i}, t_{i}, α_{i}, m, n)

is shown below:

D a t a_{M A P} (k, λ_{i}, t_{i}, α_{k i}, m, n) = \frac{\max L_{A l l} (λ_{p}, t_{p}, α_{k p}) - \min L_{A l l} (λ_{p}, t_{p}, α_{k p})}{L_{A l l} (k, λ_{p}, t_{p}, α_{k p}, m_{k}, n_{k}) - \min L_{A l l} (k, λ_{p}, t_{p}, α_{p})} \cdot \frac{L_{A l l} (k, λ_{q}, t_{q}, α_{k q}, m_{k}, n_{k}) - \min L_{A l l} (k, λ_{q}, t_{q}, α_{k q})}{\max L_{A l l} (λ_{q}, t_{q}, α_{k q}) - \min L_{A l l} (λ_{q}, t_{q}, α_{k q})}

(1)

L_{A l l} (k, λ_{i}, t_{i}, α_{k i}) = [L_{b b} (k, λ_{i}, t_{i}) \cdot (1 - α_{k i}) + (L_{S u n} (λ_{i}, t_{i}) + L_{S k y} (λ_{i}, t_{i})) \cdot α_{k i}] \times τ_{A t m} + L_{P a t h} (λ_{i}, t_{i})

(2)

where k characterizes the material model;

λ_{p}

and

λ_{q}

represent the visible and infrared bands, respectively;

α_{k i}

is the reflectance of material

k

in the

λ_{i}

-band, which can be obtained by consulting databases of spectral reflectivity for materials (such as those maintained by the United States Geological Survey (USGS) and Johns Hopkins University (JHU));

L_{A l l} (k, λ_{i}, t_{i}, α_{k i}, m_{k}, n_{k})

is the total radiance of each pixel

(m_{k}, n_{k})

of material

k

at time

t_{i}

in the λ-band;

L_{A l l} (λ_{p}, t_{p}, α_{k p})

is the total radiance of the entire feature scene at time

t_{p}

in the

λ_{p}

-band;

L_{A l l} (λ_{q}, t_{q}, α_{k q})

is the total radiance of the entire feature scene at time

t_{q}

in the

λ_{q}

-band;

L_{b b} (k, λ_{i}, t_{i})

is the brightness of blackbody radiation at the same temperature as the material;

L_{S u n} (λ_{i}, t_{i})

is the brightness of solar radiation received in the target region; is the brightness of sky background radiation received in the target region;

L_{S k y} (λ_{i}, t_{i})

is the atmospheric transmittance; and

τ_{A t m}

is the brightness of atmospheric path radiation.

The actual data extraction phase serves to anchor the authenticity of the texture. By extracting valid single-channel grayscale distribution data from actual infrared images, this data is used to supervise the training of the neural network model, providing a reference for determining the grayscale tone and realism of the generated images. This process does not extend to the inference stage.

Following the aforementioned steps, the SEMFE module ultimately outputs a three-channel multidimensional feature information image (

I_{M D F}

), where the three RGB channels carry core information of different dimensions respectively. Specifically, the R channel stores material weight data, which quantifies the weight distribution across different ground object material regions, serving as a priority constraint for the discriminator module. It guides the discriminator to focus on the feature conversion quality of high-value critical areas, thereby improving the information density and application value of the generated images. The G channel stores radiative spectral feature mapping data, guiding the adversarial game process to ensure the conversion accuracy from visible light to infrared in terms of physical radiation, and providing underlying support for the physical authenticity of the generated images. The B channel, meanwhile, stores grayscale information of measured infrared images as a reference benchmark, ensuring that the overall quality of the generated images aligns with the imaging characteristics of the measured infrared images.

It is evident that multi-dimensional feature information images not only serve as pixel-level representations of target-band infrared images but also embody the quantitative characteristics of core regions in image generation tasks. More importantly, they act as a condensed reflection of the imaging modalities and spectral features inherent in both visible light and infrared remote sensing. As the core hub connecting physical imaging mechanisms and intelligent network models, this module provides the underlying data support and a constraint framework for enhancing the performance of generators and discriminators, as well as their adversarial iteration, thereby laying a critical foundation for constructing infrared remote sensing image generation models driven by both physical mechanisms and prior data.

2.4. Generator

The Pix2Pix generators typically adopt the U-Net [51] architecture, which offers fundamental image-to-image conversion capabilities. However, when tackling remote sensing image conversion tasks involving diverse ground object types and complex scene environments, the comprehensiveness of their feature extraction and their ability to capture multi-scale information exhibit significant limitations, making it challenging to accurately focus on the features of key regions.

To address this issue, this paper designs and proposes a generator architecture based on the FASCRC_Unet3+, as illustrated in Figure 5. On the one hand, this module draws inspiration from the UNet3+ design philosophy: by adding multi-level skip connections between the encoder and decoder, it enables simultaneous image feature extraction across different scale levels. This allows the model to comprehensively integrate image information from local details to global spatial distribution, and such preservation of multi-scale contextual information effectively enhances the model’s robustness and generation accuracy, thereby ensuring the integrity of details and spatial information richness of the generated remote sensing images. On the other hand, the FASCRC module is embedded into the architecture to improve the generator’s modeling effectiveness for multi-level features and its capacity for learning frequency-domain characteristics.

As shown in Figure 6, the FASCRC module builds upon spatial and channel reconstruction convolution (SCConv) [52]. By simultaneously modeling feature correlations across spatial and channel dimensions, it achieves the collaborative reconstruction of dual-dimensional information, thereby solidifying the foundational capability of feature representation. Furthermore, the introduction of fast fourier transform (FFT) [53] feature skip connections specifically enhances the network’s focus on high-frequency image details, boosts the model’s tolerance to noise and interference, and drives the generator to output clearer and more accurate remote sensing images.

It is evident that integrating the UNet3+ architecture with the FASCRC module in an embedded and coupled manner for Vis-to-IR remote sensing image generation tasks facilitates the comprehensive acquisition of multi-dimensional image features, thereby significantly improving the quality and fidelity of the generated infrared remote sensing images.

2.5. Discriminator

In GANs, the generator and discriminator maintain a symbiotic relationship of co-evolution. The core function of the discriminator is to quantify the similarity between the generated images and real images. By outputting probabilistic feedback, it guides the generator to dynamically adjust its generation strategy, ultimately achieving iterative improvements in the generation quality.

The PatchGAN [46] discriminator architecture adopted by the Pix2Pix model processes the authenticity judgments of the image segmentation matrix through global averaging, outputting a single authenticity determination. This “treat-all-equally” judgment mechanism fails to direct resource allocation toward core objectives—specifically, it fails to prioritize the model’s focus on high-value regional features in remote sensing images, which may cause the model’s iterative direction to deviate from expectations.

To address the core requirements for high-value region feature extraction and conversion accuracy in specific generation tasks within complex remote sensing scenarios, this paper proposes the FFD_PMW method, whose architecture is illustrated in Figure 7.

The FFD_PMW employs proactive weighting constraint regulation on the discriminative mechanism using material weight data from the

I_{M D F}

, guiding the discriminator to focus on high-value ground object regions. It enhances the discriminator’s ability to learn multi-scale high-value ground object features and improves its discriminative accuracy via multi-scale fusion convolution. Meanwhile, it weakens the focus on non-priority regions, thereby boosting the model’s iterative optimization efficiency. The core optimized discriminative mechanism of the FFD_PMW is shown in Equation (3).

{\overset{\land}{P}}_{i} = I_{R_w e i g h t} \cdot \min (\max ((1 - χ (\frac{1}{2} - P_{i})) \cdot P_{i}, 0), 1) + (1 - I_{R_w e i g h t}) \cdot P_{i} (1 - χ)

(3)

where

\overset{\land}{P_{i}}

represents the feature-weighted discrimination result of the discriminator for the i-th paired image group;

P_{i}

denotes the initial discrimination result (probability value) of the discriminator for the i-th paired image group;

I_{R_w e i g h t}

is the feature material weight image, which quantifies the discriminator model’s active attention to each feature material through

σ_{k} \in (0, 1]

. Higher values indicate greater attention, derived from the R channel of

I_{M D F}

. The

χ_{k} \in (0, 1]

is the adjustment intensity coefficient. This design overcomes the limitation of “uniform processing” in traditional discriminators, enhancing the model’s ability to represent details in critical regions and improving the scene adaptability and information validity of the generative model.

2.6. Collaborative Loss Module

For the Vis-to-IR remote sensing image conversion task, this paper addresses the core limitations of traditional physical modeling and simulation methods—low fidelity and simulation efficiency—and the inherent lack of interpretability and weak generalization capabilities of deep learning generation methods. By taking the SEMFE module as the link, this approach establishes a bridge connecting physical imaging mechanisms with the interpretability of intelligent models, and constructs the

L o s s_{C O}

based on multidimensional feature information, whose structural composition is illustrated in Figure 8.

The

L o s s_{C O}

is calculated as the weighted sum of losses from each component, using the formula in Equation (4):

L o s s_{C O} = c_{1} L_{G D} + c_{2} L o s s_{W e i_I R} + c_{3} L o s s_{G P} + c_{4} L o s s_{F S M}

(4)

L o s s_{W e i_I R} = \{\begin{matrix} \frac{1}{2 N} \sum_{i = 1}^{N} [1 - I_{R_w e i g h t} (δ - γ)] \cdot γ^{2}, i f |γ| < 1 \\ \frac{1}{N} \sum_{i = 1}^{N} [1 - I_{R_w e i g h t} (δ - γ)] \cdot |γ| - \frac{1}{2}, o t h e r w i s e \end{matrix}

(5)

γ = I_{G_m a p p i n g} \cdot I_{i n p u t} - G (I_{i n p u t})

(6)

where

L_{G D}

denotes the base loss function of Pix2Pix, comprising adversarial loss (

L_{G A N} (G, D)

) and conditional loss (

L_{L 1} (G)

);

L o s s_{G P}

and

L o s s_{F S M}

denote gradient penalty loss and feature space matching loss, respectively [47].

c_{1}, c_{2}, c_{3}, c_{4}

represent the respective proportional weights of each loss. The incorporation of both further enhances the stability of the training process and improves the model’s ability to extract and transform image texture features and spatial structures. Focus on the loss of object-weighted radiometric spectral characteristics (

L o s s_{W e i_I R}

) show in Equation (5), which represents the quantitative physical constraints imposed by scene element radiometric spectral properties on visible-infrared remote sensing image generation. In Equation (6), the

I_{G_m a p p i n g}

denotes the G channel

D a t a_{M A P} (k, λ_{i}, t_{i}, α_{k i}, m, n)

of the

I_{M D F}

, signifying the mapping relationship between scene radiometric spectral characteristics from the input image to the target image. The

δ

represents the pixel difference threshold, determined based on the image grayscale quantization bit depth.

As shown in Figure 8, the

I_{M D F}

provides effective multidimensional data support for the

L o s s_{C O}

. Specifically, the ground object material-weighted modulation based on the R channel enhances the model’s feature learning accuracy for key regions; the radiative spectral characteristic loss based on the G channel ensures the radiometric rationality of the Vis-to-IR spectral conversion; and the pixel grayscale loss based on the B channel guarantees grayscale consistency between the generated images and the measured infrared images.

In summary, the

L o s s_{C O}

comprehensively guides the generative-adversarial training through the synergistic optimization of multidimensional losses. Through the deep integration of data-driven paradigms with imaging physics, it overcomes the limitations of traditional physical simulation methods in modeling complex scenarios. By imposing physical constraints on the generated results, suppressing non-physical and unreasonable outputs, and ensuring physical consistency, it significantly enhances the simulation accuracy and operational efficiency of the model, providing a new solution for the practical application of intelligent fusion generation methods for infrared remote sensing images.

3. Results

3.1. Experimental Setup

This experiment was configured on a Windows 10 operating system; CPU: Intel(R) Core(TM) i7-10700F; GPU: NVIDIA GeForce RTX 4070; RAM: 16 GB; Python 3.8; CUDA 11.1; CuDNN 8.9.7. The experiment’s batch size was set to 1, using the Adam optimizer with a learning rate and momentum set to 0.5 and 0.999, respectively. The network was trained for a total of 200 epochs. The initial learning rate for the first 100 epochs was 0.0002, while the learning rate for the subsequent 100 epochs linearly decayed from 0.0002 to 0. Based on the differences in the numerical magnitudes of the various loss terms and the need to strengthen constraints on pixel-level accuracy and infrared physical characteristics, the weight coefficients for

L_{L 1} (G)

and

L o s s_{W e i_I R}

were set to 100, while those for the remaining losses were set to 10. When the weight coefficients for

L_{L 1} (G)

and

L o s s_{W e i_I R}

varied within the range of 80–120 and those for the remaining losses varied within the range of 5–15, the fluctuation in the Fréchet Inception Distance (FID) metric was less than 5%. The Structural Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR) metrics remained stable, and model performance showed no significant degradation, indicating that the adopted loss configuration is relatively stable.

3.2. Datasets

To effectively train and evaluate the effectiveness and generalization capability of the model for infrared remote sensing image generation tasks, Landsat 8 [54] and Sentinel-2 [55] satellite data were employed as experimental data sources. Diverse data covering different imaging times, geographic locations, and landform features were randomly downloaded. Through preprocessing, a total of 7200 training sets and 800 independent test sets were constructed, with each image sized 256 × 256 pixels. The training set and independent test sets built on Landsat 8 account for three-quarters of the total dataset, where SWIR (1.566–1.651 μm), MWIR (3.0–5.0 μm), and LWIR (10.6–11.19 μm) data are equally proportioned. Except for LWIR images with a resolution of 100 m, all other band images have a resolution of 30 m. Meanwhile, the corresponding Sentinel-2 dataset only includes Vis-SWIR paired data, with an image resolution of 10 m. As shown in Figure 9, examples of dataset images cover diverse terrain scenarios, such as ports, roads, airports, urban buildings, cultivated land and vegetation, mountains and rivers, and the gobi desert.

3.3. Evaluation Indicators

This study has established a comprehensive multidimensional evaluation system, in which the evaluation criteria and calculation formulas for each indicator are elaborated below.

Peak Signal-to-Noise Ratio (PSNR) [56]

A pixel-level metric focusing on the difference between target images (

x

) and generated images (

y

). Its quantified value positively correlates with the consistency of image pixel distribution—that is, a higher value indicates greater alignment between the pixel values of the

x

and those of the

y

.

P S N R = 10 \times \log_{10} (\frac{{M A X}_{pix}^{2}}{\frac{1}{m n} {\sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} ‖x (i, j) - y (i, j)‖}^{2}})

(7)

Structural Similarity Index Measure (SSIM) [57]

Its core lies in characterizing the degree of structural and luminance matching between generated images and real target images. An increase in this metric directly corresponds to enhanced similarity in structural features between the two images.

S S I M (x, y) = L (x, y) \times C (x, y) \times S (x, y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} + μ_{y}^{2} + C_{1}} \times \frac{2 σ_{x} σ_{y} + C_{2}}{σ_{x}^{2} + σ_{y}^{2} + C_{2}} \times \frac{σ_{x y} + C_{3}}{σ_{x} σ_{y} + C_{3}}

(8)

where

μ_{x}, σ_{x}

and

μ_{y}, σ_{y}

represent the mean and standard deviation of

x

and

y

, respectively.

σ_{x y}

denote the covariance of

x

and

y

. C₁, C₂, and C₃ are constants.

Universal Quality Index (UQI) [58]

By integrating three core dimensions—luminance consistency, contrast matching, and structural similarity—it provides a comprehensive assessment of image similarity, reflecting the perceived quality of an image.

U Q I (x, y) = \frac{4 μ_{x} μ_{y} σ_{x y}}{(μ_{x}^{2} + μ_{y}^{2}) (σ_{x}^{2} + σ_{y}^{2})}

(9)

Fréchet Inception Distance (FID) [59]

A metric that quantifies the similarity between generated and real images by measuring the difference in feature space distributions derived from deep learning models.

F I D = {‖μ_{r} - μ_{g}‖}^{2} + T r (\sum_{r} + \sum_{g} - 2 ({\sum_{r} \sum_{g})}^{\frac{1}{2}})

(10)

where

μ_{r}

and

μ_{g}

are the mean vectors of the features extracted from

x

and

y

in the deep network, respectively.

\sum_{r}

and

\sum_{g}

are the feature covariance matrices of

x

and

y

, respectively.

T r (\cdot)

denotes the trace of a matrix.

Learned Perceptual Image Patch Similarity (LPIPS) [60]

Leveraging the multi-layer feature response matching mechanism of deep convolutional neural networks to achieve precise evaluation of visual perceptual similarity in images.

L P I P S_{(x, y)} = \sum_{l} \frac{1}{N_{l}} {‖ϕ_{l} (x) - ϕ_{l} (y)‖}_{2}^{2}

(11)

where

ϕ_{l} (x)

and

ϕ_{l} (y)

represent the features extracted from

x

and

y

, respectively, at layer l of the pre-trained network.

N_{l}

denotes the feature dimension of that layer.

3.4. Experiment Analysis

3.4.1. Ablation Experiment Analysis

To validate the independent effects and synergistic performance of the innovative strategies proposed in this paper, a series of ablation experiments were designed and conducted using the Landsat 8 visible-shortwave infrared remote sensing dataset as the experimental data source, with the Pix2Pix model serving as the baseline framework. The experimental group designs and results are presented in Table 1.

Through an analysis of the quantitative results of the ablation experiments in Table 1, the following core conclusions can be drawn:

The baseline group (Experiment 1) exhibited mediocre performance, with all evaluation metrics at relatively low levels, confirming the limitations of the original Pix2Pix model in shortwave infrared remote sensing image generation tasks.
Compared with the baseline group, the SSIM of the $L o s s_{C O}$ -incorporated group (Experiment 2) increased by 4.41%, respectively, with significant improvements also observed in PSNR, FID, and LPIPS. This finding indicates that incorporating L effectively enhances the accuracy and detail representation capability of feature mapping, thereby improving the realism and quality of the generated images, which further validates the core value and practical efficacy of the innovative fusion paradigm between physical mechanisms and deep learning in infrared remote sensing image generation tasks.
The generator module optimization group (Experiment 3) achieved improvements across all metrics; specifically, the UQI rose by 4.75%, respectively, compared to the baseline group. This demonstrates that the embedding of the FASCRC_Unet3+ module can effectively strengthen the model’s ability to transform and generate key image features, significantly boosting the precision of feature learning.
For the discriminator module optimization group (Experiment 4), the SSIM was 4.03% higher than that of the baseline group, respectively. This reveals that the combined optimization strategy of FASCRC_Unet3+ and FFD_PMW exerts a positive gain effect on model performance, enhancing the model’s ability to learn complex image features and adapt to scenarios, and further improving its feature representation adaptability in complex remote sensing scenes.
The SEMFE module contribution group (Experiment 5) achieved a 3.67% improvement in the SSIM metric compared to the baseline group, while the FID metric decreased by 19.7653 compared to the baseline group. The synergistic effect of the combination of $L o s s_{W e i_I R}$ and FFD_AMW directly reflects the SEMFE module’s contribution to model optimization. The results demonstrate that the SEMFE module can effectively enhance the similarity between simulated and measured images, and its optimization of model performance metrics is comparable to the combined optimization strategy of the generator and discriminator modules, effectively validating the advantages of the SEMFE-based mechanism-data dual-drive strategy.
The full strategy fusion group (Experiment 6) saw SSIM and UQI improve by 10.67% and 9.59%, respectively, compared to the baseline group, with PSNR reaching 30.2602 dB, FID 83.0685, and LPIPS 0.1405.

These results confirm that the infrared remote sensing images generated by the proposed optimized model based on visible light images are highly similar to the measured infrared images in both pixel details and feature structures.

3.4.2. Subjective Experimental Analysis

To evaluate the performance of the proposed generative model from a visual perception perspective, this study adopts a unified experimental data configuration and selects CycleGAN, Pix2Pix, UGATIT [61], InfraGAN, HFIRSIGM_G-RSMP, PID, and MDD-VIR as comparative models. All models take visible light remote sensing images as input and conduct test generation of SWIR, MWIR, and LWIR remote sensing images respectively, with examples of the subjective experimental results are illustrated in Figure 10.

Subjective evaluation analysis reveals that the selected representative comparative models generally exhibit common issues, such as missing detailed information and blurred edge contours, across different test datasets (Landsat 8, Sentinel-2) and infrared remote sensing image generation tasks for various bands. Among these models, images generated by CycleGAN and UGATIT are notably blurred with significant brightness deviations, leading to poor overall visual quality. Pix2Pix-generated images suffer from detail loss, tonal discrepancies, and texture distribution inconsistencies; InfraGAN can only partially enhance the structural expressiveness of the generated images; HFIRSIGM_GRSMP-generated images show high similarity to reference images in terms of overall texture distribution and image quality characteristics, yet closer inspection still reveals texture feature deficiencies; PID-generated images are consistent with the measured images in terms of overall structural distribution, but there are noticeable differences in pixel quality and visual clarity.

In contrast, the MDD-VIR-generated images achieve high alignment with the measured infrared remote sensing images across key dimensions—including color-texture distribution, edge contour integrity, and semantic information consistency—boasting excellent visual fidelity. More importantly, the MDD-VIR maintains stable and superior performance in the test generation of multi-band infrared remote sensing images (Vis-SWIR/MWIR/LWIR) derived from different satellite data and encompassing diverse terrain features, validating its robust effectiveness and generalization capability.

3.4.3. Objective Experimental Analysis

To validate the reliability of subjective experimental conclusions from an objective data perspective, a series of objective comparative verification experiments were designed for the visible-to-multi-band infrared remote sensing image generation task. The obtained experimental data are presented in Table 2, Table 3 and Table 4, respectively.

The objective experimental data demonstrate that, across infrared remote sensing image generation tasks involving different spectral bands, data sources, and terrain scenarios, images generated by MDD-VIR exhibit excellent performance in both pixel detail fidelity and structural feature consistency, significantly outperforming the other comparative models. Specifically, compared to the benchmark model Pix2Pix, MDD-VIR achieves a 10.44% improvement in the average SSIM metric for multi-band simulated images. This objective data robustly validates the accuracy, adaptability, and generalization capability of the MDD-VIR model.

To further quantitatively analyze the simulation performance of MDD-VIR from the perspective of physical consistency, this study verifies the radiation distribution similarity of the generated test samples. A unified quantitative calibration criterion is employed to establish the mapping relationship between image grayscale and radiant luminance. The radiant luminance distributions of the generated images and measured images are calculated separately, and the deviation between their radiation distributions is quantitatively compared.

Figure 11 illustrates a comparative example of the radiant energy distribution between the measured and generated images, with the calculated relative radiant errors for SWIR, MWIR, and LWIR test samples being 5.64%, 7.50%, and 8.19%, respectively. This result demonstrates that the infrared remote sensing images generated by MDD-VIR exhibit good physical consistency and environmental adaptability compared with measured images.

Meanwhile, when taking 100 sets of visible light remote sensing images as input, MDD-VIR requires only approximately 1.5 s to generate high-fidelity infrared remote sensing images of the desired bands. This represents a nearly thousandfold reduction in time consumption compared to traditional physical modeling and simulation methods, significantly reducing simulation costs and improving simulation efficiency, thus verifying the high efficiency of the proposed model.

4. Discussion

This study confirms that the

L o s s_{C O}

and FASCRC_Unet3+ modules yield significant gains in enhancing the model’s overall performance. Specifically,

L o s s_{C O}

fully leverages the structured reconstruction of limited known data by SEMFE to deeply explore multi-dimensional image features. It strengthens regional pertinence using material weight data, establishes physical relationships for the Vis-to-IR conversion via radiative spectral feature mapping data, and anchors texture authenticity with measured infrared data. Through a collaborative optimization training mechanism, it effectively enhances the physical consistency and training efficiency of the simulation model. The introduction of the FASCRC_Unet3+ module, on the one hand, strengthens the generative network’s ability to focus on key image features, and on the other hand, improves the generator’s efficiency in extracting frequency-domain features—providing core support for model simulation accuracy. Additionally, the FFD_PMW module enhances the discriminator’s learning capability and discriminative precision for multi-scale high-value ground object features in remote sensing images, exerting a positive regulatory effect on model performance improvement.

Based on the ablation experiment results, MDD-VIR maximizes the quality and fidelity of generated images through the functional complementarity and synergistic interaction of its optimization modules, effectively boosting the model’s competitiveness in visible-to-multi-band infrared remote sensing image generation tasks. Furthermore, comprehensive comparative experiments utilizing diverse data from different satellites, various spectral bands, and diverse scene elements—alongside results from typical mainstream models—have fully demonstrated the robustness, reliability, and generalization capability of the MDD-VIR model.

To intuitively demonstrate the competitiveness of MDD-VIR in the task of Vis-to-IR remote sensing image generation, 100 sets of test results were selected to statistically analyze the proportion of samples—generated by CycleGAN, Pix2Pix, UGATIT, InfraGAN, HFIRSIGM_GRSMP, PID, and MDD-VIR—whose SSIM values fall into the intervals [0, 0.8), [0.8, 0.9), and [0.9, 1.0], respectively. The results are illustrated in Figure 12.

Notably, approximately 57% of the samples generated by MDD-VIR have an SSIM greater than 0.9, representing a significant improvement over all other comparative models. This further validates the simulation accuracy and stability of the MDD-VIR model, while also underscoring the value and significance of continuing this research despite the prior existence of the HFIRSIGM_GRSMP model.

5. Conclusions

Driven by the strong demand for infrared remote sensing image data in technologies such as digital simulation of long-range infrared combat scenarios, this paper addresses the core challenges limiting the fidelity of simulated images and the reliability of models. The SEMFE module is designed to establish a bridge linking physical information and the interpretability of intelligent models. This study further develops targeted collaborative and innovative optimization strategies—including FASCRC_Unet3+, FFD_PMW, and

L o s s_{C O}

—for each functional module of the image generation network, thereby constructing the MDD-VIR model. To validate the model performance, a total of 8000 image sets were used for multidimensional validation experiments. The results indicate that images generated by the MDD-VIR model achieve an average PSNR of 31.2144 dB, SSIM of 91.07%, UQI of 92.10%, FID of 80.3632, and LPIPS of 0.1433. These results fully confirm the high quality and high fidelity of the images generated by the proposed method. More importantly, the model organically integrates the interpretability advantage of physical modeling with the powerful feature extraction capability of deep generative models. It not only significantly improves the fidelity and reliability of simulated infrared remote sensing images but also enhances the model’s adaptability to image conversion tasks involving different scene elements and its generalization capability for Vis-to-multi-band (SWIR, MWIR, and LWIR) IR remote sensing image generation. This method overcomes the core contradictions of traditional approaches, achieving the multiple objectives of “robust physical consistency–high fidelity–high efficiency” and providing a novel solution for the practical implementation of intelligent computing fusion generation methods for visible-to-infrared remote sensing images.

Future research will focus on developing efficient, intelligent technologies for generating high-quality infrared remote sensing images that are adaptable to cross-regional, cross-sensor, and cross-seasonal conditions. Concurrently, in-depth validation studies will be conducted for downstream detection tasks to further advance the iteration and application of image generation technologies.

Author Contributions

Conceptualization, Y.L. and C.Z.; methodology, Y.L.; software, Y.L. and D.S.; validation, Y.L. and F.R.; formal analysis, Y.L. and X.W.; investigation, Y.L., D.S. and F.R.; resources, X.W.; data curation, Y.L., D.S. and F.R.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L., C.Z. and X.W.; visualization, Y.L. and D.S.; supervision, C.Z. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (62475205) and Fundamental Research Funds for the Central Universities (XJSJ24031, ZYTS25282).

Data Availability Statement

The data presented in this study are derived from the following publicly available sources: Landsat and Sentinel satellite imagery. These data are available on the Geospatial Data Cloud or the USGS website at https://www.gscloud.cn/ and https://earthexplorer.usgs.gov/.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Teng, Y.; Ren, H.; Hu, Y.; Dou, C. Land surface temperature retrieval from SDGSAT-1 thermal infrared spectrometer images: Algorithm and validation. Remote Sens. Environ. 2024, 315, 114412. [Google Scholar] [CrossRef]
Wang, D.; Cao, L.; Du, Y.; Xiong, H.; Ye, F.; Zhong, Y. Tow noise-resilient retrieval of land surface temperature and emissivity using airborne thermal infrared hyperspectral imagery. ISPRS J. Photogramm. Remote Sens. 2026, 231, 532–551. [Google Scholar] [CrossRef]
Kanneh, J.E.; Wang, J.; Li, C.; Ma, Y.; Li, S.; Zhong, D.; Wang, Z.; Kpalari, D.F.; Collela, M.B.E. Novel indices and multi-source data fusion for monitoring plant moisture stress in winter wheat fields. Sci. Rep. 2026, 16, 3836. [Google Scholar] [CrossRef]
Zhou, Z.; Majeed, Y.; Naranjo, G.D.; Gambacorta, E.M.T. Assessment for crop water stress with infrared thermal imagery in precision agriculture: A review and future prospects for deep learning applications. Comput. Electron. Agric. 2021, 182, 106019. [Google Scholar] [CrossRef]
Robinson, A.J.; Lesage, F.J.; Reilly, A.; McGranaghan, G.; Byrne, G.; O’Hegarty, R.; Kinnane, O. A new transient method for determining thermal properties of wall sections. Energy Build. 2017, 142, 139–146. [Google Scholar] [CrossRef]
Li, Z.L.; Tang, B.H.; Wu, H.; Zhao, W.; Duan, S.B.; Ren, H.Z.; Zhao, E.Y.; Tang, R.L.; Si, M.L.; Leng, P.; et al. Development and prospects of thermal infrared remote sensing. Natl. Remote Sens. Bull. 2025, 29, 1529–1550. [Google Scholar] [CrossRef]
Takodjou Wambo, J.D.; Nomo Negue, E.; Traore, M.; Asimow, P.D.; Ganno, S.; Beiranvand Pour, A.; Ngounouno, F.Y.; Nzenti, J.P. Integrating multispectral remote sensing and geological investigation for gold prospecting in the Borongo-Mborguene gold field, eastern Cameroon. Adv. Space Res. 2024, 74, 4574–4597. [Google Scholar] [CrossRef]
Scafutto, R.D.M.; Lievens, C.; Hecker, C.; van der Meer, F.D.; de Souza Filho, C.R. Detection of petroleum hydrocarbons in continental areas using airborne hyperspectral thermal infrared data (SEBASS). Remote Sens. Environ. 2021, 256, 112323. [Google Scholar] [CrossRef]
Sharif, S.S.; Banad, Y.M. Revolutionizing infrared detection in defense applications: A nanophotonic approach leveraging 2D materials for enhanced mid-IR absorption. In Proceedings of the 2024 IEEE Research and Applications of Photonics in Defense Conference (RAPID), Miramar Beach, FL, USA, 13–15 May 2024; pp. 1–2. [Google Scholar] [CrossRef]
Kumar, N.; Singh, P. Small and dim target detection in infrared imagery: A review, current techniques and future directions. Neurocomputing 2025, 630, 129640. [Google Scholar] [CrossRef]
Han, Z.; Zhang, Z.; Zhang, S.; Zhang, G.; Mei, S. Aerial visible-to-infrared image translation: Dataset, evaluation, and baseline. J. Remote Sens. 2023, 3, 96. [Google Scholar] [CrossRef]
Pu, R.; Bonafoni, S. Thermal infrared remote sensing data downscaling investigations: An overview on current status and perspectives. Remote Sens. Appl. Soc. Environ. 2023, 29, 100921. [Google Scholar] [CrossRef]
Wang, S.; Xu, W.; Guo, T. Advances in thermal infrared remote sensing technology for geothermal resource detection. Remote Sens. 2024, 16, 1690. [Google Scholar] [CrossRef]
Rogalski, A. Infrared detectors: Status and trends. Prog. Quantum Electron. 2003, 27, 59–210. [Google Scholar] [CrossRef]
Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Wang, Y. Research on Infrared Image Expansion Methods Based on Actual Measurement Data. Master’s Thesis, Xidian University, Xi’an, China, 2021. [Google Scholar]
Zhang, R.G. Simulation Research on Infrared Remote Sensing Images from Space-Based Platforms. Master’s Thesis, Harbin Institute of Technology, Harbin, China, 2022. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv 2015. [Google Scholar] [CrossRef]
Weng, L. Background Simulation Methods for Space-Based Infrared Imaging of Earth. Master’s Thesis, University of Electronic Science and Technology of China, Chengdu, China, 2024. [Google Scholar]
Zhang, X.; Huang, J.; Zhang, L. Any2RSI: Controllable remote sensing text-to-image generation via any control and enriched description. In Proceedings of the AAAI 2026, Singapore, 20–27 January 2026. [Google Scholar] [CrossRef]
Zhang, X.; Ma, J.; Wang, G.; Zhang, Q.; Zhang, H.; Zhang, L. Perceive-IR: Learning to perceive degradation better for all-in-one image restoration. IEEE Trans. Image Process. 2025, 35, 2018–2033. [Google Scholar] [CrossRef]
Chen, S.; Zhang, L.; Zhang, L. Cross-scope spatial-spectral information aggregation for hyperspectral image super-resolution. IEEE Trans. Image Process. 2024, 33, 5878–5891. [Google Scholar] [CrossRef]
Willers, C.J.; Willers, M.S.; Lapierre, F. Signature modelling and radiometric rendering equations in infrared scene simulation systems. In Proceedings of SPIE Security + Defence, Prague, Czech Republic, 6 October 2011; SPIE: Cergy-Pontoise, France, 2011; Volume 8187, p. 81870R. [Google Scholar] [CrossRef]
Li, M.; Xu, Z.; Xie, H.; Xing, Y. Infrared image generation method and detail modulation based on visible light images. Infrared Technol. 2018, 40, 34–38. Available online: http://hwjs.nvir.cn/en/article/id/hwjs201801007 (accessed on 25 May 2025).
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014. [Google Scholar] [CrossRef]
Cao, X.; Zou, H.; Li, J.; Chen, H.; Ying, X.; He, S.; Wang, Y.; Pan, L. Multimodal image generation and fusion through content-style hybrid disentanglement. Knowl.-Based Syst. 2025, 330, 114597. [Google Scholar] [CrossRef]
Ma, D.; Xian, Y.; Su, J.; Li, S.; Li, B. Visible-to-infrared image translation based on an improved conditional generative adversarial nets. Acta Photonica Sin. 2023, 52, 0410003. [Google Scholar] [CrossRef]
Wang, S.; Sun, G.; Dong, L.; Zheng, B. PAS-GAN: A GAN based on the pyramid across-scale module for visible-infrared image transformation. Infrared Phys. Technol. 2024, 139, 105314. [Google Scholar] [CrossRef]
Liu, M.Z. Research on Shore Target and Environmental Characteristics Based on Physical Material Rendering and Full-Path Infrared Imaging Simulation. Master’s Thesis, Nanjing University of Science and Technology, Nanjing, China, 2023. [Google Scholar]
Rankin-Parobek, D.; Salvaggio, C.; Gallagher, T.W.; Schott, J.R. Instrumentation and procedures for validation of synthetic infrared image generation models. In Proceedings of SPIE, Los Angeles, CA, USA, 19–20 January 1993; SPIE: Cergy-Pontoise, France; Volume 1762, pp. 584–600. [CrossRef]
Goodenough, A.A.; Brown, S.D. DIRSIG5: Next-generation remote sensing data and image simulation framework. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 4818–4833. [Google Scholar] [CrossRef]
RIT DIRS Laboratory. Digital Imaging and Remote Sensing Laboratory Fiscal Year 2024–2025 Annual Report; College of Science, Rochester Institute of Technology: Rochester, NY, USA, 2025; Available online: https://www.rit.edu/dirs/DIRS_Annual_Report_24-25 (accessed on 3 January 2026).
Yang, X.; Liu, H.; Lv, M.; Shi, L.; Weng, L.; Hu, L.; Li, Q. AGMD-GAN: Attention-based generator with multi-scaled feature extraction discriminator for unpaired visible to infrared image translation. Infrared Phys. Technol. 2026, 152, 106266. [Google Scholar] [CrossRef]
Özkanoğlu, M.A.; Ozer, S. InfraGAN: A GAN architecture to transfer visible images to infrared domain. Pattern Recognit. Lett. 2022, 155, 69–76. [Google Scholar] [CrossRef]
Ran, L.; Wang, L.; Wang, G.; Wang, P.; Zhang, Y. DiffV2IR: Visible-to-infrared diffusion model via vision-language understanding. arXiv 2025. [Google Scholar] [CrossRef]
Liu, Y.; Yue, J.; Xia, S.; Ghamisi, P.; Xie, W.; Fang, L. Diffusion models meet remote sensing: Principles, methods, and perspectives. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4708322. [Google Scholar] [CrossRef]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 105. [Google Scholar] [CrossRef]
Han, Y.; Han, A.; Huang, W.; Lu, C.; Zou, D. Can diffusion models learn hidden inter-feature rules behind images? arXiv 2025. [Google Scholar] [CrossRef]
Hu, X.; Liu, X.; Duan, Q.; Hong, D.; Jiang, L.; Yang, H.; Zhan, D. Recent advances in diffusion models for hyperspectral image processing and analysis: A review. arXiv 2025. [Google Scholar] [CrossRef]
Jia, X. Research on Infrared Image Generation Algorithm Based on GAN and Physical Model. Master’s Thesis, Beijing University of Posts and Telecommunications, Beijing, China, 2022. [Google Scholar]
Wang, D.-Y.; Bie, S.-H.; Chen, X.-H.; Yu, W.-K. Physics-driven generative adversarial networks empower single-pixel infrared hyperspectral imaging. arXiv 2023. [Google Scholar] [CrossRef]
Berman, O.; Oz, N.; Mendlovic, D.; Sochen, N.; Cohen, Y.; Klapp, I. PETIT-GAN: Physically enhanced thermal image-translating generative adversarial network. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1607–1616. [Google Scholar] [CrossRef]
Mao, F.; Mei, J.; Lu, S.; Liu, F.; Chen, L.; Zhao, F.; Hu, Y. PID: Physics-informed diffusion model for infrared image generation. Pattern Recognit. 2026, 169, 111816. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Zhang, C.; Zhang, Z.; Ren, F. High-fidelity infrared remote sensing image generation method coupled with the global radiation scattering mechanism and Pix2PixGAN. Remote Sens. 2024, 16, 4350. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y. UNet 3+: A full-scale connected UNet for medical image segmentation. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
Li, G.W.; Shi, Z.G.; Zhang, Y. Image transformation technology based on generative adversarial networks. J. Terahertz Sci. Electron. Inf. Technol. 2021, 19, 724. [Google Scholar] [CrossRef]
Ye, M.; Shi, C.; Hao, Y.; Li, D. Infrared image conversion technology based on improved pix2pix. Laser Infrared 2024, 54, 1157–1163. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar] [CrossRef]
Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
Roy, D.P.; Wulder, M.A.; Loveland, T.R.; Woodcock, C.E.; Allen, R.G.; Anderson, M.C.; Helder, D.; Irons, J.R.; Johnson, D.M.; Kennedy, R.; et al. Landsat-8: Science and product vision for terrestrial global change research. Remote Sens. Environ. 2014, 145, 154–172. [Google Scholar] [CrossRef]
Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar] [CrossRef]
Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes: The Art of Scientific Computing, 3rd ed.; Cambridge University Press: Cambridge, UK, 2007; pp. 28–735. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. arXiv 2017. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
Kim, J.; Kim, M.; Kang, H.; Lee, K. U-GAT-IT: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv 2019. [Google Scholar] [CrossRef]

Figure 1. Full-chain remote sensing imaging mechanism. The dashed line connecting to “IR” in the figure indicates that reflected radiation is not present across the entire infrared spectrum, but is primarily concentrated in the near-infrared and short-wave infrared regions.

Figure 2. Differences in Vis-IR remote sensing imaging.

Figure 3. The overall architecture of the methodology of this paper. The temperature map data is derived from satellite-derived thermal infrared (10.60–11.19 μm) imagery, which has undergone preprocessing steps such as geometric correction, radiometric calibration, and atmospheric correction. This data is used for subsequent theoretical analysis and radiative calculations, thereby enabling the effective coupling of physical mechanisms with deep generative model.

Figure 4. SEMFE schematic diagram.

Figure 5. Generator architecture.

Figure 6. FASCRC.

Figure 7. FFD_PMW.

Figure 8. Schematic diagram of the

L o s s_{C O}

.

Figure 8. Schematic diagram of the

L o s s_{C O}

.

Figure 9. Sample dataset examples.

Figure 10. Generated examples of multi-band infrared remote sensing images: (a) Vis-to-SWIR Image Generation Results (Landsat 8); (b) Vis-to-SWIR Image Generation Results (Sentinel-2); (c) Vis-to-MWIR Image Generation Results (Landsat 8); (d) Vis-to-LWIR Image Generation Results (Landsat 8).

Figure 11. Comparison of radiance distributions: (a) SWIR-true image, (b) SWIR-generated image, (c) MWIR-true image. (d) MWIR-generated image, (e) LWIR-true image, and (f) LWIR-generated image.

Figure 12. Statistical comparison results of SSIM. An upward arrow indicates that the larger the metric, the better the image quality.

Table 1. Ablation experiment results.

Num	Experiment	PSNR↑	SSIM↑	UQI↑	FID↓	LPIPS↓
1	Pix2Pix (baseline)	24.3028	0.8095	0.8226	109.0354	0.2515
2	1 + $L o s s_{C O}$	29.0275	0.8536	0.8812	87.9927	0.2029
3	1 + FASCRC_Unet3+	28.3216	0.8334	0.8701	90.6418	0.2058
4	3 + FFD_PMW	29.1084	0.8498	0.8755	88.3576	0.1797
5	1 + FFD_PMW + $L o s s_{W e i_I R}$	28.9516	0.8462	0.8793	89.1701	0.2164
6	MDD-VIR	30.2602	0.9162	0.9185	83.0685	0.1405

In the table, an upward arrow indicates that the larger the metric, the better the image quality, while a downward arrow indicates that the smaller the metric, the better the image quality. Boldface fonts represent the best values obtained. And the plus sign “+” indicates that the corresponding optimization strategy has been added to the preceding experimental model. For example, “Experiment (Number 4): 3 + FFD_PMW” means that Experiment 4 adds the FFD_PMW model to Experiment 3.

Table 2. Evaluation indicators results for SWIR.

Landsat8	PSNR↑	SSIM↑	UQI↑	FID↓	LPIPS↓
CycleGAN	21.2918	0.7386	0.7741	114.9850	0.2834
Pix2Pix	24.3028	0.8095	0.8226	109.0354	0.2515
UGATIT	19.1432	0.6599	0.7169	138.3082	0.3712
InfraGAN	26.3268	0.8662	0.8673	111.7826	0.2798
HFIRSIGM_GRSMP	29.3215	0.8814	0.8706	88.1474	0.1726
PID	21.7167	0.7862	0.8091	102.0756	0.2359
MDD-VIR	30.2602	0.9162	0.9185	83.0685	0.1405
Sentinel-2	PSNR↑	SSIM↑	UQI↑	FID↓	LPIPS↓
CycleGAN	21.0985	0.7206	0.7298	132.5103	0.3698
Pix2Pix	26.6589	0.8218	0.8798	91.9159	0.2084
UGATIT	20.0148	0.6727	0.7180	139.0348	0.4507
InfraGAN	28.7611	0.8259	0.8329	130.2104	0.3826
HFIRSIGM_GRSMP	30.0163	0.8882	0.8813	85.4927	0.1714
PID	23.6948	0.8038	0.8435	94.3284	0.2482
MDD-VIR	31.8324	0.9073	0.9156	80.1642	0.1316

In the table, an upward arrow indicates that the larger the metric, the better the image quality, while a downward arrow indicates that the smaller the metric, the better the image quality. Boldface fonts represent the best values obtained.

Table 3. Evaluation indicators results for MWIR.

Model	PSNR↑	SSIM↑	UQI↑	FID↓	LPIPS↓
CycleGAN	20.6425	0.7204	0.7812	138.5219	0.4299
Pix2Pix	25.1128	0.7745	0.8238	109.6057	0.2703
UGATIT	16.7854	0.6391	0.7506	144.9924	0.4435
InfraGAN	25.8432	0.7856	0.8304	109.2356	0.2918
HFIRSIGM_GRSMP	29.9417	0.8752	0.8293	83.2494	0.1886
PID	21.0254	0.7463	0.7990	112.4027	0.3015
MDD-VIR	31.6625	0.8991	0.9075	80.1126	0.1603

In the table, an upward arrow indicates that the larger the metric, the better the image quality, while a downward arrow indicates that the smaller the metric, the better the image quality. Boldface fonts represent the best values obtained.

Table 4. Evaluation indicators results for LWIR.

Model	PSNR↑	SSIM↑	UQI↑	FID↓	LPIPS↓
CycleGAN	21.7352	0.7941	0.8026	114.1687	0.2911
Pix2Pix	26.4836	0.8192	0.8257	112.0376	0.2687
UGATIT	17.1358	0.6094	0.6773	150.1583	0.5842
InfraGAN	28.9641	0.8375	0.8529	99.8452	0.2719
HFIRSIGM_GRSMP	31.0156	0.9011	0.9134	80.1225	0.1603
PID	23.8043	0.8057	0.8121	117.0687	0.3348
MDD-VIR	31.1025	0.9203	0.9424	78.1073	0.1406

In the table, an upward arrow indicates that the larger the metric, the better the image quality, while a downward arrow indicates that the smaller the metric, the better the image quality. Boldface fonts represent the best values obtained.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Sun, D.; Wang, X.; Ren, F.; Zhang, C. MDD-VIR: Vis-to-IR Remote Sensing Image Generation Method Based on Mechanism-Data Dual-Driven Strategy. Remote Sens. 2026, 18, 1502. https://doi.org/10.3390/rs18101502

AMA Style

Li Y, Sun D, Wang X, Ren F, Zhang C. MDD-VIR: Vis-to-IR Remote Sensing Image Generation Method Based on Mechanism-Data Dual-Driven Strategy. Remote Sensing. 2026; 18(10):1502. https://doi.org/10.3390/rs18101502

Chicago/Turabian Style

Li, Yue, Dechang Sun, Xiaorui Wang, Fafa Ren, and Chao Zhang. 2026. "MDD-VIR: Vis-to-IR Remote Sensing Image Generation Method Based on Mechanism-Data Dual-Driven Strategy" Remote Sensing 18, no. 10: 1502. https://doi.org/10.3390/rs18101502

APA Style

Li, Y., Sun, D., Wang, X., Ren, F., & Zhang, C. (2026). MDD-VIR: Vis-to-IR Remote Sensing Image Generation Method Based on Mechanism-Data Dual-Driven Strategy. Remote Sensing, 18(10), 1502. https://doi.org/10.3390/rs18101502

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MDD-VIR: Vis-to-IR Remote Sensing Image Generation Method Based on Mechanism-Data Dual-Driven Strategy

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Problem Analysis

2.2. Overall Architecture of the MDD-VIR

2.3. SEMFE

2.4. Generator

2.5. Discriminator

2.6. Collaborative Loss Module

3. Results

3.1. Experimental Setup

3.2. Datasets

3.3. Evaluation Indicators

3.4. Experiment Analysis

3.4.1. Ablation Experiment Analysis

3.4.2. Subjective Experimental Analysis

3.4.3. Objective Experimental Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI