Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review

Li, Junjie; Fan, Cunzheng; Ou, Congyang; Zhang, Haokui

doi:10.3390/drones9120811

Open AccessReview

Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review

School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710129, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2025, 9(12), 811; https://doi.org/10.3390/drones9120811

Submission received: 16 October 2025 / Revised: 13 November 2025 / Accepted: 14 November 2025 / Published: 21 November 2025

(This article belongs to the Special Issue Recent Developments in Artificial Intelligence and Interdisciplinary Research for UAV Application)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Identifies UAV-specific gaps in current IR–VIS fusion, including issues with small/fast targets, misalignment, and data bias.
Advocates for task-specific metrics and lightweight, alignment-aware fusion methods for resource-constrained UAV platforms.

What is the implication of the main finding?

Promotes task-guided fusion approaches to improve detection, segmentation, and tracking robustness in real-world UAV applications.
Highlights the need for UAV-specific benchmarks and cross-modal robustness to advance the field.

Abstract

Infrared–visible (IR–VIS) image fusion is becoming central to unmanned aerial vehicle (UAV) perception, enabling robust operation across day–night cycles, backlighting, haze or smoke, and large viewpoint or scale changes. However, for practical applications some challenges still remain: visible images are illumination-sensitive; infrared imagery suffers thermal crossover and weak texture; motion and parallax cause cross-modal misalignment; UAV scenes contain many small or fast targets; and onboard platforms face strict latency, power, and bandwidth budgets. Given these UAV-specific challenges and constraints, we provide a UAV-centric synthesis of IR–VIS fusion. We: (i) propose a taxonomy linking data compatibility, fusion mechanisms, and task adaptivity; (ii) critically review learning-based methods—including autoencoders, CNNs, GANs, Transformers, and emerging paradigms; (iii) compare explicit/implicit registration strategies and general-purpose fusion frameworks; and (iv) consolidate datasets and evaluation metrics to reveal UAV-specific gaps. We further identify open challenges in benchmarking, metrics, lightweight design, and integration with downstream detection, segmentation, and tracking, offering guidance for real-world deployment. A continuously updated bibliography and resources are provided and discussed in the main text.

Keywords:

infrared–visible image fusion; unmanned aerial vehicles; multimodal perception

1. Introduction

In recent years, rapid advances in artificial intelligence (AI) have accelerated the deployment of unmanned aerial vehicles (UAVs) for inspection, security, emergency response, and agroforestry remote sensing, owing to their reliable scene perception in complex environments [1]. The electromagnetic spectrum, ranging from radio waves to gamma rays, provides complementary channels for onboard sensors that support reliable scene understanding in AI-enabled UAVs. Put simply, the visible band spans about 400–700 nm (red to violet) and occupies only a small slice of the broader electromagnetic spectrum. Outside the visible band are the ultraviolet, infrared, microwave [2], and X-ray regions [3], whose differing frequencies and energies drive different matter interactions and hence complementary sensing and imaging properties [4,5]. In contemporary UAV perception, the principal spectral modalities include visible light, infrared, multispectral imaging, and microwave radar.

With AI and mission complexity advancing together, there is a pressing demand for UAVs to deliver high-reliability perception in real time under nighttime, backlighting, haze or smoke, and large-scale variation. Even with powerful models and quality imagery, reliance on a single modality is intrinsically suboptimal, restricting scene comprehension. The two principal bands used on UAVs exhibit complementary yet incomplete traits: visible imagery is acutely sensitive to illumination and weather, providing limited cues after dark or against backlight [6]; infrared imagery, though highlighting thermal contrast, is susceptible to background temperature fields and thermal crossover and often sacrifices detail and edge sharpness, with atmospheric transmission posing additional limits [7]. Low-altitude attitude perturbations, parallax variation, and abundant small targets exacerbate these modality-specific weaknesses. Consequently, multimodal fusion has emerged, taking advantage of cross-spectral integration to remedy the shortcomings of single-modality perception [8,9,10].

Among the available sensing sources, IR and VIS images represent the primary data forms that anchor multimodal UAV perception [11]. From an imaging standpoint, VIS records reflect visible-band radiance with fine detail, whereas IR senses emitt thermal radiation and are less affected by lighting, enhancing target visibility after dark, under backlighting, and in haze/smoke. From a fusion perspective, VIS anchors contextual detail and geometry, while IR delivers stable target saliency, together boosting environmental adaptability and robustness. Realizing their joint advantages requires explicit IR–VIS fusion, rather than isolated processing. A straightforward option runs two streams separately and merges detection boxes, masks, or confidences at the decision level. However, decision-level fusion, because it operates only on outputs, underutilizes cross-spectral correlations, complicates confidence calibration and synchronization, and often roughly doubles inference cost [12,13]. In contrast, fusion-map methods first synthesize a single or multichannel representation that couples IR saliency with VIS detail at the front end; this fused map can then be ingested by standard single-modality pipelines. In UAV deployments, it integrates cleanly with legacy pipelines, cuts redundant inference to save power, storage, and link capacity, and provides a unified fused input that serves detection, segmentation, and tracking while improving end-to-end latency and robustness [14,15].

Since 2018, propelled by the expressive function-approximation capacity of deep learning [16,17], fused-map IR–VIS image fusion has advanced rapidly. Compared with classical signal-processing pipelines, learning-based approaches generally yield higher perceptual quality, stronger robustness, and better computational efficiency, thereby attracting broad interest. The research emphasis has shifted from early designs centered on visual appeal—prioritizing contrast, detail, and texture for subjective clarity—toward two complementary priorities: data compatibility and task adaptivity. On the compatibility front, recent methods produce network-friendly outputs and reduce cross-sensor/domain discrepancies through misregistration-tolerant architectures, alignment modules, and cross-modal normalization. On the adaptivity front, models increasingly replace static weighting with content- and task-aware fusion mechanisms, leveraging cross-modal attention, frequency–spatial decoupling, semantic/detection distillation, and multi-task objectives to balance fine-detail fidelity and thermal saliency across operating conditions (day/night, haze/smoke, backlighting) and downstream tasks (detection, segmentation) [18,19].

Building on the foregoing, our survey provides a comprehensive review of recent deep learning approaches to infrared-visible (IR-VIS) fusion, with particular emphasis on UAV-centric methods oriented to perception tasks. Unlike prior surveys [20,21,22] that typically adopt a single-axis taxonomy or consider algorithmic families in isolation, we take an end-to-end perspective that explicitly links data compatibility, fusion mechanisms, and task adaptivity across the perception pipeline. Our scope is deliberately centered on task-driven fusion-map methods viewed through the UAV lens, distinguishing this work from earlier overviews that primarily cataloged traditional infrared and visible image fusion (IVIF) techniques or generic learning-based methods decoupled from UAV constraints. In summary, this paper presents a comprehensive survey of UAV-oriented IR–VIS image fusion; our main contributions are as follows:

UAV-Centric Fused-Map Taxonomy: To the best of our knowledge, this is the first UAV-oriented survey of IR–VIS fusion. We introduce a fused-map perspective and a unified taxonomy that link data compatibility, fusion mechanisms, and task adaptivity across tasks including detection and segmentation.
Systematic Analysis: We provide a structured synthesis of learning-based fused-map methods across AE/CNN/GAN/Transformer families, comparing architectural primitives, cross-spectral interaction strategies, and loss designs.
Research Directions: We offer evaluation guidance (datasets, metrics) and deployment recommendations under power, bandwidth, and latency constraints, and outline concrete directions including misregistration-aware and hardware-efficient fusion, robustness to cross-modality perturbations, scalable UAV benchmarks beyond urban scenes, and task-aligned metrics.

1.1. Scope

In this review, we primarily focus on the fusion of infrared and visible images and, where appropriate, selectively include approaches tailored to image fusion on unmanned aerial vehicle (UAV) platforms. As summarized in Table 1, the survey covers literature published from 2018 to 2025, drawing from top-tier journals (e.g., TPAMI, TIP, IJCV) and premier conferences (e.g., CVPR, ICCV, ICML). Beyond presenting the various technical approaches to image fusion, we also summarize commonly used datasets and outline future research directions.

1.2. Organization

The rest of the paper is organized as follows. Section 1 motivates infrared–visible (IR–VIS) image fusion for UAV perception, outlines the limitations of single-modality sensing, and surveys recent deep-learning advances. Section 2 reviews fusion methods—including autoencoder-based, CNN-based, GAN-based, and Transformer-based approaches, as well as emerging diffusion and Mamba models and task-driven fusion; we also discuss representative architectures and loss-function designs. Section 3 addresses data compatibility, catalogs commonly used aligned datasets, and compares explicit and implicit registration strategies alongside general-purpose fusion frameworks. Section 4 summarizes widely adopted benchmarks and evaluation metrics, covering both reference-based and no-reference metrics. Section 5 reports qualitative and quantitative comparisons on standard datasets and traces performance trends across paradigms. Section 6 highlights avenues for future work—including benchmark curation, improved metrics, lightweight and efficient design, and integration with downstream UAV tasks. Finally, Section 7 concludes the paper.

2. Methods for IVIF on UAV Platforms

2.1. Visual Enhancement Oriented Fusion

2.1.1. AE-Based Approaches

Autoencoder (AE)–based image fusion methods [24,25,26,27,66,67,68,69] typically follow a two-stage pipeline (Figure 1). In the first stage, an autoencoder is pre-trained: the encoder is optimized to learn reconstructible cross-modal representations, while the decoder acquires robust reconstruction capability. In the second stage, a fusion operation is inserted between the pre-trained encoder and decoder, after which the decoder reconstructs the fused output. Fusion can be performed using handcrafted/rule-based strategies, or implemented as a trainable module whose weights are learned in the second stage from infrared–visible paired data. Under this two-stage paradigm, existing AE methods can be grouped into two categories:

Rule-based Fusion: This line advances fusion heuristics and data-integration rules to improve the synthesis of multimodal features. A representative early design is DenseFuse [23], which pre-trains an encoder–decoder on natural images and then performs fusion between the frozen encoder and decoder via two handcrafted strategies: (i) direct addition of encoder feature maps, and (ii) an activity-driven mechanism that, at each feature-map spatial location, computes the channel-wise

ℓ_{1}

norm to form an activity map, applies local averaging and a softmax to obtain weights, and linearly combines the modality features before the shared decoder reconstructs the fused image. The encoder employs a dense block to preserve multi-level cues, and training uses a combination of pixel loss and SSIM loss to stabilize reconstruction. This pipeline is clear and interpretable, but it relies on fixed heuristics for feature weighting. Building on this paradigm, DMRO-Fusion [30] retains the AE backbone yet upgrades the “rules” to be data-adaptive through two-level modulation: at the image level, a Global Coefficient Modulation (GCM) network blends Gaussian-filtered structure and texture components into an auxiliary modulation map; at the feature level, a Hyper-Prior (HYP) network predicts modulation coefficients that dynamically update features produced by a recurrent–octave convolution encoder (ROA) [70]. The octave decomposition and subsequent reassembly, together with recurrent connections, capture long- and cross-frequency dependencies. This progression preserves the staged AE recipe while shifting from static, handcrafted weighting toward learnable, modality-aware modulation, yielding more diverse fused outputs and stronger robustness.

Structural Innovations: Moving beyond fixed fusion rules, recent AE-based IVIF studies redesign the encoder–fusion–decoder topology to better capture cross-scale structure, frequency components, and high-level semantics. Luo et al. [28] propose a full-scale hierarchical encoder–decoder with a triple-fusion mechanism, maximum edge-image fusion, single-scale shallow fusion, and full-scale semantic fusion via dual attention augmented by a full-scale non-local affinity module and a cascading edge-prior branch that injects fused edge cues into the decoder; an SSIM–intensity–edge loss further sharpens boundaries and contrast. Building on this idea, Liu et al. [29] split the autoencoder into two branches: an invertible high-frequency path (DWT+INN) that preserves and reconstructs fine details under a composite high-frequency objective comprising wavelet, content, and reconstruction terms, and a Transformer-based low-frequency path that models long-range context. The decoder integrates both to produce detail-rich, globally coherent fusion, with gains verified on detection and segmentation. Leveraging pretraining priors, MaeFuse [31] replaces the encoder with a pretrained MAE to extract omni-level features and introduces a guided training strategy that aligns the fusion layer with the MAE feature domain, mitigating ViT-block artifacts while retaining texture and semantics without coupling to downstream tasks. Finally, Wang et al. [32] reorganize the pipeline around task-aware supervision: a self-supervised cross-modal reconstruction phase learns a cross-modal adaptive decoder, followed by a weakly supervised, segmentation-guided cross mixer that mines complementary features to form a complete fused representation, thereby eschewing handcrafted fusion losses and improving both fusion quality and semantic segmentation [71,72].

2.1.2. CNN-Based Approaches

Figure 2 illustrates the basic structure of CNN-based IVIF approaches. The principal advantages of convolutional neural networks (CNNs) stem from the locality of the convolution operator and their translation equivariance [73], which jointly bias the model toward low-level primitives (e.g., edges, textures, gradients) and enable robust learning under small-sample regimes and noisy annotations [74,75]. In addition, architectural designs such as U-Net [76] and pyramid modules [17,77] effectively capture large-scale context while preserving high-frequency detail [26,78,79].

One class of approaches focuses on architectural optimizations designed to enhance the extraction and fusion of multimodal features [33,34,35,37,39]. To address scene distortion and information redundancy, Liu et al. [42] introduce a third class of scene-common features in addition to conventional infrared- and visible-specific features. These are stable, modality-invariant cues shared across images from different modalities and minimally dependent on specific imaging mechanisms. Using these cross-modal commonalities as anchors both preserves key structures and markedly alleviates scene distortion, while selectively superimposing each modality’s advantageous details as needed, thereby improving performance on downstream tasks. Zheng et al. [40] observe that, in the frequency domain, phase encodes structure/semantics while amplitude reflects style/appearance. Accordingly, FISCNet fuses infrared and visible images via phase summation with visible amplitude retention, enhancing infrared saliency and preserving visible textures; spatial and channel attention further refine details. Gao et al. [43] present HalVFusion for hazy scenes: SIRNet restores fogged visible textures; DFNet fuses with infrared while suppressing residual dehazing noise and emphasizing salient, high-texture cues; an RGB-space constraint enforces chromatic consistency with the dehazed output.

Another strand of methods is loss-centric, guiding the fusion outcome by designing or refining loss formulations [36,38]. To accommodate divergent downstream-task requirements in infrared–visible image fusion, Wu et al. [41] proposes AdFusion, a tunable framework that enables users to dynamically modulate the relative contributions of visible and infrared information by adjusting a global coefficient

α

. Xiao [44] proposes a lightweight, semantics-guided Mutually Reinforcing Network (SMR-Net) that couples infrared–visible image fusion with salient object detection (SOD). The model first fuses infrared and visible inputs to obtain semantics-focused fused features, which are then fed into an SOD module; the SOD loss is incorporated into the fusion objective, thereby encouraging the fused output to emphasize salient regions.

2.1.3. GAN-Based Approaches

As presented in Figure 3, the core merit of GAN-based methods lies in leveraging a generator–discriminator interplay with adversarial loss to produce visually realistic fusions that both resemble the visible modality and preserve infrared saliency and fine textures [80,81,82,83,84]. Moreover, these approaches typically obviate hand-crafted fusion rules and can be trained in an unsupervised paradigm. Existing GAN-based fusion methods can be grouped into two principal categories.

Single discrimination: Rao et al. [48] address the susceptibility of infrared–visible fusion to noise, extreme contrast, and redundant textures under adverse conditions by proposing AT-GAN, an adversarial fusion network. A structure/quality evaluation module assesses source-image reliability and dynamically modulates the relative contributions of infrared and visible cues, enabling a robust fusion balance in low illumination, strong light, smoke, and other degraded scenarios. Ref. [45] first trains a Texture-Conditional GAN (TC-GAN) to learn a combined texture map—conditioned on visible-image textures and optimized with adversarial and gradient losses—which then guides adaptive guided filtering to generate multiple decision maps for weighted fusion, thereby preserving infrared thermal targets while enhancing visible details and achieving superior subjective and objective results on TNO and RoadScene compared with numerous baselines.

Dual discrimination: Sui [49] proposes IG-GAN (Interactive Guided GAN) for visible–infrared fusion. The method decomposes fusion into a detail stream, guided by the modality with richer fine-grained detail, and a content stream, guided by the modality with more complete, environment-robust content—jointly preserving detail sharpness and content integrity. Two discriminators enable adversarial learning to generate high-fidelity fused images in an unsupervised manner. Gao et al. [47] propose DCDR-GAN, which disentangles high-level content features (e.g., contours, details, objects) from modality features (e.g., contrast, color/texture style), processes the two branches separately, and then recombines them to synthesize the fused image—thereby mitigating cross-modality interference from mutually exclusive characteristics. Zhou et al. [46] propose a semantics-guided dual-discriminator GAN (SDDGAN): an IQD module estimates per-object IR/visible contributions to form a weight map M, and two discriminators separately supervise IR- and VIS-favored regions, preserving thermal cues and visible textures while reducing cross-modality interference.

2.1.4. Transformer-Based Approaches

With the rapid adoption of Transformers in computer vision, their strong global modeling capacity and cross-modal interaction capability have been increasingly applied to infrared and visible image fusion (IVIF). In UAV applications, Transformer-based approaches provide powerful long-range dependency modeling and effective multi-modal integration under complex environments. As shown in Figure 4, self-attention modules replace traditional CNN blocks to build Transformer-based fusion frameworks. Unlike convolutional networks that rely on local receptive fields, Transformers compute global relationships among tokens via multi-head self-attention [85], enabling context-aware feature aggregation and dynamic receptive fields. Recent vision-oriented variants such as ViT [86], DeiT [87], Swin Transformer [88], and PVT [89] introduce hierarchical and multi-scale designs to enhance efficiency and scalability. Further improvements like MaxViT [90], DETR [91], and efficient attention mechanisms such as FlashAttention [92] demonstrate the Transformer’s adaptability across dense prediction and end-to-end perception tasks, establishing it as a robust backbone for UAV-based vision applications.

Some methods focus on cross-modal feature modeling. For example, ref. [57] leverages self-attention and cross-attention mechanisms to enhance both intra- and inter-modal complementarity. Ref. [61] further introduces frequency-aware and multi-scale feature extraction to improve robustness in complex and out-of-distribution scenarios. Similarly, ref. [55] employs spatial and channel-wise cross-modal Transformers to suppress redundancy and highlight complementary information, generating fused images with enhanced details and structure.

Other approaches emphasize hybrid designs that combine CNN and Transformers. IFT [50] proposes a Spatio-Transformer strategy, where CNN branches capture local details while Transformer branches model long-range dependencies for improved fusion. SwinFusion [51] exploits the shifted window mechanism to establish intra- and cross-domain dependencies, enabling efficient fusion across arbitrary image sizes, making it suitable for both general-purpose fusion and UAV-specific scenarios.

A further line of research introduces semantic guidance and task-driven fusion. MDDPFuse [59] employs ViT to extract multi-frequency features, combined with adaptive perception and semantic injection modules for task-aware fusion. PromptFusion [58] and Text-IF [56] incorporate priors from vision-language models (VLMs): the former leverages learnable prompt tokens with frequency-domain decomposition, while the latter uses CLIP-based text embeddings to modulate Transformer decoders, enabling semantic-driven, interactive, and degradation-aware fusion. These designs better support UAV downstream tasks such as detection and segmentation.

Finally, several methods integrate adversarial learning or invertible structures. TGFuse [53] combines Transformers with GANs, employing dual discriminators to enforce modality alignment and preservation of detail. YDTR [52] introduces a Y-shaped dynamic Transformer architecture to capture global dependencies in the backbone, balancing salient infrared targets with visible details. CDDFuse [54] proposes a dual-branch decomposition framework that applies Lite Transformers to low-frequency global features and invertible networks to high-frequency details, with correlation-driven losses to enhance cross-modal alignment.

Overall, Transformer-based approaches have emerged as a key direction in IVIF research. They not only preserve structural and detail fidelity but also improve semantic adaptability, showing strong potential for UAV-specific fusion tasks in diverse and challenging environments.

2.1.5. Other Approaches

Beyond the mainstream AE/CNN/GAN/Transformer families, an emerging body of IVIF research explores alternative modeling paradigms that better balance image quality, robustness, and compute under UAV constraints. These methods are often tailored to aerial idiosyncrasies—low illumination and haze, platform motion and residual misalignment, scale variation across altitude, strict onboard latency/power budgets, and frequently couple fusion with restoration, registration, or detection. Two fast-maturing directions exemplify this trend: (i) diffusion-based generative formulations that inject strong priors via score matching to stabilize training and recover fine textures (with practical attention to accelerating iterative sampling), and (ii) Mamba formulations that replace quadratic attention with linear-time selective scanning to enable long-range cross-modal interaction at low memory and latency, with plug-and-play integration into detector backbones. The subsections below review representative methods along these two lines and summarize their supervision strategies, computational profiles, and deployment trade-offs for UAV applications.

Diffusion-Based: Recently, diffusion models have become a major paradigm in generative modeling, based on gradually adding Gaussian noise to data and learning to reverse the process to generate high-quality samples [93,94,95,96]. Compared with GANs, diffusion models exhibit superior training stability and detail fidelity, achieving state-of-the-art performance in image synthesis, editing, and cross-modal generation. Inspired by their success, researchers have begun applying them to infrared and visible image fusion (IVIF). A representative method, Diff-IF [63], integrates fusion priors with a conditional diffusion framework, where priors guide the forward process and high-quality fused images are generated in the reverse process, effectively mitigating the instability of GAN-based methods and enhancing fine-detail retention. Another work, Dif-Fusion [62], concatenates the infrared single channel with visible RGB channels as a four-channel input, learns cross-modal complementary features in the latent space through iterative noise injection and removal, and employs multi-channel gradient and intensity losses to ensure texture, structural, and color fidelity, thereby producing high-quality color fusion results. Overall, diffusion-based approaches demonstrate clear advantages in stability, fidelity, and color consistency, offering promising potential for UAV tasks such as surveillance, navigation, and target detection.

Mamba-Based: Recent studies introduce state-space models (Mamba) to IVIF for UAV perception, exploiting linear-time selective scanning to realize efficient cross-modal interaction. Fusion-Mamba [65] inserts a plug-and-play block at the last three backbone stages that (i) performs state-space channel swapping to prime RGB–IR correlation and (ii) applies dual state-space fusion with gated interaction to suppress spurious targets while preserving complementary cues; embedded in a YOLOv8-style detector, it attains state-of-the-art results on LLVIP, M3FD, DroneVehicle, and FLIR-Aligned with lower latency than Transformer-based fusion. Based on this, COMO [64] targets the misalignment of RGB-IR by confining the interaction between Mamba to high-level (coarser) features, coupling a global–local selective scan with an offset-guided multiscale fusion that uses high-level cues to steer low-level aggregation; analyses of offset prevalence and experiments on DroneVehicle and allied benchmarks show consistent SOTA accuracy and real-time inference speed (fps), underscoring the suitability of SSMs for UAV deployment onboard.

2.2. Task-Driven Fusion for UAV Perception

Unlike enhancement-oriented IVIF that optimizes for photographic quality alone, task-driven fusion explicitly shapes the fused representation to benefit onboard perception under UAV-specific constraints—rapid viewpoint changes, small/weak targets, illumination extremes, residual RGB–IR misalignment, and tight compute/power budgets. Recent evidence and analysis in the application-oriented literature show that coupling fusion with downstream objectives (e.g., detection/segmentation losses, object-aware priors, semantic injection, expert gating) consistently yields larger gains on perception metrics than optimizing only classical fusion scores [97]; moreover, robust pipelines must confront pixel-level misregistration and adverse conditions rather than assuming perfectly aligned inputs [98]. Consequently, we organize this subsection around the two core perception tasks emphasized by the survey—object detection and semantic segmentation—followed by other tasks (e.g., tracking), and we adopt evaluation that reflects end-task utility (mAP/mIoU) alongside traditional fusion metrics, while keeping an eye on lightweight, hardware-friendly designs necessary for UAV deployment [99].

Detection Oriented Fusion: Sun et al. [100] present DetFusion, which explicitly couples object detection with infrared–visible image fusion. Modality-specific detectors for IR and VIS supply attention maps; a shared attention stream is then injected into the fusion decoder to bias the network toward detector-identified object regions. Zhang et al. [101] propose QFDet, a quality-aware RGBT detector for drone-view tiny persons. A quality-aware learning scheme (SIWD + QAF with a Manhattaness prior) stabilizes supervision for tiny boxes, while a lightweight PreHead predicts per-modality quality maps to drive region-adaptive cross-modal enhancement, delivering consistent gains on UAV benchmarks. Fu et al. [102] introduce CF-Deformable DETR, an end-to-end, alignment-free RGBT detector that leverages cross-modal deformable attention together with a hyperbolic-space point-level consistency loss to learn point correspondences across modalities. The design maintains accuracy under RGB–IR misalignment without incurring explicit alignment overhead.

Segmentation Oriented Fusion: Jiang et al. [103] propose HSFusion, a high-level task–driven IR–VI fusion framework that bidirectionally transforms between semantic and geometric domains via two CycleGAN-based branches with non-shared weights for VIS and IR. Features from the reconstruction (geometry) path feed a segmentation-guided fusion network. During fusion, semantic masks from the segmentation output partition thermal targets from background, enabling the network to upweight IR features in foreground thermal regions and VIS features in background areas—preserving fine details while improving downstream segmentation. Dong et al. [104] develop EGFNet, which injects modality-agnostic prior edge maps (derived from RGB and thermal Sobel cues) into a unified fusion pipeline. A dedicated fusion module, complemented by global- and semantic-information modules, enables boundary-aware aggregation; multi-task deep supervision on edge and semantic maps sharpens contours and boosts urban scene parsing. Wang et al. [105] propose SGFNet, an asymmetric, semantic-guided fusion network in which a TIR branch provides stable semantic guidance (SGH) to steer multi-level fusion within the RGB encoder via coordination–distillation and cross-level enhancement units. An edge-aware Lawin-ASPP decoder further refines boundaries, yielding robust segmentation under adverse illumination.

Other Perception Tasks: Li et al. [106] first apply a salient object segmentation (SOS) network to the infrared image to obtain a binary foreground mask that highlights target regions. Guided by this mask, the method performs foreground/background-separated fusion of IR and VIS, followed by reconstruction. This design preserves thermal targets, suppresses IR background noise, and retains rich visible-texture details. Zhang et al. [107] recast tracking as task-driven fusion via cross-modality distillation (CMD). A teacher–student framework transfers modality-specific/common cues (SCFD) and multi-path fusion behaviors (MPSD) from a strong two-stream tracker to a lightweight single-stream model. A hard-focused response distillation mitigates distractors, achieving competitive accuracy on standard RGB–T benchmarks with a fraction of the parameters and computation, well aligned with UAV resource constraints.

2.3. Summary and Discussion

2.3.1. Architectures

In UAV scenarios, the architectural design of infrared and visible image fusion (IVIF) has evolved from traditional convolutional frameworks to Transformers, and further to semantics- and task-guided paradigms. These architectures not only pursue enhanced visual quality but also increasingly emphasize adaptability to complex environments and support for downstream tasks. Therefore, categorizing and comparing existing architectures helps clarify their technical trajectories and development trends. Current UAV-oriented IVIF architectures can be broadly grouped into several categories:

Autoencoder-based Encoder–Decoder Models: Early works [23,24,25] adopt a three-stage encoder–fusion–decoder framework. Features are extracted via CNN encoders, fused through simple addition or attention strategies, and reconstructed by decoders. These models are straightforward but limited in capturing long-range dependencies.
Transformer-driven Fusion Networks: Recent advances leverage Transformers for global modeling. CDDFuse [54] combines Lite Transformer and INN for low/high-frequency decomposition; SwinFusion [51] and IFT [50] adopt shifted-window and spatio-transformer modules to capture both local and global contexts; YDTR [52] and CrossFuse [57] further exploit cross-modal attention and dynamic Transformer units. These methods improve semantic consistency and long-range interaction, becoming the mainstream paradigm.
Semantics- and Language-guided Fusion: Emerging methods integrate high-level semantics or textual prompts. PromptFusion [58] and Text-IF [56] use CLIP-based semantic prompts to guide fusion; SMR-Net [44] introduces saliency detection as auxiliary supervision; MDDPFuse [59] and HaIVFusion [43] inject semantic priors or haze-recovery modules to enhance robustness in complex UAV scenarios. This reflects the trend of IVIF evolving from pure visual enhancement to semantic-driven integration.
GAN-based Approaches: GANframeworks [46,48,49,53] employ adversarial training with dual discriminators, edge-aware constraints, or guided attention blocks. These methods enhance perceptual realism and target saliency, though stability and interpretability remain challenges.
Cross-modal Registration and Task-driven Fusion: Beyond generic fusion, some UAV-focused methods integrate registration or downstream tasks. MulFS-CAP [108] performs cross-modal alignment before fusion, Collaborative Fusion and Registration [109] jointly optimizes registration and fusion, while COMO [64] and Fusion-Mamba [65] embed fusion into detection pipelines. These works highlight the necessity of coupling fusion with UAV-specific applications.

Overall, IVIF architectures for UAV applications exhibit a trend of diversification and integration. From early encoder–decoder models, to Transformer-based global modeling, and further to semantic-driven and task-aware designs, these approaches have become increasingly complementary. They not only improve image quality but also demonstrate stronger potential in semantic consistency and task adaptability, thus laying a solid foundation for future UAV applications.

2.3.2. Loss Function

Beyond architecture design, the choice of loss functions plays an equally critical role in determining the performance of IVIF models. Different loss formulations often reflect varied emphases of researchers—some prioritize visual fidelity, others stress task adaptability, while still others focus on cross-modal alignment. Hence, summarizing and comparing these loss functions provides a comprehensive understanding of their role in UAV-specific fusion.

Reconstruction and Similarity Losses: Most autoencoder-based methods employ pixel-level losses (L2/L1) combined with structural similarity (SSIM), such as DenseFuse [23], RFN-Nest [26], and SEDRFuse [25]. DIDFuse [24] additionally introduce gradient-consistency constraints to enhance edge preservation.
Frequency- and Gradient-based Constraints: To better retain fine details, some methods add gradient or frequency terms into the loss. Examples include the Texture/Gradient Loss in SwinFusion [51], as well as wavelet/frequency consistency constraints in FISCNet [40]. These designs improve detail sharpness and texture fidelity.
Semantic- and Task-guided Losses: Semantic-driven approaches often incorporate semantic segmentation or saliency supervision. For example, MDDPFuse [59] introduces semantic injection losses, SMR-Net [44] adopts saliency detection losses, and Text-IF uses semantic modulation losses. Such designs ensure visual quality while improving performance in downstream tasks.
Adversarial and Feature-consistency Losses: GAN-based methods typically include adversarial objectives. For instance, TGFus [53] employs feature differences from discriminators, while IG-GAN [49], AT-GAN [48], and SDDGAN [46] leverage dual discriminators or region-specific supervision. These are often combined with content, edge, or compatibility constraints to ensure fused images align with modality distributions while maintaining visual sharpness.
Alignment and Correlation Losses: To address cross-modal alignment, some works introduce explicit correlation losses. For example, CDDFuse [54] applies correlation-driven decomposition to enforce low-frequency sharing and high-frequency decorrelation; MulFS-CAP [108] employs relative/absolute local correlation losses; and Collaborative Fusion and Registration [109] incorporates cyclic consistency and smoothness regularization in the registration stage. These designs improve spatial alignment and modality disentanglement in fused results.

In summary, the evolution of loss function design in IVIF has moved beyond early pixel- and similarity-based formulations toward a more comprehensive and task-aware paradigm. Each class of loss contributes distinct advantages: frequency- and gradient-based terms enhance fine-grained texture reconstruction; semantic- and task-guided losses introduce high-level contextual awareness that bridges fusion and downstream understanding; adversarial and feature-consistency objectives improve perceptual realism and cross-modal coherence; and correlation-driven constraints explicitly address spatial misalignment and modality disentanglement. The interplay among these objectives reveals a broader trend—from purely visual optimization to joint optimization across perception, semantics, and geometry.

For UAV-oriented IVIF, this shift is particularly meaningful: UAV imagery often suffers from illumination variability, viewpoint distortion, and atmospheric interference, where simple reconstruction losses fail to capture complex modality relations. Hence, hybrid loss designs that integrate low-level fidelity with high-level semantic and structural consistency are proving more effective. Future research is expected to explore adaptive and data-dependent weighting strategies, contrastive or diffusion-based alignment losses, and self-supervised objectives that can dynamically adjust to diverse UAV scenarios, ultimately enabling more generalizable and robust fusion models.

3. Data Compatibility

Table 2 summarizes widely used aligned IR–VIS datasets, which are overwhelmingly ground-based and only partially representative of UAV imagery. Scene types span fixed/horizontal (TNO), driving (RoadScene, MS, MFNet), surveillance (LLVIP), and mixed viewpoints (VIFB, M³FD, FMB). Resolutions range from

640 \times 480

to

1280 \times 720

, and scales vary from small benchmarks (VIFB, 21 pairs) to large training corpora (LLVIP, 16,836; MS, 2999). Night/“challenge” coverage also differs. LLVIP is entirely low-light, while TNO and M³FD include curated difficult scenes—making them popular stress tests. FMB and MFNet emphasize pedestrian/traffic content. Prior surveys similarly conclude that current IVIF benchmarks are ground-centric and that data compatibility (e.g., misalignment) remains a bottleneck for practical use.

For UAVs, the main gaps are geometric and dynamic: altitude-driven scale changes, nadir/oblique views, depth-dependent parallax, platform jitter and rolling-shutter offsets, and atmospheric/thermal drift [110,111,112]. Models trained solely on Table 2 therefore tend to transfer poorly to aerial sorties. A practical recipe is to (i) pretrain on large aligned sets to learn low-level priors (LLVIP/MS/MFNet), (ii) stress-test robustness with TNO and M³FD challenge scenes, and (iii) narrow the domain gap via UAV-style augmentations, random homographies/parallax, temporal offsets, motion blur, scale/rotation jitter, small-target patch sampling, followed by fine-tuning on in-house UAV IR–VIS pairs when available.

Table 2. Characteristics of common aligned infrared–visible fusion datasets.

Dataset	Img Pairs	Resolution	Camera Angle	Nighttime	Challenge Scenes
TNO [113]	261	$768 \times 576$	horizontal	65	✓
RoadScene [17]	221	Various	driving	122	×
VIFB [114]	21	Various	multiplication	10	×
MS [115]	2999	$768 \times 576$	driving	1139	×
LLVIP [116]	16,836	$1280 \times 720$	surveillance	all	×
M³FD [117]	4200	$1024 \times 768$	multiplication	1671	✓
MFNet [118]	1569	$640 \times 480$	driving	749	×
FMB [119]	1500	$800 \times 600$	multiplication	826	×

All Here, ✓ indicates that the dataset contains challenge scenes, whereas × indicates that no challenge scenes are reported. All dataset download links are organized in our public repository on GitHub (https://github.com/JJLibra/IVIF-in-UAV-Imagery, accessed on 13 November 2025).

3.1. Registration-Free

In multimodal image fusion, varying imaging conditions can yield disparate observations of the same scene, so image registration is needed to map different modalities into a shared coordinate frame [120,121,122]. Approaches largely fall into two families: (i) Explicit registration, which directly estimates a geometric transform to warp the moving image onto a reference, offering strong interpretability, greater robustness to large parallax or severe misalignment, and convenient reuse of the estimated deformation for downstream tasks such as detection and segmentation; (ii) Implicit registration, which forgoes an explicit deformation field and instead aligns within the network’s feature space (i.e., align-while-fusing), typically improving robustness to cross-modal discrepancies and avoiding the interpolation artifacts introduced by explicit resampling.

Explicit registration: Yang et al. [109] proposed a joint registration fusion optimization framework: visible and infrared modalities are used as reciprocal references to generate one fused image per modality; cycle consistency is enforced by requiring that both fused results be consistent with the aligned ground truth; and the fusion loss is back-propagated to supervise the registration module. The overall objective is a weighted sum of the registration and fusion losses. UM-Fusion [123] first applies cross-modal image translation to homogenize the visible image to a pseudo-infrared counterpart. A multistage registration network (MRRN) then estimates a displacement field to perform explicit single-modality registration, followed by fusion. This pipeline substantially suppresses ghosting/artifacts and is particularly effective under medium-to-large misalignments. ReCoNet [124] integrates a deformation-correction module into the fusion network to explicitly compensate for geometric distortions and pairs it with attention to suppress residual artifacts; the approach targets efficient fusion under mild misregistration. SuperFusion [125] provides an integrated semantics-sensitive registration fusion framework. Its registration branch jointly estimates bidirectional deformation fields and leverages semantic constraints to improve alignment and the fidelity of target regions. The approach is intended for general-purpose scenarios. MURF [126] formulates registration and fusion as a closed-loop joint optimization: registration enhances fusion, and fusion, in turn, supervises registration. It is tailored for scenarios with pronounced misregistration and large cross-modal discrepancies. One-Stage Progressive Dense Registration (C-MPDR) [127] introduces a one-stage, progressive dense registration scheme that, within an end-to-end framework, estimates multiscale deformations in a coarse-to-fine manner while working with a fusion subnetwork, thus mitigating the error propagation characteristic of two-stage “register-then fuse” pipelines. The method is designed for generally misaligned inputs.

Implicit registration: Translation-Robust Fusion [98] introduces a dynamic feature-alignment plus re-aggregation strategy that primarily targets translational misalignment; during fusion, it adaptively estimates and compensates for cross-modal displacement to achieve alignment, thereby obviating the need for explicit pixel-level registration. The “Without Strict Registration” framework [128] performs cross-modal alignment in the feature space by coupling a CNN–Transformer Hybrid feature extractor—CNN–Transformer Hierarchical Interactive Embedding (CTHIE)—with a Dynamic Re-aggregation Feature Representation (DRFR) module. The design emphasizes that robust fusion can be achieved without strict pre-registration, and is thus well suited to scenarios with unknown or mixed forms of misalignment. Li et al. [108] likewise address the complexity and resource inefficiency of two-stage registration–fusion pipelines by proposing MulFS-CAP, a single-stage infrared–visible image fusion framework that dispenses with explicit registration by reformulating it as alignment-aware, feature-level rearrangement. Compared with conventional two-stage approaches, MulFS-CAP is more lightweight and delivers superior fusion quality.

3.2. General

General-purpose fusion seeks a single pixel-level model that transfers across heterogeneous families (e.g., infrared–visible, multi-exposure, multi-focus, medical, remote sensing), while preserving low-level structure (contrast, gradients, textures) and maintaining high-level utility for downstream perception. Recent advances converge along two complementary axes: (1) loss function design that stabilizes unsupervised optimization and encodes cross-modality consistency, and (2) architectural innovations that enable effective cross-domain interaction with deployment-friendly complexity [129].

Loss-function design: Task-unified, unsupervised objectives have become standard for general fusion. U²Fusion [17] formulates fusion as adaptive information preservation: the network estimates source importance and is trained to preserve measured similarity without task-specific ground truth, covering multimodal, multi-exposure, and multi-focus settings within a single model. PMGI [130] further unifies supervision via proportional maintenance of gradient and intensity, decoupling optimization into two complementary pathways governed by a uniform loss that transfers naturally across tasks. Adversarial formulations provide a general envelope: FusionGAN [131] encourages fused images with dominant infrared intensities and strengthened visible gradients through a generator–discriminator game, removing hand-crafted rules; GANMcC [132] extends this with multi-class discrimination to jointly align visible/infrared distributions and a content loss that balances contrast and detail by treating gradient and intensity as primary/auxiliary factors. Self-evolving/self-supervised objectives have also emerged. MUFusion [133] introduces a memory unit that leverages intermediate fused outputs to supervise subsequent iterations; its unified loss (content + memory with adaptive weights) improves training without external labels and transfers across four fusion families. DDBFusion [134] constructs a self-supervised dual decomposition objective, augmented with Bézier-curve perturbations, to filter redundant information and learn effective components prior to fusion, avoiding task-specific annotations. Finally, efficiency-aware learning couples measurement with objective weighting: a lightweight unified network [135] refines gradient/intensity cues and converts them into adaptive loss weights, improving fusion quality under strict budgets and validating utility for detection and segmentation.

Architectural innovations: Unified decoders paired with modality-aware encoders remain effective. UNFusion [77] adopts a multi-scale encoder–decoder with dense skip connections and

L_{p}

-normalized spatial/channel attentions to aggressively reuse features across scales, preserving bright thermal targets and rich textures. PMGI [130] realizes a two-path design (gradient vs. intensity) with path-wise transfer to exchange complementary cues, yielding a fast, unified backbone. Transformer-style cross-domain interaction strengthens long-range dependencies: SwinFusion [51] couples intra-domain self-attention with inter-domain cross-attention under a unified objective with SSIM-, texture-, and intensity-based terms, enabling global intensity control and detail preservation across modalities and photographic conditions. Beyond attention, recent works explore memory, decomposition, and efficient operators. MUFusion’s [133] memory unit provides iterative self-guidance for feature reconstruction; DDBFusion [134] explicitly decomposes inputs into effective vs. redundant components via hierarchical dual decomposition prior to fusion. For deployment, lightweight pixel-level unified networks [135] emphasize hardware-friendly design while retaining unified learning. Most recently, Cao et al. [60] show that language prompts and semantic masks can serve as unifying guidance within an efficient recurrent (RWKV-like) backbone [136,137], introducing a multimodal fusion module that exchanges information without quadratic attention and reporting state-of-the-art results across diverse tracks.

4. Benchmark & Evaluation Metric

4.1. Benchmark

To comprehensively evaluate infrared and visible image fusion techniques for UAVs, diverse datasets have been established, covering scenarios such as urban monitoring, nighttime surveillance, and adverse weather conditions. These datasets differ in resolution, modality alignment, and annotation availability, thus providing tailored benchmarks for different research objectives. Representative image pairs sampled from these datasets highlight the diversity of conditions and scene complexity faced in UAV-based fusion tasks (Figure 5).

4.2. Evaluation Metric

To more effectively evaluate the image fusion quality brought by different methods, this section summarizes nine fusion metrics, divided into two groups based on whether reference (source) images are required: five reference-based metrics and three no-reference metrics.

Reference-based metrics: Mutual Information (MI) [138] quantifies how much information the fused image shares with the source images; higher MI indicates that more source information is transferred into the fusion result. Let A and B denote the two source images and F the fused result; the specific definition is as follows:

M I = M I_{A, F} + M I_{B, F},

(1)

M I_{X, F} (X \in {A, B})

quantifies the information transferred from source image X to the fused image F, given by

{MI}_{X, F} = \sum_{x} \sum_{f} P_{X, F} (x, f) log \frac{P_{X, F} (x, f)}{P_{X} (x) P_{F} (f)} .

(2)

In this expression,

P_{X} (x)

and

P_{F} (f)

denote the marginal probability mass functions of the source image X and the fused image F (obtained by normalizing their intensity histograms), while

P_{X, F} (x, f)

is their joint probability mass function (obtained by normalizing the joint histogram). The variables x and f range over the corresponding gray levels.

Visual Information Fidelity (VIF) [139] measures the information preserved in the fused image relative to a reference under a natural-scene-statistics/HVS model; larger VIF values generally align better with human visual perception. Formally,

{VIF}_{X, F} = \frac{\sum_{i \in B} I (C^{\to S, i}, F^{\to S, i} | R^{S, i})}{\sum_{i \in B} I (C^{\to S, i}, X^{\to S, i} | R^{S, i})} .

(3)

Here, i ranges over the subband set

B

;

C^{\to S, i}

denotes the coefficients of the reference natural image in subband i after passing through the visual channel S;

X^{\to S, i}

and

F^{\to S, i}

are the corresponding coefficients for the source image X and the fused image F, respectively;

R^{S, i}

is the local gain/variance field; and

I (\cdot, \cdot ∣ \cdot)

is conditional mutual information.

Sum of Correlation Differences (SCD) [140] evaluates the correlations between difference images derived from the sources and the fused image; a higher SCD suggests that more source information is retained in the fusion. Let A and B be the two source images and F the fused image. A commonly used definition first constructs two difference images

D_{1} = F - B, D_{2} = F - A,

and then computes

SCD = r (D_{1}, A) + r (D_{2}, B),

(4)

where

r (\cdot, \cdot)

denotes the Pearson correlation coefficient. Explicitly,

r (U, V) = \frac{\sum_{i} \sum_{j} (U_{i j} - \bar{U}) (V_{i j} - \bar{V})}{\sqrt{\sum_{i} \sum_{j} {(U_{i j} - \bar{U})}^{2}} \sqrt{\sum_{i} \sum_{j} {(V_{i j} - \bar{V})}^{2}}},

(5)

with

\bar{U}

and

\bar{V}

denoting the pixel means of U and V, respectively. Intuitively, the more complementary information F preserves from A (or B) the stronger the correlation between A and

F - B

(or between B and

F - A

); hence, a larger

SCD

indicates richer information from the sources contained in the fused image.

No-reference metrics: Entropy (EN) [141] gauges the information content of the fused image, with larger entropy typically indicating richer detail. It can be expressed mathematically as follows:

H (F) = - \sum_{i = 0}^{L - 1} p_{F} (i) {log}_{b} p_{F} (i),

(6)

where

p_{F} (i)

is the normalized gray-level histogram of F at level i, L is the number of gray levels (for 8-bit images,

L = 256

), and

{log}_{b}

denotes the logarithm with base b. In general, larger

H (F)

indicates a more dispersed intensity distribution and richer information content, suggesting that the fused image tends to preserve more detail. In practice, consistent quantization and histogram normalization should be used, with a small smoothing term for zero-probability bins to avoid numerical issues in

log 0

.

Standard Deviation (SD) [142] reflects global contrast by measuring the spread of pixel intensities; higher SD corresponds to stronger contrast in the fused image. Let the image size be

M \times N

, and define the mean

μ = \frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} F (i, j) .

The standard deviation is then

SD (F) = \sqrt{\frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {(F (i, j) - μ)}^{2}} .

(7)

SF quantifies [143] the overall activity/detail level of an image based on inter-pixel intensity differences, and is widely used to assess how well edges and textures are preserved in a fused image F. For an image (or image block) F of size

M \times N

, SF is defined as

SF = \sqrt{{RF}^{2} + {CF}^{2}},

(8)

where the row and column frequencies are

RF = \sqrt{\frac{1}{M N} \sum_{m = 1}^{M} \sum_{n = 2}^{N} {[F (m, n) - F (m, n - 1)]}^{2}},

(9)

CF = \sqrt{\frac{1}{M N} \sum_{m = 2}^{M} \sum_{n = 1}^{N} {[F (m, n) - F (m - 1, n)]}^{2}} .

(10)

In general, a larger

SF

indicates richer edge and texture detail and a higher level of image activity.

Q^{A B / F}

Edge-Preservation Quality Index

Q^{A B / F}

[144]: to assess how well a fused image F preserves edge information from source images A and B, the index

Q^{A B / F}

is commonly adopted. At the pixel level, the edge-preservation measure from a source image

X \in {A, B}

to the fused image is defined as

Q^{X, F} (i, j) = Q_{g}^{X, F} (i, j) Q_{a}^{X, F} (i, j),

(11)

where

Q_{g}^{X, F} (i, j)

and

Q_{a}^{X, F} (i, j)

quantify, respectively, the retention of edge strength and edge orientation at pixel

(i, j)

after fusion. Edge strength and orientation are typically estimated using the Sobel operator. Based on this, a weight map

w^{X} (i, j)

is introduced to reflect the local importance of source image X at

(i, j)

(e.g., derived from edge saliency or contrast). The pixelwise measures are then aggregated via a weighted normalization to obtain the global score:

Q^{A B / F} = \frac{\sum_{i = 1}^{M} \sum_{j = 1}^{N} (Q^{A, F} (i, j) w^{A} (i, j) + Q^{B, F} (i, j) w^{B} (i, j))}{\sum_{i = 1}^{M} \sum_{j = 1}^{N} (w^{A} (i, j) + w^{B} (i, j))},

(12)

where

M \times N

is the image size. A larger value of (12) indicates that the fused image better preserves the edge strength and orientation information from both source images, and thus exhibits higher fusion quality.

4.3. Limitations of Existing Metrics for UAV-Oriented IVIF

Although the above measures (EN, SD, SF, MI, SCD, VIF,

Q^{A B / F}

) are widely adopted in the IVIF community, they mainly capture low-level gray-scale statistics under an assumption of near-perfect pixel-wise alignment [11,114,145]. As a result, they often fail to reflect either human visual preference or the utility of the fused images for downstream UAV perception tasks. For example, methods that aggressively boost global contrast or sharpen noise can obtain higher EN/SD/SF scores [146,147], even though the resulting images look unnatural and may degrade detection performance. Conversely, fusion strategies that preserve small, low-contrast targets—typical in UAV scenes—may only change these global averages marginally, despite being clearly preferable in practice.

In addition, most reference-based metrics such as MI, VIF, SCD, and

Q^{A B / F}

implicitly assume strictly aligned source pairs. This assumption rarely holds on UAV platforms, where altitude changes, oblique viewpoints, parallax, platform jitter, and rolling-shutter effects induce residual RGB–IR misalignment [101,128]. Under such conditions, a fused image that is geometrically reasonable and visually consistent can still receive a low score simply because corresponding pixels no longer occupy identical positions. This sensitivity to misregistration makes conventional metrics unreliable for evaluating fusion robustness in realistic aerial sorties.

Another limitation is that almost all existing metrics treat the image as a spatially uniform grid, without paying special attention to regions of interest. However, in UAV applications, the targets of interest (vehicles, pedestrians, small aircraft) often occupy a tiny fraction of the field of view. Global measures such as EN, SD, SF, MI, and VIF are dominated by the background and therefore insensitive to improvements on small targets [117]. At the same time, none of these metrics is explicitly correlated with task performance [148], e.g., object-detection mAP or segmentation mIoU. Our analysis in Section 5 shows that methods with similar fusion scores can exhibit noticeably different performance on downstream detection, highlighting the need for task-aware or region-aware evaluation protocols.

Overall, traditional fusion metrics provide a convenient first-order summary but are insufficient to fully characterize UAV-oriented IR–VIS fusion, especially under misregistration and small-target conditions. We therefore view them as complementary to, rather than substitutes for, task-driven and perception-based metrics. This motivates the research directions outlined in Section 6.2, including the development of metrics that explicitly account for human perception, cross-modal alignment, and downstream UAV tasks such as detection, segmentation, and tracking.

5. Performance Summary and Analysis

In this section, we employ the most commonly used fusion datasets (RoadScene) to compare the performance of various advanced fusion methods, using the pretrained models released by the original authors.

5.1. Qualitative Evaluation

Figure 6 tells a clear visual story across the daytime and nighttime scenes. The earliest encoder–decoder baselines make hot targets pop, but textures in the green crop look waxy and faint halos cling to bright edges around the person in the red crop. As models with stronger local feature extractors appear, pedestrians and road markings snap into focus; still, texture rendition wobbles between soft and crunchy, and bright edges can ring. Attention-driven designs then reconcile the two inputs more gracefully—thermal saliency is kept while fine visible details are recovered—so boundaries read cleaner, background transitions are smoother, and halos largely fade in both crops. The newest iterative denoising generators polish the frame further, suppressing noise and evening out brightness (especially at night), though the very finest textures can look a touch smoother than the attention-based results. Overall, the progression moves from “make hot things bright” toward natural, globally consistent fusion that preserves structure and looks plausible at a glance.

5.2. Quantitative Evaluation

As shown in Table 3, we benchmark seven common fusion metrics on 221 RoadScene pairs. These metrics jointly probe information richness (EN/MI), contrast and activity (SD/SF), complementary cues (SCD), perceptual fidelity (VIF), and edge preservation (

Q^{A B / F}

).

The earliest baselines set a clear reference point. They pass along a fair amount of source information and contrast, yet edges are not strongly maintained and correlation-style indicators remain modest—consistent with pipelines that blend features with limited global context. This establishes the floor from which later families improve.

The next wave strengthens local representation. Models centered on richer neighborhood operators lift indicators tied to content transfer and spatial detail, and edges become more dependable than in the baselines. However, because receptive fields remain primarily local, gains are uneven across the table: information/texture metrics rise, but global consistency and cross-modal complementarity do not dominate across all methods.

A further step appears once long-range interactions are modeled explicitly. Attention-driven designs push several metrics together: information richness and contrast move up, perceptual fidelity improves, and edge preservation becomes markedly stronger. Representative entries in the table—such as correlation-driven decompositions and text-guided variants—illustrate this balanced profile, reflecting the benefit of coordinating infrared saliency with visible textures under global context.

Most recently, generative denoising strategies deliver stable, clean outputs that score well on global and complementary-information measures. At the same time, the edge-focused index typically trails the best attention-based results by a small margin, hinting at a mild smoothing tendency from the denoising prior and a need for stronger boundary-aware cues. Diffusion-style methods listed in Table 3 exemplify this trade-off.

Taken together, the table sketches a steady ascent: from a reference floor, through locally strengthened detail transfer, to globally coordinated fusion, and finally to noise-robust generative polishing. Each stage becomes the prevailing practice of its time while collectively pushing the metric suite forward on RoadScene.

6. Future Perspectives and Open Problems

6.1. Developing Benchmarks

The development of benchmark datasets has played a pivotal role in advancing infrared and visible image fusion (IVIF). Early datasets such as TNO [113] and RoadScene [17] provided pixel-level aligned image pairs, but they were limited in resolution, scene diversity, and mainly focused on static surveillance or road environments. Later, larger-scale datasets such as LLVIP [116] and MS [115] extended the scope by including more varied scenarios and annotations for higher-level tasks like object detection. However, these datasets remain restricted to specific domains (e.g., traffic monitoring or nighttime pedestrians) and still lack sufficient coverage of complex weather conditions, dynamic UAV missions, and cross-domain generalization. More recent datasets like M3FD [117] and MFNet [118] have attempted to address these gaps by introducing adverse weather conditions and multi-task labels, yet they still fall short in providing comprehensive coverage and realistic UAV adaptability.

These limitations highlight that, although existing benchmarks have mitigated the scarcity of IVIF datasets, they remain inadequate to fully represent the challenges faced in UAV applications. Future benchmark development should therefore focus on:

Constructing UAV-specific datasets with explicit annotations of cross-modal registration errors to simulate sensor misalignment and dynamic in-flight variations;
Expanding scene diversity to include extreme weather (fog, rain, snow), challenging illumination (backlight, low-light), and multi-task UAV missions (search and rescue, inspection, environmental monitoring);
Incorporating high-level task labels to promote IVIF applications in detection, segmentation, and multi-object tracking;
Establishing a unified, open, and extensible evaluation platform that integrates both subjective and objective measures, enabling more representative and practical benchmarks for UAV fusion research across academia and industry.

6.2. Better Evaluation Metrics

Evaluation metrics are another crucial component in advancing infrared and visible image fusion (IVIF). Early studies primarily relied on low-level objective measures such as Entropy (EN) [141], Mutual Information (MI) [138], Spatial Frequency (SF) [143], and Standard Deviation (SD) [142]. These metrics capture the information content and sharpness of fused images but often fail to correlate with human perception or downstream task performance. Later, perceptual-inspired measures such as Structural Similarity Index (SSIM), Visual Information Fidelity (VIF) [139], and Qabf [144] were introduced to better align with human visual preferences, yet they still largely remain limited to pixel-level and image-quality perspectives.

With the rapid expansion of UAV applications, the shortcomings of current metrics become increasingly evident. First, they often overlook task relevance, failing to directly assess the contribution of fused images to downstream tasks such as object detection, tracking, or semantic segmentation. Second, most existing metrics depend on strictly aligned pixel pairs, which do not hold in real UAV scenarios involving parallax and geometric distortions. Third, there is a lack of unified standards and integrated evaluation frameworks that combine both subjective and objective aspects.

Future directions include:

Developing task-driven metrics, such as evaluating fusion quality based on detection accuracy or segmentation performance, to better reflect UAV mission requirements.
Integrating human perception models, leveraging deep-learning-based perceptual quality assessment methods (e.g., LPIPS, DISTS) to align more closely with subjective human judgments.
Promoting multi-dimensional evaluation frameworks that jointly consider low-level quality, perceptual similarity, consistency, and task performance.
Designing robust metrics for misaligned data, accommodating the non-strict registration and dynamic scenes commonly encountered in UAV operations.

6.3. Lightweight Design

In UAV applications, lightweight design is essential for deploying infrared and visible image fusion (IVIF) methods. Unlike ground-based platforms, UAVs are constrained by limited computational power, battery capacity, and onboard memory. However, fusion models often involve high-resolution feature extraction and complex cross-modal interactions, resulting in substantial computational and storage costs. To address these constraints, researchers have explored various efficiency-oriented strategies. Compact convolutional architectures, such as MobileNet and ShuffleNet, aim to reduce floating-point operations while maintaining accuracy [149,150]. Knowledge distillation [151] transfers high-level representations from large teacher models to smaller student networks, enhancing performance under limited resources. Network pruning [152] and quantization [153] further compress parameters and decrease precision requirements, enabling real-time inference on embedded devices. More recently, lightweight Vision Transformers (e.g., MobileViT, TinyViT) have been proposed to balance global representation capability and computational efficiency [154,155]. Despite these advances, achieving both real-time performance and high fusion quality on UAV platforms remains an open challenge, driving ongoing research into model compression, token pruning, and adaptive multimodal fusion mechanisms.

Future directions for lightweight design include:

Developing lightweight Transformers or efficient cross-modal attention mechanisms that balance representational power with reduced computational cost.
Leveraging dynamic inference and adaptive computation, selectively activating modules based on scene complexity and mission requirements.
Advancing software–hardware co-design, tailoring model architectures to UAV hardware characteristics.
Establishing energy-aware evaluation metrics that jointly consider fusion quality, inference latency, power consumption, and endurance, thereby enabling practical, real-time UAV deployment.

6.4. Combination with Various Tasks

Infrared and visible image fusion (IVIF) holds great potential for integration with a wide range of UAV tasks. Beyond improving visual quality and human interpretability, fused images can directly support higher-level mission objectives [76,131]. For instance, in object detection and recognition, fusion enhances target saliency and suppresses background interference, improving detection accuracy under low-light or occluded conditions. In semantic segmentation, fused representations provide richer structural and thermal information, enabling more precise boundary delineation between objects and background, thus contributing to robust scene understanding. For target tracking, multimodal fusion offers complementary spatiotemporal cues, maintaining stability under dynamic illumination and motion. Moreover, IVIF can empower higher-level UAV missions. In search-and-rescue or security surveillance, fused imagery accelerates the detection of hidden or low-contrast targets, facilitating decision-making under adverse weather or nighttime conditions. For autonomous navigation and path planning, fusion provides clearer environmental perception, enhancing obstacle avoidance and terrain awareness. In multimodal situational awareness, fused visual data can be integrated with radar, LiDAR, or audio streams to support comprehensive multi-source reasoning. Through such integration, IVIF is expected to evolve from a standalone image enhancement technique into a fundamental enabler of intelligent UAV perception, autonomy, and decision-making.

7. Conclusions

In this UAV-centric review, we clarified why infrared–visible fusion sits at the core of robust aerial perception and synthesized recent progress through a fused-map perspective that links data compatibility, fusion mechanisms, and task adaptivity. We organized and compared learning paradigms (AE/CNN/GAN/Transformer and emerging diffusion/state-space models), examined explicit and implicit registration together with general fusion frameworks, and consolidated datasets and evaluation metrics to provide actionable guidance for deployment under real-time, power-constrained conditions. Looking ahead, we spotlight four priorities for the community: building scalable UAV benchmarks, developing multi-dimensional metrics that couple low-level fidelity with downstream task utility, advancing misregistration-aware and hardware-efficient designs, and more deeply integrating fusion with detection, segmentation, and tracking. We hope this survey serves both as a practical handbook for practitioners and a springboard for continued innovation toward reliable, efficient, and task-aligned IVIF in the wild.

Author Contributions

Conceptualization, H.Z. and J.L.; methodology, J.L., C.F. and C.O.; software, C.O. and C.F.; validation, J.L., C.F. and C.O.; formal analysis, J.L.; investigation, J.L., C.F. and C.O.; resources, H.Z.; data curation, C.F. and C.O.; writing—original draft preparation, J.L., C.F. and C.O.; writing—review and editing, H.Z., J.L., C.F. and C.O.; visualization, C.O. and J.L.; supervision, H.Z.; project administration, H.Z.; funding acquisition, H.Z., J.L., C.F. and C.O. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62401471 and in part by the 2024 Gusu Innovation and Entrepreneurship Leading TalentsProgram (Young Innovative Leading Talents) under Grant ZXL2024333 and in part by the Xi’an Science and technology plan University Institute talent service enterprise project (23GXFW0038).

Data Availability Statement

No new data were created or analyzed in this study. All datasets discussed in this review are publicly available in the cited references. A curated list of dataset download links is available at our GitHub repository (https://github.com/JJLibra/IVIF-in-UAV-Imagery, accessed on 13 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Miclea, V.C.; Nedevschi, S. Monocular depth estimation with improved long-range accuracy for UAV environment perception. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602215. [Google Scholar] [CrossRef]
Slater, J.C. Microwave electronics. Rev. Mod. Phys. 1946, 18, 441. [Google Scholar] [CrossRef]
Bearden, J.A. X-ray wavelengths. Rev. Mod. Phys. 1967, 39, 78. [Google Scholar] [CrossRef]
Marpaung, D.; Yao, J.; Capmany, J. Integrated microwave photonics. Nat. Photonics 2019, 13, 80–90. [Google Scholar] [CrossRef]
Simon, C.J.; Dupuy, D.E.; Mayo-Smith, W.W. Microwave ablation: Principles and applications. Radiographics 2005, 25, S69–S83. [Google Scholar] [CrossRef]
Simone, G.; Farina, A.; Morabito, F.C.; Serpico, S.B.; Bruzzone, L. Image fusion techniques for remote sensing applications. Inf. Fusion 2002, 3, 3–15. [Google Scholar] [CrossRef]
Yokoya, N.; Grohnfeldt, C.; Chanussot, J. Hyperspectral and multispectral data fusion: A comparative review of the recent literature. IEEE Geosci. Remote Sens. Mag. 2017, 5, 29–56. [Google Scholar] [CrossRef]
Xiao, H.; Liu, S.; Zuo, K.; Xu, H.; Cai, Y.; Liu, T.; Yang, Z. Multiple adverse weather image restoration: A review. Neurocomputing 2025, 618, 129044. [Google Scholar] [CrossRef]
Gallagher, J.E.; Oughton, E.J. Assessing thermal imagery integration into object detection methods on air-based collection platforms. Sci. Rep. 2023, 13, 8491. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
Meher, B.; Agrawal, S.; Panda, R.; Abraham, A. A survey on region based image fusion methods. Inf. Fusion 2019, 48, 119–132. [Google Scholar] [CrossRef]
Tang, K.; Ma, Y.; Miao, D.; Song, P.; Gu, Z.; Tian, Z.; Wang, W. Decision fusion networks for image classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 36, 3890–3903. [Google Scholar] [CrossRef]
Liu, Y.; Liu, S.; Wang, Z. A general framework for image fusion based on multi-scale transform and sparse representation. Inf. Fusion 2015, 24, 147–164. [Google Scholar] [CrossRef]
Wang, H.; Liu, J.; Dong, H.; Shao, Z. A Survey of the Multi-Sensor Fusion Object Detection Task in Autonomous Driving. Sensors 2025, 25, 2794. [Google Scholar] [CrossRef]
Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.; Luo, Z. Infrared and visible image fusion: Methods, datasets, applications, and prospects. Appl. Sci. 2023, 13, 10891. [Google Scholar] [CrossRef]
Ma, J.; Liang, P.; Yu, W.; Chen, C.; Guo, X.; Wu, J.; Jiang, J. Infrared and visible image fusion via detail preserving adversarial learning. Inf. Fusion 2020, 54, 85–98. [Google Scholar] [CrossRef]
Liu, J.; Wu, G.; Liu, Z.; Wang, D.; Jiang, Z.; Ma, L.; Zhong, W.; Fan, X. Infrared and visible image fusion: From data compatibility to task adaption. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2349–2369. [Google Scholar] [CrossRef]
Zhang, X.; Demiris, Y. Visible and infrared image fusion using deep learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10535–10554. [Google Scholar] [CrossRef] [PubMed]
Karim, S.; Tong, G.; Li, J.; Qadir, A.; Farooq, U.; Yu, Y. Current advances and future perspectives of image fusion: A comprehensive review. Inf. Fusion 2023, 90, 185–217. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Li, P.; Zhang, J. DIDFuse: Deep image decomposition for infrared and visible image fusion. arXiv 2020, arXiv:2003.09210. [Google Scholar]
Jian, L.; Yang, X.; Liu, Z.; Jeon, G.; Gao, M.; Chisholm, D. SEDRFuse: A symmetric encoder–decoder with residual block network for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2020, 70, 5002215. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
Wang, Z.; Wu, Y.; Wang, J.; Xu, J.; Shao, W. Res2Fusion: Infrared and visible image fusion based on dense Res2net and double nonlocal attention models. IEEE Trans. Instrum. Meas. 2022, 71, 5005012. [Google Scholar] [CrossRef]
Luo, X.; Wang, J.; Zhang, Z.; Wu, X.j. A full-scale hierarchical encoder-decoder network with cascading edge-prior for infrared and visible image fusion. Pattern Recognit. 2024, 148, 110192. [Google Scholar] [CrossRef]
Liu, H.; Mao, Q.; Dong, M.; Zhan, Y. Infrared-visible image fusion using dual-branch auto-encoder with invertible high-frequency encoding. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 2675–2688. [Google Scholar] [CrossRef]
Zhang, J.; Qin, P.; Zeng, J.; Zhao, L. DMRO-Fusion: Infrared and Visible Image Fusion Based on Recurrent-Octave Auto-Encoder via Two-Level Modulation. IEEE Trans. Instrum. Meas. 2025, 74, 5038617. [Google Scholar] [CrossRef]
Li, J.; Jiang, J.; Liang, P.; Ma, J.; Nie, L. MaeFuse: Transferring omni features with pretrained masked autoencoders for infrared and visible image fusion via guided training. IEEE Trans. Image Process. 2025, 34, 1340–1353. [Google Scholar] [CrossRef]
Wang, W.; Zhao, W.; Wang, H.; He, Y. Weakly-supervised Cross Mixer for Infrared and Visible Image Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5629413. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, S.; Zhang, J.; Liang, C.; Zhang, C.; Liu, J. Efficient and model-based infrared and visible image fusion via algorithm unrolling. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1186–1196. [Google Scholar] [CrossRef]
Li, H.; Cen, Y.; Liu, Y.; Chen, X.; Yu, Z. Different input resolutions and arbitrary output resolution: A meta learning-based deep framework for infrared and visible image fusion. IEEE Trans. Image Process. 2021, 30, 4070–4083. [Google Scholar] [CrossRef] [PubMed]
Raza, A.; Liu, J.; Liu, Y.; Liu, J.; Li, Z.; Chen, X.; Huo, H.; Fang, T. IR-MSDNet: Infrared and visible image fusion based on infrared features and multiscale dense network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3426–3437. [Google Scholar] [CrossRef]
Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Trans. Instrum. Meas. 2021, 70, 5009513. [Google Scholar] [CrossRef]
Xu, D.; Zhang, N.; Zhang, Y.; Li, Z.; Zhao, Z.; Wang, Y. Multi-scale unsupervised network for infrared and visible image fusion based on joint attention mechanism. Infrared Phys. Technol. 2022, 125, 104242. [Google Scholar] [CrossRef]
Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
Li, H.; Xu, T.; Wu, X.J.; Lu, J.; Kittler, J. Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11040–11052. [Google Scholar] [CrossRef]
Zheng, N.; Zhou, M.; Huang, J.; Zhao, F. Frequency integration and spatial compensation network for infrared and visible image fusion. Inf. Fusion 2024, 109, 102359. [Google Scholar] [CrossRef]
Wu, B.; Nie, J.; Wei, W.; Zhang, L.; Zhang, Y. Adjustable Visible and Infrared Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 13463–13477. [Google Scholar] [CrossRef]
Liu, X.; Huo, H.; Yang, X.; Li, J. A three-dimensional feature-based fusion strategy for infrared and visible image fusion. Pattern Recognit. 2025, 157, 110885. [Google Scholar] [CrossRef]
Gao, X.; Gao, Y.; Dong, A.; Cheng, J.; Lv, G. HaIVFusion: Haze-free Infrared and Visible Image Fusion. IEEE/CAA J. Autom. Sin. 2025, 12, 2040–2055. [Google Scholar] [CrossRef]
Xiao, G.; Liu, X.; Lin, Z.; Ming, R. SMR-Net: Semantic-Guided Mutually Reinforcing Network for Cross-Modal Image Fusion and Salient Object Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8637–8645. [Google Scholar] [CrossRef]
Yang, Y.; Liu, J.; Huang, S.; Wan, W.; Wen, W.; Guan, J. Infrared and visible image fusion via texture conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4771–4783. [Google Scholar] [CrossRef]
Zhou, H.; Wu, W.; Zhang, Y.; Ma, J.; Ling, H. Semantic-supervised infrared and visible image fusion via a dual-discriminator generative adversarial network. IEEE Trans. Multimed. 2021, 25, 635–648. [Google Scholar] [CrossRef]
Gao, Y.; Ma, S.; Liu, J. DCDR-GAN: A densely connected disentangled representation generative adversarial network for infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 549–561. [Google Scholar] [CrossRef]
Rao, Y.; Wu, D.; Han, M.; Wang, T.; Yang, Y.; Lei, T.; Zhou, C.; Bai, H.; Xing, L. AT-GAN: A generative adversarial network with attention and transition for infrared and visible image fusion. Inf. Fusion 2023, 92, 336–349. [Google Scholar] [CrossRef]
Sui, C.; Yang, G.; Hong, D.; Wang, H.; Yao, J.; Atkinson, P.M.; Ghamisi, P. IG-GAN: Interactive guided generative adversarial networks for multimodal image fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5634719. [Google Scholar] [CrossRef]
Vs, V.; Valanarasu, J.M.J.; Oza, P.; Patel, V.M. Image fusion transformer. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3566–3570. [Google Scholar]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Liu, Y. YDTR: Infrared and visible image fusion via Y-shape dynamic transformer. IEEE Trans. Multimed. 2022, 25, 5413–5428. [Google Scholar] [CrossRef]
Rao, D.; Xu, T.; Wu, X.J. TGFuse: An infrared and visible image fusion approach based on transformer and generative adversarial network. IEEE Trans. Image Process. 2023; online ahead of print. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 18–22 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5906–5916. [Google Scholar]
Park, S.; Vien, A.G.; Lee, C. Cross-modal transformers for infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 770–785. [Google Scholar] [CrossRef]
Yi, X.; Xu, H.; Zhang, H.; Tang, L.; Ma, J. Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 27026–27035. [Google Scholar]
Li, H.; Wu, X.J. CrossFuse: A novel cross attention mechanism based infrared and visible image fusion approach. Inf. Fusion 2024, 103, 102147. [Google Scholar] [CrossRef]
Liu, J.; Li, X.; Wang, Z.; Jiang, Z.; Zhong, W.; Fan, W.; Xu, B. PromptFusion: Harmonized semantic prompt learning for infrared and visible image fusion. IEEE/CAA J. Autom. Sin. 2024, 12, 502–515. [Google Scholar] [CrossRef]
Wang, M.; Pan, Y.; Zhao, Z.; Li, Z.; Yao, S. MDDPFuse: Multi-driven dynamic perception network for infrared and visible image fusion via data guidance and semantic injection. Knowl.-Based Syst. 2025, 327, 114027. [Google Scholar] [CrossRef]
Cao, Z.H.; Liang, Y.J.; Deng, L.J.; Vivone, G. An Efficient Image Fusion Network Exploiting Unifying Language and Mask Guidance. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 9845–9862. [Google Scholar] [CrossRef] [PubMed]
Shi, Y.; Shi, C.; Weng, Z.; Tian, Y.; Xian, X.; Lin, L. Crossfuse: Learning infrared and visible image fusion by cross-sensor top-k vision alignment and beyond. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 7579–7591. [Google Scholar] [CrossRef]
Yue, J.; Fang, L.; Xia, S.; Deng, Y.; Ma, J. Dif-fusion: Toward high color fidelity in infrared and visible image fusion with diffusion models. IEEE Trans. Image Process. 2023, 32, 5705–5720. [Google Scholar] [CrossRef] [PubMed]
Yi, X.; Tang, L.; Zhang, H.; Xu, H.; Ma, J. Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior. Inf. Fusion 2024, 110, 102450. [Google Scholar] [CrossRef]
Liu, C.; Ma, X.; Yang, X.; Zhang, Y.; Dong, Y. COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection. Inf. Fusion 2024, 125, 103414. [Google Scholar] [CrossRef]
Dong, W.; Zhu, H.; Lin, S.; Luo, X.; Shen, Y.; Guo, G.; Zhang, B. Fusion-mamba for cross-modality object detection. IEEE Trans. Multimed. 2025, 27, 7392–7406. [Google Scholar] [CrossRef]
Zhu, J.; Dou, Q.; Jian, L.; Liu, K.; Hussain, F.; Yang, X. Multiscale channel attention network for infrared and visible image fusion. Concurr. Comput. Pract. Exp. 2021, 33, e6155. [Google Scholar] [CrossRef]
Zhao, F.; Zhao, W.; Yao, L.; Liu, Y. Self-supervised feature adaption for infrared and visible image fusion. Inf. Fusion 2021, 76, 189–203. [Google Scholar] [CrossRef]
Liu, J.; Wu, Y.; Huang, Z.; Liu, R.; Fan, X. Smoa: Searching a modality-oriented architecture for infrared and visible image fusion. IEEE Signal Process. Lett. 2021, 28, 1818–1822. [Google Scholar] [CrossRef]
Wang, J.; Xi, X.; Li, D.; Li, F. FusionGRAM: An infrared and visible image fusion framework based on gradient residual and attention mechanism. IEEE Trans. Instrum. Meas. 2023, 72, 5005412. [Google Scholar] [CrossRef]
Chen, Y.; Fan, H.; Xu, B.; Yan, Z.; Kalantidis, Y.; Rohrbach, M.; Yan, S.; Feng, J. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3435–3444. [Google Scholar]
Yang, Z.; Zhang, Y.; Li, H.; Liu, Y. Instruction-driven fusion of Infrared–visible images: Tailoring for diverse downstream tasks. Inf. Fusion 2025, 121, 103148. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Liu, Y. ITFuse: An interactive transformer for infrared and visible image fusion. Pattern Recognit. 2024, 156, 110822. [Google Scholar] [CrossRef]
Cohen, T.; Welling, M. Group equivariant convolutional networks. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; PMLR: New York, NY, USA, 2016; pp. 2990–2999. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Long, Y.; Jia, H.; Zhong, Y.; Jiang, Y.; Jia, Y. RXDNFuse: A aggregated residual dense network for infrared and visible image fusion. Inf. Fusion 2021, 69, 128–141. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J.; Durrani, T. NestFuse: An Infrared and Visible Image Fusion Architecture Based on Nest Connection and Spatial/Channel Attention Models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
Wang, Z.; Wang, J.; Wu, Y.; Xu, J.; Zhang, X. UNFusion: A unified multi-scale densely connected network for infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 3360–3374. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Wang, Z.; She, Q.; Ward, T.E. Generative adversarial networks in computer vision: A survey and taxonomy. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng. 2021, 35, 3313–3332. [Google Scholar] [CrossRef]
Shamsolmoali, P.; Zareapoor, M.; Granger, E.; Zhou, H.; Wang, R.; Celebi, M.E.; Yang, J. Image synthesis with adversarial networks: A comprehensive survey and case studies. Inf. Fusion 2021, 72, 126–146. [Google Scholar] [CrossRef]
Yi, X.; Walia, E.; Babyn, P. Generative adversarial network in medical imaging: A review. Med. Image Anal. 2019, 58, 101552. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Jégou, H. Deit iii: Revenge of the vit. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 516–533. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 568–578. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 459–479. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10684–10695. [Google Scholar]
Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
Li, H.; Zhao, J.; Li, J.; Yu, Z.; Lu, G. Feature dynamic alignment and refinement for infrared–visible image fusion: Translation robust fusion. Inf. Fusion 2023, 95, 26–41. [Google Scholar] [CrossRef]
Zhang, X.; Zhai, H.; Liu, J.; Wang, Z.; Sun, H. Real-time infrared and visible image fusion network using adaptive pixel weighting strategy. Inf. Fusion 2023, 99, 101863. [Google Scholar] [CrossRef]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Detfusion: A detection-driven infrared and visible image fusion network. In Proceedings of the 30th ACM International Conference on Multimedia (MM 2022), Lisbon, Portugal, 10–14 October 2022; ACM: New York, NY, USA, 2022; pp. 4003–4011. [Google Scholar]
Zhang, Y.; Xu, C.; Yang, W.; He, G.; Yu, H.; Yu, L.; Xia, G.S. Drone-based RGBT tiny person detection. ISPRS J. Photogramm. Remote Sens. 2023, 204, 61–76. [Google Scholar] [CrossRef]
Fu, H.; Yuan, J.; Zhong, G.; He, X.; Lin, J.; Li, Z. CF-Deformable DETR: An end-to-end alignment-free model for weakly aligned visible-infrared object detection. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI 2024), Jeju, Republic of Korea, 3–9 August 2024; pp. 758–766. [Google Scholar]
Jiang, C.; Liu, X.; Zheng, B.; Bai, L.; Li, J. HSFusion: A high-level vision task-driven infrared and visible image fusion network via semantic and geometric domain transformation. arXiv 2024, arXiv:2407.10047. [Google Scholar]
Dong, S.; Zhou, W.; Xu, C.; Yan, W. EGFNet: Edge-aware guidance fusion network for RGB–thermal urban scene parsing. IEEE Trans. Intell. Transp. Syst. 2023, 25, 657–669. [Google Scholar] [CrossRef]
Wang, Y.; Li, G.; Liu, Z. SGFNet: Semantic-guided fusion network for RGB-thermal semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7737–7748. [Google Scholar] [CrossRef]
Li, G.; Qian, X.; Qu, X. SOSMaskFuse: An infrared and visible image fusion architecture based on salient object segmentation mask. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10118–10137. [Google Scholar] [CrossRef]
Zhang, T.; Guo, H.; Jiao, Q.; Zhang, Q.; Han, J. Efficient rgb-t tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5404–5413. [Google Scholar]
Li, H.; Yang, Z.; Zhang, Y.; Jia, W.; Yu, Z.; Liu, Y. MulFS-CAP: Multimodal Fusion-Supervised Cross-Modality Alignment Perception for Unregistered Infrared-Visible Image Fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3673–3690. [Google Scholar] [CrossRef]
Yang, X.; Xing, H.; Xu, L.; Wu, L.; Zhang, H.; Zhang, W.; Yang, C.; Zhang, Y.; Zhang, J.; Yang, Z. A Collaborative Fusion and Registration Framework for Multimodal Image Fusion. IEEE Internet Things J. 2025, 12, 29584–29600. [Google Scholar] [CrossRef]
Rizzoli, G.; Barbato, F.; Caligiuri, M.; Zanuttigh, P. Syndrone-multi-modal uav dataset for urban scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2210–2220. [Google Scholar]
Perera, A.G.; Wei Law, Y.; Chahl, J. UAV-GESTURE: A dataset for UAV control and gesture recognition. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 117–128. [Google Scholar]
Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Vehicle detection from UAV imagery with deep learning: A review. IEEE Trans. Neural Networks Learn. Syst. 2021, 33, 6047–6067. [Google Scholar] [CrossRef] [PubMed]
Toet, A. The TNO multiband image data collection. Data Brief 2017, 15, 249. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Ye, P.; Xiao, G. VIFB: A visible and infrared image fusion benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 468–478. [Google Scholar]
Takumi, K.; Watanabe, K.; Ha, Q.; Tejero-De-Pablos, A.; Ushiku, Y.; Harada, T. Multispectral object detection for autonomous vehicles. In Proceedings of the Thematic Workshops of ACM Multimedia 2017, Mountain View, CA, USA, 23–27 October 2017; pp. 35–43. [Google Scholar]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3496–3504. [Google Scholar]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 5802–5811. [Google Scholar]
Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5108–5115. [Google Scholar]
Liu, J.; Liu, Z.; Wu, G.; Ma, L.; Liu, R.; Zhong, W.; Luo, Z.; Fan, X. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 8115–8124. [Google Scholar]
Jiang, X.; Ma, J.; Xiao, G.; Shao, Z.; Guo, X. A review of multimodal image matching: Methods and applications. Inf. Fusion 2021, 73, 22–71. [Google Scholar] [CrossRef]
Li, L.; Han, L.; Ye, Y.; Xiang, Y.; Zhang, T. Deep learning in remote sensing image matching: A survey. ISPRS J. Photogramm. Remote Sens. 2025, 225, 88–112. [Google Scholar] [CrossRef]
Geng, Z.; Liu, H.; Duan, P.; Wei, X.; Li, S. Feature-based multimodal remote sensing image matching: Benchmark and state-of-the-art. ISPRS J. Photogramm. Remote Sens. 2025, 229, 285–302. [Google Scholar] [CrossRef]
Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration. arXiv 2022, arXiv:2205.11876. [Google Scholar] [CrossRef]
Huang, Z.; Liu, J.; Fan, X.; Liu, R.; Zhong, W.; Luo, Z. Reconet: Recurrent correction network for fast and efficient multi-modality image fusion. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 539–555. [Google Scholar]
Tang, L.; Deng, Y.; Ma, Y.; Huang, J.; Ma, J. SuperFusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA J. Autom. Sin. 2022, 9, 2121–2137. [Google Scholar] [CrossRef]
Xu, H.; Yuan, J.; Ma, J. Murf: Mutually reinforcing multi-modal image registration and fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12148–12166. [Google Scholar] [CrossRef]
Wang, D.; Liu, J.; Ma, L.; Liu, R.; Fan, X. Improving misaligned multi-modality image fusion with one-stage progressive dense registration. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 10944–10958. [Google Scholar] [CrossRef]
Li, H.; Liu, J.; Zhang, Y.; Liu, Y. A deep learning framework for infrared and visible image fusion without strict registration. Int. J. Comput. Vis. 2024, 132, 1625–1644. [Google Scholar] [CrossRef]
Zheng, K.; Huang, J.; Yu, H.; Zhao, F. Efficient multi-exposure image fusion via filter-dominated fusion and gradient-driven unsupervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Vancouver, BC, Canada, 18–22 June 2023; pp. 2805–2814. [Google Scholar]
Zhang, H.; Xu, H.; Xiao, Y.; Guo, X.; Ma, J. Rethinking the Image Fusion: A Fast Unified Image Fusion Network based on Proportional aintenance of Gradient and Intensity. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12797–12804. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2020, 70, 5005014. [Google Scholar] [CrossRef]
Cheng, C.; Xu, T.; Wu, X.J. MUFusion: A general unsupervised image fusion network based on memory unit. Inf. Fusion 2023, 92, 80–92. [Google Scholar] [CrossRef]
Zhang, Z.; Li, H.; Xu, T.; Wu, X.J.; Kittler, J. DDBFusion: An unified image decomposition and fusion framework based on dual decomposition and Bézier curves. Inf. Fusion 2025, 114, 102655. [Google Scholar] [CrossRef]
Liu, J.; Li, S.; Liu, H.; Dian, R.; Wei, X. A lightweight pixel-level unified image fusion network. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 18120–18132. [Google Scholar] [CrossRef]
He, Q.; Zhang, J.; Peng, J.; He, H.; Li, X.; Wang, Y.; Wang, C. Pointrwkv: Efficient rwkv-like model for hierarchical point cloud learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 3410–3418. [Google Scholar]
Fei, Z.; Fan, M.; Yu, C.; Li, D.; Huang, J. Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models. arXiv 2024, arXiv:2404.04478. [Google Scholar]
Qu, G.; Zhang, D.; Yan, P. Information measure for performance of image fusion. Electron. Lett. 2002, 38, 313–315. [Google Scholar] [CrossRef]
Han, Y.; Cai, Y.; Cao, Y.; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2013, 14, 127–135. [Google Scholar] [CrossRef]
Aslantas, V.; Bendes, E. A new image quality metric for image fusion: The sum of the correlations of differences. Aeu-Int. J. Electron. Commun. 2015, 69, 1890–1896. [Google Scholar] [CrossRef]
Roberts, J.W.; Van Aardt, J.A.; Ahmed, F.B. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar]
Jagalingam, P.; Hegde, A.V. A review of quality metrics for fused image. Aquat. Procedia 2015, 4, 133–142. [Google Scholar] [CrossRef]
Li, S.; Kwok, J.T.; Wang, Y. Combination of images with diverse focuses using the spatial frequency. Inf. Fusion 2001, 2, 169–176. [Google Scholar] [CrossRef]
Xydeas, C.S.; Petrovic, V. Objective image fusion performance measure. Electron. Lett. 2000, 36, 308–309. [Google Scholar] [CrossRef]
Guan, D.; Wu, Y.; Liu, T.; Kot, A.C.; Gu, Y. Rethinking the Evaluation of Visible and Infrared Image Fusion. arXiv 2024, arXiv:2410.06811. [Google Scholar] [CrossRef]
Liu, Y.; Qi, Z.; Cheng, J.; Chen, X. Rethinking the effectiveness of objective evaluation metrics in multi-focus image fusion: A statistic-based approach. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5806–5819. [Google Scholar] [CrossRef]
Chen, H.; Varshney, P.K. A human perception inspired quality metric for image fusion based on regional information. Inf. Fusion 2007, 8, 193–207. [Google Scholar] [CrossRef]
Liu, Z.; Liu, J.; Zhang, B.; Ma, L.; Fan, X.; Liu, R. PAIF: Perception-aware infrared-visible image fusion for attack-tolerant semantic segmentation. In Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3706–3714. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Tinyvit: Fast pretraining distillation for small vision transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 68–85. [Google Scholar]

Figure 1. The basic phased processes of AE-based IVIF methods.

Figure 2. The basic phased processes of CNN-based IVIF methods.

Figure 3. The basic phased processes of GAN-based IVIF methods.

Figure 4. The basic phased processes of Transformer IVIF methods.

Figure 5. Examples of selected IVIF datasets.

Figure 6. Comparison of several state-of-the-art fusion methods on a typical RoadScene image pair.

Table 1. An overview of representative deep learning-based IVIF methods.

Aspects	Methods	Publication		Core Ideas
Auto-Encoder	DenseFuse [23]	TIP	2018	Dense connection feature extraction with $ℓ_{1}$ -norm fusion rule.
	DIDFuse [24]	IJCAI	2020	DIDFuse learns deep background–detail decomposition to fuse effectively.
	SEDRFuse [25]	TIM	2020	Symmetric encoder–decoder with residual block, attention fusion, choose-max.
	RFNNest [26]	IF	2021	Residual fusion network with Nest connections replaces rules, preserving saliency.
	Res2Fusion [27]	TIM	2022	Res2Net with nonlocal attention for multiscale long-range dependencies.
	FSFuse [28]	PR	2024	Full-scale encoder–decoder with non-local attention and a cascading edge prior.
	IHFDBAE [29]	TCSVT	2024	Dual-branch autoencoder fusing invertible-wavelet HF and transformer LF.
	DMROFuse [30]	TIM	2025	Two-level modulation with a recurrent-octave auto-encoder.
	MaeFuse [31]	TIP	2025	MAE encoder with guided two-stage fusion aligns domains to integrate omni features.
	WSCM [32]	TGRS	2025	Segmentation-guided cross mixer with a shared adaptive decoder for loss-free IVIF.
CNN	IVIF-Net [33]	TCSVT	2021	Unroll two-scale optimization into a trainable encoder–decoder.
	MetaFusion [34]	TIP	2021	Meta-learning IR–visible fusion with different input and arbitrary output resolutions.
	MSDNet [35]	JSTARS	2021	IR–visible fusion with encoder–decoder and channel attention.
	STDFusion [36]	TIM	2021	Saliency-guided fusion via pseudo-siamese CNN with region-weighted pixel losses.
	MAFusion [37]	TIM	2022	IR–visible fusion with skip connections and feature-preserving loss.
	PIAFusion [38]	IF	2022	Illumination-aware IR–visible fusion with intensity and texture losses.
	LRR-Net [39]	TPAMI	2023	IR–visible fusion via LLRR and detail-to-semantic loss.
	FISCNet [40]	IF	2024	Frequency-phase IR–visible fusion with spatial compensation.
	AdFusion [41]	TCSVT	2024	Global control coefficients and semantics-aware pixel modulation.
	D3Fuse [42]	PR	2025	D3Fuse introduces a scene-common third modality, building a 3D space.
	HalVFusion [43]	JAS	2025	Texture restoration plus denoised, color-corrected haze fusion.
	SMR-Net [44]	AAAI	2025	Semantic coupling of fusion (PCI) and SOD (BPS) via a fused-image third modality.
GAN	TC-GAN [45]	TCSVT	2021	Visible-texture-conditioned guidance for adaptive-filter-based fusion.
	DDGAN [46]	TMM	2021	Semantics-guided IR–visible fusion with modality-specific dual discriminators.
	DCDRGAN [47]	TCSVT	2022	Content–modality decoupling: fuse content, inject modality via AdaIN.
	AT-GAN [48]	IF	2023	Adversarial fusion with intensity-aware attention and semantic transfer.
	IG-GAN [49]	TGRS	2024	Strong-modality–guided dual-stream GAN for unsupervised alignment and fusion.
Transformer	IFT [50]	ICIP	2022	CNNs with Transformers enhance fusion.
	SwinFuse [51]	JAS	2022	Modeling intra- and cross-domain dependencies enhances fusion.
	YDTR [52]	TMM	2022	Y-shaped structure with dynamic Transformers fuses infrared and visible features.
	TGFuse [53]	TIP	2023	Transformers with adversarial learning enhance fusion quality.
	Cddfuse [54]	CVPR	2023	Transformers model global context, INNs preserve details to enhance fusion.
	CMTFuse [55]	TCSVT	2023	Cross-modal Transformers enhance infrared-visible fusion.
	Text-if [56]	CVPR	2024	Transformers with textual semantic guidance enable interactive infrared-visible fusion.
	CrossFuse [57]	IF	2024	Self- and cross-attention boost infrared–visible fusion.
	PromptFuse [58]	JAS	2024	Transformers with frequency and prompt learning enhance fusion.
	MDDPFuse [59]	KBS	2025	Dynamic weighting and semantics boost IR–visible fusion.
	RWKVFuse [60]	TPAM	2025	Linear attention with semantic guidance improves fusion.
	Crossfuse [61]	TCSVT	2025	Multi-view with Transformers/CNNs boosts robustness.
Other	DiFFusion [62]	TIP	2023	Build latent diffusion over RGB+IR to preserve color fidelity.
	Diff-IF [63]	IF	2024	Train conditional diffusion with fusion prior generating images without ground truth.
	COMO [64]	IF	2024	Cross-Mamba interactions handle offset misalignment in detection.
	FusMamba [65]	TM	2025	Use Mamba state-space with cross-modal fusion capturing long-range context.

Table 3. Quantitative comparison of infrared–visible image fusion methods on the RoadScene dataset.

Methods	EN (↑)	SD (↑)	SF (↑)	MI (↑)	SCD (↑)	VIF (↑)	Qabf (↑)
MaeFuse [31]	7.19	44.28	12.24	1.71	1.71	0.56	0.46
FSFusion [28]	7.07	48.18	12.51	2.34	1.67	0.64	0.46
FISCNet [40]	7.09	43.85	16.62	2.20	1.26	0.59	0.55
PIAFusion [38]	6.98	42.70	12.13	2.47	1.47	0.68	0.44
CDDFuse [54]	7.43	54.66	16.36	2.30	1.81	0.69	0.52
TGFuse [53]	7.17	43.14	14.03	1.85	1.42	0.61	0.53
PromptFusion [58]	7.39	53.15	16.24	2.38	1.92	0.68	0.50
Text-IF [56]	7.37	49.67	14.80	2.08	1.85	0.70	0.59
Diff-IF [63]	7.11	43.73	14.61	2.06	1.21	0.66	0.51
Dif-Fusion [62]	7.17	42.33	15.51	1.99	1.32	0.55	0.51

↑ indicates that higher values correspond to better fusion performance for the respective evaluation metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Fan, C.; Ou, C.; Zhang, H. Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review. Drones 2025, 9, 811. https://doi.org/10.3390/drones9120811

AMA Style

Li J, Fan C, Ou C, Zhang H. Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review. Drones. 2025; 9(12):811. https://doi.org/10.3390/drones9120811

Chicago/Turabian Style

Li, Junjie, Cunzheng Fan, Congyang Ou, and Haokui Zhang. 2025. "Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review" Drones 9, no. 12: 811. https://doi.org/10.3390/drones9120811

APA Style

Li, J., Fan, C., Ou, C., & Zhang, H. (2025). Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review. Drones, 9(12), 811. https://doi.org/10.3390/drones9120811

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review

Highlights

Abstract

1. Introduction

1.1. Scope

1.2. Organization

2. Methods for IVIF on UAV Platforms

2.1. Visual Enhancement Oriented Fusion

2.1.1. AE-Based Approaches

2.1.2. CNN-Based Approaches

2.1.3. GAN-Based Approaches

2.1.4. Transformer-Based Approaches

2.1.5. Other Approaches

2.2. Task-Driven Fusion for UAV Perception

2.3. Summary and Discussion

2.3.1. Architectures

2.3.2. Loss Function

3. Data Compatibility

3.1. Registration-Free

3.2. General

4. Benchmark & Evaluation Metric

4.1. Benchmark

4.2. Evaluation Metric

4.3. Limitations of Existing Metrics for UAV-Oriented IVIF

5. Performance Summary and Analysis

5.1. Qualitative Evaluation

5.2. Quantitative Evaluation

6. Future Perspectives and Open Problems

6.1. Developing Benchmarks

6.2. Better Evaluation Metrics

6.3. Lightweight Design

6.4. Combination with Various Tasks

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI