1. Introduction
Image fusion is an effective method for information integration, and the fusion of infrared and visible images is one of the most important and efficient approaches. Infrared data reflects the thermal radiation characteristics of objects, particularly under complex conditions such as smoke, fog, or low illumination, but infrared images lack texture details. Visible light images contain abundant texture and background information, and fusing them with infrared images can yield a more comprehensive composite image. However, during the fusion process, infrared and visible images usually come from different types of sensors with varying viewpoints and fields of view, which may result in geometric distortion and spatial misalignment. The differences between sensors lie not only in their imaging principles (infrared based on thermal radiation, visible based on reflected light), but also in their resolution, dynamic range, and spectral response. Additionally, although infrared–visible image fusion has important applications across various fields, devices capable of simultaneously capturing both modalities are relatively scarce. In summary, acquiring high-quality, directionally consistent infrared and visible data in the same scene is challenging, and sensor registration and imaging mismatch further complicate bimodal image fusion. Designing and implementing efficient and accurate bimodal image fusion strategies, while addressing sensor registration and synchronization challenges, has become a critical issue in image fusion research [
1,
2,
3].
To address the alignment problem in multimodal fused images, Ma et al. proposed a unified model based on Gaussian field criteria that simultaneously adapts to infrared and visible image features [
4]. Seong G. Kong et al. adopted a multiscale processing approach to enhance the accuracy and robustness of face recognition [
5]. Jingyuan Gao et al. achieved geometric alignment of different modalities by incorporating sparse 3D reconstruction [
6]. Chen et al. innovatively used color images as guidance and employed convolutional neural networks (CNNs) to achieve high-quality thermal image super-resolution [
7]. Bruce D. Lucas et al. adjusted spatial transformation relationships between images to achieve precise spatial alignment of multimodal data [
8]. Ma, JY et al. proposed a feature-guided Gaussian mixture model (GMM) approach, delivering more reliable image alignment for clinical applications such as disease monitoring and treatment planning [
9]. David A. Clausi et al. introduced algorithm variants into the automatic registration of remote sensing images (ARRSI) to enhance model accuracy and address remote sensing image registration issues [
10].
Since its initial proposal [
11], the neural radiance field (NeRF) has become a core method for novel view synthesis, though it suffers from slow training and poor robustness. Plenoxels accelerated the training process using sparse voxels [
12]. NeRF++ and Mega-NeRF extended NeRF applications to complex scenes and large-scale environments [
13,
14]. Semantic-NeRF integrates semantic information by mapping spatial coordinates to semantic labels, aiding object recognition [
15]. PixelNeRF learns scene priors to enable novel view synthesis from sparse image sets [
16]. VT-NeRF combines vertex and texture latent codes to improve the modeling accuracy of dynamic human scenes, while CLIP-NeRF introduces multimodal control for 3D object editing into NeRF using text and image prompts [
17,
18]. UAV-ENeRF achieves large-scale UAV scene editing [
19]. SeaThru-NeRF incorporates a SeaThru-based scattering medium rendering model into the NeRF framework, combined with an adapted network architecture to jointly learn scene information and medium parameters, thereby enabling realistic novel view synthesis and medium removal in underwater and foggy environments [
20]. Jonathan T. Barron et al. replaced ray sampling with cone tracing, improving anti-aliasing, training efficiency, and reducing texture flickering [
21]. Alex Yu et al. converted NeRF’s volumetric rendering into a sparse octree structure to enable real-time and high-frame-rate rendering without sacrificing quality [
22]. Thomas Müller applied multiresolution hash encoding to accelerate NeRF training significantly [
23]. Various NeRF variants have improved model performance in terms of reconstruction accuracy, training speed, robustness, and application scenarios.
Early image fusion methods mainly relied on multiscale transforms such as wavelets or Nonsubsampled Contourlet Transform (NSCT), lacking semantic understanding [
24,
25]. Later, convolutional sparse representation (CSR) methods were used for feature transformation, addressing dependency in image decomposition [
26]. With the advancement of deep learning, CNNs and Generative Adversarial Networks (GANs) emerged, improving image fusion speed and quality [
27,
28]. A residual Swin Transformer fusion network based on saliency detection was proposed, which effectively highlights thermal targets in infrared images while preserving texture details [
29]. As image fusion algorithms diversify, more quality evaluation metrics are needed. Zhang et al. [
30] proposed VIFB, a benchmark dataset for visible and infrared image fusion, providing a unified platform for evaluating fusion algorithms. Haghighat et al. [
31] designed a no-reference image fusion quality metric based on the mutual information of image features, addressing the limitations of traditional metrics in reflecting subjective quality. Wang et al. [
32] introduced a reference-based image super-resolution method that matches high- to low-level features and fuses complementary information from reference images to improve reconstruction quality. Similarly, Kumar and Bawa [
33] developed a no-reference image quality assessment metric based on regional mutual information, enabling more fine-grained quality evaluation. You et al. [
34] proposed a novel fusion quality metric that combines mutual information and structural similarity, balancing statistical features and structural consistency to enhance the comprehensiveness and practicality of image quality evaluation.
Among them, some algorithms performed well; Liu Yu et al. proposed a CNN-based fusion method for infrared and visible images, achieving state-of-the-art image quality and fast computation [
35,
36]. Liu combined multiscale transform (MST) and sparse representation (SR) to improve the quality and robustness of multimodal image fusion [
37]. Yonghua Li et al. proposed a fusion method based on saliency detection and LatLRR-FPDE, which enhanced infrared target saliency and texture detail expression, achieving superior visual quality and information preservation in multiscale fusion [
38].
Existing studies on visible and infrared image registration suffer from low accuracy due to geometric differences and modality-specific features, making it difficult to align images captured with different camera poses and intrinsic parameters. In terms of image fusion methods, the lack of semantic understanding limits the ability to integrate complementary information from multimodal images. Regarding image data acquisition, it is challenging to obtain high-quality registered visible–infrared image pairs simultaneously.
To address this issue, this study utilizes existing infrared and visible image data and employs neural radiance fields (NeRFs) to reconstruct the 3D structure of objects and perform image augmentation, generating accurately registered infrared and visible images from novel viewpoints. Subsequently, by applying multiple fusion methods for infrared and visible images, multimodal data is integrated to obtain fused images from new viewpoints. The main innovations of this study are summarized as follows:
To address the challenge of achieving high-precision alignment with traditional image registration techniques, a novel and innovative image augmentation and registration method based on an improved NeRF is proposed. This method takes advantage of NeRF’s capability to augment novel viewpoint images, incorporating precise constraints from camera poses to generate high-precision aligned images in different modalities. It significantly enhances the high-accuracy registration of infrared and visible light images.
An improved NeRF model, tailored to the characteristics of infrared and visible light scenarios, is developed using modified NeRF-IR and NeRF-RGB models to generate novel viewpoint images. Compared to the well-performing NeRFacto model, the proposed improved models generate images of higher quality.
Three different image fusion methods (MS-SRIF, PCA-MSIF, and CNN-LPIF) were employed, and an experimental analysis was conducted to evaluate the performance of each fusion algorithm. The results show the applicability and advantages of different algorithms in various scenarios. Notably, by automatically extracting latent differences through a data-driven approach, CNN-LPIF helps alleviate the issue of detail suppression, demonstrating improved robustness and generalization ability, and achieving better overall image fusion performance.
2. Principles of NeRF Technology
The core idea of the neural radiance fields (NeRFs) model is to fit and represent a complex 3D scene using a fully connected neural network. The scene is modeled as a radiance field that, given a spatial location and a viewing direction, predicts the color and volume density at that point, enabling augmentation. Specifically, the scene is represented as a five-dimensional function, taking as input the 3D spatial coordinates and the 2D viewing direction , and outputting a color .
As the color varies with viewing angle, the color value depends on both the spatial location p and the viewing direction d. The parameter represents the volume density at a point, indicating the probability of ray termination, and depends only on the position p.
NeRF implicitly represents the entire scene’s features using a fully connected Multilayer Perceptron (MLP). First, the model processes the input 3D position
p through a fully connected layer to produce the volume density
and a high-dimensional feature vector. Next, the feature vector is concatenated with the viewing direction d and passed through another fully connected layer to predict the color c at that point. In essence, the continuous scene is encoded as a five-dimensional function as follows:
In a trained NeRF model, image synthesis from novel viewpoints is accomplished through volume rendering. Specifically, rays are cast from each target pixel through the scene, and the color of the pixel is obtained by integrating the color and volumetric density along the ray. This process is mathematically formulated as follows:
where
denotes the accumulated transmittance, representing the probability that light is not absorbed along the path from a given point to the target point. It is computed as follows:
Due to the computational complexity of continuous integration, NeRF and related follow-up studies generally adopt a discrete stratified sampling approach for approximation. Specifically, the interval
is partitioned into
N equal sub-intervals, within each of which a sample point is randomly selected. The distance between two adjacent samples is denoted as
. Through this discrete sampling scheme, the rendering equation can be approximated as follows:
where
denotes the approximated accumulated transmittance. Using this discrete sampling approach, the expected color for each pixel is computed, and a fully rendered image is synthesized by aggregating all pixel values. To optimize the network, NeRF minimizes the error between predicted colors and ground-truth colors by using it as the loss function, which is defined as follows:
where
denotes the color value predicted by the model, and
represents the ground truth color.
Figure 2 shows the flowchart of the NeRF algorithm, which is divided into four parts as follows: real image sampling, fully connected neural network training, 3D reconstruction, and augmented image generation [
11].
3. Infrared–Visible Image Generation
Images of the same scene under two modalities are acquired using a visible-light camera and an infrared camera. NeRF is used to independently train the 3D structures of the scene in the infrared and visible domains. Two neural networks are obtained, each encoding the infrared and visible characteristics of the scene, respectively. Given a new viewing direction, the generation of visible and infrared images is performed. This serves as a foundation for subsequent image fusion.
3.1. Scene Selection and Image Acquisition
3.1.1. Scene Selection
In terms of scene selection, to ensure the effectiveness and accuracy of dual-modal reconstruction using visible and infrared images, this study selects typical scenes characterized by pronounced temperature differences and rich visible-light texture details. The first is a real car in an outdoor environment. The second is a container model placed on a heating platform in a laboratory setting. These two types of scenes, respectively, simulate representative multimodal imaging scenarios under complex outdoor and controlled indoor conditions, as illustrated in
Figure 3.
3.1.2. Image Acquisition Equipment and Methods
In this study, the image acquisition equipment includes a FLIR T620 infrared thermal imaging camera and a Realme GT5 visible-light camera, which were used to capture infrared and visible-light image data, respectively. The specific parameters are shown in
Table 1.
A visible-light camera and an infrared camera were used to capture the same target during the experiment. The image acquisition did not require synchronization and could be conducted from different viewpoints.
In terms of image acquisition strategy, to obtain sufficient disparity information and enhance scene depth perception, images of the target were captured from 20 to 30 different viewpoints. Based on the results of the pre-experiment, the most effective angular interval for sampling was 30°, whereby one image was captured every 30°, achieving a full 360° surround acquisition of the target. Additionally, to fully capture the spatial structure of the object, several sets of image acquisition loops were performed from multiple overhead and low-angle viewpoints, thereby improving the completeness and accuracy of spatial reconstruction. A schematic diagram of the image acquisition is shown in
Figure 4.
3.2. Improvement of NeRF Model
Infrared and visible-light images are captured in different spectral bands. Visible images primarily represent visual features such as surface color, texture, and shape, while infrared images primarily convey thermal characteristics and temperature distribution. Due to substantial differences in imaging principles and information representation, this study employs two improved models, NeRF-IR and NeRF-RGB, to process infrared and visible images, respectively.
This study builds upon the NeRFacto model developed in NeRFstudio, utilizing two improved models, NeRF-IR and NeRF-RGB, to implicitly reconstruct 3D models from visible and infrared image data. Through training iterations, 3D voxel information is encoded into fully connected neural networks. The network adopts a dual-input NeRF architecture, separately processing visible and infrared images. The two input channels simultaneously learn features of the scene from visible light and infrared thermal imagery, ultimately producing two fully connected network models: NeRF-IR for infrared scene reconstruction and NeRF-RGB for visible scene reconstruction [
22,
23]. For the two models, adjustments were made to their network architectures and parameters based on the data characteristics of different modalities.
For the two models, adjustments to their network structures and parameters were made based on the data characteristics of different modalities, as shown in
Table 2.
Firstly, considering the inherent physical continuity and structural sensitivity of infrared images with respect to the temperature field, we significantly increased the weight of the distortion loss from 0.002 to 0.01. Infrared imaging primarily captures the thermal radiation emitted from object surfaces, where the temperature distribution typically exhibits strong spatial smoothness and physical consistency. Any geometric distortion or local artifact in the reconstructed image may be mistakenly interpreted as a thermal source, crack, or material boundary, thereby severely affecting the accuracy of downstream physical analysis. Therefore, in infrared image reconstruction, maintaining structural fidelity is more critical than in RGB images. In contrast, RGB images can tolerate moderate distortions in local texture or color without significantly compromising the overall visual or semantic understanding.
Secondly, in terms of network architecture design, we proposed a modality-aware framework that accounts for the representational differences between RGB and infrared images, aiming to balance reconstruction fidelity and computational efficiency. For the RGB model, given its rich color–space information and intricate texture patterns, the reconstruction task imposes higher representational demands. Accordingly, the backbone network is configured with hidden units = 128, and the color branch uses hidden dim = 128. Additionally, we increase the number of feature maps = 4 in each layer to enhance the network’s capacity for capturing high-frequency details. A denser ray-sampling strategy (coarse sampling) is also adopted to extract spatial variations in illumination and color more effectively, thereby improving image detail restoration accuracy.
In contrast, the infrared model focuses on capturing the spatial distribution of thermal intensities, which are relatively smooth and contain fewer texture details. To mitigate overfitting, especially in sparsely distributed thermal backgrounds, we reduce the complexity of the color branch and decrease the number of ray samples per view. Specifically, the sampling strategy is optimized by setting the number of fine samples per ray as fine sampling = (128, 64), which ensures structural consistency while significantly reducing redundant computation, thereby accelerating convergence.
Despite the lower sampling density in the infrared model, the increased weight on distortion loss effectively guides the model to maintain 3D structural coherence and physical plausibility throughout training. Overall, the modality-specific design not only considers the representational disparities between RGB and infrared modalities, but also demonstrates adaptive control over network complexity and sampling strategy. This ensures improved modeling accuracy and efficiency across both visible and infrared imaging scenarios.
3.3. New Perspective Registration Image Generation
To achieve image registration and further enable infrared–visible image fusion, this study addresses the alignment problem between generated visible and infrared images by using identical camera poses to ensure augmentation from the same field of view. Specifically, the two improved fully connected networks—NeRF-IR and NeRF-RGB—are trained to learn the implicit representation of the 3D scene, allowing for novel view synthesis via re-rendering. By inputting identical augmentation camera poses into the two trained NeRF models (infrared and visible), preliminary dual-modality image registration is achieved under new viewpoints.
3.3.1. Principle of Image Augmentation
The augmentation process consists of the following four main steps:
Obtain a new camera pose:
Camera pose includes both position and orientation. Before performing visible and infrared image augmentation, a camera pose different from the original captured images must be selected to generate augmented images from a new viewpoint.
Cast rays from the new camera viewpoint:
For each pixel during rendering, a ray is emitted from the virtual camera through the pixel and into the 3D scene space. To determine the color and opacity of each pixel, the ray must be sampled at its intersections with the 3D scene. Each ray can be represented as
where
o denotes the camera position,
d is the ray direction, and
t is the depth parameter along the ray, representing the distance from the camera origin to a specific point in the scene.
Neural network prediction of color and density:
By inputting the 3D coordinates x of the sampled point and its viewing direction d into the NeRF-IR and NeRF-RGB networks, the networks output the corresponding color c and volumetric density .
Volume rendering for pixel color computation:
The final pixel color along each ray is computed using the volumetric rendering equation as follows:
where
denotes the accumulated transmittance from the first sampled point to the current one,
represents the density at the
i-th sampled point,
is the depth interval between adjacent samples, and
is the color value of the
i-th point.
Finally, the pixel size parameter of the camera is determined, and the above steps are repeated for the rays emitted from each pixel to compute their color values. The resulting pixel colors are then aggregated to generate the augmented image [
11].
3.3.2. Principle of Image Registration
Directly fusing captured images can result in misalignment due to variations in shooting angles, positions, and lighting conditions. Such misregistration may cause noticeable seams or distortions in the composite image, compromising the naturalness and accuracy of the final result. Therefore, image registration becomes an indispensable step in the image fusion process, ensuring accurate alignment between images to eliminate misregistration and improve the quality of the fused result. As shown in
Figure 5, when fused after cropping, the real infrared and visible images exhibit evident seams and distortions in non-registered images.
Upon completion of the reconstruction, an efficient image registration strategy is proposed to achieve precise spatial matching between different modalities. Specifically, the augmented viewing angles were unified, allowing both visible and infrared images to be augmented under the same observation conditions, thus eliminating inconsistencies caused by differences in viewing angles. Next, to further enhance registration accuracy, the same intrinsic parameters and camera poses were set, ensuring that both visible and infrared images were augmented using identical camera configurations. This method successfully mitigates the influence of differences in camera parameters and view angles, leading to high-precision spatial registration between the augmented visible and infrared images. This ensures the consistency and accuracy of the images in the physical space.
Specifically, it is necessary to determine the transformation relationship between the visible camera coordinate system and the infrared camera coordinate system, as follows:
where
denotes the world coordinate system and
represents the visible-light camera coordinate system,
is the translation vector,
s is the scale factor, and
is the rotation matrix, which can be expressed in terms of roll angle (
), pitch angle (
), and yaw angle (
) as follows:
The estimation of
R,
S, and
T is achieved using the Umeyama algorithm, which performs point-to-point matching between the two point sets
(visible point cloud) and
(infrared point cloud), with the objective of computing an optimal rigid transformation that minimizes the relative distance between the two point sets after transformation, as follows:
The optimal rotation matrix
R and translation vector
T can be obtained by constructing the covariance matrix and applying singular value decomposition (SVD). The simulation results are shown in
Figure 6: (a) and (b) depict the augmented camera views, presenting the scene-rendering effects under different modalities, and (c) and (d) display the spatial layout of the 3D scene, clearly showing the spatial relationship between the model and the camera in the scene. Through this registration strategy, more reliable and high-precision input data can be provided for subsequent image processing and fusion.
3.3.3. Image Registration Results
To achieve precise spatial alignment between infrared and visible-light modalities, we first export the reconstructed 3D point cloud model based on NeRF-IR in PLY format, facilitating subsequent analysis and registration processing. Considering non-ideal initial conditions in real-world scenarios (e.g., noise interference, occlusions, and viewpoint discrepancies), a series of robustness-enhancing preprocessing steps were introduced. Specifically, we perform spatial cropping on the reconstructed point cloud to extract the target region subset, followed by point cloud downsampling, normal estimation, and feature enhancement using Fast Point Feature Histograms (FPFH), thereby improving point correspondence quality and registration stability.
Figure 7 illustrates the extracted visible and infrared point cloud models used in the experiments. To focus on key geometric structures, we crop the exported 3D point clouds and retain only the local region containing the target object.
Next, to achieve spatial alignment of multimodal point clouds, we adopt a registration method based on feature matching and rigid transformation estimation. This approach aims to estimate the transformation between the infrared point cloud and the visible point cloud, which consists of three components: a translation vector
T that describes the positional offset between point cloud centroids, a scaling factor
s to address size discrepancies between models, and a rotation matrix
R to capture orientation differences between coordinate systems. Taking an outdoor vehicle target as an example,
Figure 8 compares the visible and infrared point clouds before and after registration within a unified coordinate system.
As shown in the figure, the registered infrared point cloud exhibits high geometric consistency with the visible point cloud, indicating that the proposed method achieves accurate alignment performance under complex multimodal conditions.
At the parameter level, the estimated transformation results obtained from the registration process are summarized in
Table 3.
In terms of spatial registration accuracy, the proposed method achieves an 86.13% point-matching ratio in typical test scenarios, demonstrating high precision. In comparison with state-of-the-art (SOTA) methods, LoFTR [
39] achieves an AUC@5° of 52.8% in standard pose estimation tasks, R2D2 [
40] reports a pose accuracy of 62.7%, and D2-Net [
41] obtains a matching accuracy of 78.4%. Although these approaches perform well in image registration tasks, they often struggle with multimodal point cloud alignment due to texture inconsistencies and viewpoint variations.
In contrast, the point cloud registration method adopted in this work, based on geometric consistency, integrates both coarse pre-alignment and rigid optimization strategies. This effectively suppresses cross-modality error propagation and offers a significant advantage in registration accuracy, making it particularly suitable for infrared–visible fusion scenarios dominated by structural information.
3.4. Image Augmentation Results and Analysis
3.4.1. Model Performance Optimization Verification
During the NeRF training process, the dataset was split into 90% for training and 10% for testing to ensure that the model’s generalization ability could be evaluated on unseen viewpoints. During testing, real captured images from the test set were selected, and their corresponding camera poses were input into the model to generate synthetic images under the same viewpoint for similarity comparison. Two experimental scenes were selected: a car on a playground and a container on a heating platform. The control group employed the NeRFacto model from the NeRFstudio community, whereas the experimental group consisted of an enhanced NeRF-RGB model tailored for visible-light scenes and a modified NeRF-IR model adapted for infrared scenarios [
21].
The NeRFacto method, as an improved algorithm, has become one of the widely adopted benchmark methods in the field of 3D scene reconstruction due to its effective balance between training efficiency and rendering quality. Additionally, the NerFacto method exhibits strong scalability and adaptability, making it suitable for scene reconstruction tasks of varying scales and complexities. Given these advantages, this study selects the NeRFacto method as the experimental control group to objectively evaluate the performance of the newly proposed algorithm. This choice aligns with the standard benchmarking practices in the field of computer vision, ensuring the reliability and comparability of the experimental results.
Both the proposed NeRF-RGB and NeRF-IR models are implemented based on the open-source framework Nerfstudio and trained/tested on a computing platform equipped with an NVIDIA GeForce RTX 4060 Laptop GPU. The average training time per session is approximately 25 min, indicating high training efficiency. The developed models and related source codes have been publicly released to facilitate academic exchange and further research.
During the experiment, a real car and a container model on a heating platform were selected as the scene, with augmented images of different models shown in
Figure 9. In
Figure 9, (a), (d), and (g) are real-captured images, (b), (e), and (h) are augmented images generated using the NeRFacto model, (c) is an augmented image generated by the NeRF-IR model, and (f) and (i) are augmented images generated by the NeRF-RGB model.
To assess the realism of the augmented images, multiple metrics were employed to evaluate the similarity between the augmented and real captured images. These metrics include structural similarity index (SSIM), cosine similarity, mutual information, peak signal-to-noise ratio (PSNR), and histogram similarity.
Table 4 presents the simulation similarity metrics between enhanced images generated by different models and their corresponding real images.
Table 5 shows the improvement of similarity metrics obtained by comparing the proposed enhancement model with the NeRFacto model.
Evaluation of Model Enhancement Results:
NeRF-IR Model (Car Scene)
The SSIM and PSNR scores of the NeRF-IR model improved by 7.90% and 18.44%, demonstrating considerable enhancement in both visual quality and structural similarity, with clearer and more detailed images. Mutual information increased by 28.63%, indicating that the refined model captures a greater amount of meaningful content. Histogram similarity rose by 25.31%, which shows that color channels are better distributed, enhancing the natural appearance of colors.
NeRF-RGB Model (Container Scene)
A 12.60% increase in SSIM for the NeRF-RGB model highlights its significant improvement in preserving structural details. Mutual information rose by 32.72%, demonstrating a stronger capacity for capturing essential scene information.The PSNR improved by 26.06%, reflecting reduced noise and improved image fidelity. A 14.95% gain in histogram similarity reflects better color consistency and precision in the augmented images.
The enhanced NeRF-IR and NeRF-RGB models demonstrated consistent improvements in 331 major metrics such as PSNR, SSIM, mutual information, and histogram similarity. These 332 enhancements not only improved image quality and structural fidelity, but also significantly increased the informational content and visual naturalness of the 333 augmented images.
3.4.2. Validation of Model Augmentation Capability Under Restricted Perspective
To validate the generalization ability and limitations of the proposed NERF-IR model under constrained viewing conditions, we designed a set of experiments simulating practical application scenarios. Specifically, only partial angular views were used for training, addressing the common real-world challenge of lacking full 360° data acquisition. In this experiment, multiview images were collected within an azimuth range of approximately 120° from the side of the target model, as illustrated in
Figure 10.
Based on these limited observations, the improved NERF-IR model was employed to perform 3D reconstruction and synthesize infrared images within the real captured view range (i.e., within the 120° sector). A qualitative comparison between the synthesized and the real infrared images is presented in
Figure 11. From the visual results, it can be observed that the synthesized images maintain high consistency with the real ones in terms of structural detail restoration and thermal feature representation, preliminarily validating the model’s effectiveness under limited view conditions.
To further quantify the model’s performance in novel view synthesis tasks, several mainstream image quality assessment metrics were adopted, including structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), cosine similarity, mutual information (MI), and histogram similarity. The evaluation results between the synthesized and real images are summarized in
Table 6.
From both quantitative and qualitative analyses, it is evident that the NERF-IR model demonstrates strong 3D reconstruction capabilities and high-quality infrared image synthesis even under limited-view training conditions. Notably, in previously unseen views, the synthesized images achieve high scores across multiple metrics, indicating the model’s robust spatial generalization and thermal structure preservation abilities. These findings convincingly validate the scalability and practical value of NeRF-based approaches in infrared image modeling and generation under data-limited scenarios, offering a promising new direction for infrared perception and reconstruction tasks in complex or inaccessible environments.
5. Conclusions
This study proposes a multimodal image registration and fusion technique based on an improved NeRF method, which demonstrates significant advantages in addressing the registration problem between visible light and infrared images. By introducing a geometry-consistent point cloud registration method, combined with pre-registration and rigid optimization strategies, we achieved an 86.13% point cloud matching rate in typical test scenarios. This significantly outperforms existing state-of-the-art (SOTA) algorithms such as LoFTR, R2D2, and D2-Net; while these methods excel in single-modal registration tasks, they often exhibit instability when dealing with multimodal point cloud alignment due to perspective changes and texture differences. In contrast, the proposed method effectively suppresses cross-modal error propagation, maintaining high-precision registration performance, especially in infrared–visible fusion scenarios dominated by structural information.
The improved NERF-IR and NERF-RGB models show significant improvements across multiple key metrics when compared to the widely recognized NERFacto model, particularly in terms of image quality and structural similarity. The NERF-IR model demonstrated a 7.90% improvement in SSIM and an 18.44% improvement in PSNR, indicating a notable enhancement in image clarity and detail. The increase in mutual information highlights the model’s ability to capture more useful information. The NERF-RGB model also exhibited substantial improvements across several metrics, particularly in noise reduction and information extraction, with SSIM and PSNR improving by 12.60% and 26.06%, respectively. These improvements result in more natural and accurate color distribution in the generated images. Overall, the improved models not only enhance the visual quality and structural restoration capabilities of the images, but also increase their information content, enabling the rendered images to exhibit higher naturalness and realism in multimodal fusion tasks.
Additionally, the images generated by NeRF exhibit high spatial continuity and smoothness, which, while presenting challenges in traditional image fusion methods, can be effectively addressed by combining deep-learning-based fusion methods such as CNN-LPIF. These methods adaptively handle pixel-level fusion strategies, reducing detail loss and thus achieving superior fusion quality and robustness compared to other traditional methods.
Specifically, to address the challenge of obtaining accurately registered images for visible light and infrared dual-modal fusion, this study employs an improved NeRF method for 3D scene reconstruction and image enhancement using visible light and infrared images captured from multiple angles. The study utilizes uniform camera parameters and pose settings for both infrared and visible light point cloud scenes to achieve spatially registered dual-modal image data, which is then used for image fusion. This approach effectively resolves the difficulty of precise registration between visible light and infrared images during acquisition, providing a high-quality data foundation and an accurate registration solution for subsequent multimodal image fusion. The main innovation of this work lies in the following:
By constructing a dual-model architecture of NERF-RGB (visible scene) and NERF-IR (infrared scene), the performance of multimodal image augmentation was significantly improved. With network structure adjustment and hyperparameter optimization, NERF-IR achieved a 28.63% increase in mutual information (MI) and an 18.44% improvement in PSNR for infrared image reconstruction, while NERF-RGB achieved 12.60% and 26.06% improvements in SSIM and PSNR, respectively, for visible-light scenes.
Various image fusion methods, such as MS-SRIF, PCA-MSIF, and CNN-LPIF, were employed. Metric analysis shows that different algorithms exhibit advantages in edge detail preservation, gradient retention, image smoothness, and noise resistance, and their applicable scenarios are discussed.
After 3D scene reconstruction, new camera parameters and pose settings were applied to generate augmented images, providing an innovative solution for multimodal image registration. This method enables the generation of image information from unknown viewpoints based on known infrared/visible data, overcoming the dependence of traditional methods on data completeness.