4.1. Dataset and Implement Details
A comprehensive evaluation of our proposed registration method was performed through comparative and ablation studies using the publicly available LLVIP dataset (visible-infrared paired dataset for low-light vision), with representative samples illustrated in
Figure 5. Rigorous quantitative metrics and qualitative assessments were employed to validate the method’s effectiveness for multimodal image registration in UAV applications, demonstrating its robustness under challenging low-light conditions.
The LLVIP dataset consists of 15,488 high-resolution image pairs, each containing an infrared and a visible light image. Of these, 12,025 pairs are designated for training, while 3463 pairs are reserved for testing. This train/test split of 12,025/3463 pairs (approximately 78%/22%) follows the official recommendation by the LLVIP dataset authors. This ratio provides several advantages: the training set of 12,025 pairs offers sufficient diversity to support comprehensive network learning, while the independent test set of 3463 pairs is large enough to yield statistically meaningful evaluation results and effectively prevent overfitting. Additionally, this data division aligns with the widely adopted practice of allocating 70–80% of data for training and 20–30% for testing, facilitating comparison with other methods and ensuring reproducibility for researchers with different backgrounds. Designed to simulate UAV perspectives, the dataset predominantly features urban road scenes with diverse targets, including vehicles, pedestrians, buildings, and trees. It covers a wide range of scenarios, from well-lit daytime conditions to challenging low-light nighttime environments. A key feature of the LLVIP dataset is the rigorous pre-registration of infrared and visible image pairs, which ensures precise spatial alignment between the two modalities. This alignment facilitates the application of various transformations, enabling comprehensive evaluation of algorithm performance across different conditions. Furthermore, the dataset’s aerial-to-ground perspective closely mirrors typical UAV imaging geometry, providing robust and reliable data for algorithm assessment.
The proposed method was implemented using the PyTorch (2.7.1) framework, with the training process divided into two distinct phases. In the first phase, the multi-scale style transfer network was independently trained to generate high-quality pseudo-infrared images. Upon completion, the parameters of this network were frozen, and the generated pseudo-infrared images, along with visible light images, were used as inputs to train the multi-scale cascaded registration network in the second phase. All training was conducted on a high-performance computing platform equipped with an Intel® Xeon® CPU E5-2678 v3 @ 2.50 GHz processor, an NVIDIA® GeForce® RTX® 3090 GPU, and 96 GB of RAM.
To enhance the model’s robustness, the input images for CSTNet undergo a series of preprocessing steps designed to address specific challenges in cross-modal UAV image registration. First, the images are resized to pixels, then randomly cropped to pixels. This two-step approach serves a dual purpose: the initial resizing ensures consistent scale normalization across the dataset, while the subsequent random cropping introduces spatial variability that simulates the unpredictable framing variations encountered during UAV operations. The images are finally normalized to the range to match the output range of the tanh activation function commonly used in GAN generators, ensuring optimal gradient flow during adversarial training.
The training process utilizes the Adam optimizer, selected for its proven effectiveness in GAN training due to its adaptive learning rate properties and momentum-based updates that help navigate the complex loss landscape of adversarial networks. Training is conducted for 200 epochs on a GPU, with the learning rate set to for the first 100 epochs, following established best practices for stable GAN convergence. The learning rate then linearly decays to 0 over the remaining 100 epochs to prevent oscillations and ensure fine-tuned convergence to optimal parameters.
For MCRNet inputs, similar preprocessing steps are applied, including random cropping to pixels to maintain consistency with the CSTNet output dimensions. Additionally, random affine transformations are strategically applied to infrared images to simulate realistic UAV motion patterns. These transformations incorporate random translations within of the image dimensions and rotations within , ranges determined through empirical analysis of typical UAV flight dynamics and camera stability characteristics. Furthermore, deformable transformations based on Gaussian random fields are implemented to model non-rigid distortions caused by atmospheric turbulence, thermal effects, and sensor noise. The convolutional kernel with standard deviation () of 32 was selected to generate realistic local deformations that preserve global structure while introducing sufficient complexity to train robust registration networks capable of handling real-world imaging conditions.
Regarding computational efficiency, the complete pipeline (CSTNet + MCRNet) processes a 256 × 256 image pair in approximately 0.12 s on an RTX 3090 GPU, making it suitable for near-real-time UAV applications. The diffeomorphic integration accounts for roughly 25% of the total computation time, which can be further optimized through the strategies mentioned above for time-critical applications.
The MCRNet training process consists of 1200 epochs and utilizes the Adam optimizer on a GPU, with an initial learning rate of . The learning rate is reduced by a factor of 0.1 every 800 epochs to ensure stable convergence throughout the extended training period.
To objectively evaluate the performance of image registration methods, four commonly used evaluation metrics were selected: mean squared error (MSE), normalized cross-correlation (NCC), local normalized cross-correlation (LNCC), and mutual information (MI). These metrics comprehensively assess the similarity between registered and fixed images from different perspectives.
MSE quantifies pixel-level differences between the registered and fixed images, with lower values indicating better registration accuracy.
NCC, a statistical correlation measure ranging from −1 to 1, demonstrates superior registration performance as its value approaches 1. Compared to MSE, NCC offers the advantage of being insensitive to intensity variations, making it particularly suitable for scenarios with significant illumination changes.
LNCC, the localized version of NCC, enhances the measurement of nonlinear local variations by computing normalized cross-correlation within each local region.
MI measures the mutual dependence between registered and fixed images, proving particularly effective for multimodal image registration. Higher MI values indicate greater information sharing between images, corresponding to improved registration outcomes.
4.2. Comparative Experiment Analysis
A comprehensive comparison was conducted between our proposed method and current mainstream image registration approaches. Due to the limited availability of multimodal image registration methods, the proposed CSTNet was first employed for modality unification to ensure a fair comparison. Subsequently, the proposed MCRNet was compared with other single-modal image registration methods. The comparative methods were categorized into three main groups: (1) traditional feature-based registration methods, including SIFT and ORB; (2) optimization theory-based registration approaches, such as MSE and NCC; and (3) deep-learning-based deformable registration methods, specifically VoxelMorph and VTN.
To comprehensively evaluate the performance of registration methods on UAV platforms, four sets of comparative experiments were designed. These experiments were systematically conducted under various image transformation conditions to assess the robustness and accuracy of the proposed method in comparison to existing mainstream approaches. Specifically, their performance was evaluated in handling translation, rotation, and deformation transformations. The experimental design was tailored to simulate the registration challenges commonly encountered during actual UAV flight operations.
The comparative performance of our method against other mainstream approaches on the LLVIP dataset is detailed in
Table 2,
Table 3,
Table 4 and
Table 5, corresponding to Experiments I, II, III, and IV, respectively. These tables report the average values of key evaluation metrics, where a lower MSE indicates higher registration accuracy, while higher values for NCC, LNCC, and MI reflect improved registration performance.
Figure 6,
Figure 7,
Figure 8 and
Figure 9 present a comprehensive visual comparison of registration outcomes across various experimental conditions.
This condition tests registration under small translation (0–5%) and large rotation . Small translation reflects minor UAV positional adjustments, while large rotation simulates significant attitude changes, such as sharp turns or camera tilts. It evaluates the method’s ability to handle rotational distortions with minimal translational impact, crucial for scenarios like UAV hovering with rotational instability.
Table 2.
Quantitative evaluation of image registration methods across metrics (MSE, NCC, LNCC, and MI) in Exp. I.
Table 2.
Quantitative evaluation of image registration methods across metrics (MSE, NCC, LNCC, and MI) in Exp. I.
Method | MSE ↓ | NCC ↑ | LNCC ↑ | MI ↑ |
---|
Initial | 0.0374 | 0.6040 | 0.2302 | 1.0728 |
SIFT | 0.0192 | 0.8334 | 0.8349 | 3.8997 |
ORB | 0.0211 | 0.8149 | 0.6551 | 2.3317 |
NCC | 0.0197 | 0.8291 | 0.7528 | 2.6590 |
MSE | 0.0191 | 0.8347 | 0.8303 | 3.6144 |
VoxelMorph | 0.0085 | 0.8983 | 0.5340 | 3.7800 |
VTN | 0.0067 | 0.9189 | 0.6379 | 3.9611 |
Ours | 0.0069 | 0.9240 | 0.6749 | 3.9874 |
Figure 6.
Visual comparison of image registration results for different methods in Exp. I.
Figure 6.
Visual comparison of image registration results for different methods in Exp. I.
This condition evaluates registration performance under large translation 5–10% and small rotation of . Large translation simulates significant UAV positional shifts, while small rotation represents minor orientation changes. It tests the method’s ability to handle substantial displacement with minimal rotational impact, reflecting scenarios like UAV drift or rapid lateral movement.
Figure 7.
Visual comparison of image registration results for different methods in Exp. II.
Figure 7.
Visual comparison of image registration results for different methods in Exp. II.
Table 3.
Quantitative evaluation of image registration methods across metrics (MSE, NCC, LNCC, and MI) in Exp. II.
Table 3.
Quantitative evaluation of image registration methods across metrics (MSE, NCC, LNCC, and MI) in Exp. II.
Method | MSE ↓ | NCC ↑ | LNCC ↑ | MI ↑ |
---|
Initial | 0.0376 | 0.6093 | 0.1536 | 0.9655 |
SIFT | 0.0176 | 0.9279 | 0.8421 | 3.6670 |
ORB | 0.0182 | 0.9223 | 0.8188 | 3.0580 |
NCC | 0.0189 | 0.9151 | 0.7447 | 2.5266 |
MSE | 0.0138 | 0.8683 | 0.4961 | 3.8942 |
VoxelMorph | 0.0139 | 0.8486 | 0.3182 | 3.4034 |
VTN | 0.0153 | 0.8330 | 0.2921 | 3.3627 |
Ours | 0.0058 | 0.9375 | 0.6646 | 3.7249 |
This condition examines registration under large translation 5–10% and large rotation , simulating extreme UAV movements such as high-speed flight with sharp turns. It assesses the method’s robustness in handling severe misalignments, critical for dynamic UAV operations with complex motion.
Table 4.
Quantitative evaluation of image registration methods across metrics (MSE, NCC, LNCC, and MI) in Exp. III.
Table 4.
Quantitative evaluation of image registration methods across metrics (MSE, NCC, LNCC, and MI) in Exp. III.
Method | MSE ↓ | NCC ↑ | LNCC ↑ | MI ↑ |
---|
Initial | 0.0369 | 0.5053 | 0.1344 | 0.7111 |
SIFT | 0.0243 | 0.6897 | 0.8425 | 3.6647 |
ORB | 0.0249 | 0.6829 | 0.8379 | 3.3728 |
NCC | 0.0291 | 0.6292 | 0.5286 | 1.7102 |
MSE | 0.0243 | 0.6910 | 0.8407 | 3.5573 |
VoxelMorph | 0.0150 | 0.7627 | 0.2347 | 3.0362 |
VTN | 0.0106 | 0.8322 | 0.2906 | 3.1866 |
Ours | 0.0068 | 0.9249 | 0.7433 | 3.6471 |
Figure 8.
Visual comparison of image registration results for different methods in Exp. III.
Figure 8.
Visual comparison of image registration results for different methods in Exp. III.
This condition tests registration under large translation, large rotation, and deformable transformation. Large translation simulates significant UAV positional shifts, while large rotation reflects extreme attitude changes, such as sharp turns or camera reorientations. Deformable transformation, modeled using a Gaussian random field ( kernel, ), captures local distortions caused by airflow, sensor noise, and thermal radiation. It evaluates the method’s ability to handle severe misalignments and non-rigid deformations, critical for dynamic UAV operations with complex motion and environmental interference.
Table 5.
Quantitative evaluation of image registration methods across metrics (MSE, NCC, LNCC, and MI) in Exp. IV.
Table 5.
Quantitative evaluation of image registration methods across metrics (MSE, NCC, LNCC, and MI) in Exp. IV.
Method | MSE ↓ | NCC ↑ | LNCC ↑ | MI ↑ |
---|
Initial | 0.0585 | 0.3183 | 0.1699 | 0.9583 |
SIFT | 0.0405 | 0.5993 | 0.5465 | 2.0077 |
ORB | 0.0675 | 0.3746 | 0.3018 | 1.3458 |
NCC | 0.0319 | 0.6784 | 0.6332 | 2.2468 |
MSE | 0.0319 | 0.6769 | 0.5991 | 2.1446 |
VoxelMorph | 0.0068 | 0.9106 | 0.6017 | 1.9960 |
VTN | 0.0059 | 0.9225 | 0.6513 | 2.1691 |
Ours | 0.0047 | 0.9384 | 0.8022 | 2.6189 |
Figure 9.
Visual comparison of image registration results for different methods in Exp. IV.
Figure 9.
Visual comparison of image registration results for different methods in Exp. IV.
The MSE metric quantifies the average squared difference in pixel values between two images, demonstrating high sensitivity to noise. As evidenced by the results presented in
Table 2,
Table 3,
Table 4 and
Table 5, and
Figure 10a, deep-learning-based deformable registration methods consistently outperform feature-based matching and optimization theory-based approaches across various experimental conditions in terms of MSE. This superior performance suggests that deep-learning methods can more effectively extract hierarchical image features and establish more robust pixel-wise correspondences, thereby achieving higher registration accuracy under diverse transformation scenarios. Notably, our proposed method exhibits enhanced robustness in noisy environments and delivers more precise registration results compared to all other evaluated approaches.
The NCC metric evaluates the overall similarity between images by comparing their intensity values, effectively eliminating the influence of brightness and contrast variations. As shown in
Table 2,
Table 3,
Table 4 and
Table 5, and
Figure 10b, deep-learning-based deformable registration methods achieve comparable NCC performance to other approaches when dealing with affine transformations (including translation and rotation). However, the introduction of deformable transformations reveals significant limitations in feature-based matching and optimization theory-based methods, as they primarily rely on affine transformations and consequently fail to effectively address misalignment caused by deformable changes, resulting in reduced overall image similarity. In contrast, deep-learning-based deformable registration methods demonstrate superior capability in handling deformable transformations, yielding significantly improved registration accuracy. Notably, our proposed method outperforms all comparative approaches, indicating its exceptional capability in global feature matching and precise alignment of detailed textures within images.
The LNCC metric assesses the regional similarity between images. As shown in
Table 2,
Table 3,
Table 4 and
Table 5, and
Figure 10c, deep-learning-based deformable registration methods generally exhibit inferior LNCC performance compared to other approaches when handling affine transformations (including translation and rotation). This phenomenon can be attributed to the inherent characteristics of feature-based matching and optimization theory-based methods, which primarily rely on affine transformations that maintain global mapping relationships, thereby preventing local misalignment. In contrast, deep-learning methods establish pixel-wise correspondences for registration, which may not strictly guarantee precise matching of local details, consequently resulting in relatively lower LNCC scores. However, the introduction of deformable transformations reveals the superior capability of deep-learning methods in capturing complex deformation features and establishing pixel-to-pixel mapping relationships, enabling better handling of local detail variations, while other methods struggle with such deformable changes. Comprehensive evaluation demonstrates that our proposed method outperforms all comparative approaches, showcasing its remarkable robustness in handling deformable transformations. These results further validate the superiority of our method in local feature matching, enabling the generation of more accurate registration outcomes.
The MI metric quantifies the degree of shared information between images, demonstrating high sensitivity to overall image similarity. As shown in
Table 2,
Table 3,
Table 4 and
Table 5 and
Figure 10d, deep-learning-based deformable registration methods achieve comparable MI performance to both feature-based and optimization theory-based approaches across various experimental conditions. Nevertheless, our proposed method demonstrates superior overall performance, indicating its enhanced capability in aligning global image structures and consequently achieving more precise image registration.
As illustrated in
Figure 6,
Figure 7,
Figure 8 and
Figure 9, a visual analysis reveals that methods based on deep learning, feature matching, and optimization theory achieve varying degrees of success. However, approaches relying on feature point matching and optimization theory depend on affine transformations, which, when applied to large-scale translations and rotations, often produce extensive blank regions in the registration outcomes. This significantly compromises the visualization quality of the results and hinders subsequent image fusion processes. In contrast, deep-learning-based methods leverage deformable registration models to learn pixel-level mapping relationships, effectively mitigating these blank regions and enhancing registration accuracy.
The introduction of deformable transformations exacerbates the limitations of feature point matching and optimization-based methods, as they fail to rectify misalignments induced by such transformations, resulting in localized distortions in the registered images. For instance, the trash can in
Figure 9b–e exhibits noticeable edge curvature. Although deep-learning-based approaches, such as VoxelMorph and VTN, demonstrate some capability in managing deformable transformations, their performance falls short of our proposed method, primarily due to the absence of robust strategies for handling large-scale deformations.