Next Article in Journal
A 2D-CFAR Target Detection Method in Sea Clutter Based on Copula Theory Using Dual-Observation Channels
Next Article in Special Issue
A Coarse-to-Fine Optical-SAR Image Registration Algorithm for UAV-Based Multi-Sensor Systems Using Geographic Information Constraints and Cross-Modal Feature Consistency Mapping
Previous Article in Journal
A Novel Three-Dimensional Imaging Method for Space Targets Utilizing Optical-ISAR Joint Observation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lightweight Unsupervised Homography Estimation for Infrared and Visible Images Based on UAV Perspective Enabling Real-Time Processing in Space–Air–Ground Integrated Network

1
School of Computer and Artificial Intelligence, Civil Aviation Flight University of China, Guanghan 618307, China
2
Information Center, Civil Aviation Flight University of China, Guanghan 618307, China
3
Key Laboratory of Flight Techniques and Flight Safety, CAAC, Guanghan 618307, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(23), 3884; https://doi.org/10.3390/rs17233884
Submission received: 29 September 2025 / Revised: 20 November 2025 / Accepted: 28 November 2025 / Published: 29 November 2025

Highlights

What are the main findings?
  • We propose a lightweight homography estimation method, LFHomo, for homography estimation of infrared and visible images in low-altitude scenarios.
  • Experimental results show that the propose method significantly outperforms existing methods on infrared and visible datasets based on UAV perspective, with significant advantages in computational complexity and inference speed.
What is the implication of the main finding?
  • LFHomo balances model accuracy and computational efficiency, demonstrating the potential of deep networks for low-altitude multimodal image registration tasks with excellent scalability.
  • We constructed a novel unregistered UAV-based infrared and visible image dataset, which provides support for research on multimodal UAV remote sensing image registration and fusion.

Abstract

Homography estimation of infrared and visible light images is a key visual technique that enables drones to perceive their environment and perform autonomous localization in low-altitude environments. Its potential lies in integration with edge computing and 5G technologies, enabling real-time control of drones within air–ground integrated networks. However, research on homography estimation techniques for low-altitude dynamic viewpoints remains scarce. Additionally, images in low-altitude scenarios suffer from issues such as blurring and jitter, presenting new challenges for homography estimation tasks. To address these issues, this paper proposes a light-weight homography estimation method, LFHomo, comprising two components: two anti-blurring feature extractors with non-shared parameters and a lightweight homography estimator, LFHomoE. The anti-blurring feature extractors introduce in-verse residual layers and feature displacement modules to capture sufficient contextual information in blurred regions and to enable lossless and rapid propagation of feature information. In addition, a spatial-reduction-based channel shuffle and spatial joint attention module is designed to suppress redundant features introduced by lossless transmission, allowing efficient extraction and refinement of informative features at low computational cost. The homography estimator LFHomoE adopts a CNN–GNN hybrid architecture to efficiently model geometric relationships between cross-modal features and to achieve fast prediction of homography matrices. Meanwhile, we construct and annotate an unregistered infrared and visible image dataset from drone perspectives for model training and evaluation. Experimental results show that LFHomo maintains great registration accuracy while significantly reducing model size and inference time.

1. Introduction

In recent years, image perception systems have become increasingly important for unmanned aerial vehicles (UAV) in intelligent flight, emergency response, and autonomous cruising tasks [1,2]. Concurrently, the rapid development of 5G communication and edge computing technologies has enabled the deployment of real-time, efficient vision models directly on UAV platforms [3,4]. However, abrupt illumination changes and adverse weather conditions, such as rain and fog, significantly degrade the reliability of single-modal visual perception in complex environments. To address this issue, fusing multimodal information, such as infrared and visible spectra, and exploiting their complementary imaging properties has been shown to be an effective solution [5], achieving notable success in tasks including object tracking [6,7], semantic segmentation [8], anomaly detection [9,10], and depth estimation [11,12].
Due to differences in viewpoint and platform motion, images captured by different sensors often exhibit substantial geometric discrepancies. Furthermore, typical low-altitude scenes are affected by motion blur and low image resolution, thus rendering multimodal images difficult to use directly for information fusion. In this context, homography estimation between infrared and visible images serves as a fundamental visual task that models the geometric transformation between two images [13]. Thereby providing reliable preprocessing for higher-level tasks such as image registration [14,15], object detection [16], and localization or recognition [17]. Nonetheless, on UAV platforms, image quality degradation and constrained onboard computing resources impose stringent requirements on both the robustness and real-time performance of homography estimation methods.
Existing homography estimation approaches can be broadly divided into two categories: traditional methods and deep learning-based methods [5]. Traditional methods depend on feature descriptor operators [18,19] and utilize algorithms such as RANSAC [20] and MAGSAC++ [21] to remove outliers during matching, followed by solving the homography matrix using Direct Linear Transformations (DLT). These methods demonstrate efficacy in high-quality, single-source image scenarios, but tend to fail under multimodal conditions, low resolution, or strong noise. In contrast, deep learning-based methods regress the parameters required for homography estimation via a trainable end-to-end network [22,23,24,25,26], and then obtain the homography matrix using DLT. Such methods demonstrate greater robustness in multi-source image homography tasks. Conversely, this advantage is frequently accompanied by the disadvantage of larger model sizes and heavier computational costs, and their performance remains sensitive to factors such as noise and motion blur. Consequently, it remains difficult for these models to simultaneously satisfy both accuracy and real-time constraints on resource-limited low-altitude UAV platforms.
Recently, there has been increased focus on high-accuracy yet lightweight network designs, which hold considerable potential for scenarios with stringent real-time requirements. The MobileNet family [27,28,29] significantly reduces parameter counts and computational costs through depthwise separable convolutions and inverted residual structures. GhostNet [30] further demonstrates that equally rich feature representations can be obtained with fewer operations. Researchers have also begun to explore hybrid architectures that combine CNNs with Transformers or Graph Neural Networks (GNNs) to enhance feature representation while preserving compactness. Such as MobileViT [31], EfficientFormer [32], and MobileViG [33]. These works imply that appropriate architectural design can facilitate achieving a favorable balance between accuracy and efficiency.
Inspired by the above observations, this paper proposes a lightweight homography estimation method, LFHomo, for infrared–visible image homography estimation from a UAV perspective in low-altitude application scenarios. The model consists of two anti-blurring feature extractors with non-shared parameters and a lightweight homography estimator, LFHomoE. In the feature extractors, a Shift Module (SM) [34] is embedded into the inverted residual structure [28] to capture local contextual information and enable nearly lossless feature transmission. Furthermore, a Spatial-Reduction Channel-Sequential Shuffle Attention (SRCSSA) mechanism is designed to enhance the perception and integration of contextual information in blurred regions while suppressing redundant features at a low computational cost. LFHomoE draws on the lightweight design of MobileViG [33] and the graph attention mechanism of Vision GNN (ViG) [35] to build a CNN–GNN hybrid architecture, achieving a balance between performance and efficiency in homography matrix estimation. To alleviate the scarcity of unregistered UAV-based infrared and visible images datasets, we also construct a new dataset from a UAV perspective to support model training and validation.
Accordingly, this work formulates the following testable hypothesis: compared with representative multimodal homography estimation methods, LFHomo can reduce the average corner error (ACE) on the proposed UAV-based infrared and visible images dataset by approximately 5%, while maintaining lower model complexity and shorter inference time.
In summary, the main contributions of this paper are as follows:
  • We introduce an Inverted Residual Shift Convolution (IRSC) block that embeds a shift module into an inverted residual structure to capture local contextual features in blurred low-altitude UAV images.
  • We design a Spatial-Reduction Channel-Sequential Shuffle Attention (SRCSSA) module that suppresses redundant and enhances informative features through spatial reduction, channel grouping and shuffling, and attention-based fusion.
  • We develop a lightweight CNN–GNN hybrid homography estimator, LFHomoE, which achieves a favorable trade-off between accuracy and efficiency and delivers fast inference on both synthetic benchmark datasets and unregistered infrared–visible UAV image pairs.
The structure of this paper is as follows: Section 2 briefly introduces existing related work. Section 3 details the proposed method. Section 4 describes the dataset construction process and presents experimental results on infrared and visible image datasets from the perspective of a drone. In Section 5, we discuss the limitations of our method and provide prospects for future work. Finally, we conclude the paper in Section 6.

2. Related Work

2.1. Homography Estimation for Multimodel Image

Homography estimation methods can be roughly divided into traditional approaches and deep learning–based approaches [13]. Traditional methods typically decompose homography estimation into three steps: feature descriptor extraction, descriptor matching, and homography matrix estimation. In practice, feature descriptors are first extracted using algorithms such as Scale Invariant Feature Transform (SIFT) [18], Speeded Up Robust Features (SURF) [36], Oriented FAST and Rotated BRIEF (ORB) [19], and BEBLID [37]. Then, matching algorithms are applied to pair descriptors, and incorrect correspondences are removed by robust estimators such as RANSAC [20] or MAGSAC++ [21]. Finally, the homography matrix is solved using the Direct Linear Transformation (DLT) algorithm. Some works also adopt trainable deep models to replace the local descriptor or matching steps in this pipeline, such as LIFT [38], SuperPoint [39], SuperGlue [40], and LoFTR [41]. Nevertheless, these methods generally require high-quality images and are easily affected by modality discrepancies, motion blur, noise, and low resolution.
In contrast, deep learning-based homography estimation methods convert the multi-step procedure of traditional approaches into a unified, trainable end-to-end framework [5] and have gradually become the mainstream direction for multi-source image homography estimation. We summarize representative multimodal homography methods and their characteristics in Table 1. For example, Luo et al. [23] employ a Generative Adversarial Network (GAN) to estimate homographies. This approach, however, is difficult to generalize across varying illumination conditions. Transformer-based methods [24,25] also exhibit sensitivity to noise and blur, and their computational complexity remains high. Liao et al. [26] propose homoViG, which introduces a graph attention mechanism to alleviate the impact of modality differences. Nonetheless, its generalization ability in low-altitude UAV scenarios with motion blur, adverse weather, or low resolution has not been systematically validated.
Overall, these methods offer valuable insights into multimodal homography estimation. Yet they often require substantial computational resources and show limited robustness in complex real-world scenarios. In low-altitude UAV applications, lightweight and robust homography estimation methods for infrared and visible images are still lacking.

2.2. Lightweight Hybrid Backbones

With the increasing use of Transformers and GNNs across various vision tasks, the development of lightweight hybrid architectures that integrate CNNs with diverse network paradigms has emerged as a significant research direction. Mehta et al. [31] proposed MobileViT, a lightweight hybrid architecture combining CNNs and Transformers. It replaces part of the convolutional blocks in MobileNetV2 [28] with transformer blocks, enhancing global modeling capability while maintaining low computational cost. Yan et al. [32] introduced EfficientFormer, which achieves high inference speed on mobile devices through dimension-consistent and stage-wise design. Huang et al. [33] further proposed MobileViG, based on ViG [35], by constructing a CNN–GNN fusion framework and designing a Sparse Vision Graph Attention (SVAG) mechanism for message passing on local vision graphs, achieving a favorable efficiency–accuracy trade-off on image classification and object detection benchmarks.
Building on these developments, LFHomoE integrates the design principles of ViG [35] and MobileViG [33]. On one hand, ViG stacked a large number of ViG blocks [35] in the backbone. The frequent graph convolution and adjacency reconstruction introduce considerable computational overhead. In contrast, LFHomoE inserts only a few ViG blocks at the end of each stage. The remaining blocks are replaced by inverted residual structures, which significantly reduce the computational burden and bring the parameter size and FLOPs closer to those of a lightweight CNN backbone. On the other hand, MobileViG [33] introduces SVAG only at the lowest-resolution stage to control the cost. Nonetheless, it constructs the same static graph structure [42] for all images, making it difficult to adapt to geometric variations across different scenes and contents. LFHomoE replaces SVAG with ViG blocks and distributes them evenly across all stages, thereby strengthening multi-scale modeling of geometric relationships between multimodal features. As a result, the model can more flexibly describe local correspondences between the two modalities in different spatial regions and improve the accuracy of homography matrix estimation.

3. Method

3.1. Network Structure

LFHomo consists of two anti-blurring fast feature extractors with non-shared parameters and a lightweight homography estimator, LFHomoE. The overall network architecture is illustrated in Figure 1.
Given two input images, an infrared image I i n f and a visible image I r g b , both of size H × W × 1 . First, employ the blur-resistant fast feature extractor to obtain two feature maps, F i n f and F r g b , of the same size H × W × 1 . These feature maps are then concatenated along the channel dimension to form the fused feature map F c , which is fed into LFHomoE to rapidly predict the homography matrix. The LFHomoE model is composed of an embedding layer, four sequentially stacked stages, and two fully connected layers. Each stage of the process comprises several inverted residual blocks [28] and one ViG block [35], collectively referred to as an LFHomoE Block. The embedding layer divides the fused feature map F c into patches and incorporates positional encoding. The patches are then processed through the four stages in sequence to establish the mapping between cross-modal features. The output of the final stage is passed via two fully connected layers and projected into an 8-dimensional offset vector. After that, employing the direct linear transformation algorithm, the offset vector is converted into a homography matrix H i n f _ r . During the training and testing phases, the order of the input image pairs is also reversed to generate another homography matrix H r g b _ f . These two homography matrices are mutually inverse.

3.2. Anti-Blurring Feature Extractor

The anti-blurring feature extractor comprises four Inverse Residual Shift Convolution Blocks (IRSC) and a Spatial Reduction Channel Sequence Shuffling Attention (SRCSSA) module. Notably, the SRCSSA block contains a downsampling layer, a channel sequence shuffling attention block (CSSA) [43], a Global Spatial Attention block (GSA), and an upsampling layer. The layer structure and the output information of the feature extractor are displayed in Table 2.

3.2.1. Inverse Residual Shift Convolution Block

The IRSC block contains an inverse residual convolutional block [28] and a shift module [34]. Its structure is shown in Figure 2a. Since the number of input and output channels per layer varies within the feature extractor, we did not incorporate residual connections within the IRSC. For each IRSC, adopting feature map F as a paradigm, an inverse residual layer is first utilized to extract feature information. Specifically, the input feature map undergoes a 1 × 1 pointwise convolution for dimensionality expansion. Subsequently, a 3 × 3 depthwise separable convolution [27] is applied in the high-dimensional space to efficiently capture local spatial features. Finally, another 1 × 1 pointwise convolution maps the number of channels back to the input dimension.
However, in drone imagery, blurred regions, particularly at edges, are still constrained by the limited receptive field of the depthwise convolution in the inverse residual block. This constraint easily leads to the loss of local contextual information in these regions, although such information often contains the true edge structure. To address this limitation, a shift module [34] is incorporated into the inverse residual block, enabling efficient transfer and supplementation of feature information.
The processing pipeline of the shift module is illustrated in Figure 3. Let the input feature map of the shift module be F shift in R H × W × C . First, it is evenly split along the channel dimension into four sub-tensors of equal size: F s h i f t i n = F 1 ,   F 2 ,   F 3 ,   F 4 ,   F n R H × W × C 4 . Each sub-tensor is then circularly shifted by one pixel in the positive or negative direction along the x-axis or y-axis. We define the circular shift function in Equation (1):
F n = C S ( Δ x , Δ y ) ( F n )
where C S ( Δ x , Δ y ) ( ) denotes the circular shift operation and ( Δ x , Δ y ) specifies the shift along the x-axis and y-axis, respectively. After the four sub-tensors are shifted, they are concatenated in the original order to restore the feature map shape. The overall computation of the shift module can thus be written as Equation (2):
F s h i f t o u t = S h i f t ( F s h i f t i n ) = C o n c a t ( C S + 1 , 0 F 1 ,   C S 1 , 0 F 2 ,   C S 0 , + 1 F 3 ,   C S ( 0 , 1 ) ( F 4 ) )
where S h i f t ( ) denotes the shift module, C o n c a t ( ) is the concatenation operation. F s h i f t o u t is the output feature map of the shift module. Based on the above, the computation of the IRSC block can be expressed as Equation (3):
F I R S C o u t = S h i f t ( P W C o n v 1 × 1 ( D W C o n v 3 × 3 ( P W C o n v 1 × 1 ( F I R S C i n ) ) ) )
where F I R S C o u t and F I R S C i n denote the output feature map and input feature map of each IRSC block, respectively. The P W C o n v 1 × 1 ( ) is a 1 × 1 pointwise convolution, while the D W C o n v 3 × 3 ( ) is indicated 3 × 3 depthwise separable convolution. In this way, the IRSC block can transfer feature information to the subsequent modules with minimal loss, while incorporating the shifted contextual information produced by the shift module.
The combination of the inverted residual block [28] and the shift module [34] exhibit a clear complementarity. The inverse residual block extracts local edge features by per-channel depth convolutions, while the shift module explicitly propagates these features at the spatial dimension. This integration facilitates the perception of motion-blurred boundaries. In other words, IRSC strengthens the complementarity and fusion among key features in blurred areas without introducing a significant increase in parameters or computational cost.

3.2.2. Spatial Reduction-Based Channel Sequence Shuffle Attention

The Spatial-Reduction Channel-Sequential Shuffle Attention suppresses interfering features in blurred regions at a low computational cost, and to better capture informative structures, the architecture is illustrated in Figure 4. To reduce the computational complexity of the attention mechanism and improve inference efficiency, the spatial size of the input feature map is first reduced from H × W to R H × R W by a downsampling operation, where R denotes the spatial reduction ratio. The downsampled feature map is then divided into G groups along the channel dimension, each group containing C G channels. Spatial average pooling is applied to each group to produce a descriptor of size 1 × 1 × C G . In our experiments, we set the number of groups to G = 4 and the spatial reduction ratio to R = 1 2 . This process can be expressed as Equation (4):
F = A v g P o o l ( G ( D o w n ( F S R C S S A i n ) ) )
where F S R C S S A i n denotes the input feature map of SRCSSA, F represents the intermediate feature vector after grouping and pooling. D o w n ( ) denotes the downsampling operation, G ( ) is the grouping operation, and A v g P o o l ( ) denotes average pooling. Then, a channel-sequence shuffling operation is used to mix channel features across different groups, as shown in Equation (5):
F ¯ = f s h u f f F = X 1 1 ,   X 2 1 ,   ,   X G 1 ,   X 1 2 ,   ,   X G 2 ,   X 1 3 ,   ,   X G 3 ,   ,   X 1 C G ,   ,   X G C G
where f s h u f f denotes the sequence shuffling operation; X j i denotes the pooled features of the i-th channel within the j-th group, where j = 1 , 2 , , G and i = { 1 , 2 , , C G } .
After shuffling, a group convolution with group size G = 4 is applied to obtain channel-wise attention weights. A second shuffling operation is then used to restore the original channel order [43]. This procedure corresponds to the CSSA part in Figure 4 and can be written as Equation (6):
F ¯ C S S A = f s h u f f T ( C o n c a t ( G C o n v ( f s h u f f ( F ) ) ) )
where G C o n v ( ) denotes group convolution, F ¯ C S S A is the channel attention weight map. Finally, the attention weights are applied to the downsampled feature map to obtain the channel-weighted feature map F ¯ C S S A o u t , as in Equation (7):
F ¯ C S S A o u t = F ¯ C S S A D o w n ( F S R C S S A i n )
where denotes element-wise multiplication. In addition, we use a Global Spatial Attention (GSA) branch to model long-range dependencies within the feature map, thereby enlarging the receptive field and enhancing global representation capability. The GSA is computed as Equation (8):
F ¯ G S A = f G S A F ¯ C S S A o u t = σ ( C o n v ( C o n c a t ( M a x P o o l ( F ¯ C S S A o u t ) ,   A v g P o o l ( F ¯ C S S A o u t ) ) ) )
where M a x P o o l ( ) and A v g P o o l ( ) represented max pooling and average pooling, respectively. C o n v ( ) is a convolution layer, and σ ( ) signifies the sigmoid function. The final SRCSSA output is obtained by weighting the channel-attended feature map with the spatial attention map, as in Equation (9):
F S R C S S A o u t = F ¯ G S A F ¯ C S S A o u t
where F S R C S S A o u t denotes the weighted feature map produced by SRCSSA. Finally, F S R C S S A o u t is upsampled back to the original resolution to match the input size required by the subsequent loss computation.
It is worth noting that, thanks to the introduction of the shift module [34], the feature maps learn richer local structural information across different channel groups. However, this cross-channel propagation may also introduce redundancy, such as background responses and image noise. SRCSSA enhances inter-channel interaction through channel shuffling combined with attention, thereby emphasizing complementary features from different directions while suppressing non-critical features.

3.3. Lightweight and Fast Homography Estimator

In this section, we develop a lightweight homography estimator, LFHomoE, which combines CNN and GNN structures to balance model accuracy, inference speed, and computational cost. LFHomoE takes the feature map F c as input and consists of an Embedding layer, four sequential LFHomoE blocks, and two fully connected layers. Each LFHomoE block is treated as one stage. The Embedding layer first performs patch embedding and positional encoding on F c . Then, the four LFHomoE blocks are applied in order for feature modeling and matching. The configuration of each layer of information for LFHomoE is detailed in Table 3.
The main difference between the inverted residual block in LFHomoE and the IRSC block lies in the presence of residual connections and the absence of the shift module. The structure of the inverted residual block in LFHomoE is shown in Figure 2b. Compared with standard convolutions, inverted residual blocks achieve deep nonlinear feature transformations with fewer parameters and lower computational cost, while providing skip connections in low-dimensional space to alleviate the vanishing-gradient problem. In both IRSC and LFHomoE, the channel expansion factor of the inverted residual block is set to 4. The shift module is not introduced into LFHomoE because the anti-blurring fast feature extractors already provide sufficiently accurate local contextual features. In addition, LFHomoE contains a relatively large number of inverted residual blocks. Adding shift modules to these blocks would introduce extra computation and degrade the overall inference speed.
In LFHomoE, each ViG block [39] consists of a maximum relative graph convolution [44] and a feed-forward network (FFN) formed by two fully connected layers. Initially, the input feature map is projected by a linear layer into a unified graph feature space, where each spatial position is regarded as a graph node. The maximum relative graph convolution then aggregates information from neighboring nodes to model their dependencies, providing cues for subsequent feature matching. The FFN applies nonlinear transformations to enhance the representation capacity and mitigate potential over-smoothing. To maintain topological structure consistency between the input and output, the ViG block reshapes features into sequences for graph operations and then restores them to their original layout.
As summarized in Table 3, LFHomoE adopts a top–down pyramid-style design. This differs from the RFViG backbone used in homoViG [26], where a large number of ViG blocks with Attention Feature Fusion (AFF) modules [45] are heavily stacked in each stage. In LFHomoE, most of these graph-based components are replaced with inverted residual convolution blocks, which preserve the ability to encode cross-modal geometric constraints while substantially reducing the overhead of repeated graph convolutions and reshape operations. The pyramid design also leads to a more balanced allocation of computation across stages. In the first two stages, the network processes feature maps with relatively high spatial resolution but fewer channels, focusing on extracting fine-grained geometric cues at larger spatial scales. In the last two stages, the spatial resolution is progressively reduced while the number of channels increases, enabling high-dimensional semantic and structural relationships to be modeled on more compact feature maps. This design effectively controls the growth of parameters and FLOPs. Quantitative results in Section 4.4 further confirm that the proposed method achieves lower parameter counts, peak memory usage, and inference time than existing counterparts, while maintaining competitive homography estimation performance.
Finally, we input the output of the last stage into two fully connected layers to obtain an 8-dimensional coordinate offset vector and use the DLT algorithm to convert it into the homography matrix of the corresponding image pair.

3.4. Loss Function

The loss function comprises three parts: Detail Feature Loss (DFL) [22], Feature Identity Loss (FIL) [46], and homography loss. The total loss L t o t a l is shown in Equation (10):
L t o t a l = L D F L ( I i n f ,   I r g b ) + L D F L ( I r g b ,   I i n f ) + λ ( L F I L ( I i n f ,   H i n f _ r ) + L F I L ( I r g b ,   H r g b _ i ) ) + μ L h
where L D F L represents the loss of details feature between the distorted image and the target image, L F I L is the feature identity loss. H i n f _ r denotes the homography matrix predicted by the model that maps image I i n f to I r g b . Contemporaneously, H r g b _ i is the homography matrix predicted after swapping the order of the input image pairs I i n f and I r g b . L h denotes the homography loss, which is employed to impose reversibility constraints. The hyperparameters λ and μ are set to 1.0 and 0.01, respectively. We compared the effects of hyperparameters λ and μ under different settings in Section 4.6.3.

3.4.1. Detail Feature Loss

Detail Feature Loss (DFL) proposed by Luo et al. [22], which calculates a triple loss on the feature map F i n f and the warped feature map F i n f obtained by the homography matrix. The original triple loss [47] computed on the input image I i n f and the distorted image I i n f , may introduce irrelevant background information and noise during computation, thereby affecting network optimization. Conducting the calculation on feature maps enables the network to more effectively identify significant features for homography estimation, minimizing the impact of extraneous noise present in the original image. The formulation is given in Equation (11):
L D F L F i n f ,   F i n f ,   F r g b = | | F i n f F r g b | | 1 | | F i n f F r g b | | 1
where F i n f = f S T N ( F i n f ) is the result of the feature map F i n f being distorted by the homography matrix H i n f _ r . f S T N ( · ) represents the distortion operation using the Spatial Transform Network (STN) [48]. During the training process, as the input order of the images was reversed, the total loss function induced another feature detail loss L D F L F r g b ,   F r g b ,   F i n f between I r g b and I i n f .

3.4.2. Feature Identity Loss

However, during training, DFL does not necessarily minimize | | F i n f F r g b | | 1 while maximizing | | F i n f F r g b | | 1 , which influences network optimization. The introduction of the Feature Identity Loss (FIL) serves to address this limitation [46]. FIL permits the network to disregard the order between image distortion and feature extraction during optimization. In other words, whether it extracts the feature map before applying distortion, or warps the original image first and then extracts the feature map, the resulting feature maps should be substantially similar. This facilitates more effective minimization of the DFL [24]. The computation of FIL is illustrated in Equation (12):
L F I L I i n f ,   H i n f _ r = | | f S T N ( f I R I i n f , H i n f _ r ) f I R ( f S T N ( I i n f , H i n f _ r ) ) | | 1
where W ( ) is used to denote the operation of distortion. For the same reason, we also calculated another feature identity loss L F I L ( I r g b ,   H r g b _ i ) . Finally, the homography loss aims to add reversibility constraints to H i n f _ r and H r g b _ i to ensure that they are reversible. As demonstrated in Equation (13):
L h = | | H i n f _ r H r g b _ i E | | 2 2
where E is represents the third-order unit matrix.

4. Experimental Results

In this section, we first introduce the datasets used in our experiments and the implementation details. To evaluate the effectiveness of the proposed method in low-altitude application scenarios, we construct and relabel two infrared and visible datasets captured from a UAV perspective. The evaluation metrics employed in the experiments are then briefly described. Subsequently, we conduct qualitative and quantitative experiments on these two datasets and compare the proposed method with existing unsupervised end-to-end approaches for infrared and visible images homography estimation. It is worth noting that, although methods such as SuperGlue [40] and LoFTR [41] perform well in the feature-matching stage of traditional methods, they rely on local feature priors and are essentially used as matching modules for subsequent homography estimation. As a result, it is difficult to compare these methods with our proposed framework under strictly identical conditions. Finally, ablation studies are carried out to verify the effectiveness of each proposed component.

4.1. Dataset and Implementation Details

4.1.1. Dataset

Infrared and Visible Image Dataset Based on UAV Perspective. Currently, there is a lack of unregistered infrared and visible light image datasets from UAV perspectives that can be used for training. To address this gap, this paper constructs a new unregistered benchmark dataset for infrared and visible images from a UAV perspective, based on three real multimodal UAV datasets: VEDAI [49], DroneRGBT [50], and DroneVehicle [51], and re-annotates it. The comparison between the original datasets and the constructed UAV perspective infrared and visible image dataset is shown in Table 4.
Specifically, to construct datasets under identical conditions, we first uniformly and randomly crop all images to 320 × 240 . Subsequently, we randomly crop 150 × 150 image patches from the RGB images and apply random offsets within a certain range to the four corner points of the cropped regions. This process yields the homography matrix H and its inverse matrix. The infrared images are then warped using the inverse matrix, and a 150 × 150 image patch is cropped from the warped image at the corresponding position. The resulting image pairs, each of size 150 × 150 , are used as training or testing samples [22]. Considering that the infrared images in VEDAI [49] lie in the near-infrared band, we construct a separate Near-Infrared UAV Homography Benchmark Dataset (NIUHBD), comprising 452 training and 24 test pairs. The datasets generated from DroneRGBT [50] and DroneVehicle [51] are merged into a unified UAV Homography Benchmark Dataset (UHBD), comprising 3998 and 29 test pairs. The training–testing split for both datasets is determined based on the total number of samples and scene distributions, aiming to ensure sufficient diversity in the training set and independence in the test set. In practice, the splits are manually defined so that training and testing samples do not overlap, thereby improving the reliability of the evaluation. Since these datasets are mainly used to assess homography estimation performance, no category labels are involved. The process of generating the dataset is presented in Figure 5.
The newly established benchmark dataset is divided into three distinct categories: the visible images image set I R G B , the infrared image set I I N F , and the ground truth infrared image set I G T . Among these, I R G B and I I N F are unaligned, while I R G B and I G T are aligned. The I G T only exists in the test set for the purpose of qualitative comparison.
Synthetic Benchmark Dataset. We adopt the same synthetic benchmark dataset as [22] to evaluate the proposed method. The dataset is constructed from several public datasets, including the OSU Color-Thermal Database, INO, and TNO, and covers a wide range of indoor and outdoor scenes with structured objects such as pedestrians, vehicles, buildings, and natural landscapes. Following the data partitioning protocol in [22], the synthetic benchmark consists of unregistered infrared–visible image pairs of size 150 × 150 , with 49,378 pairs for training, 29 pairs for validation, and 42 pairs for testing. For the synthetic benchmark dataset’s test set, each image pair provides the ground-truth infrared image I G T and four sets of corresponding ground-truth corner coordinates. The ground truth infrared image, I G T , is utilized for qualitative comparisons, aiming to visually demonstrate the channel blending results between the predicted image and the true image. The corner coordinates are deployed for quantitative comparisons, enabling the computation of relevant evaluation metrics.
Real-world Dataset. To further evaluate the generalization ability of the proposed method, we adopt the CVC Multimodal Stereo Dataset [52] as a real-world benchmark. The CVC dataset primarily contains long-wave infrared and visible images captured under various environments, with a resolution of 506 × 408. Since accurate ground-truth corner coordinates are not provided in CVC, we manually annotate four matched corner points for each test image pair in order to compute the evaluation metrics.

4.1.2. Experimentation Details

The proposed method was implemented using the PyTorch framework (version number: 1.10.0) and trained on an NVIDIA GeForce RTX 3090 (NVIDIA, Santa Clara, CA, USA). The training was performed with the Adam optimizer with an initial learning rate of 1 × 10 5 and a decay factor of 0.8 per epoch. The batch size of the experiment was 64, and the number of epochs was 50. In addition, to demonstrate the effectiveness of the proposed method, we compared its performance with that of the latest method under the same environment on various datasets.

4.2. Evaluation Metric

In the experiments, we employ a set of metrics to evaluate the performance of the proposed method, including Average Corner Error (ACE), Structural Similarity (SSIM), Adaptive Feature Registration Rate (AFRR) [22], average inference time, number of parameters, and peak memory usage.

4.2.1. Average Corner Error

ACE evaluates the accuracy of homography estimation by measuring the average l 2 distance between the four corner points obtained by the predicted homography and those obtained by the ground-truth homography. The definition is given in Equation (14). A lower ACE indicates a more accurate estimated homography matrix.
A C E = i = 1 4 | | x i y i | | 2 4
where x i represents the estimated corner point after the homography matrix transformation. y i signifies the actual corner point after the homography matrix transformation.

4.2.2. Structural Similarity

SSIM is used to measure the similarity between the warped source image and the target image after the homography transformation. A higher SSIM value indicates greater similarity and thus a more accurate homography. SSIM is defined in Equation (15):
S S I M x , y = 2 m x m y + α 2 C x y + β m x 2 + m y 2 + α s x 2 + s y 2 + β
where x and y denote the warped image and the target image, respectively. m x and m y are the average values of all pixels of images. s x and s y are their standard deviations. C x y is the covariance between x and y . α and β represent constants used to maintain stability.

4.2.3. Adaptive Feature Registration Rate

AFRR [22] reflects homography accuracy by computing the number of correctly registered feature point pairs within an adaptive range. For each matched feature pair, the Euclidean distance D i between the two points is first computed. A threshold χ is used to discard obvious outliers; matches with D i < χ are retained, and their distances are denoted as D i . A second threshold, δ , is then applied to determine whether a retained match is accurate. The AFRR is defined in Equation (16):
A F R R = 1 N i = 1 N δ ( D i )
where N is the number of matches that satisfy D i < χ , and δ ( D i ) represents the number of points that are judged to be accurate at the threshold δ (i.e., the match is regarded as accurate when D i < δ ).

4.2.4. Average Inference Time, Parameters, and Peak Memory Usage

To verify the effectiveness of the proposed method in terms of lightweight design and low computational cost, we also introduce the average inference time, number of parameters, and peak memory usage. In general, lower values of these metrics indicate a better trade-off between the method’s computational complexity and inference speed.

4.3. Qualitative Comparison

In this section, a qualitative comparison of the proposed method and existing approaches is conducted across three datasets. Firstly, our method is compared with feature-based homography estimation methods. These feature-based methods comprise three feature descriptors (SIFT [18], ORB [19], BRISK [53]) and two outlier removal algorithms (RANSAC [20] and MAGSAC++ [21]). The experimental results are shown in Figure 6. The images contained within rows 1 and 5 come from the synthetic benchmark dataset. Rows 2, 3, 6, and 7 feature images from the infrared and visible images dataset based on UAV perspective, UHBD and NIUHBD, constructed in this paper. The results demonstrate that feature-based methods display significant image distortion and ghosting, as well as match failures, across all datasets: the synthetic benchmark, UHBD, and NIUHBD. It may be posited that the significant modal differences between infrared and visible images likely make feature-based methods susceptible to errors during feature matching. Conversely, low-altitude UAV images often exhibit shake-induced blur, which hinders the extraction of sufficiently accurate features for matching using conventional methods. In contrast, the proposed method demonstrates more stable and accurate homography estimation on all three datasets.
Secondly, we compare the proposed method with existing deep learning-based homography estimation approaches, including DADHN [22], HomoMGAN [23], FCTrans [24], LCTrans [25], and homoViG [26]. The results are illustrated in Figure 7. To evaluate alignment accuracy, we blended the G and B channels of the predicted infrared image with the R channel of the real infrared image, and then visualized the degree of ghosting. The experimental results indicate that, on the synthetic benchmark dataset, where motion blur is minimal, the proposed method exhibits limited advantages in ghosting reduction. However, on the dataset based on the drone’s perspective, LFHomo demonstrated fewer ghosting artifacts and superior edge alignment performance under low-resolution and jitter-blurred conditions. This advantage primarily stems from the anti-blurring feature extractor, where the IRSC and SRCSSA blocks effectively compensate for feature information loss caused by motion blur or low resolution. This results in higher precision and robustness in UHBD and NIUHBD.

Analysis of Non-Advantageous Cases on UHBD

It is important to note that the visual difference between LFHomo and homoViG on the UHBD is indeed quite subtle. Nevertheless, it can be observed that, across all examples, after RGB channel blending, interpolation, and field-of-view cropping induced by the homography transformation, red artifacts often appear along the image boundaries. And in the UHBD example shown in Figure 7, when there is obvious red ghosting along the right edge, the visualization indicates a relatively large alignment error. As the red ghosting on the right edge weakens or disappears, the alignment quality improves. To explain this phenomenon, Figure 8 further illustrates the RGB fusion of the pre-warp infrared image and the ground-truth infrared image. The ghosting in the G and B channels appears mainly to the left of the red ghosting, suggesting that the true homography shifts the image to the right in this case. The homoViG result shows almost no red artifacts at either edge, whereas LFHomo retains slight red artifacts on the left edge. But both preserve a consistent spatial relationship between the G and B channels and the R-channel ghosting. This suggests that, although both methods are difficult to distinguish in visual images, LFHomo may provide finer geometric alignment that is not immediately evident from a single qualitative visualization. The results in Quantitative Comparison also confirm this.

4.4. Quantitative Comparison

In this section, we undertake a quantitative comparison of the proposed method with both feature-based and deep learning-based methods. Qualitative comparisons reveal that feature-based methods struggle to meet the accuracy requirements of low-altitude applications. To more precisely illustrate the accuracy gap between the proposed method and all other methods, we compare their corner error metrics on a synthetic benchmark dataset. Moreover, the results were categorized into three difficulty levels based on corner error: Easy (top 0–30%), Medium (top 30–60%), and Hard (top 60–100%) [42]. The detailed comparison results are shown in Table 5. Within this context, the term “-” signifies that the method was unsuccessful on multiple occasions during the testing phase. This led to the absence of data for that particular level. The entry “I3×3” in row 2 represents the identity transformation, where the calculated corner error denotes the original distance between point pairs. Rows 2 to 8 represent the corner errors of the feature-based method at three levels, and rows 9 to 14 donate the results of the deep learning-based method at each level. The final two columns of Table 5 present the mean corner error and failure rate, respectively.
As evidenced in Table 4, all feature-based methods experienced algorithmic failures, and their corner errors were higher at all levels compared to those of deep learning-based methods. This indicates that the performance of feature-based homography estimation methods in multimodal image scenarios falls short of practical requirements, and we believe that this testing outcome can also be generalized to UAV-based infrared and visible image datasets. In contrast, deep learning-based methods demonstrate superior performance and robustness across all levels of analysis. The method proposed in this paper achieves the best result only at the first level, yet its final average corner error reaches 4.73, second only to 4.68 from homoViG [26].
To further evaluate the performance of the proposed method, we conduct quantitative comparisons between LFHomo and five deep learning-based approaches on three datasets, as reported in Table 6. It can be observed that, although LFHomo attains a slightly lower ACE than homoViG [26] on the synthetic benchmark dataset, it achieves the best ACE on both UAV datasets. Compared with the second-best method, its ACE is reduced by approximately 4–6% on average. For SSIM and AFRR, the differences are negligible for all methods. We speculate that this is because image warping introduces invalid pixel regions near the boundaries, which, in turn, affect the calculation of these two metrics. Therefore, in the subsequent experiments, ACE is adopted as the primary evaluation metric.
Meanwhile, Table 7 compares the average inference time, number of parameters, and peak GPU memory usage during inference for the different methods. All experiments on the three datasets are conducted under a unified software–hardware configuration, with identical input and output sizes. Therefore, for a given model, the reported peak GPU memory usage is similar across the three datasets. As shown in Table 7, LFHomo achieves the lowest inference time, parameter count, and peak GPU memory usage among all competitors. This indicates that the method significantly improves the trade-off between computational efficiency and model complexity while maintaining competitive accuracy. Overall, these results are consistent with the hypothesis formulated in the introduction.
In addition, both the qualitative and quantitative comparisons show that the performance gap between homoViG and LFHomo is relatively small. To analyze whether these differences are statistically meaningful, we conduct paired t-tests on ACE across the three datasets and present the boxplot distributions of LFHomo and homoViG in Figure 9.
On the synthetic benchmark dataset, the ACE distributions of the two methods are highly similar, with a p-value of about 0.73, indicating that their accuracies are essentially at the same level in this scenario. By contrast, on the UHBD and NIUHBD UAV datasets, the ACE distributions of LFHomo are slightly shifted downward, with both the median and mean values being lower than those of homoViG. Although the corresponding p-values (0.14 and 0.27, respectively) do not reach the conventional 0.05 significance level under the current sample, LFHomo still shows a more consistent trend of error reduction while maintaining low computational complexity.

4.5. Comparison on the Real-World Dataset

We further conduct a quantitative comparison on the CVC real-world dataset [52], and the results are reported in Table 8. Similarly to the synthetic benchmark dataset, traditional methods also exhibit a large number of alignment failures on CVC, whereas deep learning-based approaches show more stable performance. Also, LFHomo does not achieve the best results on the Easy difficulty level, but attains the lowest errors on the other two difficulty levels. A plausible explanation is that, under the Easy setting, the image quality is relatively high, allowing methods such as homoViG [26] to exploit richer features and larger model capacity to achieve slightly better accuracy. In contrast, LFHomo sacrifices only a small amount of accuracy in these simple cases, while achieving significant error reductions on more challenging samples with poorer image quality. This suggests that the proposed method offers stronger robustness in more complex real-world scenarios.

4.6. Ablation Studies

4.6.1. Effectiveness of Proposed Components

To verify the contribution of each module to the model’s performance, we conducted a series of ablation experiments, and the results are shown in Table 9. Firstly, in the first two lines, we verified the complementarity between the shift module [34] and SRCSSA. As can be seen, when utilized separately, both the shift module and the SRCSSA module have been shown to yield limited improvement of the model’s performance. However, when employed in combination, they have been shown to enhance the homography estimation accuracy of the model, thus demonstrating a synergistic effect in capturing edge features.
Subsequently, to validate the benefit of the proposed homography estimators in achieving a balance between efficiency and accuracy, we compared the performance of four backbones, including ResNet-34 [54], ViT [55], ViG [35], MobileViT [31], MobileViG [33] and RFViG [16]. Among them, although MobileViT [31] has the smallest number of parameters, its homography estimation accuracy is relatively low. RFViG [26] achieves the best accuracy, but LFHomo attains a more favorable trade-off among inference speed, parameter count, and estimation accuracy.

4.6.2. Effectiveness of Attention

To further evaluate the effectiveness of different attention mechanisms in mitigating motion blur and vibration, we compared the proposed SRCSSA with several commonly used attention mechanisms, including SENet [56], CBAM [57], ECA [58], Coordinate Attention (CA) [59], Triple Attention (TA) [60], and Efficient Local Attention (ELA) [61], as presented in Table 10. The results demonstrate that channel-wise attention mechanisms, such as SENet and ECA, yield a slightly lower computational cost and higher efficiency. Nonetheless, their accuracy is notably inferior to that of other attention mechanisms. This may be attributed to the fact that single-dimensional channel attention neglects the modeling of multi-dimensional feature relationships, leading to reduced model precision. In contrast, attention mechanisms such as CA and TA exhibit comparable computational efficiency and load to SRCSSA, but the average corner errors across the three datasets are still higher than SRCSSA.
Additionally, we investigate the impact of different group numbers and spatial reduction ratios on the performance of SRCSSA. The results are summarized in Table 11. In the first row, G is set to “S”, indicating a standard channel shuffle operation with four groups. The experiments show that SRCSSA achieves better performance than standard channel shuffle. As the number of groups varies, the ACE on all datasets first decreases and then increases, and the best performance is obtained when G = 4 and the spatial reduction ratio is set to 0.5.

4.6.3. Hyperparameter Settings in Loss Function

To analyze the impact of the loss hyperparameters on model performance, we compare several combinations of the weights λ and μ in the loss function. Table 12 reports the ACE values on the three datasets under different loss-weight settings. It can be observed that when μ becomes smaller (e.g., 0.001) or larger (e.g., 0.05), the error increases. A similar degradation occurs when λ deviates from 1.0 . The best results on all three datasets are obtained when λ = 1.0 and μ = 0.01 , which is consistent with the hyperparameter configuration adopted in [26].

5. Discussion

5.1. Analysis of SRCSSA and Standard Attention Mechanisms

In this subsection, we compare SRCSSA with standard attention mechanisms from two perspectives: attention structure design and module synergistic effect. SENet [56] and ECA [58] compute global weights only along the channel dimension and do not explicitly model spatial structural information. CBAM [57] applies channel and spatial attention sequentially, but its channel branch still follows a global weighting paradigm, which limits its ability to fully exploit information across different channel subspaces. By contrast, SRCSSA performs attention computation at a lower spatial resolution. This effectively reduces the computational cost and the effect of blur and background noise. Furthermore, the channel-sequential shuffle operation facilitates enhanced cross-channel information exchange, thereby accentuating key features in the images.
Conversely, CA [59], TA [60], and ELA [61] further introduce the concept of cross-dimensional spatial interaction, providing richer contextual representations along different dimensions. Yet, these methods are designed primarily for general-purpose scenarios and are not specifically tailored to the motion-blur conditions considered in this work. In our design, the Shift Module explicitly introduces local contextual information in blurred regions through pixel-level shifts, while SRCSSA focuses and selectively enhances this information. The two modules act in a complementary manner and provide more accurate feature representations for homography estimation in infrared and visible images than generic attention mechanisms. As shown in Table 10 of Section 4.6.2, SRCSSA achieves better or more stable homography estimation performance than several representative attention mechanisms, with comparable parameter counts and inference times.
From the viewpoint of convolutional design, SRCSSA is also closely related to the depth-wise separable convolutions. Depth-wise separable convolutions are responsible for efficiently capturing local spatial patterns within each channel and enhances information flow. SRCSSA is not intended to replace these operators, but to act on top of them as a lightweight refinement module. By performing grouped channel attention and channel-sequential shuffle at a reduced spatial resolution, SRCSSA selectively reweights the features generated by depth-wise separable convolutions. This process redistributes information across channels, strengthening informative subspaces and suppressing redundant responses. In this way, the convolutional blocks provide efficient local feature extraction, whereas SRCSSA focuses on cross-channel interaction and feature selection.

5.2. Limitations and Future Work

Although the proposed method extends the applicability of homography estimation in low-altitude scenarios, several challenges remain to be addressed. Firstly, in real UAV applications, RGB and infrared images captured by onboard sensors often exhibit significant depth variations. However, most existing methods disregard this factor, resulting in inadequate adaptability in complex 3D scenes. Possible reasons include the lack of unregistered low-altitude datasets with depth information and the absence of model architectures that explicitly account for depth variation. Inspired by studies on spatial–spectral interaction in multispectral image processing [62], we are planning to construct a low-altitude multimodal dataset with depth annotations and to introduce interaction modules between depth and multimodal image features. We hypothesize that this is a promising research direction for mitigating depth information loss during registration.
Secondly, for real-time requirements in low-altitude applications, LFHomo achieves relatively fast inference, but direct deployment on UAV platforms with very limited onboard resources remains difficult. In future work, we plan to explore deployment schemes on edge servers by leveraging 5G communication and edge computing technologies. Existing studies have shown that the end-to-edge collaborative paradigm can effectively alleviate on-board computing bottlenecks. Using the “perception–transmission–computation–feedback” pipeline, computation-intensive tasks are offloaded from the UAV to edge nodes [63]. This paradigm particularly benefits from the high data rate and low latency of 5G/6G networks [64], which provides a foundation for building efficient low-altitude computing networks. On this basis, we will further incorporate model compression strategies such as network pruning, knowledge distillation, and low-bit quantization, and systematically measure changes in accuracy and inference time under different compression configurations.
Finally, because related research and publicly available data resources remain limited, LFHomo has been validated to date only in a subset of scenarios. Future work will expand the proposed UHBD and NIUHBDs to cover more complex environments, such as mountainous areas, forests, and sea surfaces. For each scene type and weather condition, the distributions of ACE, matching success rates, and mismatch ratios will be computed, enabling a quantitative analysis of model robustness across diverse environments. In addition, insights from multispectral imaging research on complex environmental perception [65] will be leveraged to more comprehensively assess the generalization capability of the proposed method. These are also expected to facilitate its application to environmental monitoring [66], disaster assessment [67,68], 3D reconstruction [69,70], and search-and-rescue tasks [71,72]. Also, we are interested in extending the method to hyperspectral image processing tasks [73,74] to evaluate its potential contribution to complex environmental perception and decision-making.

6. Conclusions

This paper addresses the problem of homography estimation between infrared and visible images from a UAV perspective, proposes a lightweight, unsupervised end-to-end model, LFHomo. Meanwhile, a corresponding UAV-based multimodal image dataset is also constructed to support model training and evaluation. Experimental results on multiple infrared–visible datasets show that LFHomo achieves competitive estimation accuracy while significantly reducing model complexity and inference time. At the same time, the model maintains good robustness under motion blur. We believe that this study provides a new perspective on efficient geometric registration of multimodal images in UAV scenarios, and suggests potential applications for real-time perception and multi-source information fusion in space–air–ground integrated networks.

Author Contributions

Conceptualization, Y.L. (Yanhao Liao) and Y.L. (Yinhui Luo); methodology, Y.L. (Yanhao Liao) and Y.L. (Yinhui Luo); formal analysis, Y.L. (Yanhao Liao), Y.L. (Yinhui Luo), J.Q. and Y.W.; investigation, J.Q., Y.L. (Yanhao Liao) and Y.W.; resources, J.Q., Y.L. (Yinhui Luo) and C.L.; data curation, Y.L. (Yanhao Liao), Y.W. and H.C.; writing—original draft preparation, Y.L. (Yinhui Luo) and Y.L. (Yanhao Liao); writing—review and editing, Y.L. (Yinhui Luo), Y.L. (Yanhao Liao), J.Q., Y.W. and C.L.; visualization, Y.L. (Yinhui Luo), Y.L. (Yanhao Liao) and H.C.; supervision, J.Q., Y.W., C.L. and H.C.; project administration, Y.L. (Yanhao Liao), and Y.L. (Yinhui Luo); funding acquisition, Y.L. (Yinhui Luo), J.Q., Y.W. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Fundamental Research Funds for The Central Universities (program no. ZJ2022-004, and no. ZHMH2022-006), in part by the Open Fund of Key Laboratory of Flight Techniques and Flight Safety under Grant (program no. F2024KF09C), in part by The Fundamental Research Funds for the Central Universities, The Funds for CAAC the Key Laboratory of Flight Techniques and Flight Safety (program no. FZ2025ZX34), in part by the Civil Aviation Information Technology Research Center, the Civil Aviation Flight University of China (program no. 25CAFUC09010), in part by Henan Province Key R&D Special Fund (program no. 251111242100) and in part by the College Students’ innovation and entrepreneurship training program of Civil Aviation Flight University of China (program no. 202510624039).

Data Availability Statement

Data derived from public domain resources.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable suggestions, which were of great help in improving the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SIFTScale Invariant Feature Transform
ORBOriented FAST and Rotated BRIEF
DLTDirect Linear Transformation
OANOrder Aware Network
LCTransLocal Correlation Transformer
GNNsGraph Neural Networks
K-NNK-Nearest Neighbor
UAVUnmanned Aerial Vehicle
ViTVision Transformers
SMShift Module
IRSCInverted Residual Shift Convolution
SRCSSASpatial Reduction Channel-Sequential Shuffle Attention
SURFSpeeded Up Robust Features
BEBLIDBoosted Efficient Binary Local Image Descriptor
RANSACRandom Sample Consensus
MAGSAC++Marginalizing Sample Consensus
LIFTLearned Invariant Feature Transform
LoFTRLocal Feature matching with Transformers
GANGenerative Adversarial Network
FCTransFeature Correlation Transformer
CNNConvolutional Neural Network
NASNetwork Architecture Search
ViGVision in Graph Neural Network
SVGASparse Vision Graph Attention
CSSAChannel Sequence Shuffling Attention
GSAGlobal Spatial Attention
AFFAttention Feature Fusion
IRBlockInverse Residual Block
DFLDetail Feature Loss
FILFeature Identity Loss
NIUHBDNear-Infrared UAV Homography Benchmark Dataset
UHBDUAV Homography Benchmark Dataset
ACEAverage Corner Error
STNSpatial Transform Network
SENetSqueeze and Excitation Network
CBAMConvolutional Block Attention Module
ECAEfficient Channel Attention
CACoordinate Attention
TATriple Attention
SSIMStructural Similarity
AFRRAdaptive Feature Registration Rate
ELAEfficient Local Attention

References

  1. Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-time object detection network in UAV-vision based on CNN and transformer. IEEE Trans. Instrum. Meas. 2023, 72, 2505713. [Google Scholar] [CrossRef]
  2. Kakaletsis, E.; Symeonidis, C.; Tzelepi, M.; Mademlis, I.; Tefas, A.; Nikolaidis, N.; Pitas, I. Computer vision for autonomous UAV flight safety: An overview and a vision-based safe landing pipeline example. ACM Comput. Surv. CSUR 2021, 54, 1–37. [Google Scholar] [CrossRef]
  3. Chen, Q.; Zhu, H.; Yang, L.; Chen, X.; Pollin, S.; Vinogradov, E. Edge computing assisted autonomous flight for UAV: Synergies between vision and communications. IEEE Commun. Mag. 2021, 59, 28–33. [Google Scholar] [CrossRef]
  4. McEnroe, P.; Wang, S.; Liyanage, M. A Survey on the Convergence of Edge Computing and AI for UAVs: Opportunities and Challenges. IEEE Internet Things J. 2022, 9, 15435–15459. [Google Scholar] [CrossRef]
  5. Zhao, D.; Zhou, L.; Li, Y.; He, W.; Arun, P.V.; Zhu, X.; Hu, J. Visibility estimation via near-infrared bispectral real-time imaging in bad weather. Infrared. Phys. Technol. 2024, 136, 105008. [Google Scholar] [CrossRef]
  6. Qin, H.; Xu, T.; Li, T.; Chen, Z.; Feng, T.; Li, J. MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 16882–16891. [Google Scholar]
  7. Zhao, D.; Hu, B.; Jiang, W.; Zhong, W.; Arun, P.V.; Cheng, K.; Zhao, Z.; Zhou, H. Hyperspectral video tracker based on spectral difference matching reduction and deep spectral target perception features. Opt. Lasers Eng. 2025, 194, 109124. [Google Scholar] [CrossRef]
  8. Ramos, L.; Sappa, A.D. Multispectral semantic segmentation for land cover classification: An overview. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14295–14336. [Google Scholar] [CrossRef]
  9. Memari, M.; Shekaramiz, M.; Masoum, M.A.; Seibi, A.C. Data Fusion and Ensemble Learning for Advanced Anomaly Detection Using Multi-Spectral RGB and Thermal Imaging of Small Wind Turbine Blades. Energies 2024, 17, 673. [Google Scholar] [CrossRef]
  10. Zhao, D.; Yan, W.; You, M.; Zhang, J.; Arun, P.V.; Jiao, C.; Wang, Q.; Zhou, H. Hyperspectral Anomaly Detection Based on Empirical Mode Decomposition and Local Weighted Contrast. IEEE Sens. J. 2024, 24, 33847–33861. [Google Scholar] [CrossRef]
  11. Zhao, D.; Asano, Y.; Gu, L.; Sato, I.; Zhou, H. City-scale distance sensing via bispectral light extinction in bad weather. Remote Sens. 2020, 12, 1401. [Google Scholar] [CrossRef]
  12. Shin, U.; Park, K.; Lee, B.-U.; Lee, K.; Kweon, I.S. Self-Supervised Monocular Depth Estimation from Thermal Images via Adversarial Multi-Spectral Adaptation. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 5787–5796. [Google Scholar]
  13. Luo, Y.; Wang, X.; Liao, Y.; Fu, Q.; Shu, C.; Wu, Y.; He, Y. A Review of Homography Estimation: Advances and Challenges. Electronics 2023, 12, 4977. [Google Scholar] [CrossRef]
  14. Lin, B.; Xu, X.; Shen, Z.; Yang, X.; Zhong, L.; Zhang, X. A Registration Algorithm for Astronomical Images Based on Geometric Constraints and Homography. Remote Sens. 2023, 15, 1921. [Google Scholar] [CrossRef]
  15. Debaque, B.; Perreault, H.; Mercier, J.P.; Drouin, M.A.; David, R.; Chatelais, B.; Duclos-Hindié, N.; Roy, S. Thermal and Visible Image Registration Using Deep Homography. In Proceedings of the 2022 25th International Conference on Information Fusion (FUSION), Linköping, Sweden, 4–7 July 2022; pp. 1–8. [Google Scholar]
  16. Bazargani, H.; Bilaniuk, O.; Laganière, R. A Fast and Robust Homography Scheme for Real-Time Planar Target Detection. J. Real-Time Image Proc. 2018, 15, 739–758. [Google Scholar] [CrossRef]
  17. Lu, F.; Dong, S.; Zhang, L.; Liu, B.; Lan, X.; Jiang, D.; Yuan, C. Deep homography estimation for visual place recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 10341–10349. [Google Scholar]
  18. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  19. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An Efficient Alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
  20. Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  21. Barath, D.; Noskova, J.; Ivashechkin, M.; Matas, J. MAGSAC++, a Fast, Reliable and Accurate Robust Estimator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1304–1312. [Google Scholar]
  22. Luo, Y.; Wang, X.; Wu, Y.; Shu, C. Detail-Aware Deep Homography Estimation for Infrared and Visible Image. Electronics 2022, 11, 4185. [Google Scholar] [CrossRef]
  23. Luo, Y.; Wang, X.; Wu, Y.; Shu, C. Infrared and Visible Image Homography Estimation Using Multiscale Generative Adversarial Network. Electronics 2023, 12, 788. [Google Scholar] [CrossRef]
  24. Wang, X.; Luo, Y.; Fu, Q.; Rui, Y.; Shu, C.; Wu, Y.; He, Z.; He, Y. Infrared and Visible Image Homography Estimation Based on Feature Correlation Transformers for Enhanced 6G Space–Air–Ground Integrated Network Perception. Remote Sens. 2023, 15, 3535. [Google Scholar] [CrossRef]
  25. Wang, X.; Luo, Y.; Fu, Q.; He, Y.; Shu, C.; Wu, Y.; Liao, Y. Coarse-to-Fine Homography Estimation for Infrared and Visible Images. Electronics 2023, 12, 4441. [Google Scholar] [CrossRef]
  26. Liao, Y.; Luo, Y.; Fu, Q.; Shu, C.; Wu, Y.; Liu, Q.; He, Y. Deep Unsupervised Homography Estimation for Single-Resolution Infrared and Visible Images Using GNN. Electronics 2024, 13, 4173. [Google Scholar] [CrossRef]
  27. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  28. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
  29. Howard, A.; Pang, R.; Adam, H.; Le, Q.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  30. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar]
  31. Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the ICLR, Virtual-Only, 25 April 2022. [Google Scholar]
  32. Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. Efficientformer: Vision transformers at mobilenet speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949. [Google Scholar]
  33. Munir, M.; Avery, W.; Marculescu, R. Mobilevig: Graph-based sparse attention for mobile vision applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2211–2219. [Google Scholar]
  34. Gao, T.; Zhang, Y.; Zhang, Z.; Geng, T.; Li, A.; Fang, Z.; Shi, L.; Di, X.; Li, H. BHViT: Binarized Hybrid Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 3563–3572. [Google Scholar]
  35. Han, K.; Wang, Y.; Guo, J.; Tang, Y.; Wu, E. Vision gnn: An image is worth graph of nodes. In Proceedings of the 35th Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 8291–8303. [Google Scholar]
  36. Bay, H.; Tuytelaars, T.; Gool, L.V. Surf: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
  37. Suárez, I.; Sfeir, G.; Buenaposada, J.M.; Baumela, L. BEBLID: Boosted efficient binary local image descriptor. Pattern Recognit. Lett. 2020, 133, 366–372. [Google Scholar] [CrossRef]
  38. Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. Lift: Learned Invariant Feature Transform. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 10–16 October 2016; pp. 467–483. [Google Scholar]
  39. DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
  40. Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4938–4947. [Google Scholar]
  41. Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8918–8927. [Google Scholar]
  42. Munir, M.; Avery, W.; Rahman, M.M.; Marculescu, R. GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 17–21 June 2024; pp. 6118–6127. [Google Scholar]
  43. Li, B.; Zhao, H.; Wang, W.; Liu, J.; Jiang, P.; Liu, Y. MAIR: A Locality-and Continuity-Preserving Mamba for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 7491–7501. [Google Scholar]
  44. Li, G.; Muller, M.; Thabet, A.; Ghanem, B. Deepgcns: Can Gcns Go as Deep as Cnns? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9267–9276. [Google Scholar]
  45. Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
  46. Hong, M.; Lu, Y.; Ye, N.; Lin, C.; Zhao, Q.; Liu, S. Unsupervised Homography Estimation with Coplanarity-Aware GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17663–17672. [Google Scholar]
  47. Zhang, J.; Wang, C.; Liu, S.; Jia, L.; Ye, N.; Wang, J.; Zhou, J.; Sun, J. Content-Aware Unsupervised Deep Homography Estima-tion. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 653–669. [Google Scholar]
  48. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the 29th Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025. [Google Scholar]
  49. Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
  50. Peng, T.; Li, Q.; Zhu, P. Rgb-t crowd counting from drone: A benchmark and mmccn network. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  51. Sun, Y.; Cao, B.; Zhu, P. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
  52. Aguilera, C.; Barrera, F.; Lumbreras, F.; Sappa, A.D.; Toledo, R. Multispectral Image Feature Points. Sensors 2012, 12, 12661–12672. [Google Scholar] [CrossRef]
  53. Leutenegger, S.; Chli, M.; Siegwart, R.Y. BRISK: Binary robust invariant scalable keypoints. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [Google Scholar]
  54. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  55. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  56. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 7132–7141. [Google Scholar]
  57. Sanghyun, W.; Jongchan, P.; Joon-Young, L.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  58. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
  59. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  60. Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
  61. Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar] [CrossRef]
  62. Zhao, D.; Zhong, W.; Ge, M.; Jiang, W.; Zhu, X.; Arun, P.V.; Zhou, H. SiamBSI: Hyperspectral video tracker based on band correlation grouping and spatial-spectral information interaction. Infrared Phys. Technol. 2025, 151, 106063. [Google Scholar] [CrossRef]
  63. Filali, A.; Abouaomar, A.; Cherkaoui, S.; Kobbane, A.; Guizani, M. Multi-access edge computing: A Survey. IEEE Access 2020, 8, 197017–197046. [Google Scholar] [CrossRef]
  64. Qi, Q.; Chen, X.; Khalili, A.; Zhong, C.; Zhang, Z.; Ng, D.W.K. Integrating Sensing, Computing, and Communication in 6G Wireless Networks: Design and Optimization. IEEE Trans. Commun. 2022, 70, 6212–6227. [Google Scholar] [CrossRef]
  65. Zhao, D.; Tang, L.; Arun, P.V.; Asano, Y.; Zhang, L.; Xiong, Y.; Tao, X.; Hu, J. City-scale distance estimation via near-infrared trispectral light extinction in bad weather. Infrared Phys. Technol. 2023, 128, 104507. [Google Scholar] [CrossRef]
  66. Asadzadeh, S.; de Oliveira, W.J.; de Souza Filho, C.R. UAV-based remote sensing for the petroleum industry and environmental monitoring: State-of-the-art and perspectives. J. Pet. Sci. Eng. 2022, 208, 109633. [Google Scholar] [CrossRef]
  67. Daud, S.M.S.M.; Yusof, M.Y.P.M.; Heo, C.C.; Khoo, L.S.; Singh, M.K.C.; Mahmood, M.S.; Nawawi, H. Applications of drone in disaster management: A scoping review. Sci. Justice 2022, 62, 30–42. [Google Scholar] [CrossRef]
  68. Khan, A.; Gupta, S.; Gupta, S.K. Emerging UAV technology for disaster detection, mitigation, response, and preparedness. J. Field Robot. 2022, 39, 905–955. [Google Scholar] [CrossRef]
  69. Wudunn, M.; Zakhor, A.; Touzani, S.; Granderson, J. Aerial 3d building reconstruction from rgb drone imagery. Geospat. Inform. X 2020, 11398, 9–19. [Google Scholar]
  70. Maboudi, M.; Homaei, M.; Song, S.; Malihi, S.; Saadatseresht, M.; Gerke, M. A Review on Viewpoints and Path Planning for UAV-Based 3D Reconstruction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5026–5048. [Google Scholar] [CrossRef]
  71. Mittal, M.; Mohan, R.; Burgard, W.; Valada, A. Vision-based autonomous UAV navigation and landing for urban search and rescue. Springer Proc. Adv. Robot. 2022, 20, 575–592. [Google Scholar]
  72. Du, Y. Multi-UAV Search and Rescue with Enhanced A* Algorithm Path Planning in 3D Environment. Int. J. Aerosp. Eng. 2023, 2023, 8614117. [Google Scholar] [CrossRef]
  73. Zhao, D.; Zhang, H.; Arun, P.V.; Jiao, C.; Zhou, H.; Xiang, P.; Cheng, K. SiamSTU: Hyperspectral video tracker based on spectral spatial angle mapping enhancement and state aware template update. Infrared Phys. Technol. 2025, 151, 105919. [Google Scholar] [CrossRef]
  74. Zhao, D.; Zhang, H.; Huang, K.; Zhu, X.; Arun, P.V.; Jiang, W.; Li, S.; Pei, X.; Zhou, H. SASU-Net: Hyperspectral video tracker based on spectral adaptive aggregation weighting and scale updating. Expert Syst. Appl. 2025, 272, 126721. [Google Scholar] [CrossRef]
Figure 1. Overall network structure. Where I i n f and I r g b denote the input infrared image and visible images, respectively. f i n f ( ) and f r g b ( ) represent two anti-fuzzy fast feature extractors that do not share weight parameters. LFHomoE comprises three components: an Embedding layer, an LFHomoE block, and two fully connected layers.
Figure 1. Overall network structure. Where I i n f and I r g b denote the input infrared image and visible images, respectively. f i n f ( ) and f r g b ( ) represent two anti-fuzzy fast feature extractors that do not share weight parameters. LFHomoE comprises three components: an Embedding layer, an LFHomoE block, and two fully connected layers.
Remotesensing 17 03884 g001
Figure 2. Inverse residual displacement convolution block and inverse residual convolution block.
Figure 2. Inverse residual displacement convolution block and inverse residual convolution block.
Remotesensing 17 03884 g002
Figure 3. The operation process of the shift module. Different colors are used only for visualization to highlight example spatial regions before and after shifting, and do not encode additional information.
Figure 3. The operation process of the shift module. Different colors are used only for visualization to highlight example spatial regions before and after shifting, and do not encode additional information.
Remotesensing 17 03884 g003
Figure 4. The structure of the spatial reduction channel sequence shuffling attention module.
Figure 4. The structure of the spatial reduction channel sequence shuffling attention module.
Remotesensing 17 03884 g004
Figure 5. The process of generating the infrared and visible images dataset based on UAV Perspectives. In Step 2, the red dashed corner squares outline the offset range, and the arrows indicate the random displacements of the four corner points.
Figure 5. The process of generating the infrared and visible images dataset based on UAV Perspectives. In Step 2, the red dashed corner squares outline the offset range, and the arrows indicate the random displacements of the four corner points.
Remotesensing 17 03884 g005
Figure 6. Qualitative comparison analysis of the proposed method and feature-based methods. To visualize the results, we mix the blue and green channels of the distorted infrared image with the red channel of the authentic infrared image. Unaligned pixels are delineated by yellow, blue, red, or green ghosts. The application of this visualization method is equally applicable to the other results. The “-” option demonstrates that the method is not successful in aligning this pair of test images.
Figure 6. Qualitative comparison analysis of the proposed method and feature-based methods. To visualize the results, we mix the blue and green channels of the distorted infrared image with the red channel of the authentic infrared image. Unaligned pixels are delineated by yellow, blue, red, or green ghosts. The application of this visualization method is equally applicable to the other results. The “-” option demonstrates that the method is not successful in aligning this pair of test images.
Remotesensing 17 03884 g006
Figure 7. Qualitative comparison with deep learning-based methods. We also used yellow and red boxes to highlight areas prone to ghosting for clearer comparison and analysis.
Figure 7. Qualitative comparison with deep learning-based methods. We also used yellow and red boxes to highlight areas prone to ghosting for clearer comparison and analysis.
Remotesensing 17 03884 g007
Figure 8. RGB channel-mixing visualization of the UHBD example in Figure 7, where the G and B channels are taken from the infrared image and the R channel is taken from the corresponding ground-truth infrared image.
Figure 8. RGB channel-mixing visualization of the UHBD example in Figure 7, where the G and B channels are taken from the infrared image and the R channel is taken from the corresponding ground-truth infrared image.
Remotesensing 17 03884 g008
Figure 9. Box plot distributions of ACE for LFHomo and homoViG on Synthetic Benchmark Dataset, UHBD and NIUHBD.
Figure 9. Box plot distributions of ACE for LFHomo and homoViG on Synthetic Benchmark Dataset, UHBD and NIUHBD.
Remotesensing 17 03884 g009
Table 1. Deep learning-based methods for multimodal homography estimation. The table lists the core idea, advantages, and limitations of each method.
Table 1. Deep learning-based methods for multimodal homography estimation. The table lists the core idea, advantages, and limitations of each method.
MethodCore StrategyAdvantagesLimitations
DADHN [22]Designs a fine-grained feature extractor to support homography estimation.Preserves channel and spatial information of multi-scale features, enhancing meaningful feature representations.Limited shallow feature extraction capability; struggles in heavily blurred regions.
HomoMGAN [23]Formulates the homography estimation process as a generative adversarial process.Self-optimizes the homography matrix via GAN without explicitly optimizing images or attention maps.Sensitive to illumination changes.
FCTrans [24]Uses cross-image attention to explicitly model feature correlations.Converts the multi-source homography problem into a single-source one.Difficult to handle large-baseline image pairs.
LCTrans [25]Adopts a coarse-to-fine strategy to iteratively refine the homography matrix.Obtains multi-scale feature maps within a single network. Without additional matrix fusion.Sensitive to noise.
homoViG [26]First applies a graph attention mechanism to homography estimation.Uses graph structures to mitigate modality discrepancies and strengthen feature matching.Generalization to motion blur, adverse weather, and low-resolution UAV scenes remains unexplored.
Table 2. The details of the anti-blurring feature extractor.
Table 2. The details of the anti-blurring feature extractor.
LayerTypeInput ChannelsOutput ChannelsOutput Size
Layer-1IRSC18 H × W
Layer-2IRSC816 H × W
Layer-3IRSC1632 H × W
Layer-4Downsampling3232 H 2 × W 2
Layer-5CSSA and GSA3232 H 2 × W 2
Layer-6Upsampling3216 H × W
Layer-7IRSC161 H × W
Table 3. The structure of each stage of LFHomoE. Among them, Conv represents the convolutional layer, IRBlock represents the inverse residual block.
Table 3. The structure of each stage of LFHomoE. Among them, Conv represents the convolutional layer, IRBlock represents the inverse residual block.
StageTypeInput ChannelsOutput ChannelsInput SizeOutput SizePara (M)FLOPs
EmbeddingConv × 3242 H × W H 4 × W 4 0.008 1.94 × 10 7
Stage 1 I R B l o c k × 2 V i G   B l o c k × 1 4242 H 4 × W 4 H 4 × W 4 0.047 9.65 × 10 7
DownsamplingConv4284 H 4 × W 4 H 8 × W 8 0.032 1.63 × 10 7
Stage 2 I R B l o c k × 2 V i G   B l o c k × 1 8484 H 8 × W 8 H 8 × W 8 0.182 9.34 × 10 7
DownsamplingConv84168 H 8 × W 8 H 16 × W 16 0.127 1.62 × 10 7
Stage 3 I R B l o c k × 5 V i G   B l o c k × 1 168168 H 16 × W 16 H 16 × W 16 1.413 1.81 × 10 8
DownsamplingConv168192 H 16 × W 16 H 32 × W 32 0.291 9.29 × 10 6
Stage 4 I R B l o c k × 2 V i G   B l o c k × 1 192192 H 32 × W 32 H 32 × W 32 0.935 2.99 × 10 7
FCPooling & MLP1928 H 32 × W 32 1 × 1 0.001 3.01 × 10 3
Table 4. Comparison of the original dataset and the infrared and visible images dataset based on UAV Perspective.
Table 4. Comparison of the original dataset and the infrared and visible images dataset based on UAV Perspective.
TypeDataSetOriginal DatasetOriginalUnregistered Image
NumberResolutionNumberResolution
Training SetNIUHBDVEDAI1246 512 × 512 452 150 × 150
UHBDDroneRGBT1807 640 × 512 293 150 × 150
DroneVehicle17,990 840 × 712 3705 150 × 150
Test SetNIUHBDVEDAI1246 512 × 512 24 150 × 150
UHBDDroneRGBT1800 640 × 512 6 150 × 150
DroneVehicle8980 840 × 712 23 150 × 150
Table 5. Comparison of corner errors between the proposed algorithm and all other methods on the synthetic benchmark dataset.
Table 5. Comparison of corner errors between the proposed algorithm and all other methods on the synthetic benchmark dataset.
(1)MethodEasyModerateHardAverageFailure Rate
(2)I3×34.595.716.775.790%
(3)SIFT + RANSAC50.87--50.8793%
(4)SIFT + MAGSAC++131.72--131.7293%
(5)ORB + RANSAC82.64118.29313.74160.8917%
(6)ORB + MAGSAC++85.99109.14142.54109.1319%
(7)BRISK + RANSAC104.06126.8244.01143.224%
(8)BRISK + MAGSAC++101.37136.01234.14143.424%
(9)DADHN3.845.016.095.080%
(10)HomoMGAN3.854.996.055.060%
(11)FCTrans3.754.705.944.910%
(12)LCTrans3.664.655.774.800%
(13)homoViG3.724.505.554.680%
(14)LFHomo (Ours)3.654.595.684.730%
The black bold number indicates the best result.
Table 6. Quantitative comparison results of each deep learning-based methods on different datasets. The black bold number indicates the best result.
Table 6. Quantitative comparison results of each deep learning-based methods on different datasets. The black bold number indicates the best result.
DatasetMethodACESSIMAFRR
Synthetic BenchmarkDADHN [22]5.080.970.74
HomoMGAN [23]5.060.970.72
FCTrans [24]4.910.970.72
LCTrans [25]4.800.970.71
homoViG [26]4.680.960.70
LFHomo(Ours)4.730.970.72
UHBDDADHN [22]7.030.950.87
HomoMGAN [23]6.690.950.85
FCTrans [24]6.590.960.88
LCTrans [25]6.410.950.87
homoViG [26]6.300.940.82
LFHomo (Ours)6.140.950.88
NIUHBDDADHN [22]6.590.940.87
HomoMGAN [23]6.540.940.88
FCTrans [24]6.390.950.87
LCTrans [25]6.360.920.85
homoViG [26]6.320.940.86
LFHomo (Ours)6.120.940.87
Table 7. Average inference time, parameters, and peak memory usage for each method. The black bold number denotes the best result.
Table 7. Average inference time, parameters, and peak memory usage for each method. The black bold number denotes the best result.
MethodTime (s)Parameter (M)Memory (M)
DADHN [42]0.3685.2385.32
HomoMGAN [32]0.3120.4420.44
FCTrans [33]0.3620.1220.12
LCTrans [16]0.3021.9921.99
homoViG [17]0.175.24321.27
LFHomo (Ours)0.093.23319.35
Table 8. Comparison of corner errors between the proposed algorithm and all other methods on the real-world dataset.
Table 8. Comparison of corner errors between the proposed algorithm and all other methods on the real-world dataset.
(1)MethodEasyModerateHardAverageFailure Rate
(2)I3×32.564.136.174.28-
(3)SIFT + RANSAC81.19--81.1989%
(4)SIFT + MAGSAC++81.91--81.9198%
(5)ORB + RANSAC112.31--112.3181%
(6)ORB + MAGSAC++64.29--64.2981%
(7)BRISK + RANSAC63.58--63.5879%
(8)BRISK + MAGSAC++52.86--52.8681%
(9)DADHN2.173.955.504.030%
(10)HomoMGAN2.473.895.524.120%
(11)FCTrans2.263.765.443.980%
(12)LCTrans2.143.685.393.900%
(13)homoViG2.033.645.363.820%
(14)LFHomo(Ours)2.173.314.593.570%
The black bold number denotes the best result.
Table 9. Ablation experiment results of different proposed components. Among them, ✓ and × represent whether the component is enabled.
Table 9. Ablation experiment results of different proposed components. Among them, ✓ and × represent whether the component is enabled.
Shift
Module
SRCSSABackboneACETime (s)Parameter (M)
Synthetic BenchmarkUHBDNIUHBD
×LFHomoE5.026.336.230.093.232
×LFHomoE5.016.316.210.093.233
ResNet-34 [54]5.086.636.520.2723.53
ViT [55]5.116.496.350.1320.45
ViG [35]4.686.266.200.115.877
MobileViT [31]5.176.646.450.091.848
MobileViG [33]4.766.256.190.093.232
RFViG [26]4.656.226.170.125.879
LFHomoE (Ours)4.736.146.120.093.233
Black bold font represents the best results.
Table 10. Comparison of average corner error and parameters of different attentions in feature extractors on three datasets.
Table 10. Comparison of average corner error and parameters of different attentions in feature extractors on three datasets.
AttentionACETime (s)Parameter (M)
Synthetic BenchmarkUHBDNIUHBD
SENet [56]5.096.716.670.093.232
CBAM [57]4.986.546.600.113.233
ECA [58]5.024.566.640.093.232
CA [59]4.856.486.410.103.234
TA [60]4.806.256.290.103.233
ELA [61]4.826.466.530.103.233
SRCSSA (Ours)4.736.146.120.093.233
Black bold font indicates the best results.
Table 11. SRCSSA performance changes under different group numbers and space reduction rate settings. The best results are indicated in black bold.
Table 11. SRCSSA performance changes under different group numbers and space reduction rate settings. The best results are indicated in black bold.
G R ACE
Synthetic BenchmarkUHBDNIUHBD
S0.54.996.346.25
10.55.086.586.51
20.55.066.536.44
40.54.736.146.12
80.55.026.426.37
41.04.766.176.16
40.254.766.336.28
Table 12. The performance of LFHomo under different combinations of loss weights λ and μ . Bold values indicate the best result.
Table 12. The performance of LFHomo under different combinations of loss weights λ and μ . Bold values indicate the best result.
λ μ ACE
Synthetic BenchmarkUHBDNIUHBD
0.10.0014.936.416.32
1.00.0014.866.246.14
1.00.014.736.146.12
1.00.055.116.626.56
0.10.015.066.426.27
2.00.0014.806.356.24
2.00.015.106.446.40
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, Y.; Luo, Y.; Qian, J.; Wu, Y.; Li, C.; Chen, H. Lightweight Unsupervised Homography Estimation for Infrared and Visible Images Based on UAV Perspective Enabling Real-Time Processing in Space–Air–Ground Integrated Network. Remote Sens. 2025, 17, 3884. https://doi.org/10.3390/rs17233884

AMA Style

Liao Y, Luo Y, Qian J, Wu Y, Li C, Chen H. Lightweight Unsupervised Homography Estimation for Infrared and Visible Images Based on UAV Perspective Enabling Real-Time Processing in Space–Air–Ground Integrated Network. Remote Sensing. 2025; 17(23):3884. https://doi.org/10.3390/rs17233884

Chicago/Turabian Style

Liao, Yanhao, Yinhui Luo, Jide Qian, Yuezhou Wu, Chengqi Li, and Hongming Chen. 2025. "Lightweight Unsupervised Homography Estimation for Infrared and Visible Images Based on UAV Perspective Enabling Real-Time Processing in Space–Air–Ground Integrated Network" Remote Sensing 17, no. 23: 3884. https://doi.org/10.3390/rs17233884

APA Style

Liao, Y., Luo, Y., Qian, J., Wu, Y., Li, C., & Chen, H. (2025). Lightweight Unsupervised Homography Estimation for Infrared and Visible Images Based on UAV Perspective Enabling Real-Time Processing in Space–Air–Ground Integrated Network. Remote Sensing, 17(23), 3884. https://doi.org/10.3390/rs17233884

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop